Cost Function in Logistic Regression: A Beginner-Friendly Guide
Jun 04, 2026 8 Min Read 34 Views
(Last Updated)
Imagine you are learning to throw darts.
Every time you throw, someone tells you how far off you were from the bullseye. That feedback is what helps you adjust your technique and get better with each attempt.
The cost function in logistic regression does exactly the same thing. Every time the model makes a prediction, the cost function measures how wrong that prediction was and sends feedback that nudges the model toward better answers.
Without a cost function, the model has no way of knowing whether it is getting better or worse. It is just throwing darts in the dark.
This guide explains what the logistic regression cost function is, why it is designed the way it is, how it works mathematically in plain terms, and how gradient descent uses it to train a model that actually makes good predictions. No advanced math background required.
Table of contents
- Quick TL;DR Summary
- Why Logistic Regression Needs Its Own Cost Function
- Understanding Log Loss Step by Step
- Maximum Likelihood Estimation: The Theory Behind Log Loss
- Gradient Descent: How the Cost Function Drives Training
- Regularization: Preventing Overfitting Through the Cost Function
- Implementing Logistic Regression Cost Function in Python
- Final Thoughts
- FAQs
- Why can we not use mean squared error for logistic regression?
- What is the difference between log loss and binary cross-entropy?
- How do I know if my log loss value is good?
- What does regularization do to the cost function?
- How does sklearn handle the cost function internally?
Quick TL;DR Summary
- This guide explains the cost function in logistic regression, the mathematical formula that measures how wrong the model’s predictions are and provides the feedback signal training uses to improve parameters.
- You will learn why logistic regression uses log loss instead of squared error, how log loss produces a convex optimization landscape that allows gradient descent to reliably find the best parameters.
- The guide covers how log loss heavily penalizes confident wrong predictions and gently rewards confident correct ones, creating exactly the right incentive for a classification model to learn well-calibrated probabilities.
- Step-by-step explanations show you how gradient descent minimizes the cost function by repeatedly computing the gradient and updating weights in the direction that reduces prediction error across the training set.
- You will understand how regularization prevents overfitting by adding a weight penalty to the cost function, how to choose between L1 and L2 regularization, and how to implement the full training pipeline correctly in Python using sklearn.
What is the Cost Function in Logistic Regression?
The cost function in logistic regression is a mathematical formula used to measure how far the model’s predictions are from the actual labels. It quantifies prediction error and helps the algorithm determine how the model parameters should be adjusted during training. By minimizing the cost function through optimization techniques such as gradient descent, logistic regression learns the parameter values that produce the most accurate predictions.
Why Logistic Regression Needs Its Own Cost Function
- The Problem With Using Squared Error for Classification
Linear regression uses mean squared error as its cost function and it works well because linear regression outputs continuous values on an unbounded scale.
Logistic regression is different. It outputs probabilities between zero and one through the sigmoid function. When you apply squared error to these probability outputs, the resulting cost function becomes non-convex, meaning it has many hills and valleys with multiple local minima.
Gradient descent needs a smooth bowl-shaped convex function to reliably find the global minimum. A non-convex landscape means gradient descent can get stuck at a local minimum and never find the best possible parameters.
- The Sigmoid Function and What It Outputs
Before understanding the cost function, you need to understand what logistic regression actually predicts.
The sigmoid function takes any real number and squashes it into a value between zero and one:
sigmoid(z) = 1 / (1 + e^(-z))
Where z is the linear combination of inputs and weights: z = w1x1 + w2x2 + … + b
The output is interpreted as the probability that the input belongs to the positive class. An output of 0.9 means the model is 90 percent confident the example is class 1. An output of 0.1 means it is 90 percent confident it is class 0.
The cost function needs to evaluate how good these probability predictions are, not just whether the final classification was right or wrong.
- Why Probability Calibration Matters
A model that predicts 0.51 for an example that is clearly class 1 and a model that predicts 0.99 for the same example both classify it correctly. But the first model is poorly calibrated and the second is well calibrated.
The cost function needs to distinguish between these two cases and reward the confident correct prediction while penalizing the uncertain one. Squared error does not do this well. Log loss does.
Read More: Top 10 Types of Regression in Machine Learning You Must Know
Understanding Log Loss Step by Step
- The Intuition Behind Logarithmic Penalty
Log loss uses the logarithm to create an asymmetric penalty that grows rapidly as confident wrong predictions are made.
Think of it this way. If the true label is 1 and your model predicts 0.99, you were right and the loss is tiny. If your model predicts 0.5, you were uncertain and the loss is moderate. If your model predicts 0.01, you were confidently wrong and the loss is enormous.
The logarithm naturally produces this behavior. As the predicted probability of the correct class approaches zero, the negative log of that probability approaches infinity. As it approaches one, the loss approaches zero.
- The Log Loss Formula for One Example
For a single training example with true label y (either 0 or 1) and predicted probability p:
If y = 1: Loss = -log(p)
If y = 0: Loss = -log(1 – p)
These two cases can be combined into one elegant formula:
Loss = -[y * log(p) + (1 – y) * log(1 – p)]
When y = 1, the second term drops out and you get -log(p), penalizing low predicted probability for the positive class.
When y = 0, the first term drops out and you get -log(1 – p), penalizing high predicted probability when the true label is negative.
- The Full Cost Function Across All Training Examples
The cost function averages the log loss across all m training examples:
J(w, b) = -(1/m) * sum[ y*log(p) + (1-y)*log(1-p) ]
This average gives a single number that summarizes how wrong the model is across the entire training set. When this number is large, the model is making many bad predictions. When it approaches zero, the model is making accurate, well-calibrated probability estimates.
- Visualizing What the Cost Function Looks Like
For the case where y = 1, the loss curve starts at zero when p = 1 and rises steeply as p approaches zero. Predicting 0.5 when the answer is 1 gives a loss of about 0.69. Predicting 0.1 when the answer is 1 gives a loss of about 2.3.
For the case where y = 0, the curve is mirrored. Predicting 0.5 when the answer is 0 gives the same loss of 0.69. Predicting 0.9 when the answer is 0 gives a loss of about 2.3.
The asymmetry of the logarithm is what makes this cost function so well-suited to probability estimation.
Log loss, also known as cross-entropy loss, is not an arbitrary error function—it emerges directly from applying maximum likelihood estimation to the Bernoulli distribution. When a binary classification model predicts probabilities, the statistically optimal parameters are the ones that maximize the likelihood of the observed training labels. Mathematically, this optimization problem becomes equivalent to minimizing log loss, which is why it is such a principled and widely used objective function in modern machine learning and deep learning systems.
Maximum Likelihood Estimation: The Theory Behind Log Loss
- What Maximum Likelihood Estimation Means
Maximum likelihood estimation is a principled approach to finding model parameters. It asks: given the training data I have observed, what parameter values make that data most probable under my model?
For logistic regression, the model assumes each training example is generated independently by a Bernoulli process where the probability of class 1 is given by the sigmoid function applied to the linear combination of features.
- From Likelihood to Log Likelihood
The likelihood of the entire training set is the product of individual example likelihoods. Because probabilities are between zero and one, multiplying many of them together produces very small numbers that cause numerical underflow in computers.
Taking the logarithm of the likelihood converts products into sums, which are numerically stable and easier to work with. Since logarithm is a monotonically increasing function, maximizing the log likelihood is equivalent to maximizing the likelihood.
- From Log Likelihood to Cost Function
Maximizing the log likelihood is the same as minimizing the negative log likelihood. Gradient descent minimizes functions, so the cost function is defined as the negative average log likelihood across all training examples.
This is exactly the log loss formula. The connection between maximum likelihood estimation and log loss means that minimizing log loss is not just a computational trick. It is the statistically principled approach to finding the best parameters for a probabilistic classification model.
Gradient Descent: How the Cost Function Drives Training
- The Goal of Gradient Descent
Gradient descent is an optimization algorithm that finds the parameter values minimizing the cost function by iteratively taking small steps in the direction that reduces the cost most steeply.
Imagine standing on a hilly landscape and wanting to reach the lowest point. You look around, identify the direction that goes downhill most steeply, take a step in that direction, and repeat. That is gradient descent.
For logistic regression, the landscape is the cost function surface defined over all possible weight values. The lowest point is the set of weights that minimizes prediction error across the training set.
- Computing the Gradient
The gradient is the vector of partial derivatives of the cost function with respect to each weight. It points in the direction of steepest increase in the cost.
For logistic regression, the gradient with respect to weight wj is:
dJ/dwj = (1/m) * sum[ (p – y) * xj ]
Where p is the predicted probability, y is the true label, and xj is the value of feature j for each training example.
This formula has a beautiful interpretation: the gradient is the average prediction error (p – y) weighted by the feature values. When the model predicts correctly, (p – y) is small and the gradient contribution is small. When the model is confidently wrong, (p – y) is large and drives a larger weight update.
- The Weight Update Rule
At each training step, weights are updated by subtracting a fraction of the gradient:
w = w – learning_rate * dJ/dw
b = b – learning_rate * dJ/db
The learning rate controls how large each step is. Too large and the updates overshoot the minimum, causing training to diverge. Too small and training is unnecessarily slow.
- Batch, Stochastic, and Mini-Batch Gradient Descent
Batch gradient descent computes the gradient using the entire training set at each step. It is accurate but slow on large datasets because every update requires processing all examples.
Stochastic gradient descent computes the gradient using a single randomly selected example at each step. It is fast and noisy, making erratic progress but often finding good solutions quickly.
Mini-batch gradient descent computes the gradient on a small random subset of examples at each step, balancing the accuracy of batch gradient descent with the speed of stochastic. It is the standard approach in practice.
The gradient update rule for logistic regression using log loss has the same mathematical form as the update rule for linear regression using squared error loss. In both cases, the weights are adjusted using the difference between the prediction and the true target multiplied by the input features. Even though logistic regression uses a sigmoid activation and predicts probabilities while linear regression predicts continuous values directly, the resulting gradient equations end up looking remarkably similar due to how the derivatives simplify mathematically.
Regularization: Preventing Overfitting Through the Cost Function
- What Overfitting Looks Like in Logistic Regression
An unregularized logistic regression model trained on a small dataset can learn to assign extreme probabilities close to zero or one to training examples by making weights very large. This produces near-perfect training accuracy but terrible generalization to new data.
Large weights mean the model is extremely sensitive to small changes in input features, creating a decision boundary that is overly complex and specific to the training data rather than capturing the true underlying pattern.
- L2 Regularization (Ridge)
L2 regularization adds a penalty term to the cost function proportional to the sum of squared weights:
J_regularized = J + (lambda / 2m) * sum(wj^2)
The lambda parameter controls regularization strength. Higher lambda means stronger penalty for large weights and simpler decision boundaries. Lower lambda allows larger weights and more complex boundaries.
L2 regularization shrinks all weights toward zero without eliminating any completely, producing a model that uses all features but with reduced sensitivity to each one.
- L1 Regularization (Lasso)
L1 regularization adds a penalty proportional to the sum of absolute weight values:
J_regularized = J + (lambda / m) * sum(|wj|)
L1 regularization has the important property of driving some weights exactly to zero, effectively performing automatic feature selection by eliminating features that do not contribute meaningfully to prediction.
When you have many features and suspect only a subset are truly relevant, L1 regularization is often preferable to L2.
- Choosing the Regularization Strength
Lambda is a hyperparameter tuned through cross-validation rather than learned during training. The typical approach tests several lambda values, evaluates validation performance for each, and selects the value that produces the best generalization.
In scikit-learn, the regularization parameter is expressed as C = 1/lambda, meaning larger C values correspond to weaker regularization and smaller C values correspond to stronger regularization.
Implementing Logistic Regression Cost Function in Python
- Computing Log Loss From Scratch
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def compute_cost(X, y, weights, bias):
m = len(y)
z = np.dot(X, weights) + bias
predictions = sigmoid(z)
# Clip predictions to avoid log(0)
predictions = np.clip(predictions, 1e-15, 1 – 1e-15)
cost = -(1/m) * np.sum(
y * np.log(predictions) +
(1 – y) * np.log(1 – predictions)
)
return cost
def compute_gradients(X, y, weights, bias):
m = len(y)
z = np.dot(X, weights) + bias
predictions = sigmoid(z)
errors = predictions – y
dw = (1/m) * np.dot(X.T, errors)
db = (1/m) * np.sum(errors)
return dw, db
- Training With Gradient Descent
def train_logistic_regression(X, y, learning_rate=0.01, epochs=1000):
weights = np.zeros(X.shape[1])
bias = 0
cost_history = []
for epoch in range(epochs):
dw, db = compute_gradients(X, y, weights, bias)
weights -= learning_rate * dw
bias -= learning_rate * db
if epoch % 100 == 0:
cost = compute_cost(X, y, weights, bias)
cost_history.append(cost)
print(f”Epoch {epoch}: Cost = {cost:.4f}”)
return weights, bias, cost_history
- Using Sklearn Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_test)
loss = log_loss(y_test, y_pred_proba)
print(f”Log Loss on test set: {loss:.4f}”)
To learn more about Cost Function in Logistic Regression, do not miss the chance to enroll in this HCL GUVI’s AI and Machine Learning course covering machine learning fundamentals, feature engineering, deep learning, and practical implementation through hands-on projects and expert guidance with certification.
Final Thoughts
The cost function is not just a technical detail of logistic regression. It is the definition of what the model is trying to achieve.
Log loss captures exactly the right objective for a probabilistic classifier: be confident when you are right, be uncertain when the evidence is mixed, and never be confidently wrong. Every training step is the model reading that feedback and adjusting to do better on the next attempt.
Understanding why logistic regression uses log loss instead of squared error, how maximum likelihood estimation connects to the cost function mathematically, and how gradient descent navigates the cost surface to find good parameters gives you the foundation to diagnose training problems, choose regularization strength wisely, and understand what your model is actually optimizing when you call model.fit().
The math is simpler than it looks. The intuition is more powerful than the equations.
FAQs
1. Why can we not use mean squared error for logistic regression?
Squared error on sigmoid outputs creates a non-convex cost function with multiple local minima where gradient descent gets stuck. Log loss produces a convex surface with one global minimum, making optimization reliable.
2. What is the difference between log loss and binary cross-entropy?
They are the same formula with different names. Log loss is the machine learning term, binary cross-entropy is the deep learning and information theory term, both computing the negative average log likelihood of the correct class label.
3. How do I know if my log loss value is good?
A model always predicting 0.5 produces log loss of 0.693. Any value meaningfully below that means the model is learning something useful. Compare against a baseline predicting the class prior probability for context.
4. What does regularization do to the cost function?
It adds a penalty proportional to weight size, increasing cost for large weights and pushing gradient descent toward simpler parameters that generalize better. Lambda controls the trade-off between fit and simplicity.
5. How does sklearn handle the cost function internally?
Sklearn minimizes regularized log loss using efficient algorithms like L-BFGS and saga rather than basic gradient descent. The C parameter sets regularization strength as the inverse of lambda, with L2 applied by default.



Did you enjoy this article?