###### 1. Introduction

Logistic Regression despite the “regression” term in its name is used in classification problems when the dependent (target) variable has two possible outcomes. However, this model can be extended to tackle multiclass classification problems, and we will discuss it at the end of this article.

###### 2. Key Terms

Odds are used in Logistic Regression algorithm to model probabilities: As you can see from formula (1), o d d s ( p ) [ 0 ; ∞ ] given that p ∈ [ 0 ; 1 ] . However, we want our model to take a real value number from [ − ∞ ; ∞ ] (as our features can have any values), and output a soft number in a range [0;1] to describe a probability. Logistic function (also called Sigmoid) possesses all of these traits. It can be derived as an inverse of a log-odds function which is also called logit.

$logit\left(p\right)=log\left(odds\left(p\right)\right)=log\left(\frac{p}{1-p}\right)\phantom{\rule{1em}{0ex}}\left(2\right)$ We can achieve the required properties by reflecting the logit function about the line y = x . This transformation can be performed by calculating the inverse of expression (2) which is called a logistic function:

$logistic\left(y\right)=logit\left(p{\right)}^{-1}$

In order to calculate that we should solve the equation:

$logit\left(p\right)=y\to log\left(\frac{p}{1-p}\right)=y\to \frac{p}{1-p}={e}^{y}\to p={e}^{y}\left(1-p\right)\to p\left({e}^{y}+1\right)={e}^{y}\phantom{\rule{0ex}{0ex}}p=\frac{{e}^{y}}{{e}^{y}+1}=\frac{1}{1+{e}^{-y}}$

Thus, the expression for logistic function (sigmoid function) is the following:

$logistic\left(y\right)=\frac{1}{1+{e}^{-y}}\phantom{\rule{1em}{0ex}}\left(3\right)$ ###### 3.Model Training

Logistic Regression represents logit function as a linear combination of predictors plus the intercept:

$logit\left(p\right)={\theta }_{0}+{\theta }_{1}{X}_{1}+{\theta }_{2}{X}_{2}+...+{\theta }_{k}{X}_{k},\phantom{\rule{1em}{0ex}}\left(4\right)$

where

• X i is the value of i t h predictor
• θ i is the generated coefficient

Coefficients θ i indicate the effect of a one-unit change in the predictor variable on the log odds of “success”

As our train data contains more than one observation, we will denote x as a column vector of the predictors’ values for the particular observation (we will also add 1 as its first element to account for an intercept term) and θ as a column vector of coefficients θ 0. . . θ k :

$x=\left[\begin{array}{c}1\\ {X}_{1}\\ {X}_{2}\\ ...\\ {X}_{k}\end{array}\right];\phantom{\rule{2em}{0ex}}\theta =\left[\begin{array}{c}{\theta }_{0}\\ {\theta }_{1}\\ {\theta }_{2}\\ ...\\ {\theta }_{k}\end{array}\right]$

Using this notation we can rewrite the expression (4) as follows:

$logit\left(p\right)={\theta }^{T}x\phantom{\rule{1em}{0ex}}\left(5\right)$

If we plug in y = θ T x into formula (3), we will get an expression for the probability of a random variable Y (that represents the predicted output) being 0 or 1 given experimental data x and model parameters θ :

$Pr\left(Y=1|x,\theta \right)=\frac{1}{1+{e}^{-{\theta }^{T}x}}\phantom{\rule{1em}{0ex}}\left(6\right)$

As we are dealing with two class problem, the probability P r ( Y = 0 ∥ x , θ ) can be expressed as follows:

$Pr\left(Y=0|x,\theta \right)=1-Pr\left(Y=1|x,\theta \right)\phantom{\rule{1em}{0ex}}\left(7\right)$

We can combine probabilities used in expressions (6) and (7) into one formula:

$Pr\left(Y|x,\theta \right)=Pr\left(Y=1|x,\theta {\right)}^{Y}\left(1-Pr\left(Y=1|x,\theta \right){\right)}^{1-Y}\phantom{\rule{1em}{0ex}}\left(8\right)$

One can notice that:

Our goal is to determine the coefficients θ = θ 0 … θ k from formula (4). The intuition here is that for any given train observation we want these coefficients to maximize the probability of observing a correct label. This sentence can be converted to the following formula (assuming train data is independently distributed):

This expression can be maximized through various optimization techniques such as Newton-Raphson algorithm or a gradient descent (which is usually applied to log-likelihood).

###### 4.Making Predictions

Now as we have the vector of model parameters θ we can calculate the predicted value of the logit function for any new observation x (we will use hat symbol for predicted values):

$logit\left(p\right)=\stackrel{^}{y}={\theta }^{T}x$

Then we plug this value into logistic function in order to determine the probability of the data belonging to Class 1 (True, “Yes”, etc):

The last step is to set up a threshold T (\in) [0;1] that will be used in order to make a prediction:

By default the threshold is set up to 0.5, but you can adjust it based on your needs (usually based on the True Positive Rate and False Positive Rate trade-off). ###### 5. Regularization

Regularization means making the model less complex which can allow it to generalize better (i.e. avoid overfitting) and perform better on a new data.

As was mentioned above, the coefficients of logistic regression are usually fitted by maximizing the log-likelihood. As many optimization techniques are aimed at finding the minimum of a function we can redefine our goal as minimizing the negative log-likelihood:

$\stackrel{^}{\theta }=\underset{\theta }{min}\left[-log\left(L\left(\theta |x\right)\right)\right]$

We can penalize the model of having coefficients that are far from zero by adding a regularization term R ( θ ) multiplied by parameter λ which is called regularization strength:

$\stackrel{^}{\theta }=\underset{\theta }{min}\left[-log\left(L\left(\theta |x\right)\right)+\lambda R\left(\theta \right)\right]$

The two most popular regularizations are L1 and L2:

$L1:R\left(\theta \right)=\sum _{i=0}^{K}|{\theta }_{i}|L2:R\left(\theta \right)=\frac{1}{2}\sum _{i=0}^{K}{\theta }_{i}^{2}$

The factor 1 2 in L2 regularization is used to simplify the derivative calculations. Through λ we can control the impact of the regularization term. Higher values of λ lead to smaller coefficients (less regularization), but too high values can lead to underfitting.

In scikit-learn package L2 regularization is used by default. Instead of regularization strength λ , its inverse is used: the C parameter (the default is C=1.0). Similarly to λ : smaller values of C leads to smaller coefficients, but too high values can lead to underfitting.

It is important to normalize the data before performing regularized logistic regression to ensure that the regularization term λ affects the coefficients in a similar manner.

###### 6. Logistic Regression For Multinomial Problems

Logistic regression can be generalized to handle problems with more than two possible outcomes. The most popular approach is called “One-vs-Rest” logistic regression where we split our multinomial problem with M classes into M binary classification problems (see Figure 5). In this case we generate different coefficients θ for each binary classification problem (basically we train M separate Logistic Regression models). When we have to classify a new observation, we calculate the probabilities of the data belonging to each class (which are the outputs of our models) and select the class that has the highest probability.