What is Logistic Regression and how does it work?: Logistic Regression is comparable to the regression toward the mean and maybe implemented for evaluating the likelihood of sophistication or event. Logistic Regression is suitable to be conducted when the variable is categorical or binary. it’s not accustomed to predictions in continuous data like age, size, etc. However, logistic regression is employed to represent the link between one dependent binary variable and one or more variables which might be nominal, ordinal, or ratio level variables.
Difference between Logistic and Linear Regression
The key difference between Logistic and regression is that the result of Logistic regression is ‘discrete’ whereas the end result of simple regression is ‘continuous’.
So, so as to get an understanding of the discrete and continuous outcomes, let’s scrutinize an example of every outcome. A model that predicts the worth of a share is often an example of continuous outcome; whereas an example of the discrete outcome will be a model that predicts whether someone has cancer or not.
Figure 1: The above figure represents the difference between Linear Regression and Logistic Regression through a graph.
(Reference: https://www.javatpoint.com/linear-regression-vs-logistic-regression-in-machine-learning)
Three main types of Logistic Regression
Binary Logistic Regression
In Binary Logistic Regression, there are only two possible types for an outcome value. For example: If a student is attending an examination and if you’re willing to work out whether that student ‘pass’ or ‘fail’ the examination then, during this case scenario, there are only two outcome values i.e. ‘pass’ and ‘fail’. Hence, the coed will either ‘pass’ or ‘fail’ the examination.
Multinomial Logistic Regression
Multinomial Logistic Regression consists of three or more possible types for an outcome value that don’t seem to be ordered. Here, ‘ordered’ refers to the possible types having quantitative significance. Example of Multinomial Logistic Regression: If you’re required to predict tomorrow’s weather, it can possible kind of outcome are ‘sunny’, ‘rainy’, ‘cloudy’, ‘partially cloudy’, etc.
Ordinal Logistic Regression
Ordinal Logistic Regression comprises three or more possible types for an outcome value which are ordered. Here, ‘ordered’ refers to the possible types having quantitative significance. Example of Ordinal Logistic Regression: you’re using a web application during which you’re asked to rate a product from ‘1’ to ‘5’ within which ‘1’ represents being very poor and ‘5’ represents being excellent.
Below provided are some of the cases in which Logistic Regression can be used:
· Predicting whether an email is a spam email or not
· As per the consumption of calorie and fat, the probability of getting a heart attack can be determined
· Predicting whether a person is likely to have diabetes or not as per the consumption of their sugar and the calories burnt.
· Predicting if a user of an application is likely to click on the advertisement link or not.
How does Logistic Regression work?
Logistic Regression implements a price function that is understood because of the ‘Sigmoid function’ or the ‘logistic function’. So, the sigmoid function is employed to calculate the anticipated values of the possibilities.
Figure 2: The above figure represents a sigmoid function.
(Reference: https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148)
Mathematical equation
The mathematical equation of sigmoid function has been presented below:
F(z) = 1
1−e−z
In the above equation, z = w0 + w1. x1+ w2. x2 + ⋯ + wn. xn .
Here, w0, w1, w2, … , wn represents the regression of co-efficient of the model which is calculated through Maximum Likelihood Estimation and x0, x1, x2, … , xn represents the features or the independent variables. Now, F (z) is used to calculate the binary outcome probability in which the probabilities are classified as the givendata point
i.e. (x) into of the two categories.
Now, let’s develop a model that can be used to predict Heart disease using a Logistic regression classifier. The data for predicting heart diseases can be obtained from the link provided below:
https://www.kaggle.com/ronitf/heart-disease-uci
Loading the dataset using Pandas
Now, first of all, let’s load the dataset through ‘pandas’ library.
Figure 3: Importing the panda’s library as pd.
Figure 4: Reading the CSV file named as ‘heart.csv’.
Figure 5: Presenting first five rows of DataFrame using head() function.
Choosing the feature columns from the dataset
Here, the columns of a dataset are divided into two variables:
i. Dependent variable or Target variable
ii. Independent variable or Feature variable
In the obtained dataset of heart diseases, the target variable is provided as ‘target’ column and remaining columns are provided as ‘feature’ column.
Figure 6: Defining features variable in feature_columns.
Figure 7: Storing feature variables in ‘x’ and target variable in ‘y’.
Splitting the data into training and test set
The dataset is required to be split into two parts i.e. training set and testing set. Here, a dataset is split into a ratio of 70:30 which suggests that 70% of information is employed for training the model whereas 30% of knowledge is employed for testing the model. Now, let’s dive into the code:
Figure 8: Splitting 70% of data into a training set and 30% of data into a testing set.
First of all, the ‘train_test_split’ module is to be imported from ‘sklearn.model_selection’. Then, the ‘train_test_split’ function is executed which takes three values from its parameter i.e. features, target, and test_size. Furthermore, random_state can be used to randomly select the records of the dataset.
Developing a model and carrying out the prediction
Here, let’s drive straight into the code for developing and carrying out the prediction.
Figure 9: Developing a logistic regression model and carrying out the prediction.
Now, initially ‘LogisticRegression’ module is to be imported from ‘sklearn.linear_model’. Then, Logistic Regression classifier object is instantiated in ‘logistic_reg’ from ‘LogisticRegression()’ function. Then, by using‘fit()’ function the training dataset is fitted into the Logistic Regression model i.e. ‘logistic_reg’. Finally, the prediction is performed on the testing set using ‘predict()’ function and its outcome are stored in ‘y_prediction’.
Confusion The matrix may be used to measure the performance of the model. Now, at first, let’s look at the following table which is employed to see the characteristics of the classification problem.
| Predicted Value Positive (P) | Predicted Value Negative (N) |
Actual Value Positive (P) | True Positive (TP) | False Negative (FN) |
Actual Value Negative (N) | False Positive (FP) | True Negative (TN) |
Defining terminologies of above table
· Positive (P): The actual value is positive.
· Negative (N): The actual value is notpositive.
· True Positive (TP): The actual value is positive and is predicted to be positive.
· False Negative (FN): The actual value is positive but is predicted to be negative.
· True Negative (TN): The actual value is negative and is predicted to be negative.
· False Positive (FP): The actual value is negative but is predicted to be positive. Now, Classification Rate or Accuracy is determined from the following relation:
Accuracy = (TP+TN) / (TP+TN+FP+FN)
Likewise, Recall is determined from the following relation: Recall = (TP) / (TP+FN)
Also, Precision is determined from the following relation: Precision = (TP) / (TP+FP)
Similarly, Precision can be calculated from the following relation F–measure = (2*Recall*Precision) / (Recall+Precision)
Using Confusion Matrix to evaluate the model
Now, let’s have a look on the python code for presenting the Confusion Matrix in order to measure the performance of the model.
Figure 10: Evaluating the performance of model using confusion matrix.
Here, first of all, the metrics module is imported from sklearn. Then, by using the ‘confusion_matrix()’ function, a confusion matrix is obtained which is stored in confusion_matrix_outcome. In the parameters of ‘confusion_matrix()’ function, test dataset and the prediction which is obtained from the training a dataset is being passed.
After, the values are passed in ‘confusion_matrix()’ function, a 2 * 2 matrix is obtained in the form of output since the model which has been developed is a binary classification. In the current scenario,there are only two classes0 and 1. In the obtained matrix, the values that are present in diagonal elements represent accurate predictions whereas the elements present in non-diagonal elements represent inaccurate predictions. Hence, 33 and 41 are accuratepredictions whereas 11 and 6 are inaccurate predictions.
Visualizing Confusion Matrix through Heatmap
It is always easy to obtain a better understanding of the topic if it is presented in the form of figures. Therefore, the obtained confusion matrix of the model can be presented in the form of the heatmap in order to provide a better overview of the model.
Figure 11: Generating a heatmap to present the outcome of Confusion Matrix.
Classification rate, precision and recall of the model
Now, classification rate, precision and recall can be calculated of the model.
Here, in accuracy_score(), precision_score() and recall_score() function, two parameters are beingpasses i.e. test dataset and prediction a result which is obtained from the training dataset. Now, let’s interpret the result obtained after executing accuracy_score(), precision_score() and recall_score() function.
Classification rate or accuracy
As per the obtained outcome,the accuracy of the developedmodel is almost 81% which can be considered as a good accuracy rate.
Precision
Likewise, precision refers to positive class predictions that belong to the positive class itself. So, the developed logistic regression model has an accuracy rate of almost 79%.
Recall
Recall refers to the positive class predictions which have been made out of all the positive circumstances of the dataset. Here, in the developed model, the patients having heart disease in the test dataset can be identified 87% of the time.
AUC –ROC curveis used for performance measurement for classification problems at various threshold settings. Here, AUC presentsthe measure or degree of separability and ROC represents the probability curve. So, the model having a higher AUC can be considered good for predicting 0 as 0 and 1 as 1. Hence, for the developed logistic regression model, higher the AUC, the better is the model for differentiating patients with and without heart disease.
Get the code from GitHub
Complete code can be obtained from the GitHub link provided below: https://github.com/hrishavtandukar/Logistic-Regression
Advantages of Logistic Regression
The logistic Regression algorithm is efficient in terms of implementation and offers great training efficiency in some case scenarios. Hence, it does not require high computation power in order to train the model with a Logistic Regression algorithm. Furthermore, a Logistic Regression algorithm is less prone to over-fitting in a low-dimensional dataset that has an adequate number of training examples. Hence, tends to provide betterresult after the attributes which are unrelatedto the output variable and the attributes that are correlated to each other are removed.
Similarly, Logistic Regression is efficient when the data has features that are linearly separable. Since
this algorithm is simple and can be implemented quickly, it can also be considered as a baseline for measuring the performance of other algorithms. Also, models of Logistic Regression can be easily updated for reflecting the new data. Stochastic gradient descent can be used for carrying out the update.
Disadvantages of Logistic Regression
Logistic Regression algorithm tends to over-fit on the training set with high dimensional datasets. Hence, the developed model may not be able to predict accurate results on the test set. Another disadvantage of the Logistic Regression algorithm is that it cannot be used for solving non-linear problems. Furthermore, Logistic Regression is highly reliant on the presentation of data. Hence, all the essential independent variables should be properly identified before implementing Logistic Regression. Additionally, it requires an adequate amount of dataset and proper training example of the categories which are required to be identified.
Conclusion
Hence, the Logistic Regression algorithm is a simple yet widely used technique for carrying out various predictions. The outcome of The logistic Regression algorithm is discrete. Furthermore, due to its simplicity and requirement for low computational power is considered as a benchmark model to measure the performance. Hence, the Logistic Regression algorithmwhen large datasetis provided with adequate trainingexamples for all the categories that are required to be identified.