Logistic Regression - Tell us "Yes" or "No"
Another regression model called Logistic regression is used to describe the relationship between the categorical dependable variable and independable variables with logistic function and estimate the probabilities of binary response. The output variable in logistic regression is categorical rather than interval. You might have heard the terms, Classification, which is one of common applications in data mining or automatic pattern recognition. We can classify the binary, ordinal or nominal target variables by performing Logistic Regression. In this tutorial, only binary target variables ("Yes", "No") is taken as example of classification in application since it is less complicated for beginner to get a grip on.
Formula:
$$ln(\frac{p}{1-p}) = w_0 + w_1x_1 + w_2x_2+...+w_nx_n=\sum_{i=1}^{m}{w_ix_i}=w^Tx$$
The probability of being "Yes", $p$ can be expressed as below by modifying above formula. If the probability exceeds the threshold (generally 0.5), the data instance will be classified as "Yes", or "No".
$$p=\frac{1}{1+e^{-w^Tx}}$$
Cost function of logistic regression is defined as below formula. Same as other models, the objective is to minimize the value of cost function and thus obtain the optimal weights, $w_i$.
$$J(w) = -\frac{1}{m}[\sum_{j=1}^{m}{(y^{(j)}log(h_w(x^{(j)}) + (1-y^{(j)})(log(h_w(x^{(j)}))}]$$
Regarding the aspects of machine learning, logistic regression is less computational expensive and easier to implement and be interpreted. However, it is relatively easier to be under-fitting on data instances such that higher classification error are resulted.
Implementation in Python
1. Installation of Scikit-Learn
Referring to previous articles to install the package of machine learning.
2. Raw Data Collection & Data Visualization
Assume we have the customer data which includes their personal information, consumption amount and renewal of membership (Download here). Our aims is to predict the members whether will renew their membership or not based on these historical data (training data). For classification purpose of regression model, you need to pay more attention to the proportion of target variable. The biased samples (say many "Yes", few "No" samples) should negatively influence the model accuracy as the logistic regression tends to learn the data instances with "Yes" over the "No". Fortunately, after we explore and visualize the proportion of target variables: renewal in the pie chart, it is found that the distribution is quite balanced and any related preprocessing is not required.
3. Data Preparation
If you have not installed Pandas, you could install refer to this tutorial. Having installed it, let's create a new python file named "log_reg.py" and import the Pandas package. Firstly, we need to read the raw data and turn it into data-frame format with the help of Pandas, which is one of famous packages in Python. As the target variable is located at seventh column, the index of it should 6 (since 0 is the starting index). We choose the factors of age, gender and salary and consumption amount (index: 2, 3, 4, 5) as input variables. If you wish to turn the dataframe objects to list objects, you are recommended to acheive that with the syntax "anyDataFrame.values", just like the below coding:
4. Model Building & Assessment
Nice! You are allowed to build a classifier to determine whether the members renewal their memberships after preparing input and target of training data. First of all, we declare a new classifier of logistic regression that is fitted with training data.
The parameter of LogisticRegression Class is set to be default. Some key parameters I would like to highlight are fit_intercept and C. The former has "True" value since we assume there is a bias (aka intercept) term in the model. The later (C) actually is the inverse of regularization strength which is used to adjusted the model to avoid overfitting. Generally, the regularization term ($\lambda\sum_{i=1}^{n}{w^2}$) is allowed to reduce the model complexity so that the model should not over-learn the model and thus overfitting occurs. Therefore, we should set C to be lower if we are going to strengthen the regularization effect ($\lambda$).
As we have 4 independent variables, $n$ = 4 and number of training data, $j$ = 20, the trained model can be expressed as:
$$h_w(x)=w_0+w_1x_1+w_2x_2+w_3x_3+w_4x_4$$
Then, we could review the value of intercept ($w_0$), parameters ($w$) by calling the attributes of classifier. The classification rate is the rate of accurately classify the data instances as its actual classes
Unfortunately, in this example, there are four independent variables as inputs of model and thus we only can see how classifier completely separate the data in the 4-D graph. Scatter plot in 2-dimensions also is an alternative way to observe how projection of classifier separates the data points. We select the salary, $x_3$ and consumption amount, $x_4$ as x and y-axis respectively and ignore the effects of other parameters, $x_1$ & $x_2$. The equation of classifier (threshold, $p$ = 0.5) is as follows:
$$ln(\frac{0.5}{1-0.5})=w_0+w_3x_3+w_4x_4$$
$$0=w_0+w_3x_3+w_4x_4$$
$$w_3=\frac{-w_0-w_3x_3}{w_4}$$
To evaluate performance of classification rate, confusion matrix is constructed and visualized with Scikit-learn and MatPlotLib respectively. The columns of the matrix represents the predicted class while the row represents the actual. The values in one of the elements in matrix implies the number of data instances which belongs to particular actual class is classified as the specific predicted class. Such metrics clearly not only shows the classification rate and error in well-format table but also allowing assess the model performance with its Accuracy, Precision (Sensitivity), Recall, Specificity. Remember to specify the parameter: labels for confusion_matrix, otherwise it will sort the row based on arithmetic order of labels.
Aside from confusion matrix, another metrics evaluating the model performance is ROC (Receiver Operating Characteristic). It is a useful to compare the performances among several classification model based on ROC curve. In this tutorial, only one model is built and thus we cannot compare with other models, but it is still possible to construct the ROC curve to see how our logistic regression classifier works on the data. Obviously, the result indicates that our build model performs more well for less False Positive Rate (FPR).
5. Classifying new data
Assume we have tuned the model and choose it as final model to classify new testing data. Having read and put the data into the model, the class of renewal are predicted and hence we could estimate the five customers whether will continue their memberships.
The results are shown below:
Uriah : Y
Vanessa : Y
Wayne : Y
Yoyo : Y
Zack : Y
Formula:
$$ln(\frac{p}{1-p}) = w_0 + w_1x_1 + w_2x_2+...+w_nx_n=\sum_{i=1}^{m}{w_ix_i}=w^Tx$$
The probability of being "Yes", $p$ can be expressed as below by modifying above formula. If the probability exceeds the threshold (generally 0.5), the data instance will be classified as "Yes", or "No".
$$p=\frac{1}{1+e^{-w^Tx}}$$
Cost function of logistic regression is defined as below formula. Same as other models, the objective is to minimize the value of cost function and thus obtain the optimal weights, $w_i$.
$$J(w) = -\frac{1}{m}[\sum_{j=1}^{m}{(y^{(j)}log(h_w(x^{(j)}) + (1-y^{(j)})(log(h_w(x^{(j)}))}]$$
Regarding the aspects of machine learning, logistic regression is less computational expensive and easier to implement and be interpreted. However, it is relatively easier to be under-fitting on data instances such that higher classification error are resulted.
Implementation in Python
1. Installation of Scikit-Learn
Referring to previous articles to install the package of machine learning.
2. Raw Data Collection & Data Visualization
Assume we have the customer data which includes their personal information, consumption amount and renewal of membership (Download here). Our aims is to predict the members whether will renew their membership or not based on these historical data (training data). For classification purpose of regression model, you need to pay more attention to the proportion of target variable. The biased samples (say many "Yes", few "No" samples) should negatively influence the model accuracy as the logistic regression tends to learn the data instances with "Yes" over the "No". Fortunately, after we explore and visualize the proportion of target variables: renewal in the pie chart, it is found that the distribution is quite balanced and any related preprocessing is not required.
import matplotlib.pyplot as plt renewal_prop = train_data['renewal'].value_counts() plt.pie(renewal_prop, explode=(0,0.1), labels=["Yes", "No"], colors=["lightskyblue", "lightcoral"], radius=1, autopct='%1.1f%%', shadow=True) plt.show()
Balance proportion of target variable: renewal
3. Data Preparation
If you have not installed Pandas, you could install refer to this tutorial. Having installed it, let's create a new python file named "log_reg.py" and import the Pandas package. Firstly, we need to read the raw data and turn it into data-frame format with the help of Pandas, which is one of famous packages in Python. As the target variable is located at seventh column, the index of it should 6 (since 0 is the starting index). We choose the factors of age, gender and salary and consumption amount (index: 2, 3, 4, 5) as input variables. If you wish to turn the dataframe objects to list objects, you are recommended to acheive that with the syntax "anyDataFrame.values", just like the below coding:
from sklearn import preprocessing train_input = train_data[[i for i in range(2,6)]].values # turn dataframe to list le = preprocessing.LabelEncoder() # call LabelEncoder from the class: preprocessing train_target = le.fit_transform(train_data[[6]].values) # assign and transform to label number to each class
4. Model Building & Assessment
Nice! You are allowed to build a classifier to determine whether the members renewal their memberships after preparing input and target of training data. First of all, we declare a new classifier of logistic regression that is fitted with training data.
classifier = linear_model.LogisticRegression() # Default: fit_intercept=True, C=1.0 classifier.fit(train_input, train_target)
The parameter of LogisticRegression Class is set to be default. Some key parameters I would like to highlight are fit_intercept and C. The former has "True" value since we assume there is a bias (aka intercept) term in the model. The later (C) actually is the inverse of regularization strength which is used to adjusted the model to avoid overfitting. Generally, the regularization term ($\lambda\sum_{i=1}^{n}{w^2}$) is allowed to reduce the model complexity so that the model should not over-learn the model and thus overfitting occurs. Therefore, we should set C to be lower if we are going to strengthen the regularization effect ($\lambda$).
As we have 4 independent variables, $n$ = 4 and number of training data, $j$ = 20, the trained model can be expressed as:
$$h_w(x)=w_0+w_1x_1+w_2x_2+w_3x_3+w_4x_4$$
Then, we could review the value of intercept ($w_0$), parameters ($w$) by calling the attributes of classifier. The classification rate is the rate of accurately classify the data instances as its actual classes
accuracy = classifier.score(train_input, train_target) # score is accuracy rate print "Intercept: ", classifier.intercept_ print "Coefficient: ", classifier.coef_ print "Classification Rate: {0:.0f}%".format(accuracy*100)
Unfortunately, in this example, there are four independent variables as inputs of model and thus we only can see how classifier completely separate the data in the 4-D graph. Scatter plot in 2-dimensions also is an alternative way to observe how projection of classifier separates the data points. We select the salary, $x_3$ and consumption amount, $x_4$ as x and y-axis respectively and ignore the effects of other parameters, $x_1$ & $x_2$. The equation of classifier (threshold, $p$ = 0.5) is as follows:
$$ln(\frac{0.5}{1-0.5})=w_0+w_3x_3+w_4x_4$$
$$0=w_0+w_3x_3+w_4x_4$$
$$w_3=\frac{-w_0-w_3x_3}{w_4}$$
i=4 # set the column index to i weight = classifier.coef_[0] yes_data = train_data[train_data['renewal']=='Y'] no_data = train_data[train_data['renewal']=='N'] plt.figure() plt.scatter(yes_data[[i]], yes_data[[i+1]], marker="x", color="blue", label="Yes") plt.scatter(no_data[[i]], no_data[[i+1]], marker="x", color="red", label="No") x = np.arange(4500,45000,4500) y = (-classifier.intercept_ - weight[i-2]*x)/weight[i-1] plt.xlabel(train_data.columns[i]) # label of x-axis plt.ylabel(train_data.columns[i+1]) # label of y-axis plt.plot(x,y, color="black") plt.legend() plt.show()
To evaluate performance of classification rate, confusion matrix is constructed and visualized with Scikit-learn and MatPlotLib respectively. The columns of the matrix represents the predicted class while the row represents the actual. The values in one of the elements in matrix implies the number of data instances which belongs to particular actual class is classified as the specific predicted class. Such metrics clearly not only shows the classification rate and error in well-format table but also allowing assess the model performance with its Accuracy, Precision (Sensitivity), Recall, Specificity. Remember to specify the parameter: labels for confusion_matrix, otherwise it will sort the row based on arithmetic order of labels.
# Confusion Matrix from sklearn.metrics import confusion_matrix predict_train_target = classifier.predict(train_input) cm = confusion_matrix(train_target, predict_train_target, labels=[1,0]) #rmb to set the labels! print cm # Visualize the matrix plt.figure() plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues) plt.title("Confusion Matrix") plt.colorbar() tick_marks = np.arange(2) plt.xticks(tick_marks, ["Yes","No"]) plt.yticks(tick_marks, ["Yes","No"]) plt.ylabel('True') plt.xlabel('Predicted') plt.show()
Aside from confusion matrix, another metrics evaluating the model performance is ROC (Receiver Operating Characteristic). It is a useful to compare the performances among several classification model based on ROC curve. In this tutorial, only one model is built and thus we cannot compare with other models, but it is still possible to construct the ROC curve to see how our logistic regression classifier works on the data. Obviously, the result indicates that our build model performs more well for less False Positive Rate (FPR).
# ROC from sklearn.metrics import roc_curve fpr, tpr, thresholds = roc_curve(train_target, classifier.decision_function(train_input)) # ROC-curve plt.figure() plt.plot(fpr, tpr, label='ROC curve') # plot the false positive and true positive rate plt.plot([0, 1], [0, 1], 'k--') # plot the threshold plt.xlim([0.0, 1.0]) # limit x axis plt.ylim([0.0, 1.05]) # limit y axis plt.xlabel('1 - Specificity (False Positive Rate)') plt.ylabel('Sensitivity (True Positive Rate)') plt.title('Receiver operating characteristic') plt.legend(loc="lower right") plt.show()
5. Classifying new data
Assume we have tuned the model and choose it as final model to classify new testing data. Having read and put the data into the model, the class of renewal are predicted and hence we could estimate the five customers whether will continue their memberships.
test_data = pd.read_csv("test_data.csv") # read test data test_input = np.array(test_data[[i for i in range(2,6)]]) test_target = classifier.predict(test_input) for i in range(test_data.shape[0]): predict_class = le.inverse_transform(test_target[i]) # use labelEncoder to transform back to class name print test_data.name[i], ": ", predict_class # show the result
The results are shown below:
Uriah : Y
Vanessa : Y
Wayne : Y
Yoyo : Y
Zack : Y
Great! You have learnt how classification can be achieved with logistic regression model. It is fact that the assessment method introduced in this chapter also can be applied to other model which is suit to conduct classification and not just limited to logistic regression model. There will be more tutorial about such models and its application coming soon. Enjoy~
Comments
Post a Comment