邏輯迴歸 - 讓你知道"是"或"否"

- April 18, 2016

另一種稱為邏輯回歸 (Logistic Regression)的回歸模型用於描述某類別 (Categorical Variable)與獨立變量 (Independent Variables)之間的關係，並估計出現某類別的概率。邏輯回歸中的輸出變量是分類而不是間隔。可能讀者們早已聽過機器學習的Classification，它是Data Mining和Pattern Recognition中常見的應用之一。透過Logit Function的轉換，目標變量的類型變成了Categorical，不再是Numeric。為了讓初學者容易理解邏輯回歸的精要，在本文中，會利用只有兩個類別的目標變數會作為例子（兩個類別為“是”或“否”）。

公式:
$$ln(\frac{p}{1-p}) = w_0 + w_1x_1 + w_2x_2+...+w_nx_n=\sum_{i=1}^{m}{w_ix_i}=w^Tx$$

當中的$p$為得到“是”的機率，透過一些轉換，我們會獲得新的公式（如下）。假設我們已經知道所有自變量, $x$和每個自變量的比重$w$，我們便能計算出得到“是”的機率。至於“否”的機率，明顯地是$1 - p$
$$p=\frac{1}{1+e^{-w^Tx}}$$

邏輯回歸的Cost function的公式定義在下面，這其實由統計學上的Maximum Likelihood Estimation衍生出來，同樣地，目的是把Cost function的減至最少，從而獲得相應的比重, $w_i$。
$$J(w) = -\frac{1}{m}[\sum_{j=1}^{m}{(y^{(j)}log(h_w(x^{(j)}) + (1-y^{(j)})(log(h_w(x^{(j)}))}]$$

從機器學習的角度出發，邏輯回歸運算的要求較低，易於實現和解釋。然而，面對一些非線性關係的自變量，違反了模型本身的假設，其表現亦會減弱，即是更容易錯誤地分類。

在Python實現
1. 安裝Scikit-Learn
參考先前的文章安裝用作機器學習的Package。

2. 數集數據及可視化
假設我們有一組客戶的數據，包括他們的個人資料，消費量和今年有沒有續會（按這裡下載）。我們的目標是根據這些歷史數據 (Training Data)預測一些新加入的會員將來會否續會。對於回歸模型的分類目的，需要更加註意目標變數中類別的比例。由於邏輯回歸傾向於通過“否”來學習“是”數據，若樣本出現比例不平衡的話（很多的“是”和很少的“否”）均會對模型的準確性產生負面的影響。從Pie Chart，我們能看到類別的比例，幸運的是，“是”和“否”的分佈相當平均，因此不需要作額外處理。

import matplotlib.pyplot as plt 
renewal_prop = train_data['renewal'].value_counts()
plt.pie(renewal_prop, explode=(0,0.1), labels=["Yes", "No"], colors=["lightskyblue", "lightcoral"], radius=1, autopct='%1.1f%%', shadow=True)
plt.show()

Balance proportion of target variable: renewal

3. 資料準備
如果讀者們還沒有安裝Pandas，可以參考這文章進行安裝。安裝後，我們建立名為“log_reg.py”的新的Python檔案。首先，在Pandas的幫助下，先讀取數據並將其轉換成Data Frame的格式。由於Target Variable位於第七個Column，因此其Index應為6（因為0是起始索引）。年齡，性別，薪金和消費量的因素（Index：2,3,4,5）會用作為Input Variable。如果您希望將Data Frame轉換為陣列 (List)，你可以使用“anyDataFrame.values”來實現，像下面的Code一樣：

from sklearn import preprocessing
train_input = train_data[[i for i in range(2,6)]].values  # turn dataframe to list
le = preprocessing.LabelEncoder()  # call LabelEncoder from the class: preprocessing
train_target = le.fit_transform(train_data[[6]].values)  # assign and transform to label number to each class

4. 建模及評估
現在，您可以建立一個分類器 (Classifier)來估計會員們會否延續他們的會籍。首先，我們宣告一個新的邏輯回歸分類器，並把訓練數據Fit進分類器。

classifier = linear_model.LogisticRegression() # Default: fit_intercept=True, C=1.0 
classifier.fit(train_input, train_target)

LogisticRegression在Python是一個Class，我們會把它的參數值都選擇默認的。之所以fit_intercept這個參數設定為"True"是因為，因為我們假設模型中有一個偏差（intercept）項。而C實際上是Regularization強弱度，愈低的C值，會有愈強的效果，主要是用來避免Overfitting。通常，Cost Function中Regularization Term ($\lambda\sum_{i=1}^{n}{w^2}$)是懲罰高複雜性的模型，讓較低複雜性的模型獲勝，阻止模型過度學習。如果我們要加強Regularization的效果($\lambda$)，我們應該將C設置為較低。

由於我們有4個獨立變數，$n$ = 4和訓練數據數，$j$ = 20，訓練出來的模型可以表示為：
$$h_w(x)=w_0+w_1x_1+w_2x_2+w_3x_3+w_4x_4$$
然後，我們可以得出邏輯回歸的intercept ($w_0$)，參數($w$)的值。Classification Rate是準確地分類的數目和真實數目的比例，即是估計正確的會員數目除以所有會員的數目。

accuracy = classifier.score(train_input, train_target) # score is accuracy rate
print "Intercept: ", classifier.intercept_
print "Coefficient: ", classifier.coef_
print "Classification Rate: {0:.0f}%".format(accuracy*100)

不幸的是，在這個例子中，有四個獨立變數作為模型的輸入，因此我們沒法用圖像看到分離不同類別的會員。我惟有選擇其中兩個變數，在一個2-D圖像中分類器如何分類。我們分別選擇薪水$x_3$和消費金額$x_4$，並忽略其他參數$x_1$和$x_2$的影響。邏輯回歸的方程式 (threshold，$p$ = 0.5)如下：
$$ln(\frac{0.5}{1-0.5})=w_0+w_3x_3+w_4x_4$$
$$0=w_0+w_3x_3+w_4x_4$$
$$w_3=\frac{-w_0-w_3x_3}{w_4}$$

i=4  # set the column index to i
weight = classifier.coef_[0]
yes_data = train_data[train_data['renewal']=='Y']
no_data = train_data[train_data['renewal']=='N']
plt.figure()
plt.scatter(yes_data[[i]], yes_data[[i+1]], marker="x", color="blue", label="Yes")
plt.scatter(no_data[[i]], no_data[[i+1]], marker="x", color="red", label="No")
x = np.arange(4500,45000,4500)
y = (-classifier.intercept_ - weight[i-2]*x)/weight[i-1]
plt.xlabel(train_data.columns[i])   # label of x-axis
plt.ylabel(train_data.columns[i+1])   # label of y-axis
plt.plot(x,y, color="black")
plt.legend()
plt.show()

為了評估Classifier的表現，筆者分別用上Scikit-learn和MatPlotLib建立Confusion Matrix。 Matrix的Column表示預測的類別，而Row則表示真實的類別。每一格的數字代表著屬於某真實的類別而被分類為某預測類別的數量。這個指標清楚顯示出Classifier如何把會員分類為會續會或否的會員，除了準確度，Precision (Sensitivity), Recall, Specificity也是用來評估模型性能和描述在那方面表現得更好。請記得要指明confusion_matrix的label啊！否則Scikit-Learn會把標籤以Arithmetic order進行排序。

# Confusion Matrix
from sklearn.metrics import confusion_matrix
predict_train_target = classifier.predict(train_input)
cm = confusion_matrix(train_target, predict_train_target, labels=[1,0]) #rmb to set the labels!
print cm

# Visualize the matrix
plt.figure()
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title("Confusion Matrix")
plt.colorbar()
tick_marks = np.arange(2)
plt.xticks(tick_marks, ["Yes","No"])
plt.yticks(tick_marks, ["Yes","No"])
plt.ylabel('True')
plt.xlabel('Predicted')
plt.show()

除了Confusion Matrix，評估模型性能的另一個指標是ROC（接收機操作特性），並用於比較幾個Binary Classifier的性能。本文中，我們只有建立唯一一個模型，因此我們無法與其他模型進行比較，當然，我們仍可以畫出ROC曲線來查看邏輯回歸分類器的表現。結果顯示，我們建立的模型在較低的假陽性率（FPR）下表現較好。

# ROC
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(train_target, classifier.decision_function(train_input))

# ROC-curve
plt.figure()
plt.plot(fpr, tpr, label='ROC curve')  # plot the false positive and true positive rate
plt.plot([0, 1], [0, 1], 'k--')  # plot the threshold
plt.xlim([0.0, 1.0])  # limit x axis
plt.ylim([0.0, 1.05])  # limit y axis
plt.xlabel('1 - Specificity (False Positive Rate)')
plt.ylabel('Sensitivity (True Positive Rate)')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

5. 分類新的數據
假設我們已經調整好模型，把其選為最好的模型，並用新的測試數據進行分類。將數據倒入模型後，可以得一組預測出來的類別，也即是五個客戶是否延續其會籍。

test_data = pd.read_csv("test_data.csv")  # read test data
test_input = np.array(test_data[[i for i in range(2,6)]])
test_target = classifier.predict(test_input)

for i in range(test_data.shape[0]):
    predict_class = le.inverse_transform(test_target[i])  # use labelEncoder to transform back to class name
    print test_data.name[i], ": ", predict_class  # show the result

分類結果如下：
Uriah : Y
Vanessa : Y
Wayne : Y
Yoyo : Y
Zack : Y

非常好！您已經了解到如何利用邏輯回歸模型做分類的工作。事實上，本章介紹的評估模型方法也可以應用於其他的分類器，例如Bayes Classifier，KNN等，不局限於邏輯回歸模型。在這之後，會有更多不同分類器的教程及其應用會推出，所以敬請期待啊！

Search This Blog

Data Jungler

邏輯迴歸 - 讓你知道"是"或"否"

Comments

Post a Comment

Popular posts from this blog

Excel VBA - 自動生成分析報告

機器學習之陷阱 - Imbalance Class Classification

Boosting vs Bagging? 別再胡亂用了!