Data Jungler

Posts

Showing posts from April, 2016

邏輯迴歸 - 讓你知道"是"或"否"

- April 18, 2016

另一種稱為邏輯回歸 (Logistic Regression)的回歸模型用於描述某類別 (Categorical Variable)與獨立變量 (Independent Variables)之間的關係，並估計出現某類別的概率。邏輯回歸中的輸出變量是分類而不是間隔。可能讀者們早已聽過機器學習的Classification，它是Data Mining和Pattern Recognition中常見的應用之一。透過Logit Function的轉換，目標變量的類型變成了Categorical，不再是Numeric。為了讓初學者容易理解邏輯回歸的精要，在本文中，會利用只有兩個類別的目標變數會作為例子（兩個類別為“是”或“否”）。公式: $$ln(\frac{p}{1-p}) = w_0 + w_1x_1 + w_2x_2+...+w_nx_n=\sum_{i=1}^{m}{w_ix_i}=w^Tx$$ 當中的$p$為得到“是”的機率，透過一些轉換，我們會獲得新的公式（如下）。假設我們已經知道所有自變量, $x$和每個自變量的比重$w$，我們便能計算出得到“是”的機率。至於“否”的機率，明顯地是$1 - p$ $$p=\frac{1}{1+e^{-w^Tx}}$$ 邏輯回歸的Cost function的公式定義在下面，這其實由統計學上的Maximum Likelihood Estimation衍生出來，同樣地，目的是把Cost function的減至最少，從而獲得相應的比重, $w_i$。 $$J(w) = -\frac{1}{m}[\sum_{j=1}^{m}{(y^{(j)}log(h_w(x^{(j)}) + (1-y^{(j)})(log(h_w(x^{(j)}))}]$$ 從機器學習的角度出發，邏輯回歸運算的要求較低，易於實現和解釋。然而，面對一些非線性關係的自變量，違反了模型本身的假設，其表現亦會減弱，即是更容易錯誤地分類。在Python實現 1. 安裝Scikit-Learn 參考先前的文章安裝用作機器學習的Package。 2. 數集數據及可視化假設我們有一組客戶的數據，包括他們的個人資料，消費量和今年有沒有續會（按這裡下載）。我們的目標是根據這些歷史數據 (Training Data)預測一些新加...

Logistic Regression - Tell us "Yes" or "No"

- April 18, 2016

Another regression model called Logistic regression is used to describe the relationship between the categorical dependable variable and independable variables with logistic function and estimate the probabilities of binary response. The output variable in logistic regression is categorical rather than interval. You might have heard the terms, Classification, which is one of common applications in data mining or automatic pattern recognition. We can classify the binary, ordinal or nominal target variables by performing Logistic Regression. In this tutorial, only binary target variables ("Yes", "No") is taken as example of classification in application since it is less complicated for beginner to get a grip on. Formula: $$ln(\frac{p}{1-p}) = w_0 + w_1x_1 + w_2x_2+...+w_nx_n=\sum_{i=1}^{m}{w_ix_i}=w^Tx$$ The probability of being "Yes", $p$ can be expressed as below by modifying above formula. If the probability exceeds the threshold (generally 0.5), the...