Posts

Showing posts from December, 2015

線性迴歸 - 統計和預測的最基礎

Image
線性回歸 (Linear Regression) 應用於不同的方面以達到不同的目的。 描述自變量Independent Variables($ x_i $)和因變量Dependent Variables($ y $)之間的關係的方法,或者通過輸入($ x_i $)預測目標變量($ y $)。 還有助於我們了解在更改一個或多個獨立變量時對因變量變化的影響。 公式: $$y = w_0+w_1x_1+w_2x_2+ ... +w_nx_n$$ 也可以表示為: $$h_w(x)=\sum_{i=1}^{m}{w_ix_i}=w^Tx$$ Cost Function定義為$h_w(x^{(j)})$和$y^{(j)}$之間的差的平方和。 統計學上,被稱為平方誤差(Sum of Square Error, SSE)。 為了使直線符合$n$數據點,Cost Function / SSE需要最小化以實現優化目標,這過程又稱OLS Estimation(Ordinary least squares Estimation)。 直觀地想,預測值(Predicted value)和實際目標(Actual value)值之間的差異越小,預測模型給出的結果越接近實際值,也暗示這個模型的愈準確,誤差值亦愈小,$R^2$表示線性回歸模型解釋數據的能力。愈高的$R^2$意味著模型愈有能力解釋現有的數據。 它還量度總變異, $SST$和回歸平方誤差的和, $SSR$的比例。 $R^2=\frac{SSR}{SST}$, where $0\leq R^2\leq 1$ 但是,線性回歸面對著非線性關係的變數,不如其他模型那樣精確,如神經網絡,始終它假設輸入和輸出的線性關係。 線性回歸的另一個問題是它對Outlier(主要是Influence Point) 的敏感性,會影響預測結果的準確性。 在我們訓練模型之前,可以在某些圖表中或者統計結果中被識別,把Outlier踢出來解決這個問題。 In previous chapter , I have introduced the overall process of data mining. Some of you might not understand well. It should be fine since the s...

Linear Regression - Simple model for statistics and prediction

Image
No doubt that Linear Regression are competent to be applied to different aspects and purposes. It can be an approach to describe the relationship between the independent variables ($x_i$) and dependable variable ($y$), or predict the target variables ($y$ ) with the inputs ($x_i$). It also helps us to understand how much effect on change of dependent variable, when changing one or more independent variables. Formula: $$y = w_0+w_1x_1+w_2x_2+ ... +w_nx_n$$ It is also presented as: $$h_w(x)=\sum_{i=1}^{m}{w_ix_i}=w^Tx$$ Cost Function is defined as sum of square of difference between $h_w(x^{(j)})$ and $y^{(j)}$. In statistics, it is called sum of square error, SSE. In order to fit the straight line to $n$ data points, cost function / SSE needs to be minimized to achieve the optimization goal.  $$J(w)=\sum_{j=1}^{m}{h_w(x^{(j)})-y^{(j)})^2}$$ Think it intuitively, the less difference between predicted and actual target values, the predictive model is giving the result which...

Scikit-learn - Powerful package of machine learning algorithm in Python

Scikit-Learn is a famous package of machine learning in Python, that are contributed by the community. Most of the processes involving data mining including data processing, model training and validation and assessment are available in this module. Some highlighted algorithms of Regression, Decision Tree, Clustering are allowed to satisfy our analytic needs. Recently, I am glad to find a new class of Artificial Neural Network for supervised learning in the new version. Really cannot wait for the latest version of scikit-learn! So, install Scikit-Learn first! Mac Open the terminal and type: pip install scikit-learn Window Open the CMD: pip install scikit-learn Ubuntu Open the terminal and type: pip install --user --install-option="--prefix=" -U scikit-learn Let's enjoy the journey of data mining in data jungle with Scikit-learn!

Scikit-learn - 讓機器學習變得人性化

Scikit-Learn是Python其中一個出名的Package,讓你在一些Data Mining中數據的特別處理和變換,模型訓練,驗證和評估,都能簡潔地實現出來。 一些基本的演算法如Regression, Decision Tree, Clustering等也能夠滿足我們對於預測分析的需求。 而最近,新版本(0.18.0)的Scikit-Learn亦新增和推出了人工神經網絡(監督式學習), 實在讓我十分期待最新版本的Scikit-learn! 事不宜遲,先一起安裝Scikit-Learn吧! Mac 打開Terminal和輸入: pip install scikit-learn Window 打開CMD和輸入: pip install scikit-learn Ubuntu 打開Terminal和輸入: pip install --user --install-option="--prefix=" -U scikit-learn Scikit-Learn囊括了七大範疇,每一個範疇裡頭有很多的以Class型式儲存的Model或者而Function型式儲存的Methodology: 1. 監督式學習 (Supervised Learning) 常用和用作分類預測的監督式學習的模型,例如Regression, Decision Tree, KNN, Naive Bayes, Neural Network等都已經應有盡有,再進階一點的Ensemble Method,如Bagging, Boosting都包括在內。 2. 非監督式學習 (Unsupervised Learning) 非監督式學習的模型就像Clusteirng, Outlier Detection,甚至Neural Network也包括內,讓你在沒有已知的目標變數下進行有關的分析。 3. 模型的選擇和評估 (Model Selection & Assessment) 一些Confusion Matrix, ROC, Lift Chart等用作評估模型的Metrics都會包含在內。另外,Cross Validation和Learning Curve等Technique也會助你選擇最好的Parameters。 4. 數據變換(...