Linear Regression - Simple model for statistics and prediction

No doubt that Linear Regression are competent to be applied to different aspects and purposes. It can be an approach to describe the relationship between the independent variables ($x_i$) and dependable variable ($y$), or predict the target variables ($y$) with the inputs ($x_i$). It also helps us to understand how much effect on change of dependent variable, when changing one or more independent variables.

Formula:
$$y = w_0+w_1x_1+w_2x_2+ ... +w_nx_n$$
It is also presented as:
$$h_w(x)=\sum_{i=1}^{m}{w_ix_i}=w^Tx$$

Cost Function is defined as sum of square of difference between $h_w(x^{(j)})$ and $y^{(j)}$. In statistics, it is called sum of square error, SSE. In order to fit the straight line to $n$ data points, cost function / SSE needs to be minimized to achieve the optimization goal. 
$$J(w)=\sum_{j=1}^{m}{h_w(x^{(j)})-y^{(j)})^2}$$

Think it intuitively, the less difference between predicted and actual target values, the predictive model is giving the result which is more near to the actual value, and thus treated as well-fitted model. Coefficient of determination, $R^2$, indicates that how well data fit in the linear regression model. The higher value of $R^2$ implies that the data fit the data more perfectly. It also measures the fraction of total variation, $SST$ accounted for by Sum of square error of regression, $SSR$.

$R^2=\frac{SSR}{SST}$, where $0\leq R^2\leq 1$

Nevertheless, linear regression normally performs not as accurate as other model such as neural network in the prediction since it assumes the linear relationship of inputs and output. Another challenge to linear regression is that it is sensitivity to the outliers and affected the accurate of predictive results. It can be solved by kicking out the outliers, that can be identified in some charts or results of summary statistics before we train the model.


Implementation in Python
In previous chapter, I have introduced the overall process of data mining. Some of you might not understand well. It should be fine since the scope of data mining across the many study areas. So as to let you guys know about the details, I will take the Python as programming language to build Linear Regression model, fit the data and finally assess the performance.

1. Installation of Scikit-Learn
Referring to previous articles to install the package of machine learning.

Next, we could open the Python editor. Recently, I am addicted to develop and code with PyCharm as its prompt is able to save my memories as well as facilitate the development speed. Don't worry. It is just my personal preference and you are welcome to use the your own favorite editor.

2. Raw Data Collection  & Data Visualization
You need to have data (not girlfriend) first before building any models for prediction or classification. Please download the raw data (data.csv) and related materials from my repository in the GitHub. You could find train_data.csv which consists of the personal information of 20 members including name, age, gender (1: Male, 0: Female), monthly salary, consumption amount per month.

Another file, test_data.csv also include the personal information without the target variables (consumption amount per month). We will predict the amount by the model built with training data.

Suppose you have installed this package. We could visualize the data first in forms of 2-D plot with the help of Matplotlib package to show the relationship between the dependent variables ( consumption_amount) and independent variable (age, gender, salary).

import matplotlib.pyplot as plt 
for i in range(2,5):    # Loop i from 2 to 4 
    plt.subplot(2, 2, i-1)    # (i-1)th graph will be added to the plot => 1st, 2nd, 3rd graph
    plt.subplots_adjust(wspace=0.3, hspace=0.3)  # Separate the 3 graph
    plt.scatter(train_data[[i]], train_data.consumption_amount, color="black")      
    plt.xlabel(train_data.columns[i])    # label of x-axis
    plt.ylabel(train_data.columns[5])   # label of y-axis
plt.show()



3. Data Preparation
If you have not installed Pandas, you could install refer to this tutorial. Having installed it, let's create a new python file named "lin_reg.py" and import the Pandas package. Firstly, we need to read the raw data and turn it into data-frame format with the help of Pandas, which is one of famous packages in Python. As the target variable is located at sixth column, the index of it should 5 (since 0 is the starting index). We will only choose the age, gender and salary(index: 2, 3, 4) as input variables and put them into the array of Numpy as well.

import numpy as np, pandas as pd
train_data = pd.read_csv("train_data.csv")  # pd refers to pandas
# split the data into input and target variables 
train_input = np.array(train_data[[i for i in range(2, 5)]])  # [i for i in range(2, 5)] is list comprehension that is [2, 3, 4]
train_target = np.array(train_data[[5]])

4. Model Building & Assessment
Having prepared the input and target data, we can build our first prediction model! Thanks to Scikit-Learn, it save many complex coding and time. As we have installed the scikit-learn packages with pip, we can import linear_model from it. After that, we need to declare a new predictor of Linear Regression with suitable parameters. Coincidentally, all parameters we needed have been set the suitable default values. Therefore, we are not necessary to specify the value the parameter.

from sklearn import linear_model
model = linear_model.LinearRegression()  # Default: fit_intercept=True, normalize=False, copy_X=True, n_jobs=1

In order to train the parameters ($w$, don't mix up the parameter of function!) in the linear regression with the training data, the function, "fit" function will be used. We put the input data and target data into "fit" function. Then, the parameters ($w$) of "model" are tuned as optimized parameters and a linear regression models fitted with training data is built.

Noted that number of independent variables, $n$ = 3 and number of training data, $j$ = 20.
$$h_w(x)=w_0+w_1x_1+w_2x_2+w_3x_3$$
Then, we could review the value of intercept ($w_0$), parameters ($w$) by calling the attributes of model. Additionally, the mean square error and coefficient of determination, R-square should be measured as it is significant indicator of how fitting on data.

model.fit(train_input, train_target)

coef_of_det = model.score(train_input, train_target)  # score is the Coefficient of Determination
sse_train = sum((model.predict(train_input) - train_target)**2)
print "Intercept: ", model.intercept_
print "Coefficient: ", model.coef_
print "Sum of Square Error of Training Data: ", sse_train
print "R square: ", coef_of_det 

Great! If you are using the default IDLE, you should have obtained the results. If not, let's run the script with terminals.
python lin_reg.py

Therefore, we could assess the performance of models and how it fits the training data. The result is as follows:
Intercept:  [ 802.27182655]
Coefficient:  [[ -23.92508808  159.8641272     0.16017803]]
Sum of Square Error of Training Data:  [ 2195542.00286537]
R square:  0.904772738595
5. Predicting new data
We can see the R-square (Coefficient of Determination) is nearly 0.9 and that means the model fit the training data quite well. In general practice, the parameters of model needs to be tuned to be the best model (your own objective determine the indicators of goodness). Assume that the model we have trained is good enough to estimate the amount of consumption.

# predict the testing data
test_data = pd.read_csv("test_data.csv")
test_input = np.array(test_data[[i for i in range(2,5)]])
test_target = model.predict(test_input)
for i in range(5):
    print test_data.name[i], ": ", test_target[i][0]

The predicted amount which five member consumes each month is shown below:
Uriah :  5177.66325009
Vanessa :  5153.03590309
Wayne :  8277.20964953
Yoyo :  5666.82497116
Zack :  4872.9185399
Good Job! You have built a simple machine learning model for the purpose of prediction. Hope you could understand more about linear regression and its implementation on machine learning. In fact, there are many algorithm for prediction not only the linear regression model. So, I will introduce some advanced algorithm to predict the target variables such as artificial neural network.



Comments

Popular posts from this blog

Boosting vs Bagging? 別再胡亂用了!

機器學習之陷阱 - Imbalance Class Classification

Excel VBA - 自動生成分析報告