Python - Multi-Athlon Programming

It is no doubt that Python can facilitates the development the tools with the support from the huge library. Many friends have raised a question which is better for data analysis with R or Python. Actually, there are CRAN in R and Python Package in Python, so that almost all function of R which Python also could have. As a both back-end developer and data analyst, Python seems to be the best choice of programming language for me to process the data extraction, preparation, analysis and modeling simultaneously.

Download and Installation
Steps to install the python for different kinds of OS are as follows. Installation of PIP which is used for downloading other useful modules is also include in the steps.

Mac
1. Download the 2.7.x version in https://www.python.org/downloads/
2. Open the downloaded file and follow the instruction to install the python
3. Run the python in terminals
$ python
>>> quit()      Quit the Python Shell


4. Open terminal and install PIP:
$ sudo easy_install pip
$ sudo pip install --upgrade pip
Window
1. Download the 2.7.x version in https://www.python.org/downloads/
2. Open the downloaded file and follow the instruction to install the python
3. Set PATH variables (Sorry for cannot provide English version as Traditional Chinese interface is set in my Window laptop)

Open My Computer, Right click and select the last option


Click the button in Red Circle to change PATH Variables

Select the variables path and Add the ";C:\Python27\;C:\Python27\Scripts" after the value of variables.


4. Open the CMD and type:
> python
>>> quit()     Quit the Python Shell

Successfully execute python in CMD!
5. Download the get-pip.py
6. Open the CMD and type (Assume you save the file in Desktop):
> python <directory>\get-pip.py


7. Upgrade PIP by typing the command in CMD:
> python -m pip install -U pip

Ubuntu
1. Install both Python and PIP in terminal:
$ sudo apt-get install python-setuptools python-pip python-dev build-essential
2. Start the python in terminal:
$ python
Successfully execute python and pip as well

Application for Data Analysis
Data Wrangling, the data scientists named as, is to import and read the data sources before cleansing and analyzing them. Moreover, Django, a famous web framework written in Python, also acts as API (application programming interface) between apps and server. Under such framework, data are serialized as JSON format and transmitted to target location.

Read JSON 
Assume we have downloaded the JSON file in the Desktop. To decode the JSON in python, a function for reading JSON in python.


Please download the source file in here (GitHub) and the checklist is:
1. data.json: Sample customer data in JSON format.
2. read_json.py: Python file consists of function of importing JSON

Suppose you have saved these files in the Desktop. Open the terminal(command) or IDLE installed before and type (the code after '#' is my comments: 
> import sys,os,pprint   # Import the build-in modules in python
> desktop_dir = os.path.join(os.path.expanduser("~"),"Desktop")
   # Extract the desktop path and put it in the variables: desktop_dir
> sys.path.insert(0,desktop_dir)    # Add the desktop directory in search-module path
> import read_json   # Import the read_json.py
> raw_data = read_json.import_json(os.path.join(desktop_dir,"data.json"))
   # Call the function named import_json and the output stores in variables: raw_data
> pprint.pprint(raw_data)    # Pretty print the imported data

Tips:
For more details of module library, you are recommended to visit the https://docs.python.org/2/library/index.html

Good jobs! You have imported the data successfully! Next, we could do some analysis on these data. Let's be the data jungler!


Data Statistics Analysis
Let's do some simple statistics!
Please install module of NumPy and SciPy first.

Mac
$ pip install numpy scipy
Window
> pip install numpy
Download the installation package and follow the instruction to install SciPy after clicking on it.
Ubuntu
$ sudo apt-get install python-numpy python-scipy python-matplotli
Great! The data could be summarized. Thanks to this supports of modules, you could ignore the tedious formula in programming
> import numpy, scipy
> age = []    # declare the empty array
> for i in range(len(raw_data["profile"])):    # Put the values of ten customers' age into the array
       age.append(raw_data["profile"][i]["age"])
> print numpy.mean(age)    # Call the function from NumPy to get the mean of age
> print numpy.std(age, ddof = 1)    # Get the sample standard deviation of age
Tips:
len(array): the basic function to get the number of elements in the array.
array.append: the basic function to put the elements into the array.

In fact, the code in line 3 can be replaced by "for i in range(10):". However, what happen if the customers data varied we have? To avoid the hard-code, "len(raw_data["profile"])" which outputs the value of number of customer is replaced to "10".

You are also suggested to visit http://docs.scipy.org/doc/ for more statistical function references.


Hope this materials useful for analysts who are learning programming. Should you have any questions, please feel free to leave a comment. Also, I will demonstrate the application of advanced statistical and machine learning in Python later on.

Comments

Popular posts from this blog

Boosting vs Bagging? 別再胡亂用了!

機器學習之陷阱 - Imbalance Class Classification

Excel VBA - 自動生成分析報告