Pandas - make Data Frame as easy as R

For a programming beginner,  one of the main difficulties is to transform or modify the raw data. Without the support of installed library, it is hard and time-consuming to handle this part. If you are the loyal R user, you should know that there is a special variable type: data frame. It is quite convenient to retrieve the attributes and corresponding values and further analysis. Thanks to Pandas package which is one of the modules in Python, we are allowed to analyze with the well-defined data structure. Let's install by inputting below command in CMD / terminal.

Installation of Pandas
Mac
$ sudo pip install pandas
Window
> pip install pandas
Ubuntu
$ sudo apt-get install python-pandas
Application
Pandas also makes the procedures of reading file (.json, .csv, .html, etc.) simpler. The file with csv format is easier to handle since it has already been in the table-like structure. As a consequence I will take Json data as the example instead of .csv file. Refer to the data sources we have used in previous tutorials, there is a series of members' information saved in json file. Suppose you have save the "data.json" file in the your own desktop. Time to play with Pandas and open your own Python editor!
> import os, pandas
> desktop_dir = os.path.join(os.path.expanduser("~"),"Desktop")    # get the desktop directory
> jsonfile_dir = os.path.join(desktop_dir,"data.json")    # as data.json saved in desktop
> json_data = pandas.read_json(jsonfile_dir)    # json_data is Data Frame type
> print json_data



See? A well-structured table is shown in the output. Now, we are going to obtain the profiles of our members. Therefore, the syntax: json_data.profile is used to get the column of profile and then convert to another variable with data frame type.
> profile_data = json_data.profile    # only extract the profile of the data 
> raw_data = pandas.DataFrame(profile_data)    # convert the data to Data frame structure
> print raw_data
> raw_data = raw_data[['id','name','gender','age']]    # revise the order of attributes as I want
> print raw_data



In the previous example, we need to create a list or dictionary to store the attributes and corresponding value. On the contrary, the data frame in pandas does not require any kind of stuff. Assume to get the average value of our members' age and it can be achieved by few code:
> import numpy    # import the useful package for statistical analysis
> numpy.mean(raw_data.age)    # call function of NumPy to get the average age

In conclusion, pandas is more user-friendly and simpler especially when coding for data processing. I strongly recommend that leverage the pandas to read and process the data such that statistical analysis and machine learning could be achieved by NumPy and Sci-kit Learn.

Comments

  1. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging…

    R Programming Online Training|
    Data Science Online Training|
    Hadoop Online Training

    ReplyDelete

Post a Comment

Popular posts from this blog

Boosting vs Bagging? 別再胡亂用了!

機器學習之陷阱 - Imbalance Class Classification

Excel VBA - 自動生成分析報告