Pandas - make Data Frame as easy as R

- September 15, 2015

For a programming beginner, one of the main difficulties is to transform or modify the raw data. Without the support of installed library, it is hard and time-consuming to handle this part. If you are the loyal R user, you should know that there is a special variable type: data frame. It is quite convenient to retrieve the attributes and corresponding values and further analysis. Thanks to Pandas package which is one of the modules in Python, we are allowed to analyze with the well-defined data structure. Let's install by inputting below command in CMD / terminal.

Installation of Pandas
Mac

$ sudo pip install pandas

Window

> pip install pandas

Ubuntu

$ sudo apt-get install python-pandas

Application
Pandas also makes the procedures of reading file (.json, .csv, .html, etc.) simpler. The file with csv format is easier to handle since it has already been in the table-like structure. As a consequence I will take Json data as the example instead of .csv file. Refer to the data sources we have used in previous tutorials, there is a series of members' information saved in json file. Suppose you have save the "data.json" file in the your own desktop. Time to play with Pandas and open your own Python editor!

> import os, pandas
> desktop_dir = os.path.join(os.path.expanduser("~"),"Desktop")    # get the desktop directory
> jsonfile_dir = os.path.join(desktop_dir,"data.json")    # as data.json saved in desktop
> json_data = pandas.read_json(jsonfile_dir)    # json_data is Data Frame type
> print json_data

See? A well-structured table is shown in the output. Now, we are going to obtain the profiles of our members. Therefore, the syntax: json_data.profile is used to get the column of profile and then convert to another variable with data frame type.

> profile_data = json_data.profile    # only extract the profile of the data 
> raw_data = pandas.DataFrame(profile_data)    # convert the data to Data frame structure
> print raw_data
> raw_data = raw_data[['id','name','gender','age']]    # revise the order of attributes as I want
> print raw_data

In the previous example, we need to create a list or dictionary to store the attributes and corresponding value. On the contrary, the data frame in pandas does not require any kind of stuff. Assume to get the average value of our members' age and it can be achieved by few code:

> import numpy    # import the useful package for statistical analysis
> numpy.mean(raw_data.age)    # call function of NumPy to get the average age

In conclusion, pandas is more user-friendly and simpler especially when coding for data processing. I strongly recommend that leverage the pandas to read and process the data such that statistical analysis and machine learning could be achieved by NumPy and Sci-kit Learn.

Comments

venkat3 August 2017 at 13:49
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging…

R Programming Online Training|
Data Science Online Training|
Hadoop Online Training
ReplyDelete
Replies
Revathi30 July 2020 at 21:08
Thanks for sharing this blog. every content should be very neatly represented. concepts are unique.keep it up!!!

Android Training in Chennai

Android Online Training in Chennai

Android Training in Bangalore

Android Training in Hyderabad

Android Training in Coimbatore

Android Training

Android Online Training

ReplyDelete
Replies

Add comment

Search This Blog

Data Jungler

Pandas - make Data Frame as easy as R

Comments

Post a Comment

Popular posts from this blog

Excel VBA - 自動生成分析報告

機器學習之陷阱 - Imbalance Class Classification

Boosting vs Bagging? 別再胡亂用了!