Pandas - 快速處理和變換數據

- September 15, 2015

對於編程初學者，其中一個難題是轉換或修改原始數據。沒有Package的支持，會很困難和耗時的。如果您是忠實的R用戶，您應該知道R有一個特殊的Data Type：Data Frame。結構十分像Database裡的Table一樣有Fields和Records，在處理和變換的工序也非常方便，令你可以再進一步分析。而Python亦多得Pandas這個Package，讓我們可以使用這個數據結構來進行分析。我們可通過在CMD / Terminals輸入以下命令來安裝。

安裝Pandas
Mac

$ sudo pip install pandas

Window

> pip install pandas

Ubuntu

$ sudo apt-get install python-pandas

應用
Pandas也使閱讀檔案（. json, . csv, . html等）的程序更簡單。具有csv格式的文件更容易處理，因為它本身就呈現Table的格式，所以我不會用CSV的檔案作例子。以Json格式的為例，參考我們之前文章所使用的數據，json文件中儲存了一系列會員的資訊。假設您已把“data.json”的文件保存在您自己的Desktop上，那麼打開你自己的Python編輯器和感受Pandas的強大支援！

> import os, pandas
> desktop_dir = os.path.join(os.path.expanduser("~"),"Desktop")    # get the desktop directory
> jsonfile_dir = os.path.join(desktop_dir,"data.json")    # as data.json saved in desktop
> json_data = pandas.read_json(jsonfile_dir)    # json_data is Data Frame type
> print json_data

看到嗎？結果顯示了一個很有結構的表格，為了獲得我們會員的資料，我們會寫上json_data.profile來獲得"Profile"這個Column，然後把它放在一個Variable裡。

> profile_data = json_data.profile    # only extract the profile of the data 
> raw_data = pandas.DataFrame(profile_data)    # convert the data to Data frame structure
> print raw_data
> raw_data = raw_data[['id','name','gender','age']]    # revise the order of attributes as I want
> print raw_data

在前面的例子中，我們需要創建一個Array或Dictionary來存儲屬性和對應的值。但是來到Pandas的Data Frame，我們不再需要宣告新的Array或Dictionary便可以直接應用統計分析的Function。假設我們會員年齡的平均值，可以通過幾個代碼來實現：

> import numpy    # import the useful package for statistical analysis
> numpy.mean(raw_data.age)    # call function of NumPy to get the average age

總括而言，Pandas用起來更簡單和人性化，尤其進行數據處理的時候。我強烈建議利用Pandas閱讀和處理數據，再利用NumPy和Sci-Kit Learn可以實現統計分析和機器學習。

Search This Blog

Data Jungler

Pandas - 快速處理和變換數據

Comments

Post a Comment

Popular posts from this blog

Boosting vs Bagging? 別再胡亂用了!

機器學習之陷阱 - Imbalance Class Classification

Excel VBA - 自動生成分析報告