Data Jungler

Posts

Showing posts from September, 2015

Pandas - make Data Frame as easy as R

- September 15, 2015

For a programming beginner, one of the main difficulties is to transform or modify the raw data. Without the support of installed library, it is hard and time-consuming to handle this part. If you are the loyal R user, you should know that there is a special variable type: data frame. It is quite convenient to retrieve the attributes and corresponding values and further analysis. Thanks to Pandas package which is one of the modules in Python, we are allowed to analyze with the well-defined data structure. Let's install by inputting below command in CMD / terminal. Installation of Pandas Mac $ sudo pip install pandas Window > pip install pandas Ubuntu $ sudo apt-get install python-pandas Application Pandas also makes the procedures of reading file (.json, .csv, .html, etc.) simpler. The file with csv format is easier to handle since it has already been in the table-like structure. As a consequence I will take Json data as the example instead of .csv file. Refer to t...

Pandas - 快速處理和變換數據

- September 15, 2015

對於編程初學者，其中一個難題是轉換或修改原始數據。沒有Package的支持，會很困難和耗時的。如果您是忠實的R用戶，您應該知道R有一個特殊的Data Type：Data Frame。結構十分像Database裡的Table一樣有Fields和Records，在處理和變換的工序也非常方便，令你可以再進一步分析。而Python亦多得Pandas這個Package，讓我們可以使用這個數據結構來進行分析。我們可通過在CMD / Terminals輸入以下命令來安裝。安裝Pandas Mac $ sudo pip install pandas Window > pip install pandas Ubuntu $ sudo apt-get install python-pandas 應用 Pandas也使閱讀檔案（. json, . csv, . html等）的程序更簡單。具有csv格式的文件更容易處理，因為它本身就呈現Table的格式，所以我不會用CSV的檔案作例子。以Json格式的為例，參考我們之前文章所使用的數據，json文件中儲存了一系列會員的資訊。假設您已把“data.json”的文件保存在您自己的Desktop上，那麼打開你自己的Python編輯器和感受Pandas的強大支援！ > import os, pandas > desktop_dir = os.path.join(os.path.expanduser("~"),"Desktop") # get the desktop directory > jsonfile_dir = os.path.join(desktop_dir,"data.json") # as data.json saved in desktop > json_data = pandas.read_json(jsonfile_dir) # json_data is Data Frame type > print json_data 看到嗎？結果顯示了一個很有結構的表格，為了獲得我們會員的資料，我們會寫上json_data.profile來獲得"P...

資料探索 - 數據挖掘過程的概述

- September 04, 2015

在之前的文章也提到過，資料探勘是從數據庫中找尋有用資料，或從未經結構化的數據中擷取有用信息的過程。資料探索模型在用法上不會只局限於Machine Learning演算法的應用。到底這個方法如何運作呢？看看下圖來了解當中的過程吧：數據整合來自不同數據庫的資料會首先被整合，并存放於稱為數據倉庫的資料存放系統裡。其目的是準備原始數據並進一步處理。表格存放資料的格式都是由資料的特質(直行)和紀錄(橫行)組成。抽樣大部分的資料探索模型涉及複雜的演算法，當遇上太大量的數據，便難以進行分析。因此，只抽出部分的數據作建模以節省運算時間。抽樣有幾種方法：例如隨機抽樣，系統抽樣和分群法(Clustering)。均是減少數據量的好方法。數據探索為了觀察數據的結構和檢測是否存在缺失和奇怪數據，所以先進行數據探索，可以利用統計的技巧來分析，概括和描述這些資料。而最常用的方法便是平均值和方差。毫無疑問，利用圖像代替數字來表達數據，不但能令讀者更容易洞察數據的分佈，也更有助於把結果包裝成故事說給聽眾們。數據清理在現實中，並沒有“完美"的數據。收集回來的數據不能避免有缺失或異常值(Outlier)，所以我們需要進行數據的清理。它的目的在於確保數據質量，從而提升數據挖掘的準確度。先前寫下的文章會為大家介紹一連串提升數據質量的方法。數據分區在挖掘資訊的過程，為了利用一部分的數據來驗證模型的準確性或者檢測模型有沒有過度學習。由於驗證模型並不是必要的，所以決定數據區需要分區也取向於分析的目的，也不是必要的步驟。一般來說，我們可以把數據源按一定比例分為三部分: 訓練，驗證，測試數據集。首先利用訓練部分的數據中的目標變數(Target variable)來訓練及構建所需的模型。而驗證部分的數據會被注入訓練數據的模型中，從中得出模型的表現（例如分類/預測的準確性），繼而選擇最佳模型。最後，測試數據會放入那個最佳的模型，獲得和評估得出的的結果。建模談到模型這部分，這無疑是萬眾期待和令人著迷的部分，因為不同的模型都有它們的作用，有些能用作分類，預測和分群法等。能夠用作分類的模型也可以分為監督式學習，有邏輯回歸，人工神娙網絡，K-近鄰和支持向量機模型，而非監督式學習則有分群法，自組織映射圖等。至於線性回歸，決...

Data Mining - Overview of data mining process

- September 04, 2015

As mentioned in previous article , data mining, as known as Knowledge Discovery in Database, is process of extracting useful information from unstructured data. The data mining model is also included but not limited to machine learning algorithm. How does it work? To understand about the big picture of data mining process, this is visualized as below: Data Integration These dispersed data distributed into multiple database are integrated into say, data warehouse. Its purpose to prepare the raw data for further processing. In normal practice, a table format with attributes (column), records (row). Sampling Most of data mining models involve complex algorithm and frequently compute. If the data size is too large to be analyzed, sampling is indispensable to reduce the time spending on computation. There are several methods such as simple random, systematic sampling, clustering to achieve reducing the amount of data. Data Exploring So as to observe the structure of the data ...