Posts

Showing posts from July, 2015

Python - 利用Dictionary處理數據

Image
本文章將分享如何利用編程技術更有效地處理原始數據。 請參考 之前的文章 安裝Python和相關模組,並下載原始數據(JSON格式)。 在剛才的例子中,10個成員的年齡被提取並存儲在一個數組中。 在會員資料中,用戶有1個Primary Key(會員ID)和3個Attributes(姓名,性別,年齡), 因此我們需要建立三組陣列(List),並把這些資料儲存在它們中。 > (name, gender, age, ...) = ([], [], [], ...) # multiple declare the arrays > for i in range(len(raw_data["profile"])):        name.append(raw_data["profile"][i]["name"])        gender.append(raw_data["profile"][i]["gender"])        age.append(raw_data["profile"][i]["age"]) Another way is that a dictionary with key value can be used to store the member list from JSON data. 除了List,Dictionary也是用作儲存多個Elements的Data Type,以Key和Value作為一組來儲起每一個會員的資料,其格式跟JSON一樣,所以我們只需輸入已知的Key便能獲得它的Value。 > from pprint import pprint  > members = {}    # declare a dictionary to store the member > for i in range(len(raw_data["profile"])):       mem_id = raw_data["profile"][i]...

Python - Data Processing with Dictionary

Image
This chapter is to share how to leverage the programming techniques to process the raw data more effectively. Please refer the previous articles to install the python, related modules and download the raw data in JSON format. In the previous examples, the age of 10 members are extracted and stored in an array. As we all know, there are 1 key(Member ID) and 3 attributes of a user(Name, Gender, Age). in the members' profile. Three lists/arrays needs to be declared and stored such information. > (name, gender, age, ...) = ([], [], [], ...) # multiple declare the arrays > for i in range(len(raw_data["profile"])):        name.append(raw_data["profile"][i]["name"])        gender.append(raw_data["profile"][i]["gender"])        age.append(raw_data["profile"][i]["age"]) Another way is that a dictionary with key value can be used to store the member list from JSON data. After that, summary st...

數據淨化 - 提高你的數據質量

Image
很多時候,由於系統故障或人為輸入錯誤,產生一些不準確或錯誤的數據。事實上,這類意外的錯誤通常發生在我們的數據庫中,並在整個記錄中佔領一定的比例。 正所謂“高質量的數據能衍生高質的分析結果”,尤其是當我們製作統計報告的時候,任何缺失或重複的記錄也會誤導統計結果。例如,我們可能從交易資料庫中發現找到一些交易為零的記錄。 以機器學習(Machine Learning)的預測模型為例,由於一些極端的數會影響模型的預測表現,所以我們會刪除這些數值,避免影響該模型的性能。 我在唸大學期間,往往忽略了數據淨化的重要性。原因似乎是我看重了研究數據分析的方法而忽略數據的處理。直到我工作以後,才領略到先準備和處理數據的重要性。但在真實的工作環境中,要知道我們是不可能擁有具備完美的原始數據。在資料庫中提取了數據後,你仍然會發現在記錄中會存在一些意想不到的和不合理的數值。為了做統計分析和數據挖掘,並獲取可信賴的結果,我們必須清理和處理這些錯誤的數據。 但是,要怎樣評估數據或信息是否已經足夠好以作分析?數據的屬性可以用該行業知識和經驗去判斷和驗證。儘管一些該行業的專業人士沒有任何統計知識,他們依然有能力去判斷數值的真偽。所以有一些文章指出,擁有相應的行業和業務知識的數據科學家本身也是一項優勢。 數據的質量 以下是判斷數據質量的指標: 1.準確度 準確的數據可以反映在實際環境中的對象或事件。在錯誤的資料中,我們怎麼會獲取準確的結果呢?不準確的數據往往帶來不恰當的和誤導性的信息。 2.有效性 資料可以由各種各樣的方式來驗證,包括利用其類別,格式,邏輯或者總數等。例如,在某物件的數量發現有非整數值,此記錄可被視為無效。 3.一致性 來自不同來源的數據經過整合後,數據應與數據源一致。假設你既是高中或大學畢業生,又是某家商店的會員,但會員證仍然顯示“學生”,這就是不一致的數據。 4.冗餘性 愈有質量的數據含有愈少量的冗餘記錄,在一項記錄中,具獨一無二特性的數值是不能重複的,因為這會導致偏差也影響到結果的準確性。如果一張發票中存在兩個相同的交易記錄,您的客戶當然會埋怨,及有權拒絕支付雙倍費用。當你分析銷售記錄時,交易總額亦會因此而出錯。 5.完整性 完整的數據,內容應該符合使用者的期望。在網上登記表格中總會有一些必填項目,以確保接收者(...

Data cleansing - Enhance your data quality

Image
Most of time, there is inaccurate or wrong data due to system bugs or human-input error. In fact, these kind of unexpected error usually occurs in our database and occupied certain proportion of overall observations. "Quality data, quality results." This statement is valid especially when we generate the summary report with statistics. Any missing or duplicated records could mislead the statistical results. Additionally, we sometimes find the abnormal value for some attribute (say zero transaction amount)  in a record. For predictive modelling in machine learning, outliers with extreme value are kicked out besides the improper data to prevent from affecting the model performance. When studied in the university, I usually neglected the importance of cleansing the data. The reason seems to be my curiosity of studying the analytical method is far away from that of processing the data. Of course, I have learned the lesson from my own jobs and awarded the significance of dat...

MySQL - Cost-effective in RDBMS

Image
SQL, Structured Query Language, is the programming language to manage the data in relational database system (RDBMS). MySQL is also widely used in RDBMS and plays an significant role in LAMP, the software bundle which consists of four open-source software: Linux, Apache, MySQL, PHP. To beginners, SQL & MySQL is relatively monotonous and easier to grasp than other programming language. As a result, developers usually do not pay much attention to it unless some error occurs in the software.  Introduction Data Definition Language (DDL) A family of syntax for defining data structures, e.g. database schema CREATE: To make a new database, table, index or view ALTER: To modify the objects in database DROP: To remove an existing database, table, index, or view Data Manipulation Language (DML) A family of syntax elements to perform read-only queries of data SELECT: To query the record INSERT: To insert a new record UPDATE: To update the record DELETE: To remove the recor...

MySQL - 性價比高的關聯式數據庫

Image
筆者初學資料庫的時候,第一個接觸的數據庫語言正正就是SQL(Structured Query Language),用作跟 關聯式數據庫(Relational Database) 溝通的程式語言,一開始接觸覺得它在眾多程式語言中較容易上手,因為那些Keywords Syntax比較直覺和容易掌握,到現在也發覺這貨根本就是易學難精,尤其是要擷取大量數據又要保持良好的性能。而MySQL是其中一個Open Source的關聯式數據庫系統,效能高、成本低、可靠性好成為了其流行的原因。它也是LA M P(開發Apps/Web服務)的其中一位成員,負責為系統的後台儲存應有的數據。 簡介 數據定義語言 (DDL) 用於定義數據結構的語法系列,例如:database schema CREATE: 建立新的Database, Table, Index or View ALTER: 修改已有的Database設計 DROP: 移除現有的Database, Table, Index or View 數據操作語言 (DML) 讀取和查詢的語法系列 SELECT: 查詢記錄 INSERT: 插入新記錄 UPDATE: 更新記錄 DELETE: 刪除記錄 數據控制語言 (DCL) 控制數據庫中存儲的數據的語法系列 COMMIT: 永久性地更改數據庫 GRANT: 允許用戶執行特定任務 REVOKE: 拒絕先前的批准 ROLLBACK: 若未完成回到最初的起點 Create, modify and Delete 指令 描述 CREATE DATABASE <database_name>; 建立新的Database RENAME DATABASE <old_db_name> TO <new_db_name>; 更新Database的名稱 DROP DATABASE <database_name>; 刪除現有的Database CREATE TABLE <database_name> (<column_name1> <format> NOT NULL PRIMARY KEY...

Excel VBA - 自動生成分析報告

Image
案例分析 上一篇文章介紹了VBA跟Macro,也寫了一個簡單的應用,想必你們會覺得沒有挑戰性,那麼現在一起開發一個更實用和複雜的應月! 假設你已經懂得使用Excel的內置工作表函數(公式),突然你親愛的老闆要求你今天之內提交這個月的顧客分析的報告,儘管你忙於其他事情,但你也不能拒絕,又要保住飯碗,那怎麼辦了?如果你學會了VBA的話,這個問題便迎刃而解了!因為你只要按幾個鍵,便可以自動生成一份報告,你應該會覺得這是瘋了,當然在自動化之前,我們先要把報告的格式先寫好,然後再Macro把生成的動作記錄起來,並作出少許的修改,每次激活這個Macro便完成了!而現在,你手上有一連串的交易記錄(csv格式)和客戶的資料(xlsx格式),目標是用最快的速度去完成這個悶人的報表! 先給大家介紹這個應用的運作,根據下圖,我會先把客戶的資料貼在一個excel活頁簿裡面,之後把這一個月的銷售記錄再貼進去,已經寫好的Excel Formula會把資料會變成一個總結起來,而一些圖表,例如Pie Chart亦會顯示出來。整個過程只需要幾秒,可是製作過程可是大工程了,不過為了未來可以省點時間,這是很值得的! 流程: 準備 首先,先打開新的活頁簿 (Workbook) 和儲存檔案,格式為" 啟用巨集的活頁簿 "。因xlsx的格式不會儲存任何的巨集(Macro),會把你寫好的巨集,在你關掉這本子的時候自動洗掉,所以千萬要記得另存為xlsm的格式啊! Remember to save as Excel Macro-Enabled Workbook! 先從Dropbox下載 原始檔案 : 1. Members.xlsx:會員的資料 2. Transactions.csv: 六月份的交易記錄 製作範本 在寫巨集之前,要先設計好報告的樣式,先把三張工作表的名字改成:"Raw_data", "Member_profile" and "Report"。打開"Raw_data"的工作表,接著根據下圖,把第一列的儲存格改成,G至L欄是透過公式計算出來的,所以我先把它塗成了綠色,在G2至L2的儲存格寫下公式。 小提示:下面的公式是用作尋找和取得會員的資...

Excel VBA - Generate a report in dashboard format

Image
Application with Case Study For the example in previous articles, I bet you guys think it is not much challenging. Let's develop more complicated application! I am going to assume that you are competent to take the advantage of worksheet functions in Excel. Imagine that your boss requested you to submit a monthly report for customer analysis. Nevertheless, you now are busy enough but cannot refuse and even lose the job. Come on~poor guy! To prevent from this happen again, I strongly recommend to build a report template for next monthly reports. In this case, a lists of transaction records (Transaction.csv) and members data (Members.xlsx) in your hand. Of course, our target is to generate a complete report with raw data as soon as possible. Process Flow: Preparation First of all, to open a empty workbook and save as macro-enabled workbook (.xlsm) which allows to save the macros. Otherwise, you could not lose all macros after closing the workbook as excel workb...

Python - Multi-Athlon Programming

Image
It is no doubt that Python can facilitates the development the tools with the support from the huge library. Many friends have raised a question which is better for data analysis with R or Python. Actually, there are CRAN in R and Python Package in Python, so that almost all function of R which Python also could have. As a both back-end developer and data analyst, Python seems to be the best choice of programming language for me to process the data extraction, preparation, analysis and modeling simultaneously. Download and Installation Steps to install the python for different kinds of OS are as follows. Installation of PIP which is used for downloading other useful modules is also include in the steps. Mac 1. Download the 2.7.x version in https://www.python.org/downloads/ 2. Open the downloaded file and follow the instruction to install the python 3. Run the python in terminals $ python >>> quit()      Quit the Python Shell 4. Open term...