Python - Data Processing with Dictionary

This chapter is to share how to leverage the programming techniques to process the raw data more effectively. Please refer the previous articles to install the python, related modules and download the raw data in JSON format.

In the previous examples, the age of 10 members are extracted and stored in an array. As we all know, there are 1 key(Member ID) and 3 attributes of a user(Name, Gender, Age). in the members' profile. Three lists/arrays needs to be declared and stored such information.
> (name, gender, age, ...) = ([], [], [], ...)    # multiple declare the arrays
> for i in range(len(raw_data["profile"])): 
       name.append(raw_data["profile"][i]["name"])
       gender.append(raw_data["profile"][i]["gender"])
       age.append(raw_data["profile"][i]["age"])

Another way is that a dictionary with key value can be used to store the member list from JSON data. After that, summary statistics could be generated more conveniently.
> from pprint import pprint 
> members = {}    # declare a dictionary to store the member
> for i in range(len(raw_data["profile"])):
      mem_id = raw_data["profile"][i]["id"]    # mem_id is the key value of the list
      name = raw_data["profile"][i]["name"]
      gender = raw_data["profile"][i]["gender"]
      age = raw_data["profile"][i]["age"]
      members.setdefault(mem_id,{ })    # setdefault is the function put the key and values into a List
      members[mem_id] = (name, gender, age)
> members = sorted(members.items())    # Sort the value in dictionary with key value 
> pprint.pprint(members)    # List out the keys and values in member_list



Great! The data is loaded into dictionary. Do you remember that we calculated the mean of the age? Right, the formula is: Sum of value / Count(Frequency) times.
> sum_age = 0
> for member in members:
        sum_age += member[1][2]
> sum_age / len(members)    # len is built-in function to get the number of elements in the list
# Looping 10 times to sum the age and then get the average
 is simplified to:
> numpy.mean([member[1][2] for member in members])
# Get the mean of 10 records of age which are listed by for-looping

To understand more about the variables first, "member" is the row of the records in the dictionary: "members". Therefore, member[0] is the key: member_id while member[1] indicates the values: {name, gender, age}. So as to obtain the age which positions at index: 2, member[1][2] performs the age of one of the member.

In the former method, [members[1][2] for member in members] actually is the List and return the value: [25, 13, 19, 26, 32, 29, 21, 24, 22, 19]. Moreover,  we are able to obtain a mean in a specify List/Array with the help of numpy.mean([...]) mentioned in the previous post. Thanks to the function in NumPy, we can get the average value in one line.



Comments

Popular posts from this blog

Boosting vs Bagging? 別再胡亂用了!

機器學習之陷阱 - Imbalance Class Classification

Excel VBA - 自動生成分析報告