Data Cleaning 2

时间：2016-10-21 08:11:50 阅读：262 评论：0 收藏：0 [点我收藏+]

标签：

1. When we match a set of data with duplicated values in a column, and we want to use this column as an unify column which is sharing for each database. We are going to filter them into a DataFrame we want.

　　class_size = data["class_size"]
　　class_size= class_size[class_size["GRADE "] == "09-12" ]
　　class_size= class_size[class_size["PROGRAM TYPE"]=="GEN ED"]

2. Once we filtered the column ,we want to condence the duplicated column into one by using groupby() and agg function.

　　import numpy as np
　　group_by = class_size.groupby(‘DBN‘) #group_by is a special type of data called GroupBy
　　class_size = group_by.aggregate(np.mean) # we use aggregate function to deal with the GroupBy types of data .At his moment, the index of class_size will change to the grouped by value (DBN).
　　class_size.reset_index(inplace = True) # reset_index allows us to reset the index as a row number - 1
　　data[‘class_size‘] = class_size

3. Numeric all the number string by using pd.numeric() function:

　　cols = [‘AP Test Takers ‘, ‘Total Exams Taken‘, ‘Number of Exams with scores 3 4 or 5‘]

　　for col in cols:
　　data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col],errors = "coerce")

4. After cleanning each dataset, we could like to combine them together so that we can plot them. Normally we use merge() function to combine two dataset.

　　combined = data["sat_results"]

　　combined = combined.merge(data["ap_2010"],how = "left")
　　combined = combined.merge(data["graduation"],how = "inner")
　　print(combined.shape)

5. At last, we want to extract some number form certain rows by using apply() function:

　　index = combined.index

　　def get_first_two_char(data):
　　　　return data[0:2]

　　combined["school_dist"] = combined["DBN"].apply(get_first_two_char)#usually once we need to use for loop in the DataFrame, we would like to use apply function to simplieze it.

Data Cleaning 2

标签：

原文地址：http://www.cnblogs.com/kingoscar/p/5983267.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行