标签:day 定义 ... equal with category agg pass 元素
简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd
import numpy as np
设定最大列数和最大行数
pd.set_option(‘max_columns‘,5 , ‘max_rows‘, 5)
state_fruit = pd.read_csv(‘data/state_fruit.csv‘, index_col=0)
state_fruit
Apple | Orange | Banana | |
---|---|---|---|
Texas | 12 | 10 | 40 |
Arizona | 9 | 7 | 12 |
Florida | 0 | 14 | 190 |
DataFrame.stack(level=-1, dropna=True)
Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe:
stack方法可以将所有列名,转变为垂直的一级行索引
state_fruit.stack()
Texas Apple 12
Orange 10
...
Florida Orange 14
Banana 190
Length: 9, dtype: int64
使用reset_index(),将结果变为DataFrame
state_fruit_tidy = state_fruit.stack().reset_index()
state_fruit_tidy
level_0 | level_1 | 0 | |
---|---|---|---|
0 | Texas | Apple | 12 |
1 | Texas | Orange | 10 |
... | ... | ... | ... |
7 | Florida | Orange | 14 |
8 | Florida | Banana | 190 |
9 rows × 3 columns
重命名列名
state_fruit_tidy.columns = [‘state‘, ‘fruit‘, ‘weight‘]
state_fruit_tidy
state | fruit | weight | |
---|---|---|---|
0 | Texas | Apple | 12 |
1 | Texas | Orange | 10 |
... | ... | ... | ... |
7 | Florida | Orange | 14 |
8 | Florida | Banana | 190 |
9 rows × 3 columns
也可以使用rename_axis给不同的行索引层级命名
state_fruit.stack().rename_axis([‘state‘, ‘fruit‘])
state fruit
Texas Apple 12
Orange 10
...
Florida Orange 14
Banana 190
Length: 9, dtype: int64
再次使用reset_index方法
state_fruit.stack().rename_axis([‘state‘, ‘fruit‘]) .reset_index(name=‘weight‘)
state | fruit | weight | |
---|---|---|---|
0 | Texas | Apple | 12 |
1 | Texas | Orange | 10 |
... | ... | ... | ... |
7 | Florida | Orange | 14 |
8 | Florida | Banana | 190 |
9 rows × 3 columns
即stack后将所有的列名,按照规则分为两列,或者多列
movie = pd.read_csv(‘data/movie.csv‘)
actor = movie[[
‘movie_title‘, ‘actor_1_name‘, ‘actor_2_name‘,
‘actor_3_name‘, ‘actor_1_facebook_likes‘,
‘actor_2_facebook_likes‘, ‘actor_3_facebook_likes‘]]
创建一个自定义函数,用来改变列名。wide_to_long要求分组的变量要有相同的数值结尾
def change_col_name(col_name):
col_name = col_name.replace(‘_name‘, ‘‘)
if ‘facebook‘ in col_name:
fb_idx = col_name.find(‘facebook‘)
col_name = col_name[:5] + col_name[fb_idx - 1:] + col_name[5:fb_idx-1]
return col_name
actor2 = actor.rename(columns=change_col_name)
actor2.iloc[:5,:5]
movie_title | actor_1 | actor_2 | actor_3 | actor_facebook_likes_1 | |
---|---|---|---|---|---|
0 | Avatar | CCH Pounder | Joel David Moore | Wes Studi | 1000.0 |
1 | Pirates of the Caribbean: At World‘s End | Johnny Depp | Orlando Bloom | Jack Davenport | 40000.0 |
2 | Spectre | Christoph Waltz | Rory Kinnear | Stephanie Sigman | 11000.0 |
3 | The Dark Knight Rises | Tom Hardy | Christian Bale | Joseph Gordon-Levitt | 27000.0 |
4 | Star Wars: Episode VII - The Force Awakens | Doug Walker | Rob Walker | NaN | 131.0 |
使用wide_to_long函数,同时stack两列actor和Facebook
stubs = [‘actor‘, ‘actor_facebook_likes‘]
actor2_tidy = pd.wide_to_long(actor2,
stubnames=stubs,
i=[‘movie_title‘],
j=‘actor_num‘,
sep=‘_‘)
actor2_tidy.head(10)
actor | actor_facebook_likes | ||
---|---|---|---|
movie_title | actor_num | ||
Avatar | 1 | CCH Pounder | 1000.0 |
Pirates of the Caribbean: At World‘s End | 1 | Johnny Depp | 40000.0 |
... | ... | ... | ... |
Avengers: Age of Ultron | 1 | Chris Hemsworth | 26000.0 |
Harry Potter and the Half-Blood Prince | 1 | Alan Rickman | 25000.0 |
10 rows × 2 columns
pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name=‘value‘, col_level=None)
读取state_fruit2数据集
state_fruit2 = pd.read_csv(‘data/state_fruit2.csv‘)
state_fruit2
State | Apple | Orange | Banana | |
---|---|---|---|---|
0 | Texas | 12 | 10 | 40 |
1 | Arizona | 9 | 7 | 12 |
2 | Florida | 0 | 14 | 190 |
melt可以将原先的列名作为变量,原先的值作为值
var_name和value_name可以用来重命名新生成的变量列和值的列
state_fruit2.melt(id_vars=[‘State‘],value_vars=[‘Apple‘, ‘Orange‘, ‘Banana‘])
State | variable | value | |
---|---|---|---|
0 | Texas | Apple | 12 |
1 | Arizona | Apple | 9 |
... | ... | ... | ... |
7 | Arizona | Banana | 12 |
8 | Florida | Banana | 190 |
9 rows × 3 columns
随意设定一个行索引
state_fruit2.index=list(‘abc‘)
state_fruit2.index.name = ‘letter‘
state_fruit2
State | Apple | Orange | Banana | |
---|---|---|---|---|
letter | ||||
a | Texas | 12 | 10 | 40 |
b | Arizona | 9 | 7 | 12 |
c | Florida | 0 | 14 | 190 |
var_name和value_name可以用来重命名新生成的变量列和值的列
var_name对应value_vars 默认为variable
value_name对应原有表格中的值 默认为value
state_fruit2.melt(id_vars=[‘State‘],
value_vars=[‘Apple‘, ‘Orange‘, ‘Banana‘],
var_name=‘Fruit‘,
value_name=‘Weight‘)
State | Fruit | Weight | |
---|---|---|---|
0 | Texas | Apple | 12 |
1 | Arizona | Apple | 9 |
... | ... | ... | ... |
7 | Arizona | Banana | 12 |
8 | Florida | Banana | 190 |
9 rows × 3 columns
如果想让所有值都位于一列,旧的列标签位于另一列,可以直接使用melt
state_fruit2.melt()
variable | value | |
---|---|---|
0 | State | Texas |
1 | State | Arizona |
... | ... | ... |
10 | Banana | 12 |
11 | Banana | 190 |
12 rows × 2 columns
要指明id变量,只需使用id_vars参数
state_fruit2.melt(id_vars=‘State‘)
State | variable | value | |
---|---|---|---|
0 | Texas | Apple | 12 |
1 | Arizona | Apple | 9 |
... | ... | ... | ... |
7 | Arizona | Banana | 12 |
8 | Florida | Banana | 190 |
9 rows × 3 columns
读取college数据集,学校名作为行索引,只选取本科生的列
usecol_func = lambda x: ‘UGDS_‘ in x or x == ‘INSTNM‘
college = pd.read_csv(‘data/college.csv‘,index_col=‘INSTNM‘, usecols=usecol_func)
用stack方法,将所有水平列名,转化为垂直的行索引
college_stacked = college.stack()
college_stacked.head(18)
INSTNM
Alabama A & M University UGDS_WHITE 0.0333
UGDS_BLACK 0.9353
...
University of Alabama at Birmingham UGDS_NRA 0.0179
UGDS_UNKN 0.0100
Length: 18, dtype: float64
unstack方法可以将其还原
college_stacked.unstack().head()
UGDS_WHITE | UGDS_BLACK | ... | UGDS_NRA | UGDS_UNKN | |
---|---|---|---|---|---|
INSTNM | |||||
Alabama A & M University | 0.0333 | 0.9353 | ... | 0.0059 | 0.0138 |
University of Alabama at Birmingham | 0.5922 | 0.2600 | ... | 0.0179 | 0.0100 |
Amridge University | 0.2990 | 0.4192 | ... | 0.0000 | 0.2715 |
University of Alabama in Huntsville | 0.6988 | 0.1255 | ... | 0.0332 | 0.0350 |
Alabama State University | 0.0158 | 0.9208 | ... | 0.0243 | 0.0137 |
5 rows × 9 columns
DataFrame.pivot(index=None, columns=None, values=None)
通过给的index和column返回一个重塑的dataframe
另一种方式是先用melt,再用pivot。先加载数据,不指定行索引名
college2 = pd.read_csv(‘data/college.csv‘, usecols=usecol_func)
college_melted = college2.melt(id_vars=‘INSTNM‘, var_name=‘Race‘,value_name=‘Percentage‘)
college_melted.head()
INSTNM | Race | Percentage | |
---|---|---|---|
0 | Alabama A & M University | UGDS_WHITE | 0.0333 |
1 | University of Alabama at Birmingham | UGDS_WHITE | 0.5922 |
2 | Amridge University | UGDS_WHITE | 0.2990 |
3 | University of Alabama in Huntsville | UGDS_WHITE | 0.6988 |
4 | Alabama State University | UGDS_WHITE | 0.0158 |
用pivot还原
melted_inv = college_melted.pivot(index=‘INSTNM‘,columns=‘Race‘,values=‘Percentage‘)
melted_inv.head()
Race | UGDS_2MOR | UGDS_AIAN | ... | UGDS_UNKN | UGDS_WHITE |
---|---|---|---|---|---|
INSTNM | |||||
A & W Healthcare Educators | 0.0000 | 0.0 | ... | 0.0000 | 0.0000 |
A T Still University of Health Sciences | NaN | NaN | ... | NaN | NaN |
ABC Beauty Academy | 0.0000 | 0.0 | ... | 0.0000 | 0.0000 |
ABC Beauty College Inc | 0.0000 | 0.0 | ... | 0.0000 | 0.2895 |
AI Miami International University of Art and Design | 0.0018 | 0.0 | ... | 0.4644 | 0.0324 |
5 rows × 9 columns
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=‘mean‘, fill_value=None, margins=False, dropna=True, margins_name=‘All‘)
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame
类似pivot
flights = pd.read_csv(‘data/flights.csv‘)
flights.head()
MONTH | DAY | ... | DIVERTED | CANCELLED | |
---|---|---|---|---|---|
0 | 1 | 1 | ... | 0 | 0 |
1 | 1 | 1 | ... | 0 | 0 |
2 | 1 | 1 | ... | 0 | 0 |
3 | 1 | 1 | ... | 0 | 0 |
4 | 1 | 1 | ... | 0 | 0 |
5 rows × 14 columns
用pivot_table方法求出每条航线每个始发地的被取消的航班总数
fp = flights.pivot_table(index=‘AIRLINE‘,
columns=‘ORG_AIR‘,
values=‘CANCELLED‘,
aggfunc=‘sum‘,
fill_value=0).round(2)
fp.head()
ORG_AIR | ATL | DEN | ... | PHX | SFO |
---|---|---|---|---|---|
AIRLINE | |||||
AA | 3 | 4 | ... | 4 | 2 |
AS | 0 | 0 | ... | 0 | 0 |
B6 | 0 | 0 | ... | 0 | 1 |
DL | 28 | 1 | ... | 1 | 2 |
EV | 18 | 6 | ... | 0 | 0 |
5 rows × 10 columns
groupby聚合不能直接复现这张表,需要先按所有index和columns的列聚合.
fg = flights.groupby([‘AIRLINE‘, ‘ORG_AIR‘])[‘CANCELLED‘].sum()
fg.head()
AIRLINE ORG_AIR
AA ATL 3
DEN 4
DFW 86
IAH 3
LAS 3
Name: CANCELLED, dtype: int64
再使用unstack,将ORG_AIR这层索引作为列名
fg_unstack = fg.unstack(‘ORG_AIR‘, fill_value=0)
fg_unstack.head()
ORG_AIR | ATL | DEN | ... | PHX | SFO |
---|---|---|---|---|---|
AIRLINE | |||||
AA | 3 | 4 | ... | 4 | 2 |
AS | 0 | 0 | ... | 0 | 0 |
B6 | 0 | 0 | ... | 0 | 1 |
DL | 28 | 1 | ... | 1 | 2 |
EV | 18 | 6 | ... | 0 | 0 |
5 rows × 10 columns
fg_unstack = fg.unstack(‘ORG_AIR‘, fill_value=0)
fp.equals(fg_unstack)
True
fp2 = flights.pivot_table(index=[‘AIRLINE‘, ‘MONTH‘],
columns=[‘ORG_AIR‘, ‘CANCELLED‘],
values=[‘DEP_DELAY‘, ‘DIST‘],
aggfunc=[np.mean, np.sum],
fill_value=0)
fp2
mean | ... | sum | ||||
---|---|---|---|---|---|---|
DEP_DELAY | ... | DIST | ||||
ORG_AIR | ATL | ... | SFO | |||
CANCELLED | 0 | 1 | ... | 0 | 1 | |
AIRLINE | MONTH | |||||
AA | 1 | -3.250000 | 0 | ... | 33483 | 0 |
2 | -3.000000 | 0 | ... | 32110 | 2586 | |
... | ... | ... | ... | ... | ... | ... |
WN | 11 | 5.932203 | 0 | ... | 23235 | 784 |
12 | 15.691589 | 0 | ... | 30508 | 0 |
149 rows × 80 columns
用groupby和unstack复现上面的方法
flights.groupby([‘AIRLINE‘, ‘MONTH‘, ‘ORG_AIR‘, ‘CANCELLED‘])[‘DEP_DELAY‘, ‘DIST‘] .agg([‘mean‘, ‘sum‘]) .unstack([‘ORG_AIR‘, ‘CANCELLED‘], fill_value=0) .swaplevel(0, 1, axis=‘columns‘) .head()
mean | ... | sum | ||||
---|---|---|---|---|---|---|
DEP_DELAY | ... | DIST | ||||
ORG_AIR | ATL | ... | SFO | |||
CANCELLED | 0 | 1 | ... | 0 | 1 | |
AIRLINE | MONTH | |||||
AA | 1 | -3.250000 | NaN | ... | 33483.0 | NaN |
2 | -3.000000 | NaN | ... | 32110.0 | 2586.0 | |
3 | -0.166667 | NaN | ... | 43580.0 | NaN | |
4 | 0.071429 | NaN | ... | 51054.0 | NaN | |
5 | 5.777778 | NaN | ... | 40233.0 | NaN |
5 rows × 80 columns
一些数据分析的案例和技巧
读取college数据集,分组后,统计本科生的SAT数学成绩信息
college = pd.read_csv(‘data/college.csv‘)
cg = college.groupby([‘STABBR‘, ‘RELAFFIL‘])[‘UGDS‘, ‘SATMTMID‘].agg([‘count‘, ‘min‘, ‘max‘]).head(6)
cg
UGDS | ... | SATMTMID | ||||
---|---|---|---|---|---|---|
count | min | ... | min | max | ||
STABBR | RELAFFIL | |||||
AK | 0 | 7 | 109.0 | ... | NaN | NaN |
1 | 3 | 27.0 | ... | 503.0 | 503.0 | |
... | ... | ... | ... | ... | ... | ... |
AR | 0 | 68 | 18.0 | ... | 427.0 | 565.0 |
1 | 14 | 20.0 | ... | 495.0 | 600.0 |
6 rows × 6 columns
行索引的两级都有名字,而列索引没有名字。用rename_axis给列索引的两级命名
cg = cg.rename_axis([‘AGG_COLS‘, ‘AGG_FUNCS‘], axis=‘columns‘)
cg
AGG_COLS | UGDS | ... | SATMTMID | |||
---|---|---|---|---|---|---|
AGG_FUNCS | count | min | ... | min | max | |
STABBR | RELAFFIL | |||||
AK | 0 | 7 | 109.0 | ... | NaN | NaN |
1 | 3 | 27.0 | ... | 503.0 | 503.0 | |
... | ... | ... | ... | ... | ... | ... |
AR | 0 | 68 | 18.0 | ... | 427.0 | 565.0 |
1 | 14 | 20.0 | ... | 495.0 | 600.0 |
6 rows × 6 columns
将AGG_FUNCS列移到行索引
cg.stack(‘AGG_FUNCS‘).head()
AGG_COLS | UGDS | SATMTMID | ||
---|---|---|---|---|
STABBR | RELAFFIL | AGG_FUNCS | ||
AK | 0 | count | 7.0 | 0.0 |
min | 109.0 | NaN | ||
max | 12865.0 | NaN | ||
1 | count | 3.0 | 1.0 | |
min | 27.0 | 503.0 |
stack默认是将列放到行索引的最内层,可以使用swaplevel改变层级
cg.stack(‘AGG_FUNCS‘).swaplevel(‘AGG_FUNCS‘, ‘STABBR‘, axis=‘index‘).head()
AGG_COLS | UGDS | SATMTMID | ||
---|---|---|---|---|
AGG_FUNCS | RELAFFIL | STABBR | ||
count | 0 | AK | 7.0 | 0.0 |
min | 0 | AK | 109.0 | NaN |
max | 0 | AK | 12865.0 | NaN |
count | 1 | AK | 3.0 | 1.0 |
min | 1 | AK | 27.0 | 503.0 |
在此前的基础上再做sort_index
cg.stack(‘AGG_FUNCS‘) .swaplevel(‘AGG_FUNCS‘, ‘STABBR‘, axis=‘index‘) .sort_index(level=‘RELAFFIL‘, axis=‘index‘) .sort_index(level=‘AGG_COLS‘, axis=‘columns‘).head(6)
AGG_COLS | SATMTMID | UGDS | ||
---|---|---|---|---|
AGG_FUNCS | RELAFFIL | STABBR | ||
count | 0 | AK | 0.0 | 7.0 |
AL | 13.0 | 71.0 | ||
... | ... | ... | ... | ... |
min | 0 | AL | 420.0 | 12.0 |
AR | 427.0 | 18.0 |
6 rows × 2 columns
对一些列做stack,对其它列做unstack
cg.stack(‘AGG_FUNCS‘).unstack([‘RELAFFIL‘, ‘STABBR‘])
AGG_COLS | UGDS | ... | SATMTMID | ||
---|---|---|---|---|---|
RELAFFIL | 0 | 1 | ... | 0 | 1 |
STABBR | AK | AK | ... | AR | AR |
AGG_FUNCS | |||||
count | 7.0 | 3.0 | ... | 9.0 | 7.0 |
min | 109.0 | 27.0 | ... | 427.0 | 495.0 |
max | 12865.0 | 275.0 | ... | 565.0 | 600.0 |
3 rows × 12 columns
对所有列做stack,会返回一个Series
cg.stack([‘AGG_FUNCS‘, ‘AGG_COLS‘]).head(12)
STABBR RELAFFIL AGG_FUNCS AGG_COLS
AK 0 count UGDS 7.0
SATMTMID 0.0
...
AL 0 count UGDS 71.0
SATMTMID 13.0
Length: 12, dtype: float64
删除行和列索引所有层级的名称
cg.rename_axis([None, None], axis=‘index‘).rename_axis([None, None], axis=‘columns‘)
UGDS | ... | SATMTMID | ||||
---|---|---|---|---|---|---|
count | min | ... | min | max | ||
AK | 0 | 7 | 109.0 | ... | NaN | NaN |
1 | 3 | 27.0 | ... | 503.0 | 503.0 | |
... | ... | ... | ... | ... | ... | ... |
AR | 0 | 68 | 18.0 | ... | 427.0 | 565.0 |
1 | 14 | 20.0 | ... | 495.0 | 600.0 |
6 rows × 6 columns
当多个变量被存储为列名时进行清理
读取weightlifting数据集
weightlifting = pd.read_csv(‘data/weightlifting_men.csv‘)
weightlifting
Weight Category | M35 35-39 | ... | M75 75-79 | M80 80+ | |
---|---|---|---|---|---|
0 | 56 | 137 | ... | 62 | 55 |
1 | 62 | 152 | ... | 67 | 57 |
... | ... | ... | ... | ... | ... |
6 | 105 | 210 | ... | 95 | 80 |
7 | 105+ | 217 | ... | 100 | 85 |
8 rows × 11 columns
用melt方法,将sex_age放入一个单独的列
wl_melt = weightlifting.melt(id_vars=‘Weight Category‘,
var_name=‘sex_age‘,
value_name=‘Qual Total‘)
wl_melt.head()
Weight Category | sex_age | Qual Total | |
---|---|---|---|
0 | 56 | M35 35-39 | 137 |
1 | 62 | M35 35-39 | 152 |
2 | 69 | M35 35-39 | 167 |
3 | 77 | M35 35-39 | 182 |
4 | 85 | M35 35-39 | 192 |
用split方法将sex_age列分为两列
sex_age = wl_melt[‘sex_age‘].str.split(expand=True)
sex_age.head()
0 | 1 | |
---|---|---|
0 | M35 | 35-39 |
1 | M35 | 35-39 |
2 | M35 | 35-39 |
3 | M35 | 35-39 |
4 | M35 | 35-39 |
sex_age.columns = [‘Sex‘, ‘Age Group‘]
sex_age.head()
Sex | Age Group | |
---|---|---|
0 | M35 | 35-39 |
1 | M35 | 35-39 |
2 | M35 | 35-39 |
3 | M35 | 35-39 |
4 | M35 | 35-39 |
只取出字符串中的M
sex_age[‘Sex‘] = sex_age[‘Sex‘].str[0]
sex_age.head()
Sex | Age Group | |
---|---|---|
0 | M | 35-39 |
1 | M | 35-39 |
2 | M | 35-39 |
3 | M | 35-39 |
4 | M | 35-39 |
用concat方法,将sex_age,与wl_cat_total连接起来
wl_cat_total = wl_melt[[‘Weight Category‘, ‘Qual Total‘]]
wl_tidy = pd.concat([sex_age, wl_cat_total], axis=‘columns‘)
wl_tidy.head()
Sex | Age Group | Weight Category | Qual Total | |
---|---|---|---|---|
0 | M | 35-39 | 56 | 137 |
1 | M | 35-39 | 62 | 152 |
2 | M | 35-39 | 69 | 167 |
3 | M | 35-39 | 77 | 182 |
4 | M | 35-39 | 85 | 192 |
上面的结果也可以如下实现
cols = [‘Weight Category‘, ‘Qual Total‘]
sex_age[cols] = wl_melt[cols]
也可以通过assign的方法,动态加载新的列
age_group = wl_melt.sex_age.str.extract(‘(\d{2}[-+](?:\d{2})?)‘, expand=False)
sex = wl_melt.sex_age.str[0]
new_cols = {‘Sex‘:sex,‘Age Group‘: age_group}
wl_tidy2 = wl_melt.assign(**new_cols).drop(‘sex_age‘, axis=‘columns‘)
wl_tidy2.head()
Weight Category | Qual Total | Sex | Age Group | |
---|---|---|---|---|
0 | 56 | 137 | M | 35-39 |
1 | 62 | 152 | M | 35-39 |
2 | 69 | 167 | M | 35-39 |
3 | 77 | 182 | M | 35-39 |
4 | 85 | 192 | M | 35-39 |
读取restaurant_inspections数据集,将Date列的数据类型变为datetime64
inspections = pd.read_csv(‘data/restaurant_inspections.csv‘, parse_dates=[‘Date‘])
inspections.head(10)
Name | Date | Info | Value | |
---|---|---|---|---|
0 | E & E Grill House | 2017-08-08 | Borough | MANHATTAN |
1 | E & E Grill House | 2017-08-08 | Cuisine | American |
... | ... | ... | ... | ... |
8 | PIZZA WAGON | 2017-04-12 | Grade | A |
9 | PIZZA WAGON | 2017-04-12 | Score | 10.0 |
10 rows × 4 columns
inspections.set_index([‘Name‘,‘Date‘, ‘Info‘]).unstack(‘Info‘).head()
Value | ||||||
---|---|---|---|---|---|---|
Info | Borough | Cuisine | Description | Grade | Score | |
Name | Date | |||||
3 STAR JUICE CENTER | 2017-05-10 | BROOKLYN | Juice, Smoothies, Fruit Salads | Facility not vermin proof. Harborage or condit... | A | 12.0 |
A & L PIZZA RESTAURANT | 2017-08-22 | BROOKLYN | Pizza | Facility not vermin proof. Harborage or condit... | A | 9.0 |
AKSARAY TURKISH CAFE AND RESTAURANT | 2017-07-25 | BROOKLYN | Turkish | Plumbing not properly installed or maintained;... | A | 13.0 |
ANTOJITOS DELI FOOD | 2017-06-01 | BROOKLYN | Latin (Cuban, Dominican, Puerto Rican, South &... | Live roaches present in facility‘s food and/or... | A | 10.0 |
BANGIA | 2017-06-16 | MANHATTAN | Korean | Covered garbage receptacle not provided or ina... | A | 9.0 |
用reset_index方法,使行索引层级与列索引相同
insp_tidy = inspections.set_index([‘Name‘,‘Date‘, ‘Info‘]) .unstack(‘Info‘) .reset_index(col_level=-1)
insp_tidy.head()
... | Value | ||||
---|---|---|---|---|---|
Info | Name | Date | ... | Grade | Score |
0 | 3 STAR JUICE CENTER | 2017-05-10 | ... | A | 12.0 |
1 | A & L PIZZA RESTAURANT | 2017-08-22 | ... | A | 9.0 |
2 | AKSARAY TURKISH CAFE AND RESTAURANT | 2017-07-25 | ... | A | 13.0 |
3 | ANTOJITOS DELI FOOD | 2017-06-01 | ... | A | 10.0 |
4 | BANGIA | 2017-06-16 | ... | A | 9.0 |
5 rows × 7 columns
除掉列索引的最外层,重命名行索引的层为None
insp_tidy.columns = insp_tidy.columns.droplevel(0).rename(None)
insp_tidy.head()
Name | Date | ... | Grade | Score | |
---|---|---|---|---|---|
0 | 3 STAR JUICE CENTER | 2017-05-10 | ... | A | 12.0 |
1 | A & L PIZZA RESTAURANT | 2017-08-22 | ... | A | 9.0 |
2 | AKSARAY TURKISH CAFE AND RESTAURANT | 2017-07-25 | ... | A | 13.0 |
3 | ANTOJITOS DELI FOOD | 2017-06-01 | ... | A | 10.0 |
4 | BANGIA | 2017-06-16 | ... | A | 9.0 |
5 rows × 7 columns
pivot_table需要传入聚合函数,才能产生一个单一值
inspections.pivot_table(index=[‘Name‘, ‘Date‘],
columns=‘Info‘,
values=‘Value‘,
aggfunc=‘first‘) .reset_index() .rename_axis(None, axis=‘columns‘)
Name | Date | ... | Grade | Score | |
---|---|---|---|---|---|
0 | 3 STAR JUICE CENTER | 2017-05-10 | ... | A | 12.0 |
1 | A & L PIZZA RESTAURANT | 2017-08-22 | ... | A | 9.0 |
... | ... | ... | ... | ... | ... |
98 | WANG MANDOO HOUSE | 2017-08-29 | ... | A | 12.0 |
99 | XIAOYAN YABO INC | 2017-08-29 | ... | Z | 49.0 |
100 rows × 7 columns
# inspections.pivot(index=[‘Name‘, ‘Date‘], columns=‘Info‘, values=‘Value‘)
# 运行pivot会报错,因为没有聚合函数,通过[‘Name‘, ‘Date‘]索引和columns=‘Info‘对应的是多个值
读取texas_cities数据集
cities = pd.read_csv(‘data/texas_cities.csv‘)
cities
City | Geolocation | |
---|---|---|
0 | Houston | 29.7604° N, 95.3698° W |
1 | Dallas | 32.7767° N, 96.7970° W |
2 | Austin | 30.2672° N, 97.7431° W |
将Geolocation分解为四个单独的列
geolocations = cities.Geolocation.str.split(pat=‘. ‘, expand=True)
geolocations.columns = [‘latitude‘, ‘latitude direction‘, ‘longitude‘, ‘longitude direction‘]
geolocations
latitude | latitude direction | longitude | longitude direction | |
---|---|---|---|---|
0 | 29.7604 | N | 95.3698 | W |
1 | 32.7767 | N | 96.7970 | W |
2 | 30.2672 | N | 97.7431 | W |
转变数据类型
geolocations = geolocations.astype({‘latitude‘:‘float‘, ‘longitude‘:‘float‘})
geolocations.dtypes
latitude float64
latitude direction object
longitude float64
longitude direction object
dtype: object
将新列与原先的city列连起来
cities_tidy = pd.concat([cities[‘City‘], geolocations], axis=‘columns‘)
cities_tidy
City | latitude | latitude direction | longitude | longitude direction | |
---|---|---|---|---|---|
0 | Houston | 29.7604 | N | 95.3698 | W |
1 | Dallas | 32.7767 | N | 96.7970 | W |
2 | Austin | 30.2672 | N | 97.7431 | W |
函数to_numeric可以将每列自动变为整数或浮点数
temp = geolocations.apply(pd.to_numeric, errors=‘ignore‘)
temp
latitude | latitude direction | longitude | longitude direction | |
---|---|---|---|---|
0 | 29.7604 | N | 95.3698 | W |
1 | 32.7767 | N | 96.7970 | W |
2 | 30.2672 | N | 97.7431 | W |
|符,可以对多个标记进行分割
cities.Geolocation.str.split(pat=‘° |, ‘, expand=True)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 29.7604 | N | 95.3698 | W |
1 | 32.7767 | N | 96.7970 | W |
2 | 30.2672 | N | 97.7431 | W |
更复杂的提取方式
cities.Geolocation.str.extract(‘([0-9.]+). (N|S), ([0-9.]+). (E|W)‘, expand=True)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 29.7604 | N | 95.3698 | W |
1 | 32.7767 | N | 96.7970 | W |
2 | 30.2672 | N | 97.7431 | W |
读取sensors数据集
sensors = pd.read_csv(‘data/sensors.csv‘)
sensors
Group | Property | ... | 2015 | 2016 | |
---|---|---|---|---|---|
0 | A | Pressure | ... | 973 | 870 |
1 | A | Temperature | ... | 1036 | 1042 |
... | ... | ... | ... | ... | ... |
4 | B | Temperature | ... | 1002 | 1013 |
5 | B | Flow | ... | 824 | 873 |
6 rows × 7 columns
用melt清理数据
sensors.melt(id_vars=[‘Group‘, ‘Property‘], var_name=‘Year‘).head(6)
Group | Property | Year | value | |
---|---|---|---|---|
0 | A | Pressure | 2012 | 928 |
1 | A | Temperature | 2012 | 1026 |
... | ... | ... | ... | ... |
4 | B | Temperature | 2012 | 1008 |
5 | B | Flow | 2012 | 887 |
6 rows × 4 columns
用pivot_table,将Property列转化为新的列名
sensors.melt(id_vars=[‘Group‘, ‘Property‘], var_name=‘Year‘) .pivot_table(index=[‘Group‘, ‘Year‘], columns=‘Property‘, values=‘value‘) .reset_index() .rename_axis(None, axis=‘columns‘)
Group | Year | Flow | Pressure | Temperature | |
---|---|---|---|---|---|
0 | A | 2012 | 819 | 928 | 1026 |
1 | A | 2013 | 806 | 873 | 1038 |
... | ... | ... | ... | ... | ... |
8 | B | 2015 | 824 | 806 | 1002 |
9 | B | 2016 | 873 | 942 | 1013 |
10 rows × 5 columns
用stack和unstack实现上述方法
sensors.set_index([‘Group‘, ‘Property‘]) .stack() .unstack(‘Property‘) .rename_axis([‘Group‘, ‘Year‘], axis=‘index‘) .rename_axis(None, axis=‘columns‘) .reset_index()
Group | Year | Flow | Pressure | Temperature | |
---|---|---|---|---|---|
0 | A | 2012 | 819 | 928 | 1026 |
1 | A | 2013 | 806 | 873 | 1038 |
... | ... | ... | ... | ... | ... |
8 | B | 2015 | 824 | 806 | 1002 |
9 | B | 2016 | 873 | 942 | 1013 |
10 rows × 5 columns
就是将一张表拆成多张表
读取movie_altered数据集
movie = pd.read_csv(‘data/movie_altered.csv‘)
movie.head()
title | rating | ... | actor_fb_likes_2 | actor_fb_likes_3 | |
---|---|---|---|---|---|
0 | Avatar | PG-13 | ... | 936.0 | 855.0 |
1 | Pirates of the Caribbean: At World‘s End | PG-13 | ... | 5000.0 | 1000.0 |
2 | Spectre | PG-13 | ... | 393.0 | 161.0 |
3 | The Dark Knight Rises | PG-13 | ... | 23000.0 | 23000.0 |
4 | Star Wars: Episode VII - The Force Awakens | NaN | ... | 12.0 | NaN |
5 rows × 12 columns
插入新的列,用来标识每一部电影
movie.insert(0, ‘id‘, np.arange(len(movie)))
用wide_to_long,将所有演员放到一列,将所有Facebook likes放到一列
stubnames = [‘director‘, ‘director_fb_likes‘, ‘actor‘, ‘actor_fb_likes‘]
movie_long = pd.wide_to_long(movie, stubnames=stubnames, i=‘id‘, j=‘num‘, sep=‘_‘).reset_index()
movie_long[‘num‘] = movie_long[‘num‘].astype(int)
movie_long.head(9)
id | num | ... | actor | actor_fb_likes | |
---|---|---|---|---|---|
0 | 0 | 1 | ... | CCH Pounder | 1000.0 |
1 | 0 | 2 | ... | Joel David Moore | 936.0 |
... | ... | ... | ... | ... | ... |
7 | 2 | 2 | ... | Rory Kinnear | 393.0 |
8 | 2 | 3 | ... | Stephanie Sigman | 161.0 |
9 rows × 10 columns
movie.columns
Index([‘id‘, ‘title‘, ‘rating‘, ‘year‘, ‘duration‘, ‘director_1‘,
‘director_fb_likes_1‘, ‘actor_1‘, ‘actor_2‘, ‘actor_3‘,
‘actor_fb_likes_1‘, ‘actor_fb_likes_2‘, ‘actor_fb_likes_3‘],
dtype=‘object‘)
movie_long.columns
Index([‘id‘, ‘num‘, ‘year‘, ‘duration‘, ‘rating‘, ‘title‘, ‘director‘,
‘director_fb_likes‘, ‘actor‘, ‘actor_fb_likes‘],
dtype=‘object‘)
将这个数据分解成多个小表
movie_table = movie_long[[‘id‘,‘title‘, ‘year‘, ‘duration‘, ‘rating‘]]
director_table = movie_long[[‘id‘, ‘director‘, ‘num‘, ‘director_fb_likes‘]]
actor_table = movie_long[[‘id‘, ‘actor‘, ‘num‘, ‘actor_fb_likes‘]]
做一些去重和去除缺失值的工作
movie_table = movie_table.drop_duplicates().reset_index(drop=True)
director_table = director_table.dropna().reset_index(drop=True)
actor_table = actor_table.dropna().reset_index(drop=True)
标签:day 定义 ... equal with category agg pass 元素
原文地址:https://www.cnblogs.com/shiyushiyu/p/9800795.html