标签:nts 数据 not body rev ola asp fill std
简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd
import numpy as np
设定最大列数和最大行数
pd.set_option(‘max_columns‘,5 , ‘max_rows‘, 5)
movie = pd.read_csv(‘data/movie.csv‘, index_col=‘movie_title‘)
movie.head()
color | director_name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
movie_title | |||||
Avatar | Color | James Cameron | ... | 1.78 | 33000 |
Pirates of the Caribbean: At World‘s End | Color | Gore Verbinski | ... | 2.35 | 0 |
Spectre | Color | Sam Mendes | ... | 2.35 | 85000 |
The Dark Knight Rises | Color | Christopher Nolan | ... | 2.35 | 164000 |
Star Wars: Episode VII - The Force Awakens | NaN | Doug Walker | ... | NaN | 0 |
5 rows × 27 columns
判断电影时长是否超过两小时
movie_2_hours = movie[‘duration‘] > 120
movie_2_hours.head(10)
movie_title
Avatar True
Pirates of the Caribbean: At World‘s End True
...
Avengers: Age of Ultron True
Harry Potter and the Half-Blood Prince True
Name: duration, Length: 10, dtype: bool
有多少时长超过两小时的电影
movie_2_hours.sum()
1039
超过两小时的电影的比例
movie_2_hours.mean()
0.2113506916192026
实际上,dureation这列是有缺失值的,要想获得真正的超过两小时的电影的比例,需要先删掉缺失值
movie[‘duration‘].dropna().gt(120).mean()
0.21199755152009794
用describe()输出一些该布尔Series信息
movie_2_hours.describe()
count 4916
unique 2
top False
freq 3877
Name: duration, dtype: object
统计False和True值的比例
movie_2_hours.value_counts(normalize=True)
False 0.788649
True 0.211351
Name: duration, dtype: float64
在Pandas中,位运算符(&, |, ~)的优先级高于比较运算符
criteria1 = movie.imdb_score > 8
criteria2 = movie.content_rating == ‘PG-13‘
criteria3 = (movie.title_year < 2000) | (movie.title_year >= 2010)
criteria3.head()
movie_title
Avatar False
Pirates of the Caribbean: At World‘s End False
Spectre True
The Dark Knight Rises True
Star Wars: Episode VII - The Force Awakens False
Name: title_year, dtype: bool
criteria_final = criteria1 & criteria2 & criteria3
criteria_final.head()
movie_title
Avatar False
Pirates of the Caribbean: At World‘s End False
Spectre False
The Dark Knight Rises True
Star Wars: Episode VII - The Force Awakens False
dtype: bool
创建第一个布尔条件
crit_a1 = movie.imdb_score > 8
crit_a2 = movie.content_rating == ‘PG-13‘
crit_a3 = (movie.title_year < 2000) | (movie.title_year > 2009)
final_crit_a = crit_a1 & crit_a2 & crit_a3
创建第二个布尔条件
crit_b1 = movie.imdb_score < 5
crit_b2 = movie.content_rating == ‘R‘
crit_b3 = (movie.title_year >= 2000) & (movie.title_year <= 2010)
final_crit_b = crit_b1 & crit_b2 & crit_b3
合并布尔条件
final_crit_all = final_crit_a | final_crit_b
final_crit_all.head()
movie_title
Avatar False
Pirates of the Caribbean: At World‘s End False
Spectre False
The Dark Knight Rises True
Star Wars: Episode VII - The Force Awakens False
dtype: bool
过滤数据
movie[final_crit_all].head()
color | director_name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
movie_title | |||||
The Dark Knight Rises | Color | Christopher Nolan | ... | 2.35 | 164000 |
The Avengers | Color | Joss Whedon | ... | 1.85 | 123000 |
Captain America: Civil War | Color | Anthony Russo | ... | 2.35 | 72000 |
Guardians of the Galaxy | Color | James Gunn | ... | 2.35 | 96000 |
Interstellar | Color | Christopher Nolan | ... | 2.35 | 349000 |
5 rows × 27 columns
验证过滤
cols = [‘imdb_score‘, ‘content_rating‘, ‘title_year‘]
movie_filtered = movie.loc[final_crit_all, cols]
movie_filtered.head(10)
imdb_score | content_rating | title_year | |
---|---|---|---|
movie_title | |||
The Dark Knight Rises | 8.5 | PG-13 | 2012.0 |
The Avengers | 8.1 | PG-13 | 2012.0 |
... | ... | ... | ... |
Sex and the City 2 | 4.3 | R | 2010.0 |
Rollerball | 3.0 | R | 2002.0 |
10 rows × 3 columns
college = pd.read_csv(‘data/college.csv‘)
college2 = college.set_index(‘STABBR‘)
college2中STABBR作为行索引,用loc选取
college2.loc[‘TX‘].head()
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
STABBR | |||||
TX | Abilene Christian University | Abilene | ... | 40200 | 25985 |
TX | Alvin Community College | Alvin | ... | 34500 | 6750 |
TX | Amarillo College | Amarillo | ... | 31700 | 10950 |
TX | Angelina College | Lufkin | ... | 26900 | PrivacySuppressed |
TX | Angelo State University | San Angelo | ... | 37700 | 21319.5 |
5 rows × 26 columns
college中,用布尔索引选取所有得克萨斯州的学校
college[college[‘STABBR‘] == ‘TX‘].head()
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
3610 | Abilene Christian University | Abilene | ... | 40200 | 25985 |
3611 | Alvin Community College | Alvin | ... | 34500 | 6750 |
3612 | Amarillo College | Amarillo | ... | 31700 | 10950 |
3613 | Angelina College | Lufkin | ... | 26900 | PrivacySuppressed |
3614 | Angelo State University | San Angelo | ... | 37700 | 21319.5 |
5 rows × 27 columns
比较二者的速度
法一
%timeit college[college[‘STABBR‘] == ‘TX‘]
937 μs ± 58.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
法二
%timeit college2.loc[‘TX‘]
520 μs ± 21.2 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit college2 = college.set_index(‘STABBR‘)
2.11 ms ± 185 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
布尔索引和标签选取多列
states =[‘TX‘, ‘CA‘, ‘NY‘]
college[college[‘STABBR‘].isin(states)]
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
192 | Academy of Art University | San Francisco | ... | 36000 | 35093 |
193 | ITT Technical Institute-Rancho Cordova | Rancho Cordova | ... | 38800 | 25827.5 |
... | ... | ... | ... | ... | ... |
7533 | Bay Area Medical Academy - San Jose Satellite ... | San Jose | ... | NaN | PrivacySuppressed |
7534 | Excel Learning Center-San Antonio South | San Antonio | ... | NaN | 12125 |
1704 rows × 27 columns
college2.loc[states].head()
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
STABBR | |||||
TX | Abilene Christian University | Abilene | ... | 40200 | 25985 |
TX | Alvin Community College | Alvin | ... | 34500 | 6750 |
TX | Amarillo College | Amarillo | ... | 31700 | 10950 |
TX | Angelina College | Lufkin | ... | 26900 | PrivacySuppressed |
TX | Angelo State University | San Angelo | ... | 37700 | 21319.5 |
5 rows × 26 columns
使用查询方法提高布尔索引的可读性
# 读取employee数据,确定选取的部门和列
employee = pd.read_csv(‘data/employee.csv‘)
depts = [‘Houston Police Department-HPD‘, ‘Houston Fire Department (HFD)‘]
select_columns = [‘UNIQUE_ID‘, ‘DEPARTMENT‘, ‘GENDER‘, ‘BASE_SALARY‘]
# 创建查询字符串,并执行query方法
qs = "DEPARTMENT in @depts and GENDER == ‘Female‘ and 80000 <= BASE_SALARY <= 120000"
emp_filtered = employee.query(qs)
emp_filtered[select_columns].head()
UNIQUE_ID | DEPARTMENT | GENDER | BASE_SALARY | |
---|---|---|---|---|
61 | 61 | Houston Fire Department (HFD) | Female | 96668.0 |
136 | 136 | Houston Police Department-HPD | Female | 81239.0 |
367 | 367 | Houston Police Department-HPD | Female | 86534.0 |
474 | 474 | Houston Police Department-HPD | Female | 91181.0 |
513 | 513 | Houston Police Department-HPD | Female | 81239.0 |
college = pd.read_csv(‘data/college.csv‘)
college2 = college.set_index(‘STABBR‘)
college2.index.is_monotonic
False
将college2排序,存储成另一个对象,查看其是否有序
college3 = college2.sort_index()
college3.index.is_monotonic
True
使用INSTNM作为行索引,检测行索引是否唯一
college_unique = college.set_index(‘INSTNM‘)
college_unique.index.is_unique
True
使用CITY和STABBR两列作为行索引,并进行排序
college.index = college[‘CITY‘] + ‘, ‘ + college[‘STABBR‘]
college = college.sort_index()
college.head()
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
ARTESIA, CA | Angeles Institute | ARTESIA | ... | NaN | 16850 |
Aberdeen, SD | Presentation College | Aberdeen | ... | 35900 | 25000 |
Aberdeen, SD | Northern State University | Aberdeen | ... | 33600 | 24847 |
Aberdeen, WA | Grays Harbor College | Aberdeen | ... | 27000 | 11490 |
Abilene, TX | Hardin-Simmons University | Abilene | ... | 38700 | 25864 |
5 rows × 27 columns
college.index.is_unique
False
选取所有Miami, FL的大学
法一
college.loc[‘Miami, FL‘].head()
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
Miami, FL | New Professions Technical Institute | Miami | ... | 18700 | 8682 |
Miami, FL | Management Resources College | Miami | ... | PrivacySuppressed | 12182 |
Miami, FL | Strayer University-Doral | Miami | ... | 49200 | 36173.5 |
Miami, FL | Keiser University- Miami | Miami | ... | 29700 | 26063 |
Miami, FL | George T Baker Aviation Technical College | Miami | ... | 38600 | PrivacySuppressed |
5 rows × 27 columns
法二
crit1 = college[‘CITY‘] == ‘Miami‘
crit2 = college[‘STABBR‘] == ‘FL‘
college[crit1 & crit2]
INSTNM | CITY | ... | MD_EARN_WNE_P10 | GRAD_DEBT_MDN_SUPP | |
---|---|---|---|---|---|
Miami, FL | New Professions Technical Institute | Miami | ... | 18700 | 8682 |
Miami, FL | Management Resources College | Miami | ... | PrivacySuppressed | 12182 |
... | ... | ... | ... | ... | ... |
Miami, FL | Advanced Technical Centers | Miami | ... | PrivacySuppressed | PrivacySuppressed |
Miami, FL | Lindsey Hopkins Technical College | Miami | ... | 29800 | PrivacySuppressed |
50 rows × 27 columns
movie = pd.read_csv(‘data/movie.csv‘, index_col=‘movie_title‘)
c1 = movie[‘content_rating‘] == ‘G‘
c2 = movie[‘imdb_score‘] < 4
criteria = c1 & c2
bool_movie = movie[criteria]
bool_movie
color | director_name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
movie_title | |||||
The True Story of Puss‘N Boots | Color | Jér?me Deschamps | ... | NaN | 90 |
Doogal | Color | Dave Borthwick | ... | 1.85 | 346 |
... | ... | ... | ... | ... | ... |
Justin Bieber: Never Say Never | Color | Jon M. Chu | ... | 1.85 | 62000 |
Sunday School Musical | Color | Rachel Goldenberg | ... | 1.85 | 777 |
6 rows × 27 columns
loc使用bool
法一
movie_loc = movie.loc[criteria]
检查loc条件和布尔条件创建出来的两个DataFrame是否一样
movie_loc.equals(movie[criteria])
True
法二
movie_loc2 = movie.loc[criteria.values]
movie_loc2.equals(movie[criteria])
True
iloc使用bool
因为criteria是包含行索引的一个Series,必须要使用底层的ndarray,才能使用,iloc
movie_iloc = movie.iloc[criteria.values]
movie_iloc.equals(movie_loc)
True
布尔索引也可以用来选取列
criteria_col = movie.dtypes == np.int64
criteria_col.head()
color False
director_name False
num_critic_for_reviews False
duration False
director_facebook_likes False
dtype: bool
movie.loc[:, criteria_col].head()
num_voted_users | cast_total_facebook_likes | movie_facebook_likes | |
---|---|---|---|
movie_title | |||
Avatar | 886204 | 4834 | 33000 |
Pirates of the Caribbean: At World‘s End | 471220 | 48350 | 0 |
Spectre | 275868 | 11700 | 85000 |
The Dark Knight Rises | 1144337 | 106759 | 164000 |
Star Wars: Episode VII - The Force Awakens | 8 | 143 | 0 |
movie.iloc[:, criteria_col.values].head()
num_voted_users | cast_total_facebook_likes | movie_facebook_likes | |
---|---|---|---|
movie_title | |||
Avatar | 886204 | 4834 | 33000 |
Pirates of the Caribbean: At World‘s End | 471220 | 48350 | 0 |
Spectre | 275868 | 11700 | 85000 |
The Dark Knight Rises | 1144337 | 106759 | 164000 |
Star Wars: Episode VII - The Force Awakens | 8 | 143 | 0 |
mask() is the inverse boolean operation of where.
DataFrame.where(cond, other=nan, inplace=False **kwgs)
Parameters:
cond : boolean NDFrame, array-like, or callable
movie = pd.read_csv(‘data/movie.csv‘, index_col=‘movie_title‘)
fb_likes = movie[‘actor_1_facebook_likes‘].dropna()
fb_likes.head()
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World‘s End 40000.0
Spectre 11000.0
The Dark Knight Rises 27000.0
Star Wars: Episode VII - The Force Awakens 131.0
Name: actor_1_facebook_likes, dtype: float64
使用describe获得对数据的认知
fb_likes.describe(percentiles=[.1, .25, .5, .75, .9]).astype(int)
count 4909
mean 6494
...
90% 18000
max 640000
Name: actor_1_facebook_likes, Length: 10, dtype: int64
检测小于20000个喜欢的的比例
criteria_high = fb_likes < 20000
criteria_high.mean().round(2)
0.91
where条件可以返回一个同样大小的Series,但是所有False会被替换成缺失值
fb_likes.where(criteria_high).head()
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World‘s End NaN
Spectre 11000.0
The Dark Knight Rises NaN
Star Wars: Episode VII - The Force Awakens 131.0
Name: actor_1_facebook_likes, dtype: float64
第二个参数other,可以让你控制替换值
fb_likes.where(criteria_high, other=20000).head()
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World‘s End 20000.0
Spectre 11000.0
The Dark Knight Rises 20000.0
Star Wars: Episode VII - The Force Awakens 131.0
Name: actor_1_facebook_likes, dtype: float64
通过where条件,设定上下限的值
criteria_low = fb_likes > 300
fb_likes_cap = fb_likes.where(criteria_high, other=20000).where(criteria_low, 300)
fb_likes_cap.head()
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World‘s End 20000.0
Spectre 11000.0
The Dark Knight Rises 20000.0
Star Wars: Episode VII - The Force Awakens 300.0
Name: actor_1_facebook_likes, dtype: float64
原始Series和修改过的Series的长度是一样的
len(fb_likes), len(fb_likes_cap)
(4909, 4909)
df = pd.DataFrame({‘vals‘: [1, 2, 3, 4], ‘ids‘: [‘a‘, ‘b‘, ‘f‘, ‘n‘],‘ids2‘: [‘a‘, ‘n‘, ‘c‘, ‘n‘]})
print(df)
print(df < 2)
df.where(df<2,1000)
vals ids ids2
0 1 a a
1 2 b n
2 3 f c
3 4 n n
vals ids ids2
0 True True True
1 False True True
2 False True True
3 False True True
vals | ids | ids2 | |
---|---|---|---|
0 | 1 | a | a |
1 | 1000 | b | n |
2 | 1000 | f | c |
3 | 1000 | n | n |
下面的代码等价于 df.where(df < 0,1000).
print(df[df < 2])
df[df < 2].fillna(1000)
vals ids ids2
0 1.0 a a
1 NaN b n
2 NaN f c
3 NaN n n
vals | ids | ids2 | |
---|---|---|---|
0 | 1.0 | a | a |
1 | 1000.0 | b | n |
2 | 1000.0 | f | c |
3 | 1000.0 | n | n |
《Pandas CookBook》---- 第五章 布尔索引
标签:nts 数据 not body rev ola asp fill std
原文地址:https://www.cnblogs.com/shiyushiyu/p/9742808.html