标签:空值 null 一个 选择 ceo cat Fix run sum
目录
两种丢失的数据:
种类
None:None是对象类型,type(None):NoneType
np.nan(NaN):是浮点型,type(np.nan):float
两种丢失数据的区别:
object类型比float在进行运算耗时
测试两种耗时时间:
import numpy as np
%timeit np.arange(1000,dtype=object).sum()
# 结果:
63.4 μs ± 1.02 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.arange(1000,dtype=float).sum()
# 结果:
6.45 μs ± 84.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
pandas中的None和NAN
df = DataFrame(np.random.randint(0,100,size=(8,6)))
df.iloc[2,3] = np.nan
df.iloc[5,5] = None # 在内部会强转成浮点型
df.iloc[5,3] = np.nan
df.iloc[7,2] = np.nan
df
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0 | 81 | 60.0 | 92.0 | 99 | 1.0 |
1 | 62 | 67 | 98.0 | 41.0 | 43 | 80.0 |
2 | 82 | 77 | 6.0 | NaN | 42 | 64.0 |
3 | 63 | 30 | 63.0 | 22.0 | 99 | 24.0 |
4 | 97 | 60 | 68.0 | 85.0 | 24 | 35.0 |
5 | 44 | 59 | 50.0 | NaN | 29 | NaN |
6 | 65 | 0 | 4.0 | 74.0 | 35 | 83.0 |
7 | 17 | 78 | NaN | 33.0 | 61 | 6.0 |
对空值对应的行数据进行删除
# 实现:Series、DataFrame都可以用isnull()
df.isnull() # 判断哪些元素为空值
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | False |
1 | False | False | False | False | False | False |
2 | False | False | False | True | False | False |
3 | False | False | False | False | False | False |
4 | False | False | False | False | False | False |
5 | False | False | False | True | False | True |
6 | False | False | False | False | False | False |
7 | False | False | True | False | False | False |
1.清洗有空值的行
# 查看哪些行存在空值数据
df.isnull().all(axis=1) # 一般结合notnull()使用
# 结果:
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 False
# 7 False
# dtype: bool
df.isnull().any(axis=1)
# 结果:
# 0 False
# 1 False
# 2 True
# 3 False
# 4 False
# 5 True
# 6 False
# 7 True
# dtype: bool
df.notnull().all(axis=1)
# 结果:
# 0 True
# 1 True
# 2 False
# 3 True
# 4 True
# 5 False
# 6 True
# 7 False
# dtype: bool
# 清洗有空值的行
df.loc[df.notnull().all(axis=1)]
# isnull() --> any:true表示其对应的行中存在空值
# notnull() --> all:False表示其对应的行中存在空值
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 45 | 98 | 41.0 | 65.0 | 90 | 50.0 |
1 | 65 | 87 | 99.0 | 34.0 | 97 | 71.0 |
3 | 35 | 19 | 38.0 | 54.0 | 91 | 24.0 |
4 | 62 | 86 | 62.0 | 54.0 | 87 | 38.0 |
6 | 60 | 99 | 15.0 | 32.0 | 54 | 16.0 |
2.直接使用dropna函数过滤空值对应的行数据
# drop系列:行用axis=0,列用axis=1
df.dropna(axis=0)
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 45 | 98 | 41.0 | 65.0 | 90 | 50.0 |
1 | 65 | 87 | 99.0 | 34.0 | 97 | 71.0 |
3 | 35 | 19 | 38.0 | 54.0 | 91 | 24.0 |
4 | 62 | 86 | 62.0 | 54.0 | 87 | 38.0 |
6 | 60 | 99 | 15.0 | 32.0 | 54 | 16.0 |
将空值进行填充
# 任意填充
df.fillna(value=666) # 将空值都填充成value值
# 推荐使用空值近邻的值进行填充
df.fillna(method='ffill',axis=1) # axis轴向,method=ffill(向前),bfill(向后填充)
# 如果发现还是有空值
df.fillna(method='bfill',axis=1).fillna(method='ffill',axis=1)
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0.0 | 81.0 | 60.0 | 92.0 | 99.0 | 1.0 |
1 | 62.0 | 67.0 | 98.0 | 41.0 | 43.0 | 80.0 |
2 | 82.0 | 77.0 | 6.0 | 6.0 | 42.0 | 64.0 |
3 | 63.0 | 30.0 | 63.0 | 22.0 | 99.0 | 24.0 |
4 | 97.0 | 60.0 | 68.0 | 85.0 | 24.0 | 35.0 |
5 | 44.0 | 59.0 | 50.0 | 50.0 | 29.0 | 29.0 |
6 | 65.0 | 0.0 | 4.0 | 74.0 | 35.0 | 83.0 |
7 | 17.0 | 78.0 | 78.0 | 33.0 | 61.0 | 6.0 |
需求:
代码实现:
time | none | 1 | 2 | 3 | 4 | none1 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|---|
2019/1/27 17:00 | -24.8 | -18.2 | -20.8 | -18.8 | NULL | NULL | NULL | ||
2019/1/27 17:01 | -23.5 | -18.8 | -20.5 | -19.8 | -15.2 | -14.5 | -16 | ||
2019/1/27 17:02 | -23.2 | -19.2 | NULL | NULL | -13 | NULL | -14 | ||
2019/1/27 17:03 | -22.8 | -19.2 | -20 | -20.5 | NULL | -12.2 | -9.8 | ||
2019/1/27 17:04 | -23.2 | -18.5 | -20 | -18.8 | -10.2 | -10.8 | -8.8 |
# 预处理,将excel数据读取出来,将空列none、none1删除
df = pd.read_excel('./data/testData.xlsx')
df.drop(labels=['none','none1'],axis=1,inplace=True)
df
# 删除空值所在的行,如果删除的代价较大,选择填充
df.dropna(axis=0)
# 填充
new_df = df.fillna(axis=0,method='ffill').fillna(axis=0,method='bfill')
# 检车空值填充的情况
new_df.isnull().any(axis=0)
# 结果:
# time False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 False
# 7 False
# dtype: bool
准备数据:
df = DataFrame(data=np.random.randint(0,100,size=(10,8)))
df.iloc[1] = [1,1,1,1,1,1,1,1]
df.iloc[3] = [1,1,1,1,1,1,1,1]
df.iloc[5] = [1,1,1,1,1,1,1,1]
df.iloc[7] = [1,1,1,1,1,1,1,1]
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
0 | 85 | 27 | 59 | 81 | 14 | 65 | 86 | 17 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | 65 | 2 | 69 | 44 | 99 | 12 | 75 | 37 |
3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
4 | 47 | 63 | 62 | 9 | 77 | 36 | 15 | 90 |
5 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
6 | 7 | 38 | 2 | 5 | 93 | 86 | 4 | 92 |
7 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
8 | 22 | 12 | 45 | 93 | 29 | 42 | 96 | 59 |
9 | 11 | 10 | 49 | 60 | 45 | 58 | 23 | 53 |
填充重复的数据
df.drop_duplicates(keep='last')
# keep = 'first' 表示第一行保留
# keep = 'last' 表示最后一行保留
# keep = 'False' 表示全部都进行填充
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
0 | 85 | 27 | 59 | 81 | 14 | 65 | 86 | 17 |
2 | 65 | 2 | 69 | 44 | 99 | 12 | 75 | 37 |
4 | 47 | 63 | 62 | 9 | 77 | 36 | 15 | 90 |
6 | 7 | 38 | 2 | 5 | 93 | 86 | 4 | 92 |
7 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
8 | 22 | 12 | 45 | 93 | 29 | 42 | 96 | 59 |
9 | 11 | 10 | 49 | 60 | 45 | 58 | 23 | 53 |
需求:自定义一个1000行3列(A,B,C)取值范围为0-1的数据源,然后将C列中的值大于其两倍标准差的异常值进行清洗
实现:
# 一个1000行3列(A,B,C)取值范围为0-1的数据源
df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df.head()
A | B | C | |
---|---|---|---|
0 | 0.483195 | 0.786152 | 0.169980 |
1 | 0.000227 | 0.593595 | 0.378660 |
2 | 0.859977 | 0.190864 | 0.885891 |
3 | 0.790337 | 0.898124 | 0.244129 |
4 | 0.212919 | 0.687666 | 0.827701 |
# 计算两倍标准差
std_twice = df['C'].std() * 2
#判定异常值的条件
df.loc[~(df['C'] > std_twice)]
pd.concat:pandas使用pd.concat函数,与np.concatenate函数类似
参数说明:
objs
axis=0
keys
join='outer' / 'inner':表示的是级联的方式,outer会将所有的项进行级联(忽略匹配和不匹配),而inner只会将匹配的项级联到一起,不匹配的不级联
ignore_index=False
匹配级联
df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
'group':['Accounting','Product','Marketing'],
'hire_date':[1998,2017,2018]})
pd.concat((df1,df1),axis=0)
employee | group | hire_date | |
---|---|---|---|
0 | Bobs | Accounting | 1998 |
1 | Linda | Product | 2017 |
2 | Bill | Marketing | 2018 |
0 | Bobs | Accounting | 1998 |
1 | Linda | Product | 2017 |
2 | Bill | Marketing | 2018 |
不匹配级联
df2 = df1.copy()
df2.columns = ['employee','groupps','hire_date']
"""
employee groupps hire_date
0 Bobs Accounting 1998
1 Linda Product 2017
2 Bill Marketing 2018
"""
pd.concat((df1,df2),axis=0)
employee | group | groupps | hire_date | |
---|---|---|---|---|
0 | Bobs | Accounting | NaN | 1998 |
1 | Linda | Product | NaN | 2017 |
2 | Bill | Marketing | NaN | 2018 |
0 | Bobs | NaN | Accounting | 1998 |
1 | Linda | NaN | Product | 2017 |
2 | Bill | NaN | Marketing | 2018 |
join
inner:只对可以匹配的项进行级联
outer:可以级联所有的项
pd.concat((df1,df2),axis=0,join='inner')
employee | hire_date | |
---|---|---|
0 | Bobs | 1998 |
1 | Linda | 2017 |
2 | Bill | 2018 |
0 | Bobs | 1998 |
1 | Linda | 2017 |
2 | Bill | 2018 |
append函数的使用
append只可以进行纵向的级联
df1.append(df2)
合并概述:
一对一合并
df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
'group':['Accounting','Engineering','Engineering'],
})
df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
'hire_date':[2004,2008,2012],
})
pd.merge(df1,df2,on='employee')
df表格如下:
employee | group | hire_date | |
---|---|---|---|
0 | Lisa | Accounting | 2004 |
1 | Jake | Engineering | 2016 |
df2表格如下:
group | supervisor | |
---|---|---|
0 | Accounting | Carly |
1 | Engineering | Guido |
2 | Engineering | Steve |
合并表格如下:
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
多对多合并
df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
'group':['Accounting','Engineering','Engineering']})
df5 = DataFrame({'group':['Engineering','Engineering','HR'],
'supervisor':['Carly','Guido','Steve']
})
pd.merge(df1,df5,how='outer')
df1表格如下:
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
df5表格如下:
group | supervisor | |
---|---|---|
0 | Engineering | Carly |
1 | Engineering | Guido |
2 | HR | Steve |
合并表格如下:
employee | group | supervisor | |
---|---|---|---|
0 | Bob | Accounting | NaN |
1 | Jake | Engineering | Carly |
2 | Jake | Engineering | Guido |
3 | Lisa | Engineering | Carly |
4 | Lisa | Engineering | Guido |
5 | NaN | HR | Steve |
key的规范化
1.当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名
df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
'group':['Accounting','Finance','Marketing']})
df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
'hire_date':[2003,2009,2012],
'group':['Accounting','sell','ceo']})
pd.merge(df1,df2,on='group')
df1表格如下:
employee | group | |
---|---|---|
0 | Jack | Accounting |
1 | Summer | Finance |
2 | Steve | Marketing |
df2表格如下:
employee | group | hire_date | |
---|---|---|---|
0 | Jack | Accounting | 2003 |
1 | Bob | sell | 2009 |
2 | Jake | ceo | 2012 |
合并表格如下:
employee_x | group | employee_y | hire_date | |
---|---|---|---|---|
0 | Jack | Accounting | Jack | 2003 |
2.当两张表没有可进行连接的列时,可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列
df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
'group':['Accounting','Product','Marketing'],
'hire_date':[1998,2017,2018]})
df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
'hire_dates':[1998,2016,2007]})
pd.merge(df1,df5,left_on='employee',right_on='name',how='outer')
df1表格如下:
employee | group | hire_date | |
---|---|---|---|
0 | Bobs | Accounting | 1998 |
1 | Linda | Product | 2017 |
2 | Bill | Marketing | 2018 |
df5表格如下:
hire_dates | name | |
---|---|---|
0 | 1998 | Lisa |
1 | 2016 | Bobs |
2 | 2007 | Bill |
合并表格如下:
employee | group | hire_date | hire_dates | name | |
---|---|---|---|---|---|
0 | Bobs | Accounting | 1998.0 | 2016.0 | Bobs |
1 | Linda | Product | 2017.0 | NaN | NaN |
2 | Bill | Marketing | 2018.0 | 2007.0 | Bill |
3 | NaN | NaN | NaN | 1998.0 | Lisa |
标签:空值 null 一个 选择 ceo cat Fix run sum
原文地址:https://www.cnblogs.com/liubing8/p/12030693.html