码迷,mamicode.com
首页 > 其他好文 > 详细

Pandas基础命令速查清单

时间:2017-12-03 14:41:30      阅读:723      评论:0      收藏:0      [点我收藏+]

标签:excel   https   als   pes   exe   ipython3   new   数据库   count   

本文翻译整理自Pandas Cheat Sheet - Python for Data Science,结合K-Lab的工具属性,添加了具体的内容将速查清单里面的代码实践了一遍。

速查表内容概要

点击右上角的Fork按钮上手实践,即可点击标题实现内容跳转

  • [缩写解释 & 库的导入]
  • [数据的导入]
  • [数据的导出]
  • [创建测试对象]
  • [数据的查看与检查]
  • [数据的选取]
  • [数据的清洗]
  • [数据的过滤(filter),排序(sort)和分组(groupby)]
  • [数据的连接(join)与组合(combine)]
  • [数据的统计]
 
缩写解释 & 库的导入
 

df --- 任意的pandas DataFrame(数据框)对象
s --- 任意的pandas Series(数组)对象
pandasnumpy是用Python做数据分析最基础且最核心的库

In [2]:
import pandas as pd # 导入pandas库并简写为pd
import numpy as np # 导入numpy库并简写为np
In [1]:
import pandas as pd
import numpy as np
 
数据的导入
 
pd.read_csv(filename) # 导入csv格式文件中的数据
pd.read_table(filename) # 导入有分隔符的文本 (如TSV) 中的数据
pd.read_excel(filename) # 导入Excel格式文件中的数据
pd.read_sql(query, connection_object) # 导入SQL数据表/数据库中的数据
pd.read_json(json_string) # 导入JSON格式的字符,URL地址或者文件中的数据
pd.read_html(url) # 导入经过解析的URL地址中包含的数据框 (DataFrame) 数据
pd.read_clipboard() # 导入系统粘贴板里面的数据
pd.DataFrame(dict)  # 导入Python字典 (dict) 里面的数据,其中key是数据框的表头,value是数据框的内容。
In [4]:
pd.read_csv(filename)
pd.read_table(filename)
pd.read_excel(filename)
pd.read_sql(query, connection_object)
pd.read_json(json_string)
pd.read_html(url)
pd.read_clipboard()
pd.DataFrame(dict)
 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-085e950bd1b9> in <module>()
----> 1 pd.read_csv(filename)
      2 pd.read_table(filename)
      3 pd.read_excel(filename)
      4 pd.read_sql(query, connection_object)
      5 pd.read_json(json_string)

NameError: name ‘filename‘ is not defined
 
数据的导出
 
df.to_csv(filename) # 将数据框 (DataFrame)中的数据导入csv格式的文件中
df.to_excel(filename) # 将数据框 (DataFrame)中的数据导入Excel格式的文件中
df.to_sql(table_name,connection_object) # 将数据框 (DataFrame)中的数据导入SQL数据表/数据库中
df.to_json(filename) # 将数据框 (DataFrame)中的数据导入JSON格式的文件中
In [5]:
df.to_csv(filename)
df.to_excel(filename)
df.to_sql(table_name, connection_object)
df.to_json(filename)
 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-79e6637c6866> in <module>()
----> 1 df.to_csv(filename)
      2 df.to_excel(filename)
      3 df.to_sql(table_name, connection_object)
      4 df.to_json(filename)

NameError: name ‘df‘ is not defined
 
创建测试对象
 
pd.DataFrame(np.random.rand(10,5)) # 创建一个5列10行的由随机浮点数组成的数据框 DataFrame
In [6]:
pd.DataFrame(np.random.rand(10,5))
Out[6]:
 01234
0 0.178801 0.846355 0.705159 0.196188 0.874350
1 0.362044 0.390863 0.760347 0.555912 0.689457
2 0.201675 0.673297 0.180532 0.648759 0.483332
3 0.645076 0.932788 0.182940 0.722370 0.542127
4 0.578884 0.839314 0.734570 0.691949 0.538795
5 0.999395 0.383014 0.192030 0.315428 0.940216
6 0.980939 0.475735 0.674909 0.112695 0.961567
7 0.389256 0.855763 0.026823 0.876811 0.274633
8 0.108523 0.267471 0.988235 0.991163 0.271738
9 0.403084 0.935190 0.628058 0.296839 0.386862
In [2]:
pd.DataFrame(np.random.rand(10,5))
Out[2]:
 01234
0 0.647736 0.372628 0.255864 0.853542 0.613267
1 0.064364 0.156340 0.575021 0.561911 0.479901
2 0.036473 0.876819 0.255325 0.393240 0.543039
3 0.357489 0.006578 0.093966 0.531294 0.029009
4 0.550582 0.504600 0.273546 0.011693 0.052523
5 0.721563 0.170689 0.702163 0.447883 0.905983
6 0.839726 0.935997 0.343133 0.356957 0.377116
7 0.931894 0.026684 0.719148 0.911425 0.676187
8 0.115619 0.114894 0.130696 0.321598 0.170082
9 0.194649 0.526141 0.965442 0.275433 0.880765
 
pd.Series(my_list) # 从一个可迭代的对象 my_list 中创建一个数据组
In [7]:
my_list = [‘huang‘, 100, ‘xiaolei‘,4,56]
pd.Series(my_list)
Out[7]:
0      huang
1        100
2    xiaolei
3          4
4         56
dtype: object
In [3]:
my_list = [‘Kesci‘,100,‘欢迎来到科赛网‘]
pd.Series(my_list)
Out[3]:
0      Kesci
1        100
2    欢迎来到科赛网
dtype: object
 
df.index = pd.date_range(‘2017/1/1‘, periods=df.shape[0]) # 添加一个日期索引 index
In [4]:
df = pd.DataFrame(np.random.rand(10,5))
df.index = pd.date_range(‘2017/1/1‘, periods=df.shape[0])
df
Out[4]:
 01234
2017-01-01 0.248515 0.647889 0.111346 0.540434 0.159914
2017-01-02 0.445073 0.329843 0.823678 0.737438 0.707598
2017-01-03 0.526543 0.876826 0.717986 0.271920 0.719657
2017-01-04 0.471256 0.657647 0.973484 0.598997 0.249301
2017-01-05 0.958465 0.474331 0.004078 0.842343 0.819295
2017-01-06 0.271308 0.271988 0.434776 0.449652 0.369188
2017-01-07 0.989573 0.928428 0.452436 0.058590 0.732283
2017-01-08 0.435328 0.730214 0.909400 0.683413 0.186820
2017-01-09 0.897414 0.687525 0.122937 0.018102 0.440427
2017-01-10 0.743821 0.134602 0.210326 0.877157 0.815462
 
数据的查看与检查
 
df.head(n)  # 查看数据框的前n行
In [9]:
df = pd.DataFrame(np.random.rand(10, 5))
df.head(5)
Out[9]:
 01234
0 0.857171 0.900692 0.500228 0.636632 0.395819
1 0.332900 0.856592 0.645121 0.311064 0.836480
2 0.815698 0.667021 0.328536 0.924848 0.400043
3 0.693114 0.551914 0.696962 0.703079 0.645103
4 0.842381 0.466469 0.279249 0.740606 0.941279
In [5]:
df = pd.DataFrame(np.random.rand(10,5))
df.head(3)
Out[5]:
 01234
0 0.705884 0.845813 0.770585 0.481049 0.381055
1 0.733309 0.542363 0.264334 0.254283 0.859442
2 0.497977 0.474898 0.806073 0.384412 0.242989
 
df.tail(n) # 查看数据框的最后n行
In [10]:
df = pd.DataFrame(np.random.rand(15,8))
df.tail(4)
Out[10]:
 01234567
11 0.785491 0.243000 0.991953 0.367337 0.512946 0.740280 0.897460 0.799860
12 0.602312 0.440157 0.985066 0.992641 0.550723 0.387046 0.047515 0.566604
13 0.726211 0.132540 0.302954 0.542220 0.029554 0.963806 0.436351 0.462788
14 0.516992 0.624268 0.423005 0.476461 0.627335 0.635427 0.173666 0.034728
In [6]:
df = pd.DataFrame(np.random.rand(10,5))
df.tail(3)
Out[6]:
 01234
7 0.617289 0.009801 0.220155 0.992743 0.944472
8 0.261141 0.940925 0.063394 0.052104 0.517853
9 0.634541 0.897483 0.748453 0.805861 0.344938
 
df.shape # 查看数据框的行数与列数
In [11]:
df = pd.DataFrame(np.random.rand(14, 5))
df.shape
Out[11]:
(14, 5)
In [7]:
df = pd.DataFrame(np.random.rand(10,5))
df.shape
Out[7]:
(10, 5)
 
df.info() # 查看数据框 (DataFrame) 的索引、数据类型及内存信息
In [13]:
df = pd.DataFrame(np.random.rand(10, 4))
df.info()
 
<class ‘pandas.core.frame.DataFrame‘>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
0    10 non-null float64
1    10 non-null float64
2    10 non-null float64
3    10 non-null float64
dtypes: float64(4)
memory usage: 400.0 bytes
In [8]:
df = pd.DataFrame(np.random.rand(10,5))
df.info()
 
<class ‘pandas.core.frame.DataFrame‘>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
0    10 non-null float64
1    10 non-null float64
2    10 non-null float64
3    10 non-null float64
4    10 non-null float64
dtypes: float64(5)
memory usage: 480.0 bytes
 
df.describe() # 对于数据类型为数值型的列,查询其描述性统计的内容
In [14]:
df.describe()
Out[14]:
 0123
count 10.000000 10.000000 10.000000 10.000000
mean 0.459510 0.467315 0.616311 0.546682
std 0.401191 0.319752 0.304275 0.205285
min 0.017633 0.150638 0.068416 0.160698
25% 0.108201 0.183076 0.535336 0.419520
50% 0.409686 0.381424 0.729697 0.610982
75% 0.846220 0.751856 0.831845 0.688182
max 0.970186 0.959066 0.905394 0.779920
In [9]:
df.describe()
Out[9]:
 01234
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 0.410631 0.497585 0.506200 0.322960 0.603119
std 0.280330 0.322573 0.254780 0.260299 0.256370
min 0.043731 0.031742 0.070668 0.044822 0.143786
25% 0.240661 0.211625 0.416827 0.145298 0.422969
50% 0.346297 0.544697 0.479648 0.217359 0.635974
75% 0.493105 0.669044 0.557353 0.468119 0.782573
max 0.937583 0.945573 0.987328 0.883157 0.992891
 
s.value_counts(dropna=False) # 查询每个独特数据值出现次数统计
In [16]:
s = pd.Series([1,2,5,6,6,6,6,5,5,‘huang‘])
s.value_counts(dropna=False)
Out[16]:
6        4
5        3
huang    1
2        1
1        1
dtype: int64
In [10]:
s = pd.Series([1,2,3,3,4,np.nan,5,5,5,6,7])
s.value_counts(dropna=False)
Out[10]:
 5.0    3
 3.0    2
 7.0    1
 6.0    1
NaN     1
 4.0    1
 2.0    1
 1.0    1
dtype: int64
 
df.apply(pd.Series.value_counts) # 查询数据框 (Data Frame) 中每个列的独特数据值出现次数统计
In [19]:
pd.DataFrame(np.random.rand(3, 3))
print(df)
df.apply(pd.Series.value_counts)
 
          a         b         c         d         e
0  0.743688  0.081938  0.693243  0.647515  0.835997
1  0.162604  0.421371  0.422371  0.930136  0.732234
2  0.842065  0.139927  0.675018  0.543914  0.017094
3  0.535794  0.078217  0.964779  0.607462  0.432429
4  0.560279  0.544811  0.304371  0.797165  0.505008
5  0.695691  0.696121  0.741812  0.502741  0.484697
6  0.775342  0.410536  0.275251  0.810911  0.081818
7  0.584267  0.917728  0.379231  0.097702  0.622885
8  0.754810  0.809628  0.102337  0.283509  0.615719
9  0.003056  0.536268  0.187236  0.181844  0.255499
Out[19]:
 abcde
0.003056 1.0 NaN NaN NaN NaN
0.017094 NaN NaN NaN NaN 1.0
0.078217 NaN 1.0 NaN NaN NaN
0.081818 NaN NaN NaN NaN 1.0
0.081938 NaN 1.0 NaN NaN NaN
0.097702 NaN NaN NaN 1.0 NaN
0.102337 NaN NaN 1.0 NaN NaN
0.139927 NaN 1.0 NaN NaN NaN
0.162604 1.0 NaN NaN NaN NaN
0.181844 NaN NaN NaN 1.0 NaN
0.187236 NaN NaN 1.0 NaN NaN
0.255499 NaN NaN NaN NaN 1.0
0.275251 NaN NaN 1.0 NaN NaN
0.283509 NaN NaN NaN 1.0 NaN
0.304371 NaN NaN 1.0 NaN NaN
0.379231 NaN NaN 1.0 NaN NaN
0.410536 NaN 1.0 NaN NaN NaN
0.421371 NaN 1.0 NaN NaN NaN
0.422371 NaN NaN 1.0 NaN NaN
0.432429 NaN NaN NaN NaN 1.0
0.484697 NaN NaN NaN NaN 1.0
0.502741 NaN NaN NaN 1.0 NaN
0.505008 NaN NaN NaN NaN 1.0
0.535794 1.0 NaN NaN NaN NaN
0.536268 NaN 1.0 NaN NaN NaN
0.543914 NaN NaN NaN 1.0 NaN
0.544811 NaN 1.0 NaN NaN NaN
0.560279 1.0 NaN NaN NaN NaN
0.584267 1.0 NaN NaN NaN NaN
0.607462 NaN NaN NaN 1.0 NaN
0.615719 NaN NaN NaN NaN 1.0
0.622885 NaN NaN NaN NaN 1.0
0.647515 NaN NaN NaN 1.0 NaN
0.675018 NaN NaN 1.0 NaN NaN
0.693243 NaN NaN 1.0 NaN NaN
0.695691 1.0 NaN NaN NaN NaN
0.696121 NaN 1.0 NaN NaN NaN
0.732234 NaN NaN NaN NaN 1.0
0.741812 NaN NaN 1.0 NaN NaN
0.743688 1.0 NaN NaN NaN NaN
0.754810 1.0 NaN NaN NaN NaN
0.775342 1.0 NaN NaN NaN NaN
0.797165 NaN NaN NaN 1.0 NaN
0.809628 NaN 1.0 NaN NaN NaN
0.810911 NaN NaN NaN 1.0 NaN
0.835997 NaN NaN NaN NaN 1.0
0.842065 1.0 NaN NaN NaN NaN
0.917728 NaN 1.0 NaN NaN NaN
0.930136 NaN NaN NaN 1.0 NaN
0.964779 NaN NaN 1.0 NaN NaN
 
数据的选取
 
df[col] # 以数组 Series 的形式返回选取的列
In [23]:
df = pd.DataFrame(np.random.rand(5, 6), columns=list(‘abcdef‘))
df[‘c‘]
Out[23]:
0    0.238355
1    0.641129
2    0.716013
3    0.549903
4    0.997134
Name: c, dtype: float64
In [11]:
df = pd.DataFrame(np.random.rand(5,5),columns=list(‘ABCDE‘))
df[‘C‘]
Out[11]:
0    0.720965
1    0.360155
2    0.474067
3    0.116206
4    0.774503
Name: C, dtype: float64
 
df[[col1, col2]] # 以新的数据框(DataFrame)的形式返回选取的列
In [25]:
df = pd.DataFrame(np.random.rand(5, 4), columns=list(‘abcd‘))
df[[‘a‘,‘d‘]]
Out[25]:
 ad
0 0.689811 0.446470
1 0.022796 0.101198
2 0.724498 0.555124
3 0.923610 0.952664
4 0.990061 0.891120
In [12]:
df = pd.DataFrame(np.random.rand(5,5),columns=list(‘ABCDE‘))
df[[‘B‘,‘E‘]]
Out[12]:
 BE
0 0.205912 0.333909
1 0.475620 0.540206
2 0.144041 0.065117
3 0.636970 0.406317
4 0.451541 0.944245
 
s.iloc[0] # 按照位置选取
In [11]:
s = pd.Series(np.array([‘huang‘,‘xiao‘,‘lei‘]))
print(s)
s.iloc[1]
 
0    huang
1     xiao
2      lei
dtype: object
Out[11]:
‘xiao‘
In [13]:
s = pd.Series(np.array([‘I‘,‘Love‘,‘Data‘]))
s.iloc[0]
Out[13]:
‘I‘
 
s.loc[‘index_one‘] # 按照索引选取
In [10]:
s = pd.Series(np.array([‘df‘,‘s‘,‘df‘]))
print(s)
s.loc[1]
 
0    df
1     s
2    df
dtype: object
Out[10]:
‘s‘
In [14]:
s = pd.Series(np.array([‘I‘,‘Love‘,‘Data‘]))
s.loc[1]
Out[14]:
‘Love‘
 
df.iloc[0,:] # 选取第一行
In [24]:
df = pd.DataFrame(np.random.rand(5, 5),columns= list(‘abcde‘))
print(df)
#df.iloc[1, :]
df.loc[1:3]
 
          a         b         c         d         e
0  0.293829  0.636855  0.383047  0.182288  0.991080
1  0.098706  0.984684  0.362848  0.865179  0.191418
2  0.238197  0.027557  0.847372  0.478444  0.286712
3  0.816694  0.886405  0.637459  0.917760  0.218578
4  0.962678  0.322024  0.489059  0.675897  0.024523
Out[24]:
 abcde
1 0.098706 0.984684 0.362848 0.865179 0.191418
2 0.238197 0.027557 0.847372 0.478444 0.286712
3 0.816694 0.886405 0.637459 0.917760 0.218578
In [15]:
df = pd.DataFrame(np.random.rand(5,5),columns=list(‘ABCDE‘))
df.iloc[0,:]
Out[15]:
A    0.234156
B    0.513754
C    0.593067
D    0.856575
E    0.291528
Name: 0, dtype: float64
 
df.iloc[0,0] # 选取第一行的第一个元素
In [26]:
df = pd.DataFrame(np.random.rand(10, 5), columns=list(‘asdfg‘))
print(df)
df.iloc[1,3]
 
          a         s         d         f         g
0  0.819962  0.011747  0.969565  0.467551  0.281303
1  0.741277  0.645715  0.113062  0.495135  0.169768
2  0.862192  0.433940  0.726602  0.692266  0.796443
3  0.701999  0.222973  0.553875  0.253598  0.090833
4  0.354669  0.779308  0.282878  0.729156  0.972402
5  0.310698  0.253160  0.435239  0.465066  0.393626
6  0.449286  0.079748  0.778311  0.651505  0.659701
7  0.621606  0.883868  0.059535  0.015870  0.056286
8  0.762552  0.159625  0.716243  0.179370  0.161484
9  0.695830  0.388746  0.759827  0.325159  0.379626
Out[26]:
0.49513455869985046
In [16]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.iloc[0,0]
Out[16]:
0.91525996455410763
 
数据的清洗
 
df.columns = [‘a‘,‘b‘] # 重命名数据框的列名称
In [36]:
df = pd.DataFrame({‘a‘:np.array([1,2,5,8,4,3]), ‘b‘:np.array([9,3,7,5,3,4]), ‘c‘:‘htl‘})
df.columns = [‘q‘,‘e‘,‘r‘]
df
Out[36]:
 qer
0 1 9 htl
1 2 3 htl
2 5 7 htl
3 8 5 htl
4 4 3 htl
5 3 4 htl
In [30]:
df = pd.DataFrame({‘A‘:np.array([1,np.nan,2,3,6,np.nan]),
                 ‘B‘:np.array([np.nan,4,np.nan,5,9,np.nan]),
                  ‘C‘:‘foo‘})
df.columns = [‘a‘,‘b‘,‘c‘]
df
Out[30]:
 abc
0 1.0 NaN foo
1 NaN 4.0 foo
2 2.0 NaN foo
3 3.0 5.0 foo
4 6.0 9.0 foo
5 NaN NaN foo
 
pd.isnull() # 检查数据中空值出现的情况,并返回一个由布尔值(True,Fale)组成的列
In [37]:
df = pd.DataFrame({‘a‘:np.array([1,np.nan,2,3,6,np.nan]),
                  ‘b‘:np.array([np.nan,4,np.nan,5,9,np.nan]),
                   ‘c‘:‘sdf‘})
pd.isnull(df)
Out[37]:
 abc
0 False True False
1 True False False
2 False True False
3 False False False
4 False False False
5 True True False
In [18]:
df = pd.DataFrame({‘A‘:np.array([1,np.nan,2,3,6,np.nan]),
                 ‘B‘:np.array([np.nan,4,np.nan,5,9,np.nan]),
                  ‘C‘:‘foo‘})
pd.isnull(df)
Out[18]:
 ABC
0 False True False
1 True False False
2 False True False
3 False False False
4 False False False
5 True True False
 
pd.notnull() # 检查数据中非空值出现的情况,并返回一个由布尔值(True,False)组成的列
In [39]:
df = pd.DataFrame({
                ‘a‘:np.array([1,np.nan,2,3,4,np.nan]),
                ‘b‘:np.array([np.nan,4,np.nan,5,9,np.nan]),
                ‘c‘:‘foo‘
                    })
pd.notnull(df)
Out[39]:
 abc
0 True False True
1 False True True
2 True False True
3 True True True
4 True True True
5 False False True
In [40]:
df = pd.DataFrame({‘A‘:np.array([1,np.nan,2,3,6,np.nan]),
                 ‘B‘:np.array([np.nan,4,np.nan,5,9,np.nan]),
                  ‘C‘:‘foo‘})
pd.notnull(df)
df.dropna()
Out[40]:
 ABC
3 3.0 5.0 foo
4 6.0 9.0 foo
 
df.dropna() # 移除数据框 DataFrame 中包含空值的行
In [20]:
df = pd.DataFrame({‘A‘:np.array([1,np.nan,2,3,6,np.nan]),
                 ‘B‘:np.array([np.nan,4,np.nan,5,9,np.nan]),
                  ‘C‘:‘foo‘})
df.dropna()
Out[20]:
 ABC
3 3.0 5.0 foo
4 6.0 9.0 foo
 
df.dropna(axis=1) # 移除数据框 DataFrame 中包含空值的列
In [45]:
df = pd.DataFrame({
        ‘a‘:np.array([1,np.nan,2,3,4,np.nan]),
        ‘b‘:np.array([np.nan,4,np.nan,5,9,np.nan]),
        ‘c‘:‘foo‘
                    })
print(df)
df.dropna(axis=1)
 
     a    b    c
0  1.0  NaN  foo
1  NaN  4.0  foo
2  2.0  NaN  foo
3  3.0  5.0  foo
4  4.0  9.0  foo
5  NaN  NaN  foo
Out[45]:
 c
0 foo
1 foo
2 foo
3 foo
4 foo
5 foo
In [21]:
df = pd.DataFrame({‘A‘:np.array([1,np.nan,2,3,6,np.nan]),
                 ‘B‘:np.array([np.nan,4,np.nan,5,9,np.nan]),
                  ‘C‘:‘foo‘})
df.dropna(axis=1)
Out[21]:
 C
0 foo
1 foo
2 foo
3 foo
4 foo
5 foo
 
df.dropna(axis=1,thresh=n) # 移除数据框df中空值个数不超过n的行
In [73]:
df = pd.DataFrame({‘A‘:np.array([1,np.nan,2,3,6,np.nan]),
                 ‘B‘:np.array([np.nan,4,np.nan,5,9,np.nan]),
                  ‘C‘:‘foo‘})
print(df)
df.dropna(axis=1,thresh=3)
 
     A    B    C
0  1.0  NaN  foo
1  NaN  4.0  foo
2  2.0  NaN  foo
3  3.0  5.0  foo
4  6.0  9.0  foo
5  NaN  NaN  foo
Out[73]:
 ABC
0 1.0 NaN foo
1 NaN 4.0 foo
2 2.0 NaN foo
3 3.0 5.0 foo
4 6.0 9.0 foo
5 NaN NaN foo
In [22]:
df = pd.DataFrame({‘A‘:np.array([1,np.nan,2,3,6,np.nan]),
                 ‘B‘:np.array([np.nan,4,np.nan,5,9,np.nan]),
                  ‘C‘:‘foo‘})
test = df.dropna(axis=1,thresh=1)
test
Out[22]:
 ABC
0 1.0 NaN foo
1 NaN 4.0 foo
2 2.0 NaN foo
3 3.0 5.0 foo
4 6.0 9.0 foo
5 NaN NaN foo
 
df.fillna(x) # 将数据框 DataFrame 中的所有空值替换为 x
In [76]:
df = pd.DataFrame({‘A‘:np.array([1,np.nan,2,3,6,np.nan]),
                 ‘B‘:np.array([np.nan,4,np.nan,5,9,np.nan]),
                  ‘C‘:‘foo‘})
print(df)
df.fillna(‘huang‘)
 
     A    B    C
0  1.0  NaN  foo
1  NaN  4.0  foo
2  2.0  NaN  foo
3  3.0  5.0  foo
4  6.0  9.0  foo
5  NaN  NaN  foo
Out[76]:
 ABC
0 1 huang foo
1 huang 4 foo
2 2 huang foo
3 3 5 foo
4 6 9 foo
5 huang huang foo
In [23]:
df = pd.DataFrame({‘A‘:np.array([1,np.nan,2,3,6,np.nan]),
                 ‘B‘:np.array([np.nan,4,np.nan,5,9,np.nan]),
                  ‘C‘:‘foo‘})
df.fillna(‘Test‘)
Out[23]:
 ABC
0 1 Test foo
1 Test 4 foo
2 2 Test foo
3 3 5 foo
4 6 9 foo
5 Test Test foo
 

s.fillna(s.mean()) -> 将所有空值替换为平均值

In [82]:
s = pd.Series([1,3,4,np.nan,7,8,9])
a = s.fillna(s.mean())
print(a)
 
0    1.000000
1    3.000000
2    4.000000
3    5.333333
4    7.000000
5    8.000000
6    9.000000
dtype: float64
In [24]:
s = pd.Series([1,3,5,np.nan,7,9,9])
s.fillna(s.mean())
Out[24]:
0    1.000000
1    3.000000
2    5.000000
3    5.666667
4    7.000000
5    9.000000
6    9.000000
dtype: float64
 
s.astype(float) # 将数组(Series)的格式转化为浮点数
In [85]:
s = pd.Series([1,2,4,np.nan,5,6,6])
a = s.fillna(s.mean())
a.astype(int)
Out[85]:
0    1
1    2
2    4
3    4
4    5
5    6
6    6
dtype: int64
In [25]:
s = pd.Series([1,3,5,np.nan,7,9,9])
s.astype(float)
Out[25]:
0    1.0
1    3.0
2    5.0
3    NaN
4    7.0
5    9.0
6    9.0
dtype: float64
 
s.replace(1,‘one‘) # 将数组(Series)中的所有1替换为‘one‘
In [86]:
s = pd.Series([1,2,4,np.nan,5,6,7])
s.replace(1,‘yi‘)
Out[86]:
0     yi
1      2
2      4
3    NaN
4      5
5      6
6      7
dtype: object
In [26]:
s = pd.Series([1,3,5,np.nan,7,9,9])
s.replace(1,‘one‘)
Out[26]:
0    one
1      3
2      5
3    NaN
4      7
5      9
6      9
dtype: object
 
s.replace([1,3],[‘one‘,‘three‘]) # 将数组(Series)中所有的1替换为‘one‘, 所有的3替换为‘three‘
In [87]:
s = pd.Series([1,3,4,np.nan,7,3,5])
s.replace([1,4],[‘sd‘, ‘dsf‘])
Out[87]:
0     sd
1      3
2    dsf
3    NaN
4      7
5      3
6      5
dtype: object
In [27]:
s = pd.Series([1,3,5,np.nan,7,9,9])
s.replace([1,3],[‘one‘,‘three‘])
Out[27]:
0      one
1    three
2        5
3      NaN
4        7
5        9
6        9
dtype: object
 
df.rename(columns=lambda x: x + 2) # 将全体列重命名
In [20]:
df = pd.DataFrame(np.random.rand(4, 4))

df.rename(columns=lambda x:x+2 )
Out[20]:
 2345
0 0.081634 0.064494 0.171152 0.568444
1 0.355771 0.934762 0.634321 0.505097
2 0.544467 0.824562 0.742992 0.937263
3 0.524025 0.620101 0.764900 0.211475
In [28]:
df = pd.DataFrame(np.random.rand(4,4))
df.rename(columns=lambda x: x+ 2)
Out[28]:
 2345
0 0.753588 0.137984 0.022013 0.900072
1 0.947073 0.815182 0.769708 0.729688
2 0.334815 0.204315 0.707794 0.437704
3 0.467212 0.738360 0.853463 0.529946
 
df.rename(columns={‘old_name‘: ‘new_ name‘}) # 将选择的列重命名
In [24]:
df = pd.DataFrame(np.random.rand(10, 5), columns=list(‘asdfp‘))
df.rename(columns={‘a‘:‘huang‘, ‘d‘:‘xiao‘})
Out[24]:
 huangsxiaofp
0 0.883222 0.073876 0.740827 0.035460 0.929947
1 0.161005 0.276637 0.095228 0.490336 0.433798
2 0.245889 0.763647 0.472240 0.718072 0.260942
3 0.933051 0.400177 0.494481 0.173994 0.800894
4 0.762221 0.170352 0.507960 0.383658 0.533412
5 0.665419 0.515597 0.538217 0.305045 0.072796
6 0.723260 0.661109 0.793995 0.391161 0.724623
7 0.829130 0.896624 0.732372 0.317762 0.745941
8 0.302628 0.320006 0.420980 0.400016 0.556747
9 0.574811 0.952172 0.573045 0.343735 0.930765
In [29]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.rename(columns={‘A‘:‘newA‘,‘C‘:‘newC‘})
Out[29]:
 newABnewCDE
0 0.169072 0.694563 0.069313 0.637560 0.475181
1 0.910271 0.800067 0.676448 0.934767 0.025608
2 0.825186 0.451545 0.135421 0.635303 0.419758
3 0.401979 0.510304 0.014901 0.209211 0.121889
4 0.579282 0.001947 0.036519 0.750415 0.453078
5 0.896213 0.557514 0.028147 0.527471 0.575772
6 0.443222 0.095459 0.319582 0.912069 0.781455
7 0.067923 0.590470 0.602999 0.507358 0.703022
8 0.301491 0.682629 0.283103 0.565754 0.089268
9 0.399671 0.925416 0.020578 0.278000 0.591522
 
df.set_index(‘column_one‘) # 改变索引
In [27]:
df = pd.DataFrame(np.random.rand(10, 5), columns=list(‘asdfg‘))
print(df)
df.set_index(‘a‘)
 
          a         s         d         f         g
0  0.483397  0.944772  0.678662  0.439009  0.588450
1  0.984601  0.110966  0.331303  0.578410  0.467633
2  0.001784  0.431582  0.593597  0.238572  0.429771
3  0.644358  0.102394  0.935862  0.863739  0.118716
4  0.514392  0.928633  0.750763  0.026851  0.049935
5  0.749309  0.961028  0.383087  0.052621  0.598980
6  0.963810  0.087193  0.569974  0.440941  0.384748
7  0.000576  0.538573  0.171773  0.802815  0.556191
8  0.731837  0.934994  0.998125  0.485058  0.745950
9  0.599032  0.462614  0.234398  0.833158  0.521382
Out[27]:
 sdfg
a    
0.483397 0.944772 0.678662 0.439009 0.588450
0.984601 0.110966 0.331303 0.578410 0.467633
0.001784 0.431582 0.593597 0.238572 0.429771
0.644358 0.102394 0.935862 0.863739 0.118716
0.514392 0.928633 0.750763 0.026851 0.049935
0.749309 0.961028 0.383087 0.052621 0.598980
0.963810 0.087193 0.569974 0.440941 0.384748
0.000576 0.538573 0.171773 0.802815 0.556191
0.731837 0.934994 0.998125 0.485058 0.745950
0.599032 0.462614 0.234398 0.833158 0.521382
In [30]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.set_index(‘B‘)
Out[30]:
 ACDE
B    
0.311742 0.972069 0.557977 0.114267 0.795128
0.931644 0.725425 0.082130 0.993764 0.136923
0.206382 0.980647 0.947041 0.038841 0.879139
0.157801 0.402233 0.249151 0.724130 0.108238
0.314238 0.341221 0.512180 0.218882 0.046379
0.029040 0.470619 0.666784 0.036655 0.823498
0.843928 0.779437 0.926912 0.189213 0.624111
0.282773 0.993681 0.048483 0.135934 0.576662
0.759600 0.235513 0.359139 0.488255 0.669043
0.088552 0.893269 0.277296 0.889523 0.398392
 
df.rename(index = lambda x: x+ 1) # 改变全体索引
In [29]:
df = pd.DataFrame(np.random.rand(10, 5))
df.rename(index = lambda x: x+1)
Out[29]:
 01234
1 0.932421 0.478929 0.051820 0.721526 0.016739
2 0.359403 0.327488 0.503009 0.352523 0.169186
3 0.894238 0.268052 0.906756 0.726393 0.973686
4 0.188892 0.056018 0.156585 0.643488 0.321641
5 0.661594 0.043409 0.392303 0.469758 0.157635
6 0.582072 0.992046 0.060181 0.202060 0.119541
7 0.073971 0.157798 0.616039 0.516502 0.472920
8 0.885208 0.158675 0.211644 0.763249 0.762270
9 0.907770 0.455217 0.430548 0.473017 0.240695
10 0.043648 0.259251 0.365041 0.518889 0.765609
In [31]:
df = pd.DataFrame(np.random.rand(10,5))
df.rename(index = lambda x: x+ 1)
Out[31]:
 01234
1 0.386542 0.031932 0.963200 0.790339 0.602533
2 0.053492 0.652174 0.889465 0.465296 0.843528
3 0.411836 0.460788 0.110352 0.083247 0.389855
4 0.336156 0.830522 0.560991 0.667896 0.233841
5 0.307933 0.995207 0.506680 0.957895 0.636461
6 0.724975 0.842118 0.123139 0.244357 0.803936
7 0.059176 0.117784 0.330192 0.418764 0.464144
8 0.104323 0.222367 0.930414 0.659232 0.562155
9 0.484089 0.024045 0.879834 0.492231 0.949636
10 0.201583 0.280658 0.356804 0.890706 0.236174
 
数据的过滤(```filter```),排序(```sort```)和分组(```groupby```)
 
df[df[col] > 0.5] # 选取数据框df中对应行的数值大于0.5的全部列
In [33]:
df = pd.DataFrame(np.random.rand(10, 5), columns=list(‘asdfg‘))
print(df)
df[df[‘a‘]>0.5]
 
          a         s         d         f         g
0  0.191880  0.437651  0.780847  0.836473  0.086490
1  0.997351  0.671057  0.212071  0.946415  0.768535
2  0.506504  0.800164  0.968510  0.513060  0.258659
3  0.791777  0.632927  0.624002  0.799357  0.270455
4  0.207246  0.152955  0.007859  0.257787  0.208638
5  0.620649  0.557626  0.393774  0.331476  0.855253
6  0.220170  0.358326  0.811410  0.667446  0.085703
7  0.554684  0.994837  0.054684  0.854683  0.749515
8  0.759856  0.771095  0.571663  0.189677  0.177212
9  0.887868  0.617078  0.487259  0.462189  0.673066
Out[33]:
 asdfg
1 0.997351 0.671057 0.212071 0.946415 0.768535
2 0.506504 0.800164 0.968510 0.513060 0.258659
3 0.791777 0.632927 0.624002 0.799357 0.270455
5 0.620649 0.557626 0.393774 0.331476 0.855253
7 0.554684 0.994837 0.054684 0.854683 0.749515
8 0.759856 0.771095 0.571663 0.189677 0.177212
9 0.887868 0.617078 0.487259 0.462189 0.673066
In [32]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df[df[‘A‘] > 0.5]
Out[32]:
 ABCDE
0 0.534886 0.863546 0.236718 0.326766 0.415460
2 0.953931 0.070198 0.483749 0.922528 0.295505
8 0.880175 0.056811 0.520499 0.533152 0.548145
 
df[(df[col] > 0.5) & (df[col] < 0.7)] # 选取数据框df中对应行的数值大于0.5,并且小于0.7的全部列
In [34]:
df = pd.DataFrame(np.random.rand(10,6),columns= list(‘qwerty‘))
df[(df[‘e‘] > 0.5) &(df[‘t‘] < 0.7) ]
Out[34]:
 qwerty
2 0.176275 0.358433 0.895002 0.739299 0.050452 0.114546
3 0.726330 0.591592 0.909450 0.120671 0.677124 0.837148
4 0.318870 0.805787 0.600435 0.629595 0.045091 0.891886
5 0.270306 0.143335 0.519607 0.118409 0.079835 0.071877
In [33]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df[(df[‘C‘] > 0.5) & (df[‘D‘] < 0.7)]
Out[33]:
 ABCDE
2 0.953112 0.174517 0.645300 0.308216 0.171177
6 0.853087 0.863079 0.701823 0.354019 0.311754
 
df.sort_values(col1) # 按照数据框的列col1升序(ascending)的方式对数据框df做排序
In [35]:
df = pd.DataFrame(np.random.rand(10,6),columns=list(‘adsfgh‘))
df.sort_values(‘a‘)
Out[35]:
 adsfgh
8 0.012038 0.240554 0.900154 0.630489 0.971382 0.889947
3 0.174606 0.704540 0.284934 0.412725 0.261158 0.807697
9 0.324203 0.834741 0.624353 0.676012 0.580034 0.436738
1 0.386444 0.256227 0.924961 0.000652 0.589956 0.476489
5 0.479683 0.080173 0.333917 0.741830 0.219858 0.550681
6 0.546706 0.358566 0.875383 0.921672 0.004955 0.631361
4 0.581234 0.001990 0.737987 0.203702 0.231551 0.235576
7 0.762742 0.800615 0.945827 0.434820 0.755877 0.312649
2 0.888132 0.019374 0.555217 0.618628 0.396756 0.924784
0 0.904388 0.758854 0.450406 0.487383 0.666163 0.430539
In [34]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.sort_values(‘E‘)
Out[34]:
 ABCDE
3 0.024096 0.623842 0.775949 0.828343 0.317729
6 0.220055 0.381614 0.463676 0.762644 0.391758
4 0.589411 0.727439 0.064528 0.319521 0.413518
1 0.878490 0.229301 0.699506 0.726879 0.464106
8 0.438101 0.970649 0.050256 0.697440 0.499057
9 0.566100 0.558798 0.723253 0.254244 0.524486
7 0.613603 0.933109 0.677036 0.808160 0.544953
5 0.079326 0.711673 0.266434 0.910628 0.816783
2 0.132114 0.145395 0.908436 0.521271 0.889645
0 0.432677 0.216837 0.203532 0.093214 0.977671
 
df.sort_values(col2,ascending=False) # 按照数据框的列col2降序(descending)的方式对数据框df做排序
In [36]:
df = pd.DataFrame(np.random.rand(10, 8),columns=list(‘qwertyui‘))
df.sort_values(‘e‘, ascending=False)
Out[36]:
 qwertyui
8 0.541191 0.443107 0.804432 0.475763 0.332738 0.169072 0.350597 0.234079
9 0.278131 0.672111 0.766488 0.555026 0.271935 0.453826 0.491817 0.986139
1 0.758781 0.041056 0.732308 0.974348 0.219851 0.211953 0.524819 0.300156
2 0.065457 0.556341 0.655507 0.205678 0.606155 0.945356 0.915438 0.642333
4 0.916662 0.179418 0.620904 0.689385 0.477483 0.262302 0.868513 0.002603
6 0.934955 0.970812 0.331655 0.507056 0.012076 0.643469 0.579360 0.416791
3 0.372486 0.775326 0.250734 0.021345 0.267355 0.059874 0.253597 0.244643
7 0.598279 0.031159 0.205364 0.715331 0.340993 0.918638 0.918882 0.971622
5 0.062437 0.923440 0.119125 0.755429 0.744593 0.421468 0.366993 0.103529
0 0.965093 0.630529 0.034310 0.500022 0.736686 0.484777 0.595759 0.281686
In [35]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.sort_values(‘A‘,ascending=False)
Out[35]:
 ABCDE
9 0.977172 0.930607 0.889285 0.475032 0.031715
0 0.864511 0.229990 0.678612 0.042491 0.148123
2 0.694747 0.580891 0.817524 0.392417 0.055003
6 0.684327 0.802028 0.862043 0.241838 0.800401
7 0.612324 0.099445 0.714120 0.215054 0.280343
8 0.441434 0.315553 0.564762 0.800143 0.330030
1 0.438734 0.161109 0.610750 0.647330 0.792404
4 0.365880 0.710768 0.344320 0.998757 0.979497
3 0.202511 0.769728 0.575057 0.511384 0.696753
5 0.029527 0.560114 0.224787 0.086291 0.318322
 
df.sort_values([col1,col2],ascending=[True,False]) # 按照数据框的列col1升序,col2降序的方式对数据框df做排序
In [37]:
df = pd.DataFrame(np.random.rand(5,6),columns=list(‘qwerty‘))
df.sort_values([‘q‘, ‘w‘],ascending=[True, False])
Out[37]:
 qwerty
3 0.039156 0.902539 0.544040 0.715766 0.476489 0.968014
4 0.369672 0.760559 0.339207 0.773287 0.112713 0.465799
2 0.446962 0.675626 0.805690 0.869418 0.553809 0.310547
0 0.898922 0.210659 0.024452 0.310047 0.492718 0.530260
1 0.981514 0.476470 0.435834 0.613164 0.071609 0.771960
In [36]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.sort_values([‘A‘,‘E‘],ascending=[True,False])
Out[36]:
 ABCDE
6 0.075863 0.696980 0.648945 0.336977 0.113122
2 0.199316 0.632063 0.787358 0.133175 0.060568
5 0.242081 0.818550 0.618439 0.215761 0.924459
7 0.261237 0.400725 0.659224 0.555746 0.132572
0 0.390540 0.358432 0.754028 0.194403 0.889624
8 0.410481 0.463811 0.343021 0.736340 0.291121
4 0.578705 0.544711 0.881707 0.396593 0.414465
3 0.600541 0.459247 0.591303 0.027464 0.496864
9 0.720029 0.419921 0.740225 0.904391 0.226958
1 0.777955 0.992290 0.144495 0.600207 0.647018
 
df.groupby(col) # 按照某列对数据框df做分组
In [3]:
df = pd.DataFrame({
                ‘a‘:np.array([‘huang‘,‘huang‘,‘huang‘,‘xiao‘,‘xiao‘,‘xiao‘]),
                ‘b‘:np.array([‘lei‘,‘lei‘,‘lei‘,‘xiao‘,‘xiao‘,‘lei‘]),
                ‘c‘:np.array([‘small‘,‘medium‘,‘large‘,‘small‘,‘large‘,‘medium‘]),
                ‘d‘:np.array([1,2,3,4,5,6])
                    })
df.groupby(‘a‘).count()
Out[3]:
 bcd
a   
huang 3 3 3
xiao 3 3 3
In [38]:
df = pd.DataFrame({‘A‘:np.array([‘foo‘,‘foo‘,‘foo‘,‘foo‘,‘bar‘,‘bar‘]),
      ‘B‘:np.array([‘one‘,‘one‘,‘two‘,‘two‘,‘three‘,‘three‘]),
     ‘C‘:np.array([‘small‘,‘medium‘,‘large‘,‘large‘,‘small‘,‘small‘]),
     ‘D‘:np.array([1,2,2,3,3,5])})
print(df)

df.groupby(‘A‘).count()
 
     A      B       C  D
0  foo    one   small  1
1  foo    one  medium  2
2  foo    two   large  2
3  foo    two   large  3
4  bar  three   small  3
5  bar  three   small  5
Out[38]:
 BCD
A   
bar 2 2 2
foo 4 4 4
 
df.groupby([col1,col2]) # 按照列col1和col2对数据框df做分组
In [4]:
df = pd.DataFrame({
                    ‘a‘:np.array([‘s‘,‘s‘,‘s‘,‘e‘,‘e‘,‘e‘]),
                    ‘b‘:np.array([‘q‘,‘w‘,‘e‘,‘e‘,‘e‘,‘w‘]),
                    ‘c‘:np.array([‘t‘,‘t‘,‘t‘,‘hu‘,‘hi‘,‘jk‘])
                    })
print(df)
df.groupby([‘a‘,‘b‘]).count()
 
   a  b   c
0  s  q   t
1  s  w   t
2  s  e   t
3  e  e  hu
4  e  e  hi
5  e  w  jk
Out[4]:
  c
ab 
ee 2
w 1
se 1
q 1
w 1
In [39]:
df = pd.DataFrame({‘A‘:np.array([‘foo‘,‘foo‘,‘foo‘,‘foo‘,‘bar‘,‘bar‘]),
      ‘B‘:np.array([‘one‘,‘one‘,‘two‘,‘two‘,‘three‘,‘three‘]),
     ‘C‘:np.array([‘small‘,‘medium‘,‘large‘,‘large‘,‘small‘,‘small‘]),
     ‘D‘:np.array([1,2,2,3,3,5])})
print(df)
df.groupby([‘B‘,‘C‘]).sum()
 
     A      B       C  D
0  foo    one   small  1
1  foo    one  medium  2
2  foo    two   large  2
3  foo    two   large  3
4  bar  three   small  3
5  bar  three   small  5
Out[39]:
  D
BC 
onemedium 2
small 1
threesmall 8
twolarge 5
 
df.groupby(col1)[col2].mean() # 按照列col1对数据框df做分组处理后,返回对应的col2的平均值
In [10]:
df = pd.DataFrame({
        ‘a‘:np.array([‘ho‘,‘ho‘,‘ho‘,‘e‘,‘e‘,‘e‘]),
        ‘b‘:np.array([‘huang‘,‘huang‘,‘lei‘,‘lei‘,‘xiao‘,‘xiao‘]),
        ‘c‘:np.array([1,2,3,4,5,6])
    })
df.groupby(‘a‘)[‘c‘].mean()
Out[10]:
a
e     5
ho    2
Name: c, dtype: int64
In [39]:
df = pd.DataFrame({‘A‘:np.array([‘foo‘,‘foo‘,‘foo‘,‘foo‘,‘bar‘,‘bar‘]),
      ‘B‘:np.array([‘one‘,‘one‘,‘two‘,‘two‘,‘three‘,‘three‘]),
     ‘C‘:np.array([‘small‘,‘medium‘,‘large‘,‘large‘,‘small‘,‘small‘]),
     ‘D‘:np.array([1,2,2,3,3,5])})
df.groupby(‘B‘)[‘D‘].mean()
Out[39]:
B
one      1.5
three    4.0
two      2.5
Name: D, dtype: float64
 
pythyon
df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean) # 做透视表,索引为col1,针对的数值列为col2和col3,分组函数为平均值
In [11]:
df = pd.DataFrame({‘A‘:np.array([‘foo‘,‘foo‘,‘foo‘,‘foo‘,‘bar‘,‘bar‘]),
      ‘B‘:np.array([‘one‘,‘one‘,‘two‘,‘two‘,‘three‘,‘three‘]),
     ‘C‘:np.array([‘small‘,‘medium‘,‘large‘,‘large‘,‘small‘,‘small‘]),
     ‘D‘:np.array([1,2,2,3,3,5])})
print(df)
df.pivot_table(df,index=[‘A‘,‘B‘],
               columns=[‘C‘],aggfunc=np.sum)
 
     A      B       C  D
0  foo    one   small  1
1  foo    one  medium  2
2  foo    two   large  2
3  foo    two   large  3
4  bar  three   small  3
5  bar  three   small  5
Out[11]:
  D
 Clargemediumsmall
AB   
barthree NaN NaN 8.0
fooone NaN 2.0 1.0
two 5.0 NaN NaN
 
df.groupby(col1).agg(np.mean)
In [12]:
df = pd.DataFrame({‘A‘:np.array([‘foo‘,‘foo‘,‘foo‘,‘foo‘,‘bar‘,‘bar‘]),
      ‘B‘:np.array([‘one‘,‘one‘,‘two‘,‘two‘,‘three‘,‘three‘]),
     ‘C‘:np.array([‘small‘,‘medium‘,‘large‘,‘large‘,‘small‘,‘small‘]),
     ‘D‘:np.array([1,2,2,3,3,5])})
print(df)
df.groupby(‘A‘).agg(np.mean)
 
     A      B       C  D
0  foo    one   small  1
1  foo    one  medium  2
2  foo    two   large  2
3  foo    two   large  3
4  bar  three   small  3
5  bar  three   small  5
Out[12]:
 D
A 
bar 4
foo 2
 
df.apply(np.mean) # 对数据框df的每一列求平均值
In [13]:
df = pd.DataFrame(np.random.rand(10, 5),columns=list(‘adsfg‘))
df.apply(np.mean)
Out[13]:
a    0.539334
d    0.500330
s    0.508882
f    0.580603
g    0.523317
dtype: float64
In [42]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.apply(np.mean)
Out[42]:
A    0.388075
B    0.539564
C    0.607983
D    0.518634
E    0.482960
dtype: float64
 
df.apply(np.max,axis=1) # 对数据框df的每一行求最大值
In [14]:
df = pd.DataFrame(np.random.rand(10, 6),columns=list(‘asdfrg‘))
df.apply(np.max, axis=1)
Out[14]:
0    0.845378
1    0.998686
2    0.968602
3    0.843231
4    0.940353
5    0.908892
6    0.949700
7    0.663064
8    0.876051
9    0.975562
dtype: float64
In [43]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.apply(np.max,axis=1)
Out[43]:
0    0.904163
1    0.804519
2    0.924102
3    0.761781
4    0.952084
5    0.923679
6    0.796320
7    0.582907
8    0.761310
9    0.893564
dtype: float64
 
数据的连接(```join```)与组合(```combine```)
 
df1.append(df2) # 在数据框df2的末尾添加数据框df1,其中df1和df2的列数应该相等
In [44]:
df1 = pd.DataFrame({‘A‘: [‘A0‘, ‘A1‘, ‘A2‘, ‘A3‘],
                    ‘B‘: [‘B0‘, ‘B1‘, ‘B2‘, ‘B3‘],
                    ‘C‘: [‘C0‘, ‘C1‘, ‘C2‘, ‘C3‘],
                    ‘D‘: [‘D0‘, ‘D1‘, ‘D2‘, ‘D3‘]},
                   index=[0, 1, 2, 3])
df2 = pd.DataFrame({‘A‘: [‘A4‘, ‘A5‘, ‘A6‘, ‘A7‘],
                    ‘B‘: [‘B4‘, ‘B5‘, ‘B6‘, ‘B7‘],
                    ‘C‘: [‘C4‘, ‘C5‘, ‘C6‘, ‘C7‘],
                    ‘D‘: [‘D4‘, ‘D5‘, ‘D6‘, ‘D7‘]},
                   index=[4, 5, 6, 7])

df1.append(df2)
Out[44]:
 ABCD
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
 
pd.concat([df1, df2],axis=1) # 在数据框df1的列最后添加数据框df2,其中df1和df2的行数应该相等
In [45]:
df1 = pd.DataFrame({‘A‘: [‘A0‘, ‘A1‘, ‘A2‘, ‘A3‘],
                    ‘B‘: [‘B0‘, ‘B1‘, ‘B2‘, ‘B3‘],
                    ‘C‘: [‘C0‘, ‘C1‘, ‘C2‘, ‘C3‘],
                    ‘D‘: [‘D0‘, ‘D1‘, ‘D2‘, ‘D3‘]},
                   index=[0, 1, 2, 3])
df2 = pd.DataFrame({‘A‘: [‘A4‘, ‘A5‘, ‘A6‘, ‘A7‘],
                    ‘B‘: [‘B4‘, ‘B5‘, ‘B6‘, ‘B7‘],
                    ‘C‘: [‘C4‘, ‘C5‘, ‘C6‘, ‘C7‘],
                    ‘D‘: [‘D4‘, ‘D5‘, ‘D6‘, ‘D7‘]},
                   index=[4, 5, 6, 7])
pd.concat([df1,df2],axis=1)
Out[45]:
 ABCDABCD
0 A0 B0 C0 D0 NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN
2 A2 B2 C2 D2 NaN NaN NaN NaN
3 A3 B3 C3 D3 NaN NaN NaN NaN
4 NaN NaN NaN NaN A4 B4 C4 D4
5 NaN NaN NaN NaN A5 B5 C5 D5
6 NaN NaN NaN NaN A6 B6 C6 D6
7 NaN NaN NaN NaN A7 B7 C7 D7
 
df1.join(df2,on=col1,how=‘inner‘) # 对数据框df1和df2做内连接,其中连接的列为col1
In [46]:
df1 = pd.DataFrame({‘A‘: [‘A0‘, ‘A1‘, ‘A2‘, ‘A3‘],           
                     ‘B‘: [‘B0‘, ‘B1‘, ‘B2‘, ‘B3‘],
                     ‘key‘: [‘K0‘, ‘K1‘, ‘K0‘, ‘K1‘]})
   

df2 = pd.DataFrame({‘C‘: [‘C0‘, ‘C1‘],
                      ‘D‘: [‘D0‘, ‘D1‘]},
                     index=[‘K0‘, ‘K1‘])
   

df1.join(df2, on=‘key‘)
Out[46]:
 ABkeyCD
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K0 C0 D0
3 A3 B3 K1 C1 D1
 

<div id = ‘p10‘>数据的统计</div>

 
df.describe() # 得到数据框df每一列的描述性统计
In [4]:
df = pd.DataFrame(np.random.rand(10, 5),columns=list(‘abcde‘))
df.describe()
Out[4]:
 abcde
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 0.401144 0.359406 0.603465 0.627617 0.408927
std 0.314415 0.276410 0.225576 0.338007 0.277260
min 0.052844 0.015361 0.255718 0.121600 0.082777
25% 0.148306 0.141934 0.498205 0.320862 0.198211
50% 0.328256 0.301379 0.575852 0.661513 0.332168
75% 0.603549 0.584706 0.665217 0.922541 0.581780
max 0.899552 0.838164 0.973688 0.986095 0.933372
In [47]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.describe()
Out[47]:
 ABCDE
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 0.398648 0.451699 0.443472 0.739478 0.412954
std 0.330605 0.221586 0.303084 0.308798 0.262148
min 0.004457 0.188689 0.079697 0.113562 0.052935
25% 0.088177 0.270355 0.205663 0.715005 0.205685
50% 0.315533 0.457229 0.332148 0.885872 0.400232
75% 0.749716 0.497208 0.737900 0.948651 0.634670
max 0.782956 0.825671 0.851065 0.962922 0.815447
 
df.mean() # 得到数据框df中每一列的平均值
In [6]:
df = pd.DataFrame(np.random.rand(10, 5),columns=list(‘abcde‘))
df.mean()
Out[6]:
a    0.501247
b    0.596623
c    0.525627
d    0.503693
e    0.420740
dtype: float64
In [5]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.mean()
Out[5]:
A    0.554337
B    0.574231
C    0.438493
D    0.514337
E    0.532763
dtype: float64
 
df.corr() # 得到数据框df中每一列与其他列的相关系数
In [7]:
df = pd.DataFrame(np.random.rand(10, 5),columns=list(‘abcde‘))
df.corr()
Out[7]:
 abcde
a 1.000000 -0.314863 0.145670 0.569909 -0.089665
b -0.314863 1.000000 0.241693 -0.105917 0.510971
c 0.145670 0.241693 1.000000 0.073844 -0.070198
d 0.569909 -0.105917 0.073844 1.000000 -0.425560
e -0.089665 0.510971 -0.070198 -0.425560 1.000000
In [49]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.corr()
Out[49]:
 ABCDE
A 1.000000 -0.634931 -0.354824 -0.354131 0.170957
B -0.634931 1.000000 0.225222 -0.338124 -0.043300
C -0.354824 0.225222 1.000000 0.098285 0.297133
D -0.354131 -0.338124 0.098285 1.000000 -0.324209
E 0.170957 -0.043300 0.297133 -0.324209 1.000000
 
df.count() # 得到数据框df中每一列的非空值个数
In [8]:
df = pd.DataFrame(np.random.rand(10, 5),columns=list(‘abcde‘))
df.count()
Out[8]:
a    10
b    10
c    10
d    10
e    10
dtype: int64
In [50]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.count()
Out[50]:
A    10
B    10
C    10
D    10
E    10
dtype: int64
 
df.max() # 得到数据框df中每一列的最大值
In [12]:
df = pd.DataFrame(np.random.rand(10, 5),columns=list(‘abcde‘))
print(df)
print(df.max())
df.count()
 
          a         b         c         d         e
0  0.743688  0.081938  0.693243  0.647515  0.835997
1  0.162604  0.421371  0.422371  0.930136  0.732234
2  0.842065  0.139927  0.675018  0.543914  0.017094
3  0.535794  0.078217  0.964779  0.607462  0.432429
4  0.560279  0.544811  0.304371  0.797165  0.505008
5  0.695691  0.696121  0.741812  0.502741  0.484697
6  0.775342  0.410536  0.275251  0.810911  0.081818
7  0.584267  0.917728  0.379231  0.097702  0.622885
8  0.754810  0.809628  0.102337  0.283509  0.615719
9  0.003056  0.536268  0.187236  0.181844  0.255499
a    0.842065
b    0.917728
c    0.964779
d    0.930136
e    0.835997
dtype: float64
Out[12]:
a    10
b    10
c    10
d    10
e    10
dtype: int64
In [51]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.max()
Out[51]:
A    0.933848
B    0.730197
C    0.921751
D    0.715280
E    0.940010
dtype: float64
 
df.min() # 得到数据框df中每一列的最小值
In [52]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.min()
Out[52]:
A    0.107516
B    0.001635
C    0.024502
D    0.092810
E    0.019898
dtype: float64
 
df.median() # 得到数据框df中每一列的中位数
In [53]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.median()
Out[53]:
A    0.497591
B    0.359854
C    0.661607
D    0.342418
E    0.588468
dtype: float64
 
df.std() # 得到数据框df中每一列的标准差
In [54]:
df = pd.DataFrame(np.random.rand(10,5),columns=list(‘ABCDE‘))
df.std()
Out[54]:
A    0.231075
B    0.286691
C    0.276511
D    0.304167
E    0.272570
dtype: float64

Pandas基础命令速查清单

标签:excel   https   als   pes   exe   ipython3   new   数据库   count   

原文地址:http://www.cnblogs.com/heitaoq/p/7965964.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!