# 10 Minutes to pandas
pandas入门教程,面向新手,如需高级教程,移步[pandas cookbook](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook)
按照约定,一般按照如下形式对pandas进行导入
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
## pandas 对象的创建
通过python列表构造一个pandas的Series对象
s = pd.Series([1,2,3,np.nan, 4,5])
s
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
dtype: float64
使用numpy的数组创建一个pandas的DataFrame,指定日期序列为行索引,指定’A’,’B’,’C’,’D’为列索引
dates = pd.date_range(‘20160101‘, periods=6)
dates
DatetimeIndex([‘2016-01-01’, ‘2016-01-02’, ‘2016-01-03’, ‘2016-01-04’,
‘2016-01-05’, ‘2016-01-06’],
dtype=’datetime64[ns]’, freq=’D’)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list(‘ABCD‘))
df
|
A |
B |
C |
D |
2016-01-01 |
-0.808397 |
-1.548973 |
1.013311 |
1.981536 |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
-1.474018 |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
1.757797 |
2016-01-04 |
-1.430586 |
-0.732160 |
-0.034836 |
0.216295 |
2016-01-05 |
-0.519748 |
0.386824 |
-2.775289 |
-0.088892 |
2016-01-06 |
1.027911 |
-0.311089 |
0.646725 |
0.773003 |
或者,可以通过传递字典来创建Dataframe对象
df2 = pd.DataFrame({
‘A‘: pd.Timestamp(‘20160701‘),
‘B‘: pd.Series(1, index=list(range(4)), dtype=‘float32‘),
‘C‘: np.array([3] * 4, dtype=‘int32‘),
‘D‘: pd.Categorical([‘Test‘, ‘Train‘, ‘Test‘, ‘Train‘]),
‘E‘: 1,
‘F‘: ‘foo‘
})
df2
|
A |
B |
C |
D |
E |
F |
0 |
2016-07-01 |
1.0 |
3 |
Test |
1 |
foo |
1 |
2016-07-01 |
1.0 |
3 |
Train |
1 |
foo |
2 |
2016-07-01 |
1.0 |
3 |
Test |
1 |
foo |
3 |
2016-07-01 |
1.0 |
3 |
Train |
1 |
foo |
df2的每一列都拥有不同的类型,可以通过dtypes
属性查看
df2.dtypes
A datetime64[ns]
B float32
C int32
D category
E int64
F object
dtype: object
## 查看数据
查看数据的前几行和后几行
df.head(3)
|
A |
B |
C |
D |
2016-01-01 |
-0.808397 |
-1.548973 |
1.013311 |
1.981536 |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
-1.474018 |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
1.757797 |
df.tail(2)
|
A |
B |
C |
D |
2016-01-05 |
-0.519748 |
0.386824 |
-2.775289 |
-0.088892 |
2016-01-06 |
1.027911 |
-0.311089 |
0.646725 |
0.773003 |
查看DataFrame的行列信息和数据信息
df.index
DatetimeIndex([‘2016-01-01’, ‘2016-01-02’, ‘2016-01-03’, ‘2016-01-04’,
‘2016-01-05’, ‘2016-01-06’],
dtype=’datetime64[ns]’, freq=’D’)
df.columns
Index([‘A’, ‘B’, ‘C’, ‘D’], dtype=’object’)
df.values
array([[-0.8083965 , -1.54897301, 1.01331067, 1.98153559],
[ 1.96654297, 0.46829396, 0.16844495, -1.47401779],
[-1.30845444, 0.62552152, -2.46554656, 1.75779664],
[-1.43058558, -0.73216048, -0.03483597, 0.21629514],
[-0.51974796, 0.3868237 , -2.77528915, -0.08889186],
[ 1.02791114, -0.31108897, 0.64672466, 0.77300274]])
简单数据统计信息
df.describe()
|
A |
B |
C |
D |
count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
-0.178788 |
-0.185264 |
-0.574532 |
0.527620 |
std |
1.372179 |
0.846927 |
1.629433 |
1.278357 |
min |
-1.430586 |
-1.548973 |
-2.775289 |
-1.474018 |
25% |
-1.183440 |
-0.626893 |
-1.857869 |
-0.012595 |
50% |
-0.664072 |
0.037867 |
0.066804 |
0.494649 |
75% |
0.640996 |
0.447926 |
0.527155 |
1.511598 |
max |
1.966543 |
0.625522 |
1.013311 |
1.981536 |
矩阵的转置
df.T
|
2016-01-01 00:00:00 |
2016-01-02 00:00:00 |
2016-01-03 00:00:00 |
2016-01-04 00:00:00 |
2016-01-05 00:00:00 |
2016-01-06 00:00:00 |
A |
-0.808397 |
1.966543 |
-1.308454 |
-1.430586 |
-0.519748 |
1.027911 |
B |
-1.548973 |
0.468294 |
0.625522 |
-0.732160 |
0.386824 |
-0.311089 |
C |
1.013311 |
0.168445 |
-2.465547 |
-0.034836 |
-2.775289 |
0.646725 |
D |
1.981536 |
-1.474018 |
1.757797 |
0.216295 |
-0.088892 |
0.773003 |
索引排序
df.sort_index(axis=1, ascending=False)
|
D |
C |
B |
A |
2016-01-01 |
1.981536 |
1.013311 |
-1.548973 |
-0.808397 |
2016-01-02 |
-1.474018 |
0.168445 |
0.468294 |
1.966543 |
2016-01-03 |
1.757797 |
-2.465547 |
0.625522 |
-1.308454 |
2016-01-04 |
0.216295 |
-0.034836 |
-0.732160 |
-1.430586 |
2016-01-05 |
-0.088892 |
-2.775289 |
0.386824 |
-0.519748 |
2016-01-06 |
0.773003 |
0.646725 |
-0.311089 |
1.027911 |
通过某一列值进行排序
df.sort_values(by=‘C‘)
|
A |
B |
C |
D |
2016-01-05 |
-0.519748 |
0.386824 |
-2.775289 |
-0.088892 |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
1.757797 |
2016-01-04 |
-1.430586 |
-0.732160 |
-0.034836 |
0.216295 |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
-1.474018 |
2016-01-06 |
1.027911 |
-0.311089 |
0.646725 |
0.773003 |
2016-01-01 |
-0.808397 |
-1.548973 |
1.013311 |
1.981536 |
## 数据的选择
### 获取数据
df[‘A‘]
2016-01-01 -0.808397
2016-01-02 1.966543
2016-01-03 -1.308454
2016-01-04 -1.430586
2016-01-05 -0.519748
2016-01-06 1.027911
Freq: D, Name: A, dtype: float64
通过切片技术,获取相对应的行. **PS: 末端包含**
df[0:3]
|
A |
B |
C |
D |
2016-01-01 |
-0.808397 |
-1.548973 |
1.013311 |
1.981536 |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
-1.474018 |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
1.757797 |
df[‘20160102‘: ‘20160104‘]
|
A |
B |
C |
D |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
-1.474018 |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
1.757797 |
2016-01-04 |
-1.430586 |
-0.732160 |
-0.034836 |
0.216295 |
通过标签选择数据。 ps:使用 .at, .iat, .loc, .iloc, .ix属性来实现
df.loc[‘20160101‘]
A -0.808397
B -1.548973
C 1.013311
D 1.981536
Name: 2016-01-01 00:00:00, dtype: float64
花式索引,选取多列数据
df.loc[:, [‘A‘,‘B‘]]
|
A |
B |
2016-01-01 |
-0.808397 |
-1.548973 |
2016-01-02 |
1.966543 |
0.468294 |
2016-01-03 |
-1.308454 |
0.625522 |
2016-01-04 |
-1.430586 |
-0.732160 |
2016-01-05 |
-0.519748 |
0.386824 |
2016-01-06 |
1.027911 |
-0.311089 |
通过标签来切片
df.loc[‘20160102‘:‘20160103‘, [‘B‘,‘C‘]]
|
B |
C |
2016-01-02 |
0.468294 |
0.168445 |
2016-01-03 |
0.625522 |
-2.465547 |
df.loc[‘20160103‘,[‘A‘, ‘B‘]]
A -1.308454
B 0.625522
Name: 2016-01-03 00:00:00, dtype: float64
获取单个数据
print(df.loc[‘20160101‘, ‘A‘])
print(df.at[dates[0], ‘A‘])
-0.808396502432
-0.808396502432
通过位置选择,通过整数坐标来获取数据片段或单个数据,此时切片跟python, numpy一致,即末端不包含。
df.iloc[3]
A -1.430586
B -0.732160
C -0.034836
D 0.216295
Name: 2016-01-04 00:00:00, dtype: float64
df.iloc[3:5, 0:2]
|
A |
B |
2016-01-04 |
-1.430586 |
-0.732160 |
2016-01-05 |
-0.519748 |
0.386824 |
df.iloc[[1,2,4],[0,2]]
|
A |
C |
2016-01-02 |
1.966543 |
0.168445 |
2016-01-03 |
-1.308454 |
-2.465547 |
2016-01-05 |
-0.519748 |
-2.775289 |
df.iloc[1:3]
|
A |
B |
C |
D |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
-1.474018 |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
1.757797 |
df.iloc[:,1:3]
|
B |
C |
2016-01-01 |
-1.548973 |
1.013311 |
2016-01-02 |
0.468294 |
0.168445 |
2016-01-03 |
0.625522 |
-2.465547 |
2016-01-04 |
-0.732160 |
-0.034836 |
2016-01-05 |
0.386824 |
-2.775289 |
2016-01-06 |
-0.311089 |
0.646725 |
df.iloc[1,1]
0.46829396335234058
### 布尔型索引
df[df.A > 0]
|
A |
B |
C |
D |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
-1.474018 |
2016-01-06 |
1.027911 |
-0.311089 |
0.646725 |
0.773003 |
df[df >0]
|
A |
B |
C |
D |
2016-01-01 |
NaN |
NaN |
1.013311 |
1.981536 |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
NaN |
2016-01-03 |
NaN |
0.625522 |
NaN |
1.757797 |
2016-01-04 |
NaN |
NaN |
NaN |
0.216295 |
2016-01-05 |
NaN |
0.386824 |
NaN |
NaN |
2016-01-06 |
1.027911 |
NaN |
0.646725 |
0.773003 |
使用 isin 方法来筛选数据
df2 = df.copy()
df2[‘E‘] = [‘one‘, ‘one‘, ‘two‘, ‘three‘, ‘four‘, ‘three‘]
df2
|
A |
B |
C |
D |
E |
2016-01-01 |
-0.808397 |
-1.548973 |
1.013311 |
1.981536 |
one |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
-1.474018 |
one |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
1.757797 |
two |
2016-01-04 |
-1.430586 |
-0.732160 |
-0.034836 |
0.216295 |
three |
2016-01-05 |
-0.519748 |
0.386824 |
-2.775289 |
-0.088892 |
four |
2016-01-06 |
1.027911 |
-0.311089 |
0.646725 |
0.773003 |
three |
df2[df2[‘E‘].isin([‘one‘,‘three‘])]
|
A |
B |
C |
D |
E |
2016-01-01 |
-0.808397 |
-1.548973 |
1.013311 |
1.981536 |
one |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
-1.474018 |
one |
2016-01-04 |
-1.430586 |
-0.732160 |
-0.034836 |
0.216295 |
three |
2016-01-06 |
1.027911 |
-0.311089 |
0.646725 |
0.773003 |
three |
### 数据的设置
通过索引匹配插入新的一列
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range(‘20160102‘, periods=6))
df[‘F‘] = s1
df
|
A |
B |
C |
D |
F |
2016-01-01 |
0.000000 |
0.000000 |
1.013311 |
1.981536 |
NaN |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
-1.474018 |
1.0 |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
1.757797 |
2.0 |
2016-01-04 |
-1.430586 |
-0.732160 |
-0.034836 |
0.216295 |
3.0 |
2016-01-05 |
-0.519748 |
0.386824 |
-2.775289 |
-0.088892 |
4.0 |
2016-01-06 |
1.027911 |
-0.311089 |
0.646725 |
0.773003 |
5.0 |
也可以通过标签来赋值
df.at[dates[0], ‘A‘] = 0
df
|
A |
B |
C |
D |
F |
2016-01-01 |
0.000000 |
0.000000 |
1.013311 |
1.981536 |
NaN |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
-1.474018 |
1.0 |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
1.757797 |
2.0 |
2016-01-04 |
-1.430586 |
-0.732160 |
-0.034836 |
0.216295 |
3.0 |
2016-01-05 |
-0.519748 |
0.386824 |
-2.775289 |
-0.088892 |
4.0 |
2016-01-06 |
1.027911 |
-0.311089 |
0.646725 |
0.773003 |
5.0 |
通过位来赋值
df.iat[0, 1] = 0
df
|
A |
B |
C |
D |
F |
2016-01-01 |
0.000000 |
0.000000 |
1.013311 |
1.981536 |
NaN |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
-1.474018 |
1.0 |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
1.757797 |
2.0 |
2016-01-04 |
-1.430586 |
-0.732160 |
-0.034836 |
0.216295 |
3.0 |
2016-01-05 |
-0.519748 |
0.386824 |
-2.775289 |
-0.088892 |
4.0 |
2016-01-06 |
1.027911 |
-0.311089 |
0.646725 |
0.773003 |
5.0 |
将numpy数组赋值给某列
df.loc[:, ‘D‘] = np.array([5] * len(df))
df
|
A |
B |
C |
D |
F |
2016-01-01 |
0.000000 |
0.000000 |
1.013311 |
5 |
NaN |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
5 |
1.0 |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
5 |
2.0 |
2016-01-04 |
-1.430586 |
-0.732160 |
-0.034836 |
5 |
3.0 |
2016-01-05 |
-0.519748 |
0.386824 |
-2.775289 |
5 |
4.0 |
2016-01-06 |
1.027911 |
-0.311089 |
0.646725 |
5 |
5.0 |
df2 = df.copy()
df2[df2>0] = -df2
df2
|
A |
B |
C |
D |
F |
2016-01-01 |
0.000000 |
0.000000 |
-1.013311 |
-5 |
NaN |
2016-01-02 |
-1.966543 |
-0.468294 |
-0.168445 |
-5 |
-1.0 |
2016-01-03 |
-1.308454 |
-0.625522 |
-2.465547 |
-5 |
-2.0 |
2016-01-04 |
-1.430586 |
-0.732160 |
-0.034836 |
-5 |
-3.0 |
2016-01-05 |
-0.519748 |
-0.386824 |
-2.775289 |
-5 |
-4.0 |
2016-01-06 |
-1.027911 |
-0.311089 |
-0.646725 |
-5 |
-5.0 |
## 处理缺失数据
pandas使用np.nan来表征缺失数据,这些数据在计算时默认不会被使用
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + [‘E‘])
df1.loc[dates[0]:dates[1], ‘E‘] = 1
df1
|
A |
B |
C |
D |
F |
E |
2016-01-01 |
0.000000 |
0.000000 |
1.013311 |
5 |
NaN |
1.0 |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
5 |
1.0 |
1.0 |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
5 |
2.0 |
NaN |
2016-01-04 |
-1.430586 |
-0.732160 |
-0.034836 |
5 |
3.0 |
NaN |
方案一、丢弃所有数据缺失的行
df1.dropna(how=‘any‘)
|
A |
B |
C |
D |
F |
E |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
5 |
1.0 |
1.0 |
方案二、填充缺失值
df1.fillna(value=5)
|
A |
B |
C |
D |
F |
E |
2016-01-01 |
0.000000 |
0.000000 |
1.013311 |
5 |
5.0 |
1.0 |
2016-01-02 |
1.966543 |
0.468294 |
0.168445 |
5 |
1.0 |
1.0 |
2016-01-03 |
-1.308454 |
0.625522 |
-2.465547 |
5 |
2.0 |
5.0 |
2016-01-04 |
-1.430586 |
-0.732160 |
-0.034836 |
5 |
3.0 |
5.0 |
可以获取到缺失数据的掩码
df.isnull()
|
A |
B |
C |
D |
F |
2016-01-01 |
False |
False |
False |
False |
True |
2016-01-02 |
False |
False |
False |
False |
False |
2016-01-03 |
False |
False |
False |
False |
False |
2016-01-04 |
False |
False |
False |
False |
False |
2016-01-05 |
False |
False |
False |
False |
False |
2016-01-06 |
False |
False |
False |
False |
False |
## 数据操作
数据操作默认不会使用缺失值
### 状态操作
df.mean()
A -0.044056
B 0.072898
C -0.574532
D 5.000000
F 3.000000
dtype: float64
df.mean(1)
2016-01-01 1.503328
2016-01-02 1.720656
2016-01-03 0.770304
2016-01-04 1.160484
2016-01-05 1.218357
2016-01-06 2.272709
Freq: D, dtype: float64
当对维度不同的数据进行操作时, 数据之间需要对其,pandas会自动在不同维度之间进行广播
s = pd.Series([1,3,5,np.nan, 6, 8], index=dates).shift(2)
s
2016-01-01 NaN
2016-01-02 NaN
2016-01-03 1.0
2016-01-04 3.0
2016-01-05 5.0
2016-01-06 NaN
Freq: D, dtype: float64
df.sub(s, axis=‘index‘)
|
A |
B |
C |
D |
F |
2016-01-01 |
NaN |
NaN |
NaN |
NaN |
NaN |
2016-01-02 |
NaN |
NaN |
NaN |
NaN |
NaN |
2016-01-03 |
-2.308454 |
-0.374478 |
-3.465547 |
4.0 |
1.0 |
2016-01-04 |
-4.430586 |
-3.732160 |
-3.034836 |
2.0 |
0.0 |
2016-01-05 |
-5.519748 |
-4.613176 |
-7.775289 |
0.0 |
-1.0 |
2016-01-06 |
NaN |
NaN |
NaN |
NaN |
NaN |
### 函数应用
将函数应用到数据上
df.apply(np.cumsum, axis=0)
|
A |
B |
C |
D |
F |
2016-01-01 |
0.000000 |
0.000000 |
1.013311 |
5 |
NaN |
2016-01-02 |
1.966543 |
0.468294 |
1.181756 |
10 |
1.0 |
2016-01-03 |
0.658089 |
1.093815 |
-1.283791 |
15 |
3.0 |
2016-01-04 |
-0.772497 |
0.361655 |
-1.318627 |
20 |
6.0 |
2016-01-05 |
-1.292245 |
0.748479 |
-4.093916 |
25 |
10.0 |
2016-01-06 |
-0.264334 |
0.437390 |
-3.447191 |
30 |
15.0 |
df.apply(lambda x: x.max() - x.min())
A 3.397129
B 1.357682
C 3.788600
D 0.000000
F 4.000000
dtype: float64
## 直方图
s = pd.Series(np.random.randint(0,7,size=10))
s
0 1
1 5
2 6
3 5
4 6
5 4
6 0
7 3
8 6
9 5
dtype: int64
s.value_counts()
6 3
5 3
4 1
3 1
1 1
0 1
dtype: int64
### 字符串操作
s = pd.Series([‘A‘, ‘B‘, ‘C‘, ‘Aaba‘, ‘BAcd‘, np.nan, ‘CBA‘, ‘dog‘, ‘CAT‘])
s.str.lower()
0 a
1 b
2 c
3 aaba
4 bacd
5 NaN
6 cba
7 dog
8 cat
dtype: object
## 数据合并
### 数据连接 concat
df = pd.DataFrame(np.random.randn(10,4))
df
|
0 |
1 |
2 |
3 |
0 |
-0.859307 |
-0.723708 |
-1.121663 |
1.438285 |
1 |
-0.168126 |
-0.343567 |
0.678940 |
0.394126 |
2 |
-0.541090 |
1.908998 |
-0.543378 |
-0.109371 |
3 |
-1.108110 |
0.332687 |
-1.320752 |
1.022476 |
4 |
0.591171 |
-1.259859 |
0.930266 |
0.688108 |
5 |
-0.065470 |
-0.957394 |
1.423691 |
-0.295647 |
6 |
1.728151 |
0.162709 |
0.836916 |
-0.573260 |
7 |
-0.025487 |
0.307945 |
-0.414787 |
-0.045495 |
8 |
-0.601439 |
-0.167967 |
-1.198304 |
0.242739 |
9 |
0.495473 |
-0.348495 |
1.599757 |
0.184015 |
pieces = [df[:3], df[5:]]
pd.concat(pieces)
|
0 |
1 |
2 |
3 |
0 |
-0.859307 |
-0.723708 |
-1.121663 |
1.438285 |
1 |
-0.168126 |
-0.343567 |
0.678940 |
0.394126 |
2 |
-0.541090 |
1.908998 |
-0.543378 |
-0.109371 |
5 |
-0.065470 |
-0.957394 |
1.423691 |
-0.295647 |
6 |
1.728151 |
0.162709 |
0.836916 |
-0.573260 |
7 |
-0.025487 |
0.307945 |
-0.414787 |
-0.045495 |
8 |
-0.601439 |
-0.167967 |
-1.198304 |
0.242739 |
9 |
0.495473 |
-0.348495 |
1.599757 |
0.184015 |
### 数据SQL风格的连接 merge
left = pd.DataFrame({‘key‘: [‘foo‘, ‘foo‘], ‘lval‘: [1,2]})
right = pd.DataFrame({‘key‘: [‘foo‘, ‘foo‘], ‘rval‘: [4,5]})
left
right
pd.merge(left, right, on=‘key‘)
|
key |
lval |
rval |
0 |
foo |
1 |
4 |
1 |
foo |
1 |
5 |
2 |
foo |
2 |
4 |
3 |
foo |
2 |
5 |
### Append
在Dateframe对象尾部添加数据
df = pd.DataFrame(np.random.randn(8, 4), columns=[‘A‘,‘B‘,‘C‘,‘D‘])
df
|
A |
B |
C |
D |
0 |
-0.535803 |
-0.319896 |
-0.313776 |
-0.401106 |
1 |
-0.231405 |
2.058233 |
0.771222 |
0.170204 |
2 |
-1.699222 |
-0.098205 |
0.465100 |
0.295165 |
3 |
-0.273538 |
-0.902247 |
-0.328348 |
0.771312 |
4 |
0.080118 |
0.796800 |
0.564468 |
0.526290 |
5 |
0.485221 |
0.478245 |
-0.943854 |
-0.097568 |
6 |
-0.440915 |
0.134749 |
-0.840602 |
-0.836712 |
7 |
-0.283432 |
-0.029233 |
1.725972 |
-0.878117 |
s = df.iloc[3]
df.append(s, ignore_index=True)
|
A |
B |
C |
D |
0 |
-0.535803 |
-0.319896 |
-0.313776 |
-0.401106 |
1 |
-0.231405 |
2.058233 |
0.771222 |
0.170204 |
2 |
-1.699222 |
-0.098205 |
0.465100 |
0.295165 |
3 |
-0.273538 |
-0.902247 |
-0.328348 |
0.771312 |
4 |
0.080118 |
0.796800 |
0.564468 |
0.526290 |
5 |
0.485221 |
0.478245 |
-0.943854 |
-0.097568 |
6 |
-0.440915 |
0.134749 |
-0.840602 |
-0.836712 |
7 |
-0.283432 |
-0.029233 |
1.725972 |
-0.878117 |
8 |
-0.273538 |
-0.902247 |
-0.328348 |
0.771312 |
### 分组聚合 groupby
分组聚合一般而言经历一下步骤:
- 按约束条件将数据分组
- 使用某个函数处理分好组的数据
- 将处理好的数据合并在一起
df = pd.DataFrame({
‘A‘: [‘foo‘,‘bar‘,‘foo‘,‘bar‘,‘foo‘,‘bar‘,‘foo‘,‘foo‘],
‘B‘: [‘one‘,‘one‘,‘two‘,‘three‘, ‘two‘, ‘two‘, ‘one‘, ‘three‘],
‘C‘: np.random.randn(8),
‘D‘: np.random.randn(8)
})
df
|
A |
B |
C |
D |
0 |
foo |
one |
0.996471 |
0.659993 |
1 |
bar |
one |
0.990690 |
-1.102114 |
2 |
foo |
two |
-0.138965 |
0.236194 |
3 |
bar |
three |
0.033469 |
0.253152 |
4 |
foo |
two |
-0.574320 |
0.081216 |
5 |
bar |
two |
1.992456 |
0.939238 |
6 |
foo |
one |
-0.514013 |
-1.610422 |
7 |
foo |
three |
-0.640462 |
-1.606399 |
分组聚合后将sum函数应用到分组数据上
df.groupby(‘A‘).sum()
|
C |
D |
A |
|
|
bar |
3.016615 |
0.090276 |
foo |
-0.871289 |
-2.239418 |
多重分组聚合之后,应用sum函数
df.groupby([‘A‘, ‘B‘]).sum()
|
|
C |
D |
A |
B |
|
|
bar |
one |
0.990690 |
-1.102114 |
three |
0.033469 |
0.253152 |
two |
1.992456 |
0.939238 |
foo |
one |
0.482458 |
-0.950429 |
three |
-0.640462 |
-1.606399 |
two |
-0.713285 |
0.317410 |
## 重塑和轴向旋转
### 轴向旋转
tuples = list(zip(*[
[‘bar‘,‘bar‘, ‘baz‘, ‘baz‘, ‘foo‘,‘foo‘, ‘qux‘, ‘qux‘],
[‘one‘, ‘two‘, ‘one‘, ‘two‘, ‘one‘, ‘two‘, ‘one‘, ‘two‘ ]
]))
index = pd.MultiIndex.from_tuples(tuples, names=[‘first‘, ‘second‘])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=[‘A‘,‘B‘])
df2 = df[:4]
df2
|
|
A |
B |
first |
second |
|
|
bar |
one |
-0.084595 |
1.495368 |
two |
-0.801703 |
-0.663997 |
baz |
one |
-0.108681 |
-0.986022 |
two |
-0.524829 |
0.983664 |
stacked = df2.stack()
stacked
first second
bar one A -0.084595
B 1.495368
two A -0.801703
B -0.663997
baz one A -0.108681
B -0.986022
two A -0.524829
B 0.983664
dtype: float64
stacked.unstack()
|
|
A |
B |
first |
second |
|
|
bar |
one |
-0.084595 |
1.495368 |
two |
-0.801703 |
-0.663997 |
baz |
one |
-0.108681 |
-0.986022 |
two |
-0.524829 |
0.983664 |
stacked.unstack(1)
|
second |
one |
two |
first |
|
|
|
bar |
A |
-0.084595 |
-0.801703 |
B |
1.495368 |
-0.663997 |
baz |
A |
-0.108681 |
-0.524829 |
B |
-0.986022 |
0.983664 |
stacked.unstack(0)
|
first |
bar |
baz |
second |
|
|
|
one |
A |
-0.084595 |
-0.108681 |
B |
1.495368 |
-0.986022 |
two |
A |
-0.801703 |
-0.524829 |
B |
-0.663997 |
0.983664 |
### 透视表
df = pd.DataFrame({
‘A‘: [‘one‘, ‘two‘, ‘three‘, ‘four‘] * 3,
‘B‘: [‘A‘, ‘B‘, ‘C‘] * 4,
‘C‘: [‘foo‘, ‘foo‘, ‘foo‘, ‘bar‘, ‘bar‘, ‘bar‘] * 2,
‘D‘: np.random.randn(12),
‘E‘: np.random.randn(12)
})
df
|
A |
B |
C |
D |
E |
0 |
one |
A |
foo |
0.319799 |
-1.264188 |
1 |
two |
B |
foo |
0.929552 |
-0.092799 |
2 |
three |
C |
foo |
-2.510099 |
0.979121 |
3 |
four |
A |
bar |
1.727211 |
0.083378 |
4 |
one |
B |
bar |
0.636672 |
-0.167700 |
5 |
two |
C |
bar |
0.337749 |
0.782511 |
6 |
three |
A |
foo |
0.429180 |
-2.415025 |
7 |
four |
B |
foo |
0.334974 |
-1.997174 |
8 |
one |
C |
foo |
0.248257 |
-1.003121 |
9 |
two |
A |
bar |
0.465319 |
1.133168 |
10 |
three |
B |
bar |
0.111670 |
-0.730784 |
11 |
four |
C |
bar |
-1.903981 |
-0.089501 |
pd.pivot_table(df, values=‘D‘, index=[‘A‘, ‘B‘], columns=[‘C‘])
|
C |
bar |
foo |
A |
B |
|
|
four |
A |
1.727211 |
NaN |
B |
NaN |
0.334974 |
C |
-1.903981 |
NaN |
one |
A |
NaN |
0.319799 |
B |
0.636672 |
NaN |
C |
NaN |
0.248257 |
three |
A |
NaN |
0.429180 |
B |
0.111670 |
NaN |
C |
NaN |
-2.510099 |
two |
A |
0.465319 |
NaN |
B |
NaN |
0.929552 |
C |
0.337749 |
NaN |
## 时间序列
pandas提供了简单有效的处理时间频率的函数。
rng = pd.date_range(‘1/1/2016‘, periods=100, freq=‘S‘)
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample(‘10S‘).sum()
2016-01-01 00:00:00 2910
2016-01-01 00:00:10 2506
2016-01-01 00:00:20 2812
2016-01-01 00:00:30 2923
2016-01-01 00:00:40 2510
2016-01-01 00:00:50 2817
2016-01-01 00:01:00 2672
2016-01-01 00:01:10 2486
2016-01-01 00:01:20 3243
2016-01-01 00:01:30 2865
Freq: 10S, dtype: int64
时区转换
rng = pd.date_range(‘2/2/2016 00:00‘, periods=5 , freq=‘D‘)
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
2016-02-02 -0.662500
2016-02-03 -0.762211
2016-02-04 0.954675
2016-02-05 -0.411404
2016-02-06 0.237898
Freq: D, dtype: float64
ts_utc = ts.tz_localize(‘UTC‘)
ts_utc
2016-02-02 00:00:00+00:00 -0.662500
2016-02-03 00:00:00+00:00 -0.762211
2016-02-04 00:00:00+00:00 0.954675
2016-02-05 00:00:00+00:00 -0.411404
2016-02-06 00:00:00+00:00 0.237898
Freq: D, dtype: float64
转化到其它时区
ts_utc.tz_convert(‘Asia/Shanghai‘)
2016-02-02 08:00:00+08:00 -0.662500
2016-02-03 08:00:00+08:00 -0.762211
2016-02-04 08:00:00+08:00 0.954675
2016-02-05 08:00:00+08:00 -0.411404
2016-02-06 08:00:00+08:00 0.237898
Freq: D, dtype: float64
时间戳和时期之间的专转换
ran = pd.date_range(‘1/1/2016‘, periods=5, freq=‘M‘)
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2016-02-02 -2.143138
2016-02-03 1.683414
2016-02-04 -0.427250
2016-02-05 -0.900378
2016-02-06 -1.039857
Freq: D, dtype: float64
ps = ts.to_period()
ps
2016-02-02 -2.143138
2016-02-03 1.683414
2016-02-04 -0.427250
2016-02-05 -0.900378
2016-02-06 -1.039857
Freq: D, dtype: float64
ps.to_timestamp()
2016-02-02 -2.143138
2016-02-03 1.683414
2016-02-04 -0.427250
2016-02-05 -0.900378
2016-02-06 -1.039857
Freq: D, dtype: float64
## 种类类型
df = pd.DataFrame({
"id": [1,2,3,4,5,6],
"raw_grade": [‘a‘,‘b‘,‘b‘,‘a‘,‘a‘,‘e‘]
})
df[‘grade‘] = df[‘raw_grade‘].astype(‘category‘)
df[‘grade‘]
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
df.grade.cat.categories = [‘very good‘, ‘good‘, ‘bad‘]
df.grade
0 very good
1 good
2 good
3 very good
4 very good
5 bad
Name: grade, dtype: category
Categories (3, object): [very good, good, bad]
df.grade = df.grade.cat.set_categories([‘very bad‘, ‘bad‘, ‘medium‘, ‘good‘, ‘very good‘])
df.grade
0 very good
1 good
2 good
3 very good
4 very good
5 bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
df.sort_values(by=‘grade‘)
|
id |
raw_grade |
grade |
5 |
6 |
e |
bad |
1 |
2 |
b |
good |
2 |
3 |
b |
good |
0 |
1 |
a |
very good |
3 |
4 |
a |
very good |
4 |
5 |
a |
very good |
df.groupby(‘grade‘).size()
grade
very bad 0
bad 1
medium 0
good 2
very good 3
dtype: int64
绘图
ts = pd.Series(np.random.randn(1000), index=pd.date_range(‘1/1/2000‘, periods=1000))
ts = ts.cumsum()
ts.plot(grid=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7ffa41fa9908>
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list(‘ABCD‘))
df = df.cumsum()
plt.figure()
df.plot(grid=True)
plt.legend(loc=‘best‘)
<matplotlib.legend.Legend at 0x7ffa41de6e48>
<matplotlib.figure.Figure at 0x7ffa41f1e198>