十分钟(小时)学习pandas

时间：2018-01-25 00:34:40 阅读：246 评论：0 收藏：0 [点我收藏+]

十分钟(小时)学习pandas

一、导语

这篇文章从pandas官网翻译：链接，而且也有很多网友翻译过，而我为什么没去看他们的，而是去官网自己艰难翻译呢？
毕竟这是一个学习的过程，别人写的不如自己写的记忆深刻。那么开始吧。

1、pandas是什么？

pandas是基于numpy的数据分析库（如果你没了解过numpy，可以在我的博客看numpy相关的文章），提供快速、灵活和富有表现力的数据结构。
pandas的数据结构分为Series（一维）和DataFrame(二维)。这两个主要的数据结构在金融，统计，社会科学和许多工程领域大展神威。

2、pandas能做什么？

轻松处理丢失的数据（以NaN表示）
大小可变性：可以从DataFrame和更高维的对象插入和删除列
自动显式的数据对齐
灵活的按组功能来执行对数据集拆分、联合操作
可轻松地将Python和Numpy数据结构中的不同索引的数据转换为DataFrame对象
可以智能地对大型数据集基于标签进行切片
直观的合并和连接数据集
数据集灵活的重塑和旋转
坐标轴分层标记
强大是IO工具：可以从CSV、Excel文件、数据库加载数据，以及从超快的HDF5格式保存和加载数据
时间序列-特定功能：日期范围生成和频率转换

3、导入numpy、pandas库

    import pandas as pd
    import numpy as np

二、对象的创建

1、创建一个Series:index

    s = pd.Series([1,2,3,4],index=list(‘abcd‘))
    out:
    a    1
    b    2
    c    3
    d    4
    dtype: int64

2、创建一个DataFrame

通过numpy数组，并制定日期时间索引和标签列来创建

    dates = pd.date_range(‘20170123‘,periods=6)
    print(dates)
    df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list(‘abcd‘))
    print(df)

    out:

        DatetimeIndex([‘2017-01-23‘, ‘2017-01-24‘, ‘2017-01-25‘, ‘2017-01-26‘,
               ‘2017-01-27‘, ‘2017-01-28‘],
              dtype=‘datetime64[ns]‘, freq=‘D‘)

                   a         b         c         d
    2017-01-23 -1.081953  2.547690  0.428435 -2.513003
    2017-01-24 -1.123833 -2.080332  0.540281  1.100093
    2017-01-25  0.048541 -0.295839 -0.236631  0.107606
    2017-01-26 -0.890604  0.408112  0.765936 -0.829474
    2017-01-27 -0.845467  2.140932  0.046358 -0.557103
    2017-01-28  0.448769  0.584306 -1.892730 -2.223615

通过传递一个可以转换为一系列的对象的字典

    df2 = pd.DataFrame({
        ‘A‘:1,
        ‘B‘:pd.Timestamp(‘20100123‘),
        ‘C‘:pd.Series(1,index=list(range(4)),dtype=‘float32‘),
        ‘D‘:np.array([3] * 4,dtype=‘int32‘),
        ‘E‘:pd.Categorical([‘test‘,‘train‘,‘test‘,‘train‘]),
        ‘F‘:‘foobar‘
    })
    print(df2)
    print(‘df2 dtypes:‘)
    print(df2.dtypes)

    out:
       A          B    C  D      E       F
    0  1 2010-01-23  1.0  3   test  foobar
    1  1 2010-01-23  1.0  3  train  foobar
    2  1 2010-01-23  1.0  3   test  foobar
    3  1 2010-01-23  1.0  3  train  foobar

    df2.dtypes:
    A             int64
    B    datetime64[ns]
    C           float32
    D             int32
    E          category
    F            object
    dtype: object

三、查看数据

1、查看数据的顶部和底部的行

    df.head(2) #默认为5行
        year    month   day     hour    season
    0   2010.0  5.0     29.0        17.0    1.0
    1   2014.0  2.0     15.0        15.0    4.0

    df.tail()
            year    month   day     hour    season
    37788   2014.0  1.0     4.0     0.0     4.0
    37789   2014.0  4.0     3.0     8.0     1.0

2、显示索引、列和底层的Numpy数据

df.index 显示索引
df.columns 显示列名
df.values 返回的是一个numpy.ndarray类型

3、显示数据的快速统计摘要

 df.describe()

            a           b           c           d
    count   6.000000    6.000000    6.000000    6.000000
    mean    -0.574091   0.550811    -0.058059   -0.819249
    std     0.658465    1.683878    0.967726    1.374977
    min     -1.123833   -2.080332   -1.892730   -2.513003
    25%     -1.034116   -0.119852   -0.165884   -1.875080
    50%     -0.868035   0.496209    0.237396    -0.693288
    75%     -0.174961   1.751775    0.512319    -0.058571
    max     0.448769    2.547690    0.765936    1.100093

4、翻转数据

df.T

5、按轴排序

    df2.sort_index(axis=0,ascending=False)

    A   B           C       D       E   F
3   1   2010-01-23  1.0     3   train   foobar
2   1   2010-01-23  1.0     3   test    foobar
1   1   2010-01-23  1.0     3   train   foobar
0   1   2010-01-23  1.0     3   test    foobar

6、按值排序

    df2.sort_values(by=‘E‘)

    A   B           C   D   E       F
0   1   2010-01-23  1.0 3   test    foobar
2   1   2010-01-23  1.0 3   test    foobar
1   1   2010-01-23  1.0 3   train   foobar
3   1   2010-01-23  1.0 3   train   foobar

四、选择数据

1、通过[‘column_name‘]选择一个列，得到Series

df[‘A‘] #等效于df.A

2、通过[]切片选择行

    df[‘day‘][:6]

0    29.0
1    15.0
2     6.0
3     5.0
4    25.0
5    26.0
Name: day, dtype: float64

3、基于标签选择

.loc属性是主访问方法。以下是有效的输入：

单个标签，例如5或‘a‘(在这里5被解释为索引的标签)
标签的列表或者数组[‘a,‘b‘,‘c‘]
具有标签 ‘b‘:‘e‘的切片对象（注意，这里与通常的python切片相反，包括开始和停止，他是包括开始和结束的）
可以是一个布尔数组

一个callable

s1 = pd.Series(np.random.randn(6),index=list(‘abcdef‘))

out:
a 1.715955
b 0.307930
c -0.971638
d -0.594908
e -3.134987
f 0.396613
dtype: float64
***

s1.loc[‘b‘:‘e‘]

out:
b    0.307930
c   -0.971638
d   -0.594908
e   -3.134987
dtype: float64

s1.loc[‘b‘]

out:
0.30792993178289157

还可以用来设置value

s1.loc[‘b‘] = 0

out:
a 1.715955
b 0.000000
c -0.971638
d -0.594908
e -3.134987
f 0.396613
dtype: float64

使用在DataFrame

df1 = pd.DataFrame(np.random.randn(6,4),
                  index = list(‘abcdef‘),
                  columns=list(‘ABCD‘))
out:
    A           B           C           D
a   1.235823    -0.767938   -0.750474   0.342353
b   0.506219    0.388180    0.400716    0.207014
c   -0.813548   0.509618    0.311099    -0.645569
d   -0.510755   -0.195760   1.162505    -2.125746
e   -0.559745   -0.937668   0.363403    0.554602
f   -1.512407   0.865061    -0.602054   0.207695

df1.loc[[‘a‘,‘b‘,‘e‘],:]
out:
    A           B           C           D
a   1.235823    -0.767938   -0.750474   0.342353
b   0.506219    0.388180    0.400716    0.207014
e   -0.559745   -0.937668   0.363403    0.554602

使用标签获取行（等效于df.xs(‘a‘)）

df1.loc[‘a‘]

out:
A    1.235823
B   -0.767938
C   -0.750474
D    0.342353
Name: a, dtype: float64

获取带有布尔数组的值

df1.loc[‘a‘] > 0

out:
A     True
B    False
C    False
D     True
Name: a, dtype: bool

显示获取值.loc[‘行标签‘,‘列标签‘]

df1.loc[‘a‘,‘A‘]

out:
1.2358232787452161

基于索引的选择

.iloc属性可以获得纯粹基于整数的索引。语义准讯python和numpy切片，包括起始便捷，不包括结束边界。
如果使用的索引是非整数，即使是有效的便签也会参数IndexError。

以下是.iloc属性的有效输入

整数，例如7
整数列表或者数组，例如[4,2,0]
整数的切片(slice)对象，例如1::7
一个布尔数组
一个callable

s2 = pd.Series(np.random.randn(5),index=list(range(0,10,2)))
out:
0   -1.051477
2   -0.495461
4    2.417686
6    0.329432
8    1.479104
dtype: float64

s2.iloc[:3]
out:
0   -1.051477
2   -0.495461
4    2.417686
6    0.000000
8    1.479104
dtype: float64

s2.iloc[3] = 0 #还可以使用iloc来修改一个的value
out：
0   -1.051477
2   -0.495461
4    2.417686
6    0.000000
8    1.479104
dtype: float64

s2.iloc[:3] = 0 #还是使用iloc连续赋值
out:
0    0.000000
2    0.000000
4    0.000000
6    0.000000
8    1.479104
dtype: float64

使用在DataFrame

df2 = pd.DataFrame(np.random.randn(6,4),
                  index=list(range(0,12,2)),
                  columns=list(range(0,8,2)))
out:
0   2           4           6
0   -0.708809   -0.417166   -1.296387   0.620899
2   -1.514339   1.145004    0.877585    -1.695285
4   1.365427    -0.721800   -0.719877   -0.418820
6   0.980937    0.230571    -0.783681   -0.985872
8   1.031649    -1.232232   0.795309    1.294055
10  0.618609    -1.370898   0.229622    0.817530

通过整数切片进行选择

df2.iloc[:3]
out:
    0           2           4           6
0   -0.708809   -0.417166   -1.296387   0.620899
2   -1.514339   1.145004    0.877585    -1.695285
4   1.365427    -0.721800   -0.719877   -0.418820

通过整数列表进行选择

df2.iloc[[1,3,5],[1,3]]
out:
    2           6
2   1.145004    -1.695285
6   0.230571    -0.985872
10  -1.370898   0.817530

df2.iloc[1:3,:] #df2.iloc[:,1:3]
out:
    0           2           4           6
2   -1.514339   1.145004    0.877585    -1.695285
4   1.365427    -0.721800   -0.719877   -0.418820

还可以获得值 .loc[‘行位置‘,‘列位置‘]

df2.iloc[0,1]
out:
-0.41716586227691288

获取整数位置的行（等于df.xs(1)）

df2.iloc[1]
out:
0   -1.514339
2    1.145004
4    0.877585
6   -1.695285
Name: 2, dtype: float64

超出范围的切片索引，会像python、numpy一样优雅的处理(pandas v0.14.0之前并不能这样，否则可能会导致返回一个空的DataFrame)

df2.iloc[:3,:1000]
out:
    0           2           4           6
0   -0.708809   -0.417166   -1.296387   0.620899
2   -1.514339   1.145004    0.877585    -1.695285
4   1.365427    -0.721800   -0.719877   -0.418820

超出范围的单个索引器将生成IndexError（并不能像切片那样优雅地处理）。任何元素超出边界的索引器列表将生成IndexError

df2.iloc[[1,2,8]]
IndexError: positional indexers are out-of-bounds

未完待续...

十分钟(小时)学习pandas

标签：oba 否则对象也会 callable test out col pos

原文地址：https://www.cnblogs.com/luhuan/p/8343654.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行