码迷,mamicode.com
首页 > 其他好文 > 详细

《Pandas CookBook》---- 第三章 数据分析入门

时间:2018-10-03 23:05:23      阅读:235      评论:0      收藏:0      [点我收藏+]

标签:trie   use   tip   black   dup   vertica   mock   cat   clu   

数据分析入门

简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

设定最大列数和最大行数

pd.set_option(‘max_columns‘,5 , ‘max_rows‘, 5)

基础分析

读取查看数据

college = pd.read_csv(‘data/college.csv‘)

基础信息

college.info()
<class ‘pandas.core.frame.DataFrame‘>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
INSTNM                7535 non-null object
CITY                  7535 non-null object
STABBR                7535 non-null object
HBCU                  7164 non-null float64
MENONLY               7164 non-null float64
WOMENONLY             7164 non-null float64
RELAFFIL              7535 non-null int64
SATVRMID              1185 non-null float64
SATMTMID              1196 non-null float64
DISTANCEONLY          7164 non-null float64
UGDS                  6874 non-null float64
UGDS_WHITE            6874 non-null float64
UGDS_BLACK            6874 non-null float64
UGDS_HISP             6874 non-null float64
UGDS_ASIAN            6874 non-null float64
UGDS_AIAN             6874 non-null float64
UGDS_NHPI             6874 non-null float64
UGDS_2MOR             6874 non-null float64
UGDS_NRA              6874 non-null float64
UGDS_UNKN             6874 non-null float64
PPTUG_EF              6853 non-null float64
CURROPER              7535 non-null int64
PCTPELL               6849 non-null float64
PCTFLOAN              6849 non-null float64
UG25ABV               6718 non-null float64
MD_EARN_WNE_P10       6413 non-null object
GRAD_DEBT_MDN_SUPP    7503 non-null object
dtypes: float64(20), int64(2), object(5)
memory usage: 1.6+ MB

统计数值列

college.describe(include=[np.number]).T
count mean ... 75% max
HBCU 7164.0 0.014238 ... 0.000000 1.0
MENONLY 7164.0 0.009213 ... 0.000000 1.0
... ... ... ... ... ...
PCTFLOAN 6849.0 0.522211 ... 0.745000 1.0
UG25ABV 6718.0 0.410021 ... 0.572275 1.0

22 rows × 8 columns

统计对象和类型列

college.describe(include=[np.object, pd.Categorical]).T
count unique top freq
INSTNM 7535 7535 Modern Welding School 1
CITY 7535 2514 New York 87
STABBR 7535 59 CA 773
MD_EARN_WNE_P10 6413 598 PrivacySuppressed 822
GRAD_DEBT_MDN_SUPP 7503 2038 PrivacySuppressed 1510

数据类型的转换

DataFrame.astype(dtype, copy=True, errors=‘raise‘, **kwargs)

  • dtype : data type, or dict of column name -> data type
    • Use a numpy.dtype or Python type to cast entire pandas object to the same type.
    • Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
    • 使用numpy的数据类型,或者直接用Python内置的类型。
    • 如果只传递一个dtype的话整个dataframe都会被转成该类型,可以传递一个{列名:类型}的字典,对dataframe中的列分别转换。
  • copy : bool, default True.
  • Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).
  • 如果设置了copy = True 的话,不改变原有的dataframe

选取5列

different_cols = [‘RELAFFIL‘, ‘SATMTMID‘, ‘CURROPER‘, ‘INSTNM‘, ‘STABBR‘]
col2 = college.loc[:, different_cols]
col2.dtypes
RELAFFIL      int64
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object

用memory_usage方法查看每列的内存消耗

original_mem = col2.memory_usage(deep=True)
original_mem
Index           80
RELAFFIL     60280
             ...  
INSTNM      660240
STABBR      444565
Length: 6, dtype: int64

检查数值列的独立个数

col2[‘RELAFFIL‘].unique()
array([0, 1])

RELAFFIL这列只包含0或1,因此没必要用64位,使用astype方法将其变为8位(1字节)整数

检查两个对象列的独立值的个数

col2.select_dtypes(include=[‘object‘]).nunique()
INSTNM    7535
STABBR      59
dtype: int64

STABBR列可以转变为“类型”(Categorical),独立值的个数小于总数的1%

类型转化

col3 = col2.astype({‘STABBR‘:‘category‘,‘RELAFFIL‘:np.int8})
col3.dtypes
RELAFFIL        int8
SATMTMID     float64
CURROPER       int64
INSTNM        object
STABBR      category
dtype: object
new_mem = col3.memory_usage(deep = True)
new_mem
Index           80
RELAFFIL      7535
             ...  
INSTNM      660699
STABBR       13576
Length: 6, dtype: int64

效果对比

new_mem / original_mem
Index       1.000000
RELAFFIL    0.125000
              ...   
INSTNM      1.000695
STABBR      0.030538
Length: 6, dtype: float64

通过和原始数据比较,RELAFFIL列变为了原来的八分之一,STABBR列只有原始大小的3%

排序与去重

排序

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind=‘quicksort‘, na_position=‘last‘)

  • by : str or list of str
    • Name or list of names to sort by.
  • ascending : bool or list of bool, default True
    • Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
  • inplace : bool, default False
    • if True, perform operation in-place
  • na_position : {‘first’, ‘last’}, default ‘last’
    • first puts NaNs at the beginning, last puts NaNs at the end

单列排序

选取出三列。按照title_year降序排列

movie = pd.read_csv(‘data/movie.csv‘)
movie2 = movie[[‘movie_title‘, ‘title_year‘, ‘imdb_score‘]]
movie2.sort_values(‘title_year‘, ascending=False).head()
movie_title title_year imdb_score
3884 The Veil 2016.0 4.7
2375 My Big Fat Greek Wedding 2 2016.0 6.1
2794 Miracles from Heaven 2016.0 6.8
92 Independence Day: Resurgence 2016.0 5.5
153 Kung Fu Panda 3 2016.0 7.2

多列排序

通过给ascending设置列表,可以同时对一列降序排列,一列升序排列

movie4 = movie[[‘movie_title‘, ‘title_year‘, ‘content_rating‘, ‘budget‘]]
movie4_sorted = movie4.sort_values([‘title_year‘, ‘content_rating‘, ‘budget‘], ascending=[False, False, True])
movie4_sorted.head()
movie_title title_year content_rating budget
4026 Compadres 2016.0 R 3000000.0
3884 The Veil 2016.0 R 4000000.0
3682 Fifty Shades of Black 2016.0 R 5000000.0
3685 The Perfect Match 2016.0 R 5000000.0
3396 The Neon Demon 2016.0 R 7000000.0

去重

DataFrame.drop_duplicates(subset=None, keep=‘first‘, inplace=False)

  • subset : column label or sequence of labels, optional
    • Only consider certain columns for identifying duplicates, by default use all of the columns
    • 默认整行的所有列数据都重复了,才会被认为重复。
    • subset 传递为列名的列表时,在判断重复时候,只要该几列值重复了,就认为该行是重复的。
  • keep : {‘first’, ‘last’, False}, default ‘first’
    • first : Drop duplicates except for the first occurrence.
    • last : Drop duplicates except for the last occurrence.
    • False : Drop all duplicates.
  • inplace : boolean, default False
    • Whether to drop duplicates in place or to return a copy
movie3 = movie2.sort_values([‘title_year‘,‘imdb_score‘], ascending=False)
movie3
movie_title title_year imdb_score
4312 Kickboxer: Vengeance 2016.0 9.1
4277 A Beginner‘s Guide to Snuff 2016.0 8.7
... ... ... ...
3246 The Bold and the Beautiful NaN 3.5
2119 The Bachelor NaN 2.9

4916 rows × 3 columns

用drop_duplicates去重,只保留每年的第一条数据

movie_top_year = movie3.drop_duplicates(subset=‘title_year‘,keep=‘first‘)
movie_top_year
movie_title title_year imdb_score
4312 Kickboxer: Vengeance 2016.0 9.1
3745 Running Forever 2015.0 8.6
... ... ... ...
4695 Intolerance: Love‘s Struggle Throughout the Ages 1916.0 8.0
2725 Towering Inferno NaN 9.5

92 rows × 3 columns

从最大中选择最小

读取movie.csv,选取‘movie_title‘, ‘imdb_score‘, ‘budget‘三列

 movie2 = movie[[‘movie_title‘, ‘imdb_score‘, ‘budget‘]]
 movie2.head()
movie_title imdb_score budget
0 Avatar 7.9 237000000.0
1 Pirates of the Caribbean: At World‘s End 7.1 300000000.0
2 Spectre 6.8 245000000.0
3 The Dark Knight Rises 8.5 250000000.0
4 Star Wars: Episode VII - The Force Awakens 7.1 NaN

nlargest与nsmallest

用nlargest方法,选出imdb_score分数最高的100个

movie2.nlargest(100, ‘imdb_score‘).head()
movie_title imdb_score budget
2725 Towering Inferno 9.5 NaN
1920 The Shawshank Redemption 9.3 25000000.0
3402 The Godfather 9.2 6000000.0
2779 Dekalog 9.1 NaN
4312 Kickboxer: Vengeance 9.1 17000000.0

用链式操作,nsmallest方法再从中挑出预算最小的五部

movie2.nlargest(100, ‘imdb_score‘).nsmallest(5, ‘budget‘)
movie_title imdb_score budget
4804 Butterfly Girl 8.7 180000.0
4801 Children of Heaven 8.5 180000.0
4706 12 Angry Men 8.9 350000.0
4550 A Separation 8.4 500000.0
4636 The Other Dream Team 8.4 500000.0

用sort_values方法

用sort_values方法,选取imdb_score最高的100个

movie2.sort_values(‘imdb_score‘, ascending=False).head(100).head()
movie_title imdb_score budget
2725 Towering Inferno 9.5 NaN
1920 The Shawshank Redemption 9.3 25000000.0
3402 The Godfather 9.2 6000000.0
2779 Dekalog 9.1 NaN
4312 Kickboxer: Vengeance 9.1 17000000.0

然后可以再.sort_values(‘budget‘).head(),选出预算最低的5个,结果如下

 movie2.nlargest(100, ‘imdb_score‘).tail()
movie_title imdb_score budget
4023 Oldboy 8.4 3000000.0
4163 To Kill a Mockingbird 8.4 2000000.0
4395 Reservoir Dogs 8.4 1200000.0
4550 A Separation 8.4 500000.0
4636 The Other Dream Team 8.4 500000.0

《Pandas CookBook》---- 第三章 数据分析入门

标签:trie   use   tip   black   dup   vertica   mock   cat   clu   

原文地址:https://www.cnblogs.com/shiyushiyu/p/9738861.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!