标签:trie use tip black dup vertica mock cat clu
简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd
import numpy as np
设定最大列数和最大行数
pd.set_option(‘max_columns‘,5 , ‘max_rows‘, 5)
读取查看数据
college = pd.read_csv(‘data/college.csv‘)
college.info()
<class ‘pandas.core.frame.DataFrame‘>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
INSTNM 7535 non-null object
CITY 7535 non-null object
STABBR 7535 non-null object
HBCU 7164 non-null float64
MENONLY 7164 non-null float64
WOMENONLY 7164 non-null float64
RELAFFIL 7535 non-null int64
SATVRMID 1185 non-null float64
SATMTMID 1196 non-null float64
DISTANCEONLY 7164 non-null float64
UGDS 6874 non-null float64
UGDS_WHITE 6874 non-null float64
UGDS_BLACK 6874 non-null float64
UGDS_HISP 6874 non-null float64
UGDS_ASIAN 6874 non-null float64
UGDS_AIAN 6874 non-null float64
UGDS_NHPI 6874 non-null float64
UGDS_2MOR 6874 non-null float64
UGDS_NRA 6874 non-null float64
UGDS_UNKN 6874 non-null float64
PPTUG_EF 6853 non-null float64
CURROPER 7535 non-null int64
PCTPELL 6849 non-null float64
PCTFLOAN 6849 non-null float64
UG25ABV 6718 non-null float64
MD_EARN_WNE_P10 6413 non-null object
GRAD_DEBT_MDN_SUPP 7503 non-null object
dtypes: float64(20), int64(2), object(5)
memory usage: 1.6+ MB
college.describe(include=[np.number]).T
count | mean | ... | 75% | max | |
---|---|---|---|---|---|
HBCU | 7164.0 | 0.014238 | ... | 0.000000 | 1.0 |
MENONLY | 7164.0 | 0.009213 | ... | 0.000000 | 1.0 |
... | ... | ... | ... | ... | ... |
PCTFLOAN | 6849.0 | 0.522211 | ... | 0.745000 | 1.0 |
UG25ABV | 6718.0 | 0.410021 | ... | 0.572275 | 1.0 |
22 rows × 8 columns
college.describe(include=[np.object, pd.Categorical]).T
count | unique | top | freq | |
---|---|---|---|---|
INSTNM | 7535 | 7535 | Modern Welding School | 1 |
CITY | 7535 | 2514 | New York | 87 |
STABBR | 7535 | 59 | CA | 773 |
MD_EARN_WNE_P10 | 6413 | 598 | PrivacySuppressed | 822 |
GRAD_DEBT_MDN_SUPP | 7503 | 2038 | PrivacySuppressed | 1510 |
DataFrame.astype(dtype, copy=True, errors=‘raise‘, **kwargs)
different_cols = [‘RELAFFIL‘, ‘SATMTMID‘, ‘CURROPER‘, ‘INSTNM‘, ‘STABBR‘]
col2 = college.loc[:, different_cols]
col2.dtypes
RELAFFIL int64
SATMTMID float64
CURROPER int64
INSTNM object
STABBR object
dtype: object
用memory_usage方法查看每列的内存消耗
original_mem = col2.memory_usage(deep=True)
original_mem
Index 80
RELAFFIL 60280
...
INSTNM 660240
STABBR 444565
Length: 6, dtype: int64
col2[‘RELAFFIL‘].unique()
array([0, 1])
RELAFFIL这列只包含0或1,因此没必要用64位,使用astype方法将其变为8位(1字节)整数
col2.select_dtypes(include=[‘object‘]).nunique()
INSTNM 7535
STABBR 59
dtype: int64
STABBR列可以转变为“类型”(Categorical),独立值的个数小于总数的1%
col3 = col2.astype({‘STABBR‘:‘category‘,‘RELAFFIL‘:np.int8})
col3.dtypes
RELAFFIL int8
SATMTMID float64
CURROPER int64
INSTNM object
STABBR category
dtype: object
new_mem = col3.memory_usage(deep = True)
new_mem
Index 80
RELAFFIL 7535
...
INSTNM 660699
STABBR 13576
Length: 6, dtype: int64
new_mem / original_mem
Index 1.000000
RELAFFIL 0.125000
...
INSTNM 1.000695
STABBR 0.030538
Length: 6, dtype: float64
通过和原始数据比较,RELAFFIL列变为了原来的八分之一,STABBR列只有原始大小的3%
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind=‘quicksort‘, na_position=‘last‘)
选取出三列。按照title_year降序排列
movie = pd.read_csv(‘data/movie.csv‘)
movie2 = movie[[‘movie_title‘, ‘title_year‘, ‘imdb_score‘]]
movie2.sort_values(‘title_year‘, ascending=False).head()
movie_title | title_year | imdb_score | |
---|---|---|---|
3884 | The Veil | 2016.0 | 4.7 |
2375 | My Big Fat Greek Wedding 2 | 2016.0 | 6.1 |
2794 | Miracles from Heaven | 2016.0 | 6.8 |
92 | Independence Day: Resurgence | 2016.0 | 5.5 |
153 | Kung Fu Panda 3 | 2016.0 | 7.2 |
通过给ascending设置列表,可以同时对一列降序排列,一列升序排列
movie4 = movie[[‘movie_title‘, ‘title_year‘, ‘content_rating‘, ‘budget‘]]
movie4_sorted = movie4.sort_values([‘title_year‘, ‘content_rating‘, ‘budget‘], ascending=[False, False, True])
movie4_sorted.head()
movie_title | title_year | content_rating | budget | |
---|---|---|---|---|
4026 | Compadres | 2016.0 | R | 3000000.0 |
3884 | The Veil | 2016.0 | R | 4000000.0 |
3682 | Fifty Shades of Black | 2016.0 | R | 5000000.0 |
3685 | The Perfect Match | 2016.0 | R | 5000000.0 |
3396 | The Neon Demon | 2016.0 | R | 7000000.0 |
DataFrame.drop_duplicates(subset=None, keep=‘first‘, inplace=False)
movie3 = movie2.sort_values([‘title_year‘,‘imdb_score‘], ascending=False)
movie3
movie_title | title_year | imdb_score | |
---|---|---|---|
4312 | Kickboxer: Vengeance | 2016.0 | 9.1 |
4277 | A Beginner‘s Guide to Snuff | 2016.0 | 8.7 |
... | ... | ... | ... |
3246 | The Bold and the Beautiful | NaN | 3.5 |
2119 | The Bachelor | NaN | 2.9 |
4916 rows × 3 columns
用drop_duplicates去重,只保留每年的第一条数据
movie_top_year = movie3.drop_duplicates(subset=‘title_year‘,keep=‘first‘)
movie_top_year
movie_title | title_year | imdb_score | |
---|---|---|---|
4312 | Kickboxer: Vengeance | 2016.0 | 9.1 |
3745 | Running Forever | 2015.0 | 8.6 |
... | ... | ... | ... |
4695 | Intolerance: Love‘s Struggle Throughout the Ages | 1916.0 | 8.0 |
2725 | Towering Inferno | NaN | 9.5 |
92 rows × 3 columns
读取movie.csv,选取‘movie_title‘, ‘imdb_score‘, ‘budget‘三列
movie2 = movie[[‘movie_title‘, ‘imdb_score‘, ‘budget‘]]
movie2.head()
movie_title | imdb_score | budget | |
---|---|---|---|
0 | Avatar | 7.9 | 237000000.0 |
1 | Pirates of the Caribbean: At World‘s End | 7.1 | 300000000.0 |
2 | Spectre | 6.8 | 245000000.0 |
3 | The Dark Knight Rises | 8.5 | 250000000.0 |
4 | Star Wars: Episode VII - The Force Awakens | 7.1 | NaN |
用nlargest方法,选出imdb_score分数最高的100个
movie2.nlargest(100, ‘imdb_score‘).head()
movie_title | imdb_score | budget | |
---|---|---|---|
2725 | Towering Inferno | 9.5 | NaN |
1920 | The Shawshank Redemption | 9.3 | 25000000.0 |
3402 | The Godfather | 9.2 | 6000000.0 |
2779 | Dekalog | 9.1 | NaN |
4312 | Kickboxer: Vengeance | 9.1 | 17000000.0 |
用链式操作,nsmallest方法再从中挑出预算最小的五部
movie2.nlargest(100, ‘imdb_score‘).nsmallest(5, ‘budget‘)
movie_title | imdb_score | budget | |
---|---|---|---|
4804 | Butterfly Girl | 8.7 | 180000.0 |
4801 | Children of Heaven | 8.5 | 180000.0 |
4706 | 12 Angry Men | 8.9 | 350000.0 |
4550 | A Separation | 8.4 | 500000.0 |
4636 | The Other Dream Team | 8.4 | 500000.0 |
用sort_values方法,选取imdb_score最高的100个
movie2.sort_values(‘imdb_score‘, ascending=False).head(100).head()
movie_title | imdb_score | budget | |
---|---|---|---|
2725 | Towering Inferno | 9.5 | NaN |
1920 | The Shawshank Redemption | 9.3 | 25000000.0 |
3402 | The Godfather | 9.2 | 6000000.0 |
2779 | Dekalog | 9.1 | NaN |
4312 | Kickboxer: Vengeance | 9.1 | 17000000.0 |
然后可以再.sort_values(‘budget‘).head(),选出预算最低的5个,结果如下
movie2.nlargest(100, ‘imdb_score‘).tail()
movie_title | imdb_score | budget | |
---|---|---|---|
4023 | Oldboy | 8.4 | 3000000.0 |
4163 | To Kill a Mockingbird | 8.4 | 2000000.0 |
4395 | Reservoir Dogs | 8.4 | 1200000.0 |
4550 | A Separation | 8.4 | 500000.0 |
4636 | The Other Dream Team | 8.4 | 500000.0 |
《Pandas CookBook》---- 第三章 数据分析入门
标签:trie use tip black dup vertica mock cat clu
原文地址:https://www.cnblogs.com/shiyushiyu/p/9738861.html