码迷,mamicode.com
首页 > 其他好文 > 详细

数据分析案例

时间:2019-11-14 18:10:44      阅读:62      评论:0      收藏:0      [点我收藏+]

标签:rop   war   tput   检测   bool   false   values   tab   ted   

人口分析案例

需求:

  • 导入文件,查看原始数据
  • 将人口数据和各州简称数据进行合并
  • 将合并的数据中重复的abbreviation列进行删除
  • 查看存在缺失数据的列
  • 找到有哪些state/region使得state的值为NaN,进行去重操作
  • 为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
  • 合并各州面积数据areas
  • 我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
  • 去除含有缺失数据的行
  • 找出2010年的全民人口数据
  • 计算各州的人口密度
  • 排序,并找出人口密度最高的五个州 df.sort_values()
abb = pd.read_csv(./data/state-abbrevs.csv)
pop = pd.read_csv(./data/state-population.csv)
area = pd.read_csv(./data/state-areas.csv)
#将人口数据和各州简称数据进行合并
display(abb.head(1),pop.head(1))
abb_pop = pd.merge(abb,pop,left_on=abbreviation,right_on=state/region,how=outer)
abb_pop.head()
 stateabbreviation
0 Alabama AL

 
 state/regionagesyearpopulation
0 AL under18 2012 1117489.0

 
 stateabbreviationstate/regionagesyearpopulation
0 Alabama AL AL under18 2012 1117489.0
1 Alabama AL AL total 2012 4817528.0
2 Alabama AL AL under18 2010 1130966.0
3 Alabama AL AL total 2010 4785570.0
4 Alabama AL AL under18 2011 1125763.0
 
#将合并的数据中重复的abbreviation列进行删除
abb_pop.drop(labels=abbreviation,axis=1,inplace=True)
abb_pop.head()

state    state/region    ages    year    population
0    Alabama    AL    under18    2012    1117489.0
1    Alabama    AL    total    2012    4817528.0
2    Alabama    AL    under18    2010    1130966.0
3    Alabama    AL    total    2010    4785570.0
4    Alabama    AL    under18    2011    1125763.0
#查看存在缺失数据的列
abb_pop.isnull().any(axis=0)

state             True
state/region      False
ages              False
year              False
population        True
dtype: bool
#找到有哪些state/region使得state的值为NaN,进行去重操作
#1.检测state列中的空值
abb_pop[state].isnull()
#2.将1的返回值作用的state_region这一列中
abb_pop[state/region][abb_pop[state].isnull()]
#3.去重
abb_pop[state/region][abb_pop[state].isnull()].unique()

 

#为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
abb_pop[state/region] == USA
# 将控制覆盖成United State
abb_pop.loc[indexs,state] = United State
pr_index = abb_pop[state][abb_pop[state/region] == PR].index
abb_pop.loc[pr_index,state] = PPPRRR
#合并各州面积数据areas 我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行 去除含有缺失数据的行 找出2010年的全民人口数据 计算各州的人口密度 排序,并找出人口密度最高的五个州 df.sort_values()

#合并各州面积数据areas
abb_pop_area = pd.merge(abb_pop,area,how=outer)
abb_pop_area.head()

    state    state/region    ages    year    population    area (sq. mi)
0    Alabama    AL    under18    2012.0    1117489.0    52423.0
1    Alabama    AL    total    2012.0    4817528.0    52423.0
2    Alabama    AL    under18    2010.0    1130966.0    52423.0
3    Alabama    AL    total    2010.0    4785570.0    52423.0
4    Alabama    AL    under18    2011.0    1125763.0    52423.0

#我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
abb_pop_area[area (sq. mi)].isnull()
a_index = abb_pop_area.loc[abb_pop_area[area (sq. mi)].isnull()].index

#去除含有缺失数据的行
abb_pop_area.drop(labels=a_index,axis=0,inplace=True)

#找出2010年的全民人口数据
abb_pop_area.query(year == 2010 & ages == "total")
state state
/region ages year population area (sq. mi) 3 Alabama AL total 2010.0 4785570.0 52423.0 91 Alaska AK total 2010.0 713868.0 656425.0 101 Arizona AZ total 2010.0 6408790.0 114006.0 189 Arkansas AR total 2010.0 2922280.0 53182.0 197 California CA total 2010.0 37333601.0 163707.0 283 Colorado CO total 2010.0 5048196.0 104100.0 293 Connecticut CT total 2010.0 3579210.0 5544.0 379 Delaware DE total 2010.0 899711.0 1954.0 389 District of Columbia DC total 2010.0 605125.0 68.0 475 Florida FL total 2010.0 18846054.0 65758.0 485 Georgia GA total 2010.0 9713248.0 59441.0 570 Hawaii HI total 2010.0 1363731.0 10932.0 #计算各州的人口密度 abb_pop_area[midu] = abb_pop_area[population] / abb_pop_area[area (sq. mi)] abb_pop_area.head() state state/region ages year population area (sq. mi) midu 0 Alabama AL under18 2012.0 1117489.0 52423.0 21.316769 1 Alabama AL total 2012.0 4817528.0 52423.0 91.897221 2 Alabama AL under18 2010.0 1130966.0 52423.0 21.573851 3 Alabama AL total 2010.0 4785570.0 52423.0 91.287603 4 Alabama AL under18 2011.0 1125763.0 52423.0 21.474601 #排序,并找出人口密度最高的五个州 df.sort_values() abb_pop_area.sort_values(by=midu,axis=0,ascending=False).head() state state/region ages year population area (sq. mi) midu 391 District of Columbia DC total 2013.0 646449.0 68.0 9506.602941 385 District of Columbia DC total 2012.0 633427.0 68.0 9315.102941 387 District of Columbia DC total 2011.0 619624.0 68.0 9112.117647 431 District of Columbia DC total 1990.0 605321.0 68.0 8901.779412 389 District of Columbia DC total 2010.0 605125.0 68.0 8898.897059

 

数据分析案例

标签:rop   war   tput   检测   bool   false   values   tab   ted   

原文地址:https://www.cnblogs.com/harryblog/p/11858968.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!