码迷,mamicode.com
首页 > 其他好文 > 详细

中国大学排名定向爬虫

时间:2019-11-05 21:43:57      阅读:246      评论:0      收藏:0      [点我收藏+]

标签:mysql   raise   print   enc   定向   instance   com   规模   exe   

本篇爬虫主要是为了从最好大学网站上爬取2019年各个大学的排名,以及把数据存储到mysql的过程:

技术图片
 1 import requests
 2 from bs4 import BeautifulSoup
 3 import bs4
 4 import pymysql
 5 
 6 # 连接数据库并且创建数据表
 7 db = pymysql.connect(localhost,root,password,universityrankings)
 8 cursor = db.cursor()
 9 cursor.execute(drop table if exists UNRANKING2019)
10 sql = """
11 create table UNRANKING2019
12 (
13 paiming INTEGER,
14 xuexiaomingchen VARCHAR(40),
15 shengshi VARCHAR(40),
16 zongfen VARCHAR(40),
17 shengyuanzhiliang VARCHAR(40),
18 peiyangjieguo VARCHAR(40),
19 shehuishengyu VARCHAR(40),
20 keyanguimo VARCHAR(40),
21 keyanzhiliang VARCHAR(40),
22 dingjianchengguo VARCHAR(40),
23 dingjianrencai VARCHAR(40),
24 kejifuwu VARCHAR(40),
25 chengguozhuanhua VARCHAR(40),
26 xueshengguojihua VARCHAR(40),
27 primary key(xuexiaomingchen)
28 );
29 """
30 cursor.execute(sql)
31 
32 def getHTMLText(url):
33     try:
34         r = requests.get(url,timeout = 30)
35         r.raise_for_status()
36         r.encoding = r.apparent_encoding
37         return r.text
38     except:
39         return ""
40 
41 def fillUnivlist(ulist,html):
42     soup = BeautifulSoup(html,"html.parser")
43     for tr in soup.find(tbody).children:
44         if isinstance(tr, bs4.element.Tag):
45             tds = tr.find_all(td)
46             ulist.append([tds[0].string,tds[1].string,tds[2].string,tds[3].string,tds[4].string,tds[5].string,tds[6].string,tds[7].string,tds[8].string,tds[9].string,tds[10].string,tds[11].string,tds[12].string,tds[13].string])
47     sql = """
48         INSERT INTO universityrankings.unranking2019 values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
49         """
50     for i in range(len(ulist)):
51         cursor.execute(sql,ulist[i])
52     db.commit()
53     cursor.close()
54 
55 def printUnivList(ulist,num):
56     tplt = "{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}\t{8}\t{9}\t{10}\t{11}\t{12}\t{13}"
57     print(tplt.format("排名","学校名称","省市","总分","生源质量","培养结果","社会声誉","科研规模","科研质量","顶尖成果","顶尖人才","科技服务","成果转化","学生国际化"))
58     for i in range(num):
59         u = ulist[i]
60         print(tplt.format(u[0],u[1],u[2],u[3],u[4],u[5],u[6],u[7],u[8],u[9],u[10],u[11],u[12],u[13]))
61 
62 def main():
63     uinfo = []
64     url = http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html
65     html = getHTMLText(url)
66     fillUnivlist(uinfo,html)
67     printUnivList(uinfo,20)
68 main()
View Code

 

中国大学排名定向爬虫

标签:mysql   raise   print   enc   定向   instance   com   规模   exe   

原文地址:https://www.cnblogs.com/lsyb-python/p/11801576.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!