爬取中国大学排名

时间：2020-01-28 12:34:12 阅读：76 评论：0 收藏：0 [点我收藏+]

标签：fill 错误 append rom agent window img idt 排名

我们需要爬取2019年中国内地的大学排名，这里以物理学科为例，http://www.zuihaodaxue.cn/BCSR/wulixue2019.html。

技术图片

这个页面比较简单爬取难度不大，这里我们使用python的requests，bs4，BeautifulSoup库，关于BeatutifulSoup库的文档可以在这个网站查询https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

先定义一个get函数来爬取相关信息

def get(url):
    try:
        header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0"}
        re = requests.get(url,headers=header)
        re.encoding = re.apparent_encoding
        return(re.text)
    except:
        print(‘爬取错误‘)
        return ‘‘

接着再定义一个函数将大学的排名存入列表中，从网站的源码中可以看出排名是在一个tbody标签中，而每一个大学都在一对tr标签中，每一个大学的相关信息都在一对td标签中，接着只要在tbody标签中取得每个tr标签，再从tr标签中获得td标签中内容存入一个列表中即可。

技术图片

def fillUnivList(uList,html):
    soup = BeautifulSoup(html,‘html.parser‘)
    tbody = soup.find(‘tbody‘)
    count = 1
    for tr in tbody.children:
        if count < len(tbody)-1 and isinstance(tr,bs4.element.Tag):                
            td = tr(‘td‘)
            uList.append([td[0].string,td[3].string,td[6].string])
        count = count + 1

这里只拿取了第一行，第四行，第七行的内容。最后在main函数中调用即可

def main():       
    url = ‘http://www.zuihaodaxue.cn/BCSR/wulixue2019.html‘
    html = get(url)
    uList = []
    fillUnivList(uList,html)

爬取中国大学排名

标签：fill 错误 append rom agent window img idt 排名

原文地址：https://www.cnblogs.com/mambakb/p/12237523.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行