标签:soup lin except ext 内容 htm alt col app
1.导入相应的库
2.找到要爬取的网站:http://top.baidu.com/buzz?b=341&c=513&fr=topbuzz_b341_c513
3.找到爬取的内容:
4.用for循环将需要的内容添加到空列表中,在使用DataFrame打印出热搜榜前十
import requests from bs4 import BeautifulSoup import bs4 import pandas as pd url = ‘http://top.baidu.com/buzz?b=341&c=513&fr=topbuzz_b341_c513‘ def f(s): try: headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36‘} r=requests.get(s,timeout=30,headers=headers) r.raise_for_status() r.encoding=r.apparent_encoding soup=BeautifulSoup(r.text,‘lxml‘) return soup except: return "" soup=f(url) a=[] b=[] for link1 in soup.find_all(class_=‘list-title‘): a.append(link1.get_text()) for link2 in soup.find_all(‘td‘,class_=‘last‘): b.append(link2.get_text().strip()) data=pd.DataFrame([a,b],index=["关键词","搜索指数"]).T print("爬取百度热搜榜前十:","\n") print(data.iloc[0:10])
标签:soup lin except ext 内容 htm alt col app
原文地址:https://www.cnblogs.com/lzq129/p/12504595.html