标签:print title 网站 sel sts ges sele lis 爬取
1.选取一个自己感兴趣的主题,我选取了搜狐新闻
网站:http://news.sohu.com/
2.网络上爬取相关的数据,并输出结果
import requests from bs4 import BeautifulSoup url = ‘http://news.sohu.com/‘ res = requests.get(url) res.encoding = ‘UTF-8‘ soup = BeautifulSoup(res.text, ‘html.parser‘) for news in soup.select(‘.list16‘): li = news.select(‘li‘) if len(li) > 0: title = li[0].text href = li[0].select(‘a‘)[0][‘href‘] print(title, href)
3.进行文本分析,生成词云。
import jieba from wordcloud import WordCloud import matplotlib.pyplot as plt text =open("D:\\cc.txt",‘r‘,encoding=‘utf-8‘).read() print(text) wordlist = jieba.cut(text,cut_all=True) wl_split = "/".join(wordlist) mywc = WordCloud().generate(text) plt.imshow(mywc) plt.axis("off") plt.show()
4.结果
标签:print title 网站 sel sts ges sele lis 爬取
原文地址:http://www.cnblogs.com/hzl123/p/7772912.html