标签:比较 数据 网站 com die key excel coding ram
选一个自己感兴趣的主题
首先选取一个网站,我选取手游网站进行爬虫操作,网站网址为http://xin.ptbus.com/indiegame/news/
网络上爬取相关的数据
import requests from bs4 import BeautifulSoup url = ‘http://xin.ptbus.com/indiegame/news/‘ res = requests.get(url) res.encoding=‘utf-8‘ soup=BeautifulSoup(res.text,‘html.parser‘) for news in soup.select(‘li‘): if len(news.select(‘.ecst‘))>0: title=news.select(‘.ecst‘)[0].text url=news.select(‘a‘)[0][‘href‘] source=soup.select(‘span‘)[0].text resd=requests.get(url) resd.encoding=‘utf-8‘ soupd=BeautifulSoup(resd.text,‘html.parser‘) pa=soupd.select(‘.gmIntro‘)[0].text print(title,url,source,pa)
爬取网站的数据如下图。
进行文本分析,生成词云
将爬取到的数据直接制作成词云。
import requests from bs4 import BeautifulSoup import jieba url = ‘http://xin.ptbus.com/indiegame/news/‘ res = requests.get(url) res.encoding=‘utf-8‘ soup=BeautifulSoup(res.text,‘html.parser‘) for news in soup.select(‘li‘): if len(news.select(‘.ecst‘))>0: title=news.select(‘.ecst‘)[0].text url=news.select(‘a‘)[0][‘href‘] source=soup.select(‘span‘)[0].text resd=requests.get(url) resd.encoding=‘utf-8‘ soupd=BeautifulSoup(resd.text,‘html.parser‘) pa=soupd.select(‘.gmIntro‘)[0].text print(title,url,source,pa) words = jieba.lcut(pa) ls = [] counts = {} for word in words: ls.append(word) if len(word) == 1: continue else: counts[word] = counts.get(word,0)+1 items = list(counts.items()) items.sort(key = lambda x:x[1], reverse = True) for i in range(10): word , count = items[i] print ("{:<5}{:>2}".format(word,count)) from wordcloud import WordCloud import matplotlib.pyplot as plt cy = WordCloud(font_path=‘msyh.ttc‘).generate(pa)#wordcloud默认不支持中文,这里的font_path需要指向中文字体 plt.imshow(cy, interpolation=‘bilinear‘) plt.axis("off") plt.show()
效果图如下,毕竟是一个手游资讯网站,游戏的字眼出现很频繁,而黎明危机则是一款即将上市的游戏,因此关注度比较高。
import requests from bs4 import BeautifulSoup import jieba import pandas import sqlite3 def onepage(pageurl): res = requests.get(pageurl) res.encoding=‘utf-8‘ soup=BeautifulSoup(res.text,‘html.parser‘) newsls = [] for news in soup.select(‘li‘): if len(news.select(‘.ecst‘))>0: newsls.append(news.select(‘a‘)[0][‘href‘]) newsls.append(news.select(‘.ecst‘)[0].text) return(newsls) newstotal = [] dmurl=‘http://xin.ptbus.com/indiegame/news/‘ newstotal.extend(onepage(dmurl)) for i in range(2,3): listurl=‘http://xin.ptbus.com/indiegame/news/{}.html‘.format(i) newstotal.extend(onepage(listurl)) df = pandas.DataFrame(newstotal) df.to_excel(‘news.xlsx‘) with sqlite3.connect(‘dmnewsdb.sqlite‘) as db: df.to_sql(‘dmnewsdb8‘,con = db)
标签:比较 数据 网站 com die key excel coding ram
原文地址:http://www.cnblogs.com/zhoujinpeng/p/7763501.html