码迷,mamicode.com
首页 > 其他好文 > 详细

一个完整的大作业

时间:2017-10-31 21:33:44      阅读:124      评论:0      收藏:0      [点我收藏+]

标签:比较   数据   网站   com   die   key   excel   coding   ram   

选一个自己感兴趣的主题

首先选取一个网站,我选取手游网站进行爬虫操作,网站网址为http://xin.ptbus.com/indiegame/news/

技术分享

 

网络上爬取相关的数据

import requests
from bs4 import BeautifulSoup

url = http://xin.ptbus.com/indiegame/news/
res = requests.get(url)
res.encoding=utf-8   
soup=BeautifulSoup(res.text,html.parser)
for news in soup.select(li):
    if len(news.select(.ecst))>0:
        title=news.select(.ecst)[0].text
        url=news.select(a)[0][href]                        
        source=soup.select(span)[0].text
        resd=requests.get(url)
        resd.encoding=utf-8
        soupd=BeautifulSoup(resd.text,html.parser)
        
        pa=soupd.select(.gmIntro)[0].text

print(title,url,source,pa)

 

爬取网站的数据如下图。

技术分享

 

 

进行文本分析,生成词云

将爬取到的数据直接制作成词云。

import requests
from bs4 import BeautifulSoup
import jieba

url = http://xin.ptbus.com/indiegame/news/
res = requests.get(url)
res.encoding=utf-8   
soup=BeautifulSoup(res.text,html.parser)
for news in soup.select(li):
    if len(news.select(.ecst))>0:
        title=news.select(.ecst)[0].text
        url=news.select(a)[0][href]                        
        source=soup.select(span)[0].text
        resd=requests.get(url)
        resd.encoding=utf-8
        soupd=BeautifulSoup(resd.text,html.parser)
        
        pa=soupd.select(.gmIntro)[0].text
print(title,url,source,pa)
words = jieba.lcut(pa)
ls = []
counts = {}
for word in words:
    ls.append(word)
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0)+1
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True)
for i in range(10):
    word , count = items[i]
    print ("{:<5}{:>2}".format(word,count))

from wordcloud import WordCloud
import matplotlib.pyplot as plt    
cy = WordCloud(font_path=msyh.ttc).generate(pa)#wordcloud默认不支持中文,这里的font_path需要指向中文字体
plt.imshow(cy, interpolation=bilinear)
plt.axis("off")
plt.show()

 

效果图如下,毕竟是一个手游资讯网站,游戏的字眼出现很频繁,而黎明危机则是一款即将上市的游戏,因此关注度比较高。

技术分享

 

import requests
from bs4 import BeautifulSoup
import jieba
import pandas
import sqlite3


def onepage(pageurl):
    res = requests.get(pageurl)
    res.encoding=utf-8   
    soup=BeautifulSoup(res.text,html.parser)
    newsls = []
    for news in soup.select(li):
        if len(news.select(.ecst))>0:
            newsls.append(news.select(a)[0][href])
            newsls.append(news.select(.ecst)[0].text)
    return(newsls)
newstotal = []
dmurl=http://xin.ptbus.com/indiegame/news/
newstotal.extend(onepage(dmurl))

for i in range(2,3):
    listurl=http://xin.ptbus.com/indiegame/news/{}.html.format(i)
    newstotal.extend(onepage(listurl))




df = pandas.DataFrame(newstotal)
df.to_excel(news.xlsx)


with sqlite3.connect(dmnewsdb.sqlite) as db:
    df.to_sql(dmnewsdb8,con = db)

技术分享

 

一个完整的大作业

标签:比较   数据   网站   com   die   key   excel   coding   ram   

原文地址:http://www.cnblogs.com/zhoujinpeng/p/7763501.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!