爬虫大作业－爬区a9vg电玩部落ps4专区

时间：2018-04-23 00:09:24 阅读：228 评论：0 收藏：0 [点我收藏+]

标签：page 过多设计 programs 链接 attrs ack turn white

1.选一个自己感兴趣的主题或网站。(所有同学不能雷同)

2.用python 编写爬虫程序，从网络上爬取相关主题的数据。

def writeNewsDetail(content):
    f = open(‘a9vg.txt‘,‘a‘,encoding=‘utf-8‘)
    f.write(content)
    f.close()

def getNewsDetail(url):
    res2 = requests.get(url)
    res2.encoding = ‘utf-8‘
    soup2 = BeautifulSoup(res2.text, ‘html.parser‘)
    news = {}
    news[‘content‘] = soup2.select(‘.art-ctn‘)[0].text # 爬取ps4专区新闻的正文
    writeNewsDetail(news[‘content‘])
    news[‘newsurl‘]=url
    return(news)

def getListPage(pageUrl):
    res = requests.get(pageUrl)
    res.encoding = ‘utf-8‘
    soup = BeautifulSoup(res.text,‘html.parser‘)
    newsList=[]
    for news in soup.select(‘.tab-ctn dl‘):
        if len(news.select(‘h3‘)) > 0:
            a = news.select(‘a‘)[0].attrs[‘href‘]
            print(a)
            newsList.append(getNewsDetail(a))
    return(newsList)

3.对爬了的数据进行文本分析，生成词云。

def cutword():
    text=‘‘
    f = open(‘a9vg.txt‘, ‘r‘, encoding=‘utf8‘)
    lines = f.readlines()
    for line in lines:
        text += line
    for key in analyse.extract_tags(text, 50, withWeight=False):
        # 使用jieba.analyse.extract_tags()参数提取关键字,默认参数为50
        print(key)
    jieba.add_word(‘奥丁‘)
    words_ls = jieba.cut(text)
    words_split = " ".join(words_ls)
    print(words_ls)
    return words_split

def wordspic():
    wordsp=cutword()
    Stopwords = [‘programs‘,‘view‘,‘tudou‘,‘www‘,‘http‘,‘com‘,‘https‘,‘qq‘,‘page‘,‘杀死‘,‘渡鸦‘]
    wc = WordCloud()    # 字体这里有个坑，一定要设这个参数。否则会显示一堆小方框
    wc.stopwords=Stopwords
    wc.max_words=200
    wc.background_color=‘white‘
    wc.font_path="simhei.ttf"   # 黑体
    my_wordcloud = wc.generate(wordsp)
    plt.imshow(my_wordcloud)
    plt.axis("off")
    plt.show()
    wc.to_file(‘ttt.png‘) # 保存图片文件