码迷,mamicode.com
首页 > 其他好文 > 详细

33、豆瓣图书短评

时间:2019-05-12 00:59:31      阅读:135      评论:0      收藏:0      [点我收藏+]

标签:tar   ogr   like   ref   col   dom   pipeline   strong   git   

练习介绍
要求:
本练习需要运用scrapy的知识,爬取豆瓣图书TOP250(https://book.douban.com/top250  )前2页的书籍(50本)的短评数据存储成Excel
    书名
    评论ID
    短评内容
 
1、创建爬虫项目 
 
1 D:\USERDATA\python>scrapy startproject duanping
2 New Scrapy project duanping, using template directory c:\users\www1707\appdata\local\programs\python\python37\lib\site-packages\scrapy\templates\project, created in:
3     D:\USERDATA\python\duanping
4 
5 You can start your first spider with:
6     cd duanping
7     scrapy genspider example example.com
8 
9 D:\USERDATA\python>

 

2、创建爬虫文件 D:\USERDATA\python\duanping\duanping\spiders\duanping.py
 
 1 import scrapy
 2 import bs4
 3 import re
 4 import requests
 5 import math
 6 from ..items import DuanpingItem
 7 
 8 class DuanpingItemSpider(scrapy.Spider):
 9     name = duanping
10     allowed_domains = [book.douban.com]
11     start_urls = [https://book.douban.com/top250?start=0,https://book.douban.com/top250?start=0]
12 
13     def parse(self,response):
14         bs = bs4.BeautifulSoup(response.text,html.parser)
15         datas = bs.find_all(a,class_=nbg)
16         for data in datas:
17             book_url = data[href]
18             yield scrapy.Request(book_url,callback=self.parse_book)
19 
20     def parse_book(self,response):
21         book_url = str(response).split( )[1].replace(>,‘‘)
22         print(book_url)
23         bs = bs4.BeautifulSoup(response.text,html.parser)
24         comments = int(bs.find(a,href=re.compile(^https://book.douban.com/subject/.*/comments/)).text.split( )[1])
25         pages = math.ceil(comments / 20) + 1
26         #for i in range(1,pages):
27         for i in range(1,3):
28             comments_url = {}comments/hot?p={}.format(book_url,i)
29             print(comments_url)
30             yield scrapy.Request(comments_url,callback=self.parse_comment)
31 
32     def parse_comment(self,response):
33         bs = bs4.BeautifulSoup(response.text,html.parser)
34         book_name = bs.find(a,href=re.compile(^https://book.douban.com/subject/)).text
35         datas = bs.find_all(li,class_=comment-item)
36         for data in datas:
37             item = DuanpingItem()
38             item[book_name] = book_name
39             item[user_id]= data.find_all(a,href=re.compile(^https://www.douban.com/people/))[1].text
40             item[comment] = data.find(span,class_=short).text
41             yield item

 

3、编辑item文件 D:\USERDATA\python\duanping\duanping\items.py
 
1 import scrapy
2 
3 class DuanpingItem(scrapy.Item):
4     book_name = scrapy.Field()
5     user_id = scrapy.Field()
6     comment  = scrapy.Field()

 

4、编辑文件 D:\USERDATA\python\duanping\duanping\pipelines.py
 
 1 import openpyxl
 2 
 3 class DuanpingPipeline(object):
 4     def __init__(self):
 5         self.wb = openpyxl.Workbook()
 6         self.ws = self.wb.active
 7         self.ws.append([书名,评论ID,评论内容])
 8     
 9     def process_item(self, item, spider):
10         line = [item[book_name],item[user_id],item[comment]]
11         self.ws.append(line)
12         return item
13     def close_spider(self,spider):
14         self.wb.save(./save.xlsx)
15         self.wb.close()

 

5、编辑D:\USERDATA\python\duanping\duanping\settings.py
 
 1 BOT_NAME = duanping
 2 
 3 SPIDER_MODULES = [duanping.spiders]
 4 NEWSPIDER_MODULE = duanping.spiders
 5 
 6 USER_AGENT = Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36
 7 
 8 ROBOTSTXT_OBEY = False
 9 
10 DOWNLOAD_DELAY = 1
11 
12 ITEM_PIPELINES = {
13     duanping.pipelines.DuanpingPipeline: 300,
14 }
15 
16 FEED_URI = ./save.csv
17 FEED_FORMAT=CSV
18 FEED_EXPORT_ENCODING=utf-8-sig

 

6、执行命令  scrapy crawl duanping
 
技术图片

以下是草稿部分

 

1、图书列表页
 
 
            <a class="nbg" href="https://book.douban.com/subject/1083428/"
 
 
2、图书详情页
 
 
 
评论总数          <a href="https://book.douban.com/subject/1770782/comments/">全部 112943 条</a>
 
 
3、图书短评页
 
 
 
书名                <a href="https://book.douban.com/subject/1770782/">追风筝的人</a>
 
 
 
find_all    <li class="comment-item" data-cid="693413905">
 
    评论ID[1]       <a title="九尾黑猫" href="https://www.douban.com/people/LadyInSatin/">
                          <a href="https://www.douban.com/people/LadyInSatin/">九尾黑猫</a>
 
    短评内容         <span class="short">“为你,千千万万遍。”我想,小说描写了一种最为诚挚的情感,而且它让你相信有些东西依然存在。在这个没有人相信承诺的年代,让人再次看到承诺背后那些美丽复杂的情感。这是一本好看的书,它让你重新思考。</span>
 
 1     <li class="comment-item" data-cid="693413905">
 2             <div class="avatar">
 3                 <a title="福根儿" href="https://www.douban.com/people/fugen/">
 4                     <img src="https://img3.doubanio.com/icon/u3825598-141.jpg">
 5                 </a>
 6             </div>
 7         <div class="comment">
 8             <h3>
 9                 <span class="comment-vote">
10                     <span id="c-693413905" class="vote-count">4756</span>
11                         <a href="javascript:;" id="btn-693413905" class="j a_show_login" data-cid="693413905">有用</a>
12                 </span>
13                 <span class="comment-info">
14                     <a href="https://www.douban.com/people/fugen/">福根儿</a>
15                         <span class="user-stars allstar50 rating" title="力荐"></span>
16                     <span>2013-09-18</span>
17                 </span>
18             </h3>
19             <p class="comment-content">
20             
21                 <span class="short">为你,千千万万遍。</span>
22             </p>
23         </div>
24     </li>

 技术图片

33、豆瓣图书短评

标签:tar   ogr   like   ref   col   dom   pipeline   strong   git   

原文地址:https://www.cnblogs.com/www1707/p/10850700.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!