码迷,mamicode.com
首页 > 其他好文 > 详细

爬取豆瓣排行前250数据----基本定义

时间:2020-02-25 20:03:04      阅读:73      评论:0      收藏:0      [点我收藏+]

标签:art   ret   highlight   next   image   strip   ima   for   gen   

时间久了,自然就忘了。一时性起,爬取豆瓣玩玩。

    

1.scrapy startproject novels  创建novel 项目

2.cd  novels    &&   scrapy genspider  douban douban.com  创建模板

3.上代码。

 

爬虫主页面:
# -*- coding: utf-8 -*-
import scrapy
from novels.items import Douban250BookItem

class DoubanSpider(scrapy.Spider):
    name = ‘douban‘
    allowed_domains = [‘douban.com‘]
    start_urls = [‘https://book.douban.com/top250‘]


    def parse(self, response):
        #第一页
        yield scrapy.Request(response.url, callback=self.paese_next)

        #其他页(这里我只得出其他9页的href)
        for page in response.xpath(‘//div[@class="paginator"]/a‘):
            url = page.xpath(‘@href‘).extract_first()
            yield scrapy.Request(url,callback=self.paese_next)

    def paese_next(self,response):

        for item in response.xpath(‘//tr[@class="item"]‘):
            book = Douban250BookItem()
            book[‘name‘] = item.xpath(‘td[2]/div[1]/a/@title‘).extract_first()
            price_str = item.xpath(‘td[2]/p/text()‘).extract_first().split(‘/‘)[-1]
            book[‘price‘] = price_str.strip()
            book[‘ratings‘] = item.xpath(‘td[2]/div[2]/span[2]/text()‘).extract_first()
            yield  book

记住,要在设置里面 加入user_agent  ,否则爬不了

技术图片

 

 

item.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class NovelsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


class Douban250BookItem(scrapy.Item):
    name    = scrapy.Field()
    price   = scrapy.Field()
    ratings = scrapy.Field()

  

pipelines.py代码;

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
class NovelsPipeline(object):
    def process_item(self, item, spider):
        name    = item[‘name‘]
        price   = item[‘price‘]
        ratings = item[‘ratings‘]

        connection = pymysql.connect(
            host=‘127.0.0.1‘,
            user=‘root‘,
            passwd=‘root‘,
            db=‘python_scrapy‘,
            # charset=‘utf-8‘,
            cursorclass=pymysql.cursors.DictCursor
        )
        try:
            with connection.cursor() as  cursor:
                # 创建更新值的sql语句
                sql = """INSERT INTO `douban250` (name, price, ratings) VALUES (%s, %s, %s) """
                cursor.execute(
                    sql, (name, price, ratings)
                )
                connection.commit()

        finally:
            connection.close()


        return item

  以上就是最基本的scrapy 代码了。

结果:

技术图片

 

 刚好250行 。

以上代码作个复习。好长时间不写爬虫了,温故而知新!

 

爬取豆瓣排行前250数据----基本定义

标签:art   ret   highlight   next   image   strip   ima   for   gen   

原文地址:https://www.cnblogs.com/wujf-myblog/p/12363124.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!