标签:pipeline https txt win ini rtp cmd ensp tst
第一步:选择一个文件夹,进入控制台,输入命令scrapy startproject qidian
第二步:切换到内层的spiders文件加 cd qidian/qidian/spiders 输入命令 scrapy genspider qidianyuedu qidian.com(域名)
注意点:爬虫的名字 qidianyuedu 不能和工程的名字重复
第三步:在工程的路径下,建立一个启动文件starts.py
1 from scrapy import cmdline 2 cmdline.execute(["scrapy","crawl","qidianyuedu"])
第四步:修改settings文件,主要修改内容如下
1 # 添加headers 2 USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36‘ 3 4 # robot.txt 5 ROBOTSTXT_OBEY = False 6 7 # 打开pipeline 8 ITEM_PIPELINES = { 9 ‘qidian.pipelines.QidianPipeline‘: 300, 10 }
第五步:根据要爬取的数据,设置相对应的item字段
1 class QidianItem(scrapy.Item): 2 # define the fields for your item here like: 3 # name = scrapy.Field() 4 title = Field() 5 url = Field() 6 author = Field() 7 category = Field() 8 status = Field() 9 bref = Field()
第六步:书写pipeline,这里以将数据保存到mysql为例
1 import pymysql 2 3 class QidianPipeline(object): 4 5 def __init__(self): 6 self.db = pymysql.connect(host="xx.xx.xx.xx", 7 port=3306, 8 user="root", 9 password="xxx", 10 db="xxx", 11 charset="utf8mb4") 12 self.cur = self.db.cursor() 13 14 15 16 17 def process_item(self, item, spider): 18 19 sql = """insert into qqyuedu(title,url,author,category, 20 status,bref) 21 VALUES (%s,%s,%s,%s,%s,%s)""" 22 data = (item["title"],item["url"],item["author"],item["category"],item["status"] 23 ,item["bref"]) 24 try: 25 self.cur.execute(sql,data) 26 except: 27 pass 28 else: 29 self.db.commit() 30 return item 31 32 def __del__(self): 33 self.cur.close() 34 self.db.close()
第七步:书写爬虫主要的程序 spiders 下面的那个文件
分成两种格式进行总结:
1. 使用starts_url的方式,使用offset配合翻页
1 class Douban250Spider(scrapy.Spider): 2 name = ‘douban250‘ 3 offset = 0 4 allowed_domains = [‘movie.douban.com‘] 5 start_urls = [‘https://movie.douban.com/top250?start=0&filter=‘] 6 7 def parse(self, response): 8 item = DoubanItem() 9 li_list = response.css(".grid_view li") 10 for li in li_list: 11 item["name"] = li.css(".info")[0].xpath(".//span[@class=\"title\"][1]/text()")[0].extract() 12 item["info"] = "".join("".join(li.css(".info .bd")[0].xpath("./p//text()").extract()).split()) 13 item["score"] = float(li.css(".info .star")[0].xpath("./span[@class=\"rating_num\"]/text()")[0].extract()) 14 item["access"] = li.css(".info .star")[0].xpath("./span[4]/text()")[0].extract() 15 item["bref"]= li.css(".info .quote")[0].xpath("./span[@class=\"inq\"]/text()")[0].extract() 16 yield item 17 18 if self.offset < 250: 19 self.offset += 25 20 url = "https://movie.douban.com/top250?start="+str(self.offset)+"&filter=" 21 yield scrapy.Request(url,callback=self.parse,dont_filter=True)
2.重写start_requests
1 class QidianyueduSpider(scrapy.Spider): 2 name = ‘qidianyuedu‘ 3 allowed_domains = [‘book.qidian.com‘] 4 5 def start_requests(self): 6 page_num = self.get_page_num() 7 for i in range(1,page_num+1): 8 url = "https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page="+str(i) 9 yield scrapy.Request(url,callback=self.parse, 10 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}) 11 12 def parse(self, response): 13 li_list = response.css(".book-img-text li") 14 for li in li_list: 15 item = QidianItem() 16 item["title"] = li.css(".book-mid-info h4 a::text")[0].extract() 17 item["url"] = "https:"+li.css(".book-mid-info h4 a::attr(href)")[0].extract() 18 item["author"] = li.css(".book-mid-info .author a")[0].xpath("./text()")[0].extract() 19 category = "" 20 a_list = li.css(".book-mid-info .author a")[1:] 21 for a in a_list: 22 a_text = a.css("a::text")[0].extract() 23 category += a_text 24 category += " " 25 item["category"] = category.strip() 26 item["status"] = li.css(".book-mid-info .author span::text")[0].extract() 27 yield item
第八步:解析数据,在解析数据的时候我们可以借助着scrapy shell xxxxx 要爬取的网站 进入代码输入区域,首先输入view(response) 查看要爬取的网页是否是目标网页,然后在使用css/xpath的方式进行提取
注意:当我们提取的网络中的数据文字多,想进行拼接操作的时候,会有很多空白字符进行妨碍,解决方法
1 "".join("".join(li.css(".info .bd")[0].xpath("./p//text()").extract()).split())
从shell中将所有要提取的数据提取成功了,在转移到代码中即可,代码见第七步
深化一个问题,就是item分裂的问题
在一个页面的提取并不满足所有的item数据,需要深层次的网页的数据提取,这个时候就需要进行item的传递,实际上就是Request(url,meta={"meta":item},callback=self.parse_detail)的传递,和item = response.meta["meta"]
的解包,在新的解析函数中继续使用,在yield返回即可
1 class QidianyueduSpider(scrapy.Spider): 2 name = ‘qidianyuedu‘ 3 allowed_domains = [‘book.qidian.com‘] 4 5 def get_page_num(self): 6 headers = { 7 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"} 8 url = "https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=1" 9 res = requests.get(url, headers=headers) 10 html = res.content.decode("utf-8") 11 soup = BeautifulSoup(html, "lxml") 12 num = int(soup.select(".count-text span")[0].get_text()) 13 if num%20 == 0: 14 page = num//20 15 else: 16 page = (num//20) 17 return page 18 19 def start_requests(self): 20 page_num = self.get_page_num() 21 for i in range(1,page_num+1): 22 url = "https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page="+str(i) 23 yield scrapy.Request(url,callback=self.parse, 24 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}) 25 26 def parse(self, response): 27 li_list = response.css(".book-img-text li") 28 for li in li_list: 29 item = QidianItem() 30 item["title"] = li.css(".book-mid-info h4 a::text")[0].extract() 31 item["url"] = "https:"+li.css(".book-mid-info h4 a::attr(href)")[0].extract() 32 item["author"] = li.css(".book-mid-info .author a")[0].xpath("./text()")[0].extract() 33 category = "" 34 a_list = li.css(".book-mid-info .author a")[1:] 35 for a in a_list: 36 a_text = a.css("a::text")[0].extract() 37 category += a_text 38 category += " " 39 item["category"] = category.strip() 40 item["status"] = li.css(".book-mid-info .author span::text")[0].extract() 41 yield scrapy.Request(item["url"],meta={"meta":item}, 42 callback=self.parse_detial, 43 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}) 44 45 46 def parse_detial(self,response): 47 item = response.meta["meta"] 48 item["bref"] = "".join("".join(response.css(".book-intro p")[0].xpath(".//text()").extract()).split()) 49 yield item
标签:pipeline https txt win ini rtp cmd ensp tst
原文地址:https://www.cnblogs.com/waws1314/p/12444080.html