标签:pil rip ack 修改 ict The strong HERE ase
Scrapy爬虫框架是一个好东西,可以十分简单快速爬取网站,特别适合那些不分离前后端的,数据直接生成在html文件内的网站。本文以爬取 杭电OJ http://acm.hdu.edu.cn 的题目ID和标题为例,做一个基本用法的记录
使用pip安装
pip install scrapy
建立项目 myspider
scrapy startproject myspider
创建爬虫 hdu,网站是 acm.hdu.edu.cn
scrapy genspider hdu acm.hdu.edu.cn
执行上面的命令后,会在spiders文件夹下建立一个 hdu.py,修改代码为:
class HduSpider(scrapy.Spider):
# 爬虫名
name = 'hdu'
# 爬取的目标地址
allowed_domains = ['acm.hdu.edu.cn']
# 爬虫开始的页面
start_urls = ['http://acm.hdu.edu.cn/listproblem.php?vol=1']
# 爬取逻辑
def parse(self, response):
# 题目列表是写在页面的第二个script下的,先全部取出script到problem_list列表中
problem_list = response.xpath('//script/text()').extract()
# 取题目列表,为第二个,index为1,并使用分号分割
problems = str.split(problem_list[1], ";")
# 循环在控制台输出。这里没有交给管道进行操作
for item in problems:
print(item)
在 items.py 里新建题目的对应类
class ProblemItem(scrapy.Item):
id = scrapy.Field()
title = scrapy.Field()
在 pipelines.py 里建立一个数据管道来保存数据到 hdu.json文件内
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json
class ItcastPipeline(object):
def __init__(self):
self.filename = open("teacher.json", "wb+")
def process_item(self, item, spider):
jsontext = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.filename.write(jsontext.encode("utf-8"))
return item
def close_spider(self, spider):
self.filename.close()
class HduPipeline(object):
full_json = ''
def __init__(self):
self.filename = open("hdu.json", "wb+")
self.filename.write("[".encode("utf-8"))
def process_item(self, item, spider):
json_text = json.dumps(dict(item), ensure_ascii=False) + ",\n"
self.full_json += json_text
return item
def close_spider(self, spider):
self.filename.write(self.full_json.encode("utf-8"))
self.filename.write("]".encode("utf-8"))
self.filename.close()
setting.py 中给管道进行配置
ITEM_PIPELINES = {
'myspider.pipelines.HduPipeline': 300
}
# 不遵循网站的爬虫君子约定
ROBOTSTXT_OBEY = False
修改 hdu.py 让其交由管道处理
# -*- coding: utf-8 -*-
import scrapy
import re
from myspider.items import ProblemItem
class HduSpider(scrapy.Spider):
name = 'hdu'
allowed_domains = ['acm.hdu.edu.cn']
start_urls = ['http://acm.hdu.edu.cn/listproblem.php?vol=1']
def parse(self, response):
hdu = ProblemItem()
problem_list = response.xpath('//script/text()').extract()
problems = str.split(problem_list[1], ";")
for item in problems:
# print(item)
p = re.compile(r'[(](.*)[)]', re.S)
str1 = re.findall(p, item)[0]
# print(str1)
detail = str.split(str1, ",")
hdu['id'] = detail[1]
hdu['title'] = detail[3]
yield hdu
运行命令,这里把日志输出到 all.log 中
scrapy crawl hdu -s LOG_FILE=all.log
在hdu.json文件中看到了爬取的第一页题目标题
```
{"id": "1000", "title": ""A + B Problem""}
{"id": "1001", "title": ""Sum Problem""}
{"id": "1002", "title": ""A + B Problem II""}
{"id": "1003", "title": ""Max Sum""}
{"id": "1004", "title": ""Let the Balloon Rise""}
{"id": "1005", "title": ""Number Sequence""}
...
{"id": "1099", "title": ""Lottery ""}
```再次修改 hdu.py 让其能够爬取全部有效页码的内容
# -*- coding: utf-8 -*-
import scrapy
import re
from myspider.items import ProblemItem
class HduSpider(scrapy.Spider):
name = 'hdu'
allowed_domains = ['acm.hdu.edu.cn']
# download_delay = 1
base_url = 'http://acm.hdu.edu.cn/listproblem.php?vol=%s'
start_urls = ['http://acm.hdu.edu.cn/listproblem.php']
# 爬虫入口
def parse(self, response):
# 首先拿到全部有效页码
real_pages = response.xpath('//p[@class="footer_link"]/font/a/text()').extract()
for page in real_pages:
url = self.base_url % page
yield scrapy.Request(url, callback=self.parse_problem)
def parse_problem(self, response):
# 从字符串中抽取有用内容
hdu = ProblemItem()
problem_list = response.xpath('//script/text()').extract()
problems = str.split(problem_list[1], ";")
for item in problems:
# hdu有无效空题,进行剔除
if str.isspace(item) or len(item) == 0:
return
p = re.compile(r'[(](.*)[)]', re.S)
str1 = re.findall(p, item)
detail = str.split(str1[0], ",")
hdu['id'] = detail[1]
hdu['title'] = detail[3]
yield hdu
再次运行命令,这里把日志输出到 all.log 中
scrapy crawl hdu -s LOG_FILE=all.log
现在能爬到全部页码的全部题目标题信息了。但是特别注意的是,爬取到的内容并不是按顺序排列的,有多种原因决定了顺序
[{"id": "4400", "title": "\"Mines\""},
{"id": "4401", "title": "\"Battery\""},
{"id": "4402", "title": "\"Magic Board\""},
{"id": "4403", "title": "\"A very hard Aoshu problem\""},
{"id": "4404", "title": "\"Worms\""},
{"id": "4405", "title": "\"Aeroplane chess\""},
{"id": "4406", "title": "\"GPA\""},
{"id": "4407", "title": "\"Sum\""},
...
{"id": "1099", "title": "\"Lottery \""},
]
以上只是爬取到文本文件中,后续将放置到数据库中,本教程暂时略过
标签:pil rip ack 修改 ict The strong HERE ase
原文地址:https://www.cnblogs.com/axiangcoding/p/12096894.html