标签:html item ... inux win path com pip3 linux
- scrapy框架
介绍:大而全的爬虫组件。
安装:
- Win:
下载:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
pip3 install wheel
pip install Twisted?18.4.0?cp36?cp36m?win_amd64.whl
pip3 install pywin32
pip3 install scrapy
- Linux:
pip3 install scrapy
使用:
Django:
# 创建project
django-admin startproject mysite
cd mysite
# 创建app
python manage.py startapp app01
python manage.py startapp app02
# 启动项目
python manage.runserver
Scrapy:
# 创建project
scrapy startproject xdb
cd xdb
# 创建爬虫
scrapy genspider chouti chouti.com
scrapy genspider cnblogs cnblogs.com
# 启动爬虫
scrapy crawl chouti
1. 创建project
scrapy startproject 项目名称
项目名称
项目名称/
- spiders # 爬虫文件
- chouti.py
- cnblgos.py
....
- items.py # 持久化
- pipelines # 持久化
- middlewares.py # 中间件
- settings.py # 配置文件(爬虫)
scrapy.cfg # 配置文件(部署)
2. 创建爬虫
cd 项目名称
scrapy genspider chouti chouti.com
scrapy genspider cnblgos cnblgos.com
3. 启动爬虫
scrapy crawl chouti
scrapy crawl chouti --nolog
总结:
- HTML解析:xpath
- 再次发起请求:yield Request对象
标签:html item ... inux win path com pip3 linux
原文地址:https://www.cnblogs.com/l-jie-n/p/10017560.html