标签:sts bootstra 规律 styles 热点 models 命令行 self 加载
### Python网络爬虫之requests模块
###### 什么是requests模块
? request模块是python中原生的基于网络请求的模块,其主要作用是用来模拟浏览器发起请求。功能强大,用法简洁高效。在爬虫领域中占着半壁江山的地位。
###### 为什么要使用request模块
? 因为在使用urllib模块的时候,会有诸多不便之处,总结如下:手动处理url编码,手动处理post请求参数,处理cookie和代理操作频繁。使用request模块:自动处理url编码,自动处理post请求参数,简化cookie和代理操作 <!-- more -->
###### 如何使用request模块
安装
```python
pip install requests
使用流程
指定url---->基于requests模块发起请求------->获取响应对象中的数据值------>持久化存储
1.requests的get请求:
# requests的简单get请求
# requests.get + headers
# requests.get + headers + params
# requests.get + headers + params + proxy
import requests
url = ‘...‘
headers = {
"User-Agent":‘...‘
}
params = {
‘key‘: ‘value‘
}
proxies = {
‘http‘: ‘http://127.0.0.1:8080‘
‘https‘: ‘http://127.0.0.1:8899‘
}
res = requests.get(url=url, headers=headers, params=params, proxies=proxies)
# 代理:
透明代理:
匿名代理:
高匿代理:
#第一种: 反爬机制与反反爬策略
反爬机制: UA检测
反反爬策略: UA伪装
3.响应数据
# 获取响应数据内容:
res.text 获取HTML文本
res.content 获取二进制流
res.json() 获取json数据
# 响应数据的属性:
res_code = res.status_code # 响应状态码(*)
res_headers = res.headers # 响应头信息
res_url = res.url # 此响应对应的请求url
res_cookie = res.cookies # 响应的cookies(*)
res_history = res.history # 请求历史
1.requests上传文件操作
2.会话维持: Session对象
3.设置超时事件:timeout,请求5秒内没有返回响应,则抛出异常
4.Prepare requests:构建requests对象,可以放入队列中实现爬取队列调度
1.requests上传文件操作
files={‘file‘:open(‘filename‘,‘rb‘)}
res=requests.post(url=url,files=files)
2.会话维持 Session对象
from requests import Session
session=Session()
res=requests.get(url=url,headers=headers)
3.设置超时时间:timeout,请求5秒内没有返回相应,则抛出异常
res=requests.get(url=url,headers=headers,timeout=5)
4.Prepare Request: 构建request对象, 可以放入队列中实现爬取队列调度
from requests import Request, Session
url = ‘....‘
data = {
‘wd‘: ‘spiderman‘
}
headers = {
‘User-Agent‘: ‘...‘
}
# 1.实话session对象
session = Session()
# 2.构建request对象, 传入必要参数
req = Request(‘POST‘, url, data=data, headers=headers)
req = Request(‘GET‘, url, params=params, headers=headers)
# 3.应用prepared_request方法将request对象转化为Prepared Request对象
prepared = session.prepare_request(req)
# 4.利用session的send方法发送请求
res = session.send(prepared)
# Xpath解析库介绍:
数据解析的过程中使用过正则表达式, 但正则表达式想要进准匹配难度较高, 一旦正则表达式书写错误, 匹配的数据也会出错.
网页由三部分组成: HTML, Css, JavaScript, HTML页面标签存在层级关系, 即DOM树, 在获取目标数据时可以根据网页层次关系定位标签, 在获取标签的文本或属性.
# xpath解析库解析数据原理:
1. 根据网页DOM树定位节点标签
2. 获取节点标签的征文文本即与属性值
# xpath安装, 初体验 --> 使用步骤:
1.xpath安装: pip install lxml
2.requests模块爬取糗事百科热门的标题:
import requests
from lxml import etree
url = ‘https://www.qiushibaike.com/‘
headers = {
"User-Agent":‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36‘
}
res = requests.get(url=url, headers=headers)
tree = etree.HTML(res.text)
title_lst = tree.xpath(‘//ul/li/div/a/text()‘)
for item in title_lst:
print(item)
3.xpath使用步骤:
from lxml import etree
tree = etree.HTML(res.text)
tree = etree.parse(res.html, etree.HTMLParse()) # 示例如下, 了解内容
tag_or_attr = tree.xpath(‘xpath表达式‘)
**********************************************************
# xpath解析本地文件
import requests
from lxml import etree
url = ‘https://www.qiushibaike.com/‘
headers = {
"User-Agent":‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36‘
}
res = requests.get(url=url, headers=headers)
with open(‘qb.html‘, ‘w‘, encoding=‘utf-8‘) as f:
f.write(res.text)
tree = etree.parse(‘./qb.html‘, etree.HTMLParser())
title_lst = tree.xpath(‘//ul/li/div/a/text()‘)
for item in title_lst:
print(item)
# xpath语法:
1.常用规则:
1. nodename: 节点名定位
2. //: 从当前节点选取子孙节点
3. /: 从当前节点选取直接子节点
4. nodename[@attribute="..."] 根据属性定位标签
5. @attributename: 获取属性
6. text(): 获取文本
2.属性匹配两种情况: 多属性匹配 & 单属性多值匹配
2.2 多属性匹配
示例: tree.xpath(‘//div[@class="item" and @name="test"]/text()‘)
2.1 单属性多值匹配
示例: tree.xpath(‘//div[contains(@class, "dc")]/text()‘)
3.按序选择:
3.1 索引定位: 从1开始
3.2 last()函数
3.3 position()函数
BeautifulSoup也是一个解析库
BS解析数据是依赖解析器的, BS支持的解析器有html.parser, lxml, xml, html5lib等, 其中lxml解析器解析速度快, 容错能力强.
BS现阶段应用的解析器多数是lxml
# BeautifulSoup 使用步骤:
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, ‘lxml‘)
tag = soup.select("CSS选择器表达式") # 返回一个列表
# CSS选择器:
1.根据节点名及节点层次关系定位标签: 标签选择器 & 层级选择器
soup.select(‘title‘)
soup.select(‘div > ul > li‘) # 单层级选择器
soup.select(‘div li‘) # 多层级选择器
2.根据节点的class属性定位标签: class选择器
soup.select(‘.panel‘)
3.根据id属性定位标签: id选择器
soup.select(‘#item‘)
4.嵌套选择:
ul_list = soup.select(‘ul‘)
for ul in ul_list:
print(ul.select(‘li‘))
# 获取节点的文本或属性:
tag_obj.string: 获取直接子文本-->如果节点内有与直系文本平行的节点, 该方法拿到的是None
tag_obj.get_text(): 获取子孙节点的所有文本
tag_obj[‘attribute‘]: 获取节点属性
# 练习示例:
html = ‘‘‘
<div class="panel">
<div class="panel-heading">
<h4>BeautifulSoup练习</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">第一个li标签</li>
<li class="element">第二个li标签</li>
<li class="element">第三个li标签</li>
</ul>
<ul class="list list-small">
<li class="element">one</li>
<li class="element">two</li>
</ul>
<li class="element">测试多层级选择器</li>
</div>
</div>
‘‘‘
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, ‘lxml‘)
# 1.根据节点名定位节点, 获取其文本
h4 = soup.select(‘h4‘) # 标签选择器
print(h4[0].get_text())
# 2.根据class属性定位节点
panel = soup.select(‘.panel-heading‘)
print(panel)
# 3.根据id属性定位节点
ul = soup.select(‘#list-1‘)
print(ul)
# 4.嵌套选择
ul_list = soup.select(‘ul‘)
for ul in ul_list:
li = ul.select(‘li‘)
print(li)
# 5.单层级选择器与多层级选择器
li_list_single = soup.select(".panel-body > ul > li")
li_list_multi = soup.select(".panel-body li")
需求:爬取搜狗指定词条搜索后的页面数据
import requests
import os
#指定搜索关键字
word = input(‘enter a word you want to search:‘)
#自定义请求头信息
headers={
‘User-Agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36‘,
}
#指定url
url = ‘https://www.sogou.com/web‘
#封装get请求参数
prams = {
‘query‘:word,
‘ie‘:‘utf-8‘
}
#发起请求
response = requests.get(url=url,params=prams)
#获取响应数据
page_text = response.text
with open(‘./sougou.html‘,‘w‘,encoding=‘utf-8‘) as fp:
fp.write(page_text)
User-Agent:请求载体身份标识,通过浏览器发起的请求,请求载体为浏览器,则该请求的User-Agent为浏览器的身份标识。可以通过判断该值来获知该请求的载体究竟是基于那款浏览器还是基于爬虫程序。
反爬机制:某些门户网站会对访问该网站的请求中的User-Agent进行捕获和判断,如果该请求的UA为爬虫程序,则拒绝向该请求提供数据。
反反爬策略:将爬虫程序的UA伪装成某一款浏览器的身份标识
需求:requests带参的get请求
#百度原 https://www.baidu.com/s?wd=python
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}
url = ‘https://www.baidu.com/s‘
data = {‘wd‘ : ‘Scrapy‘}
#url不需要编码 会自动拼接
res=requests.get(url=url,headers=headers,params=data)
print(res.url)
爬取校花网图片案列
import requests
import re
from lxml import etree
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}
#获取网页地址
#http://www.xiaohuar.com/list-1-3.html
#http://www.xiaohuar.com/list-1-5.html
# ‘http://www.xueshengmai.com/list-1-%d.html‘
url=‘http://www.xueshengmai.com/list-1-%d.html‘
for i in range(4):#4表示要爬取4页图片,这里可根据需求做出修改。
temp=(url % i)
# print(temp)
#获取网页源码
response=requests.get(url=temp,headers=headers)
tree=etree.HTML(response.text)
for td in tree:
pc=td.xpath(‘//div[@class="item_t"]/div[1]/a/img/@src‘) #图片url
name=td.xpath(‘//div[@class="item_t"]/div[1]/span/text()‘) #名字
# print(pc)
for x in pc:
pc_url=‘http://www.xueshengmai.com‘+x #美女图片
# print(pc_url)
# img_data=response.content
res=requests.get(url=pc_url,headers=headers)
img_data=res.content
girl=pc_url.split(‘/‘)[-1]
# print(girl)
with open(‘%s‘%girl,‘wb‘) as f:
f.write(img_data)
爬取梨视频
# 梨视频数据的爬取
import requests
import random
from lxml import etree
import re
from fake_useragent import UserAgent
# 安装fake_useragent: pip install fake_useragent
url = "https://www.pearvideo.com/category_8"
ua = UserAgent().random
# 定制请求头
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
}
# 获取首页页面数据
page_text = requests.get(url=url, headers=headers).text
# print("111", page_text)
# 获取首页页面数据的相关视频详情连接进行解析
tree = etree.HTML(page_text)
li_list = tree.xpath("//div[@id=‘listvideoList‘]/ul/li")
print("哈哈哈", li_list)
# 详情的url
detail_urls = [] # type:list
for li in li_list:
detail_url = "http://www.pearvideo.com/" + li.xpath("./div/a/@href")[0] # 此时返回的是一个列表
print("222",detail_url)
title = li.xpath("./div/a/div[@class=‘vervideo-title‘]/text()")[0]
detail_urls.append(detail_url)
print(detail_urls)
# 拿到每一个详情url发送get请求
for url in detail_urls:
page_text = requests.get(url=url, headers=headers).text
video_url = re.findall(‘srcUrl="(.*?)"‘, page_text, re.S)[0]
print("啧啧啧", video_url)
# 向视频的详情发送请求
data = requests.get(url=video_url, headers=headers).content
fileName = str(random.randint(1,10000)) + ‘.mp4‘ # 随机生成视频文件
# 存储得到本地
with open(fileName, "wb") as fp:
fp.write(data)
print(fileName + "下载成功!")
随手拍,分享经典一刻
import requests
from lxml import etree
url=‘http://jandan.net/ooxx‘
headers={
‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36‘
}
res=requests.get(url=url,headers=headers)
tree=etree.HTML(res.text)
td_list=tree.xpath(‘//div[@class="row"]/div[2]‘)
for td in td_list:
img_src=‘http:‘+td.xpath(‘./p/img/@src‘)[0] #需要拼接url
img_data=requests.get(url=img_src).content #图片二进制流
img_name=img_src.split(‘/‘)[-1]
with open(‘%s‘%img_name,‘wb‘) as fp:
fp.write(img_data)
selenium是一个web自动化测试用的框架。程序员可以通过代码实现对浏览器的控制,比如打开网页,点击网页中的元素,实现鼠标滚动操作。
它支持多款浏览器,如谷歌浏览器,火狐浏览器等等,当然也支持无头浏览器。
目的:
在爬取数据的过程中,经常遇到动态数据加载,一般动态数据加载有两种,一种通过ajax请求加载数据,另一种通过js代码加载动态数据。selenium可以模拟人操作真实浏览器,获取加载完成的页面数据。
ajax:
url有规律且未加密,直接构建url连接请求
url加密过无法破解规律 ----->selenium
js动态数据加载 -----> selenium
三元素:浏览器 驱动程序 selenium框架
浏览器:推荐谷歌浏览器,标准稳定版本
pip install selenium
#测试
from selenium import webdriver
browser=webdriver.Chrome(‘./webdriver‘) #将驱动放在脚本所在的文件夹
browser.get(‘www.baidu.com‘)
#实列化浏览器对象
from selenium import webdriver
browser=webdriver.Chrome(‘driverpath‘)
#发送get请求
browser.get(‘https://www.baidu.com‘)
#获取页面元素
find_element_by_id:根据元素的id
find_element_by_name:根据元素的name
find_element_by_xpath:根据xpath表达式
find_element_by_class_name:根据class的值
find_element_by_css_selector:根据css选择器
# 节点交互操作:
click(): 点击
send_keys(): 输入内容
clear(): 清空操作
execute_script(js): 执行指定的js代码
# JS代码: window.scrollTo(0, document.body.scrollHeight)可以模拟鼠标滚动一屏高度
quit(): 退出浏览器
I quit! 我不干!
# 获取网页的数据:
browser.page_source ---> str类型
# frame
switch_to.frame(‘frameid‘)
selenium百度图片抓取
# 百度图片抓取:
import time
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
from urllib import request
# 1.实例化浏览器对象
browser = webdriver.Chrome(‘./chromedriver.exe‘)
# 2.向服务器发起请求
browser.get(‘http://image.baidu.com/‘)
time.sleep(2)
# 3.输入关键字
input_tag = browser.find_element_by_id(‘kw‘)
input_tag.send_keys(‘腰子姐‘)
time.sleep(2)
# 4.点击搜索
button = browser.find_element_by_class_name(‘s_search‘)
button.click()
time.sleep(2)
# 5.实现滚动下拉
for i in range(3):
browser.execute_script(‘window.scrollTo(0, document.body.scrollHeight)‘)
time.sleep(3)
text = browser.page_source
# 6.实现数据解析
soup = BeautifulSoup(text, ‘lxml‘)
li_list = soup.select(‘.imgpage ul li‘)
headers = {
"User-Agent": ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36‘
}
for li in li_list:
href = li[‘data-objurl‘]
# print(href)
# request.urlretrieve(href, ‘%s.jpg‘%li_list.index(li))
res = requests.get(url=href, headers=headers)
# res.text: 文本数据
# res.json(): json数据
# res.content: 二进制流
with open(‘%s.jpg‘%li_list.index(li), ‘wb‘) as f:
f.write(res.content)
browser.quit()
qq空间模拟登陆
from selenium import webdriver
import time
# 实例化浏览器对象
browser = webdriver.Chrome(r‘C:\Users\lenovo\Desktop\sp\chromedriver.exe‘)
#注意浏览器版本号与插件版本号
# 打开qq空间登陆页面
browser.get(‘https://qzone.qq.com/‘)
time.sleep(1)
# 转至frame子页面
browser.switch_to.frame(‘login_frame‘)
# 获取密码登陆选项并点击
a_tag = browser.find_element_by_id(‘switcher_plogin‘)
a_tag.click()
time.sleep(1)
# 获取账号输入框并输入账号
browser.find_element_by_id(‘u‘).clear()
user = browser.find_element_by_id(‘u‘)
user.send_keys(‘账号‘)
time.sleep(1)
# 获取密码输入框并输入密码
browser.find_element_by_id(‘p‘).clear()
pwd = browser.find_element_by_id(‘p‘)
pwd.send_keys(‘密码‘)
time.sleep(1)
# 获取登陆按钮并单击
button = browser.find_element_by_id(‘login_button‘)
button.click()
# 1.在安装scrapy前需要安装好相应的依赖库, 再安装scrapy, 具体安装步骤如下:
(1).安装lxml库: pip install lxml
(2).安装wheel: pip install wheel
(3).安装twisted: pip install twisted文件路径
(twisted需下载后本地安装,下载地址:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted)
(版本选择如下图,版本后面有解释,请根据自己实际选择)
(4).安装pywin32: pip install pywin32
(注意:以上安装步骤一定要确保每一步安装都成功,没有报错信息,如有报错自行百度解决)
(5).安装scrapy: pip install scrapy
(注意:以上安装步骤一定要确保每一步安装都成功,没有报错信息,如有报错自行百度解决)
(6).成功验证:在cmd命令行输入scrapy,显示Scrapy1.6.0-no active project,证明安装成功
2.创建项目
1.手动创建一个目录test
2.在test文件夹下创建项目为:scrapy startproject spiderpro
3.进入项目文件夹: cd spiderpro
4.创建爬虫文件: Scrapy genspider 爬虫名 域名
3.项目目录介绍
spiderpro
spiderpro # 项目目录
__init__
spiders:爬虫文件目录
__init__
tests.py:爬虫文件
items.py:定义爬取数据持久化的数据结构
middlewares.py:定义中间件
pipelines.py:管道,持久化存储相关
settings.py:配置文件
venv:虚拟环境目录
scrapy.cfg: scrapy项目配置文件
说明
(1).spiders:其内包含一个个Spider的实现, 每个Spider是一个单独的文件
(2).items.py:它定义了Item数据结构, 爬取到的数据存储为哪些字段
(3).pipelines.py:它定义Item Pipeline的实现
(4).settings.py:项目的全局配置
(5).middlewares.py:定义中间件, 包括爬虫中间件和下载中间件
(6).scrapy.cfg:它是scrapy项目的配置文件, 其内定义了项目的配置路径, 部署相关的信息等
(1).架构:
Scrapy Engine: 这是引擎,负责Spiders、ItemPipeline、Downloader、Scheduler中间的通讯,信号、数据传递等等!
Scheduler(调度器): 它负责接受引擎发送过来的requests请求,并按照一定的方式进行整理排列,入队、并等待Scrapy Engine(引擎)来请求时,交给引擎。
Downloader(下载器):负责下载Scrapy Engine(引擎)发送的所有Requests请求,并将其获取到的Responses交还给Scrapy Engine(引擎),由引擎交给Spiders来处理
Spiders:它负责处理所有Responses,从中分析提取数据,获取Item字段需要的数据,并将需要跟进的URL提交给引擎,再次进入Scheduler(调度器),
Item Pipeline:它负责处理Spiders中获取到的Item,并进行处理,比如去重,持久化存储(存数据库,写入文件,总之就是保存数据用的)
Downloader Middlewares(下载中间件):你可以当作是一个可以自定义扩展下载功能的组件
Spider Middlewares(Spider中间件):你可以理解为是一个可以自定扩展和操作引擎和Spiders中间‘通信‘的功能组件(比如进入Spiders的Responses;和从Spiders出去的Requests)
(2).工作流:
1.spider将请求发送给引擎, 引擎将request发送给调度器进行请求调度
2.调度器把接下来要请求的request发送给引擎, 引擎传递给下载器, 中间会途径下载中间件
3.下载携带request访问服务器, 并将爬取内容response返回给引擎, 引擎将response返回给spider
4.spider将response传递给自己的parse进行数据解析处理及构建item一系列的工作, 最后将item返回给引擎, 引擎传递个pipeline
5.pipeline获取到item后进行数据持久化
6.以上过程不断循环直至爬虫程序终止
5.使用scrapy框架爬取糗百
#创建项目
scrapy startproject qbsk
cd qsbk #切换到项目目录
scrapy genspider qsbk_hot www.qiushibaike.com #创建爬虫文件,qsbk_hot为爬虫名
www....com为爬取范围
#item文件定义数据存储的字段:
import scrapy
class QsbkItem(scrapy.Item):
title = scrapy.Field() # 标题
lau = scrapy.Field() # 好笑数
comment = scrapy.Field() # 评论数
auth = scrapy.Field() # 作者
# spider文件中定义解析数据的方法
class QsbkHotSpider(scrapy.Spider):
name =‘qsbk_hot‘
# allowed_domains = [‘www.qiushibaike.com‘] # 无用, 可注释掉
start_urls =[‘http://www.qiushibaike.com/‘]
# 思路:一条热点数据在前端中对应一个li标签, 将一页中的所有li标签取出, 再进一步操作
def parse(self, response):
li_list = response.selector.xpath(‘//div[@class="recommend-article"]/ul/li‘)
# 循环li标签组成的列表, 先实例化item, 再取需要的字段, 并该item对象的相应属性赋值
for li in li_list:
# 实例化item对象
item =QsbkItem()
# 解析获取title(标题), lau(好笑数), comment(评论数), auth(作者)等信息
title = ....
lau = ....
comment = ....
auth = ....
# 将字段的值存储在item的属性中
# 返回item, 框架会自动将item传送至pipeline中的指定类
yield item
# 在pipeline中定义管道类进行数据的存储
import pymongo
classQsbkPipeline(object):
# 连接MongoDB数据库
conn = pymongo.MongoClient("localhost", 27017)
db = conn.qiubai
table = db.qb_hot
def process_item(self, item, spider):
# 向数据库中出入数据
self.table.insert(dict(item))
# 此处return item是为了下一个管道类能够接收到item进行存储
return item
def close_spider(self):
# 关闭数据库连接
self.conn.close()
# 此示例中配置文件中的配置的项, 注意是不是全部的配置, 是针对该项目增加或修改的配置项
# 忽略robots协议
ROBOTSTXT_OBEY =False
# UA伪装
USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36‘
# 管道类的注册配置
ITEM_PIPELINES ={
‘qsbk.pipelines.QsbkPipeline‘:300,
}
# 需求: 爬取校花网大学校花的默认的第一页的所有图片src和人名, 并通过管道存入mongodb数据库
# 创建item类, 用于存储解析出的数据
import scrapy
class XiaohuaspiderItem(scrapy.Item):
name = scrapy.Field()
src = scrapy.Field()
# spider中定义爬取的行为与解析数据的操作
import scrapy
from ..items import XiaohuaspiderItem
class HuaSpider(scrapy.Spider):
name = ‘hua‘
# allowed_domains = [‘www.baidu.com‘]
start_urls = [‘http://www.xiaohuar.com/hua/‘]
def parse(self, response):
div_list = response.xpath(‘//div[@class="img"]‘)
for div in div_list:
item = XiaohuaspiderItem()
name = ...(xpath匹配)
src = ...(xpath匹配)
# 将数据存储到item的属性中
item[...] = ...
item[...] = ...
yield item
# itemPipeline编码, 持久化数据到本地
import pymongo
class XiaohuaspiderPipeline(object):
conn = pymongo.MongoClient(‘localhost‘, 27017)
db = conn.xiaohua
table = db.hua
def process_item(self, item, spider):
self.table.insert(dict(item))
return item
def close_spider(self, spider):
self.conn.close()
# 配置项:
# UA伪装:
USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36‘
# 忽略robots协议:
ROBOTSTXT_OBEY = False
# 开启管道类
ITEM_PIPELINES = {
‘xiaohuaspider.pipelines.XiaohuaspiderPipeline‘: 300,
}
# 1.确定爬取的字段, 在items.py中定义字段
# 2.将要爬取的url, 放置在start_urls, 要把allowed_domains注释掉
# 3.在parse当中定义解析规则
1).解析响应数据
2).存储在临时的容器中, item对象
3).yield item: 将item提交给管道
# 4.在管道中实现与数据库的交互
class RtysPipeline(object):
def process_item(self, item, spider):
conn = pymongo.MongoClient(‘localhost‘, 27017)
db = conn.rtys
table = db.ys
table.insert_one(dict(item))
return item
# 运行项目:
scrapy crawl 爬虫名
# spider编码在原基础之上, 构建其他页面的url地址, 并利用scrapy.Request发起新的请求, 请求的回调函数依然是parse:
page = 1
base_url = ‘http://www.xiaohuar.com/list-1-%s.html‘
if self.page < 4:
page_url = base_url%self.page
self.page += 1
yield scrapy.Request(url=page_url, callback=self.parse)
# (其他文件不用改动)
# 需求: 爬取笑话的标题与详情页连接, 通过详情页链接, 爬取详情页笑话内容
# item编码: 定义数据持久化的字段信息
import scrapy
class JokeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
content = scrapy.Field()
# spider的编码:
# -*- coding: utf-8 -*-
import scrapy
from ..items import JokeItem
class XhSpider(scrapy.Spider):
name = ‘xh‘
# allowed_domains = [‘www.baidu.com‘]
start_urls = [‘http://www.jokeji.cn/list.htm‘]
def parse(self, response):
li_list = response.xpath(‘//div[@class="list_title"]/ul/li‘)
for li in li_list:
title = li.xpath(‘./b/a/text()‘).extract_first()
link = ‘http://www.jokeji.cn‘ + li.xpath(‘./b/a/@href‘).extract_first()
yield scrapy.Request(url=link, callback=self.datail_parse, meta={"title":title})
def datail_parse(self, response):
joke_list = response.xpath(‘//span[@id="text110"]//text()‘).extract()
title = response.meta["title"]
content = ‘‘
for s in joke_list:
content += s
item = JokeItem()
item["title"] = title
item["content"] = content
yield item
# Pipeline编码: 数据持久化具体操作
import pymongo
class JokePipeline(object):
conn = pymongo.MongoClient(‘localhost‘, 27017)
db = conn.haha
table = db.hahatable
def process_item(self, item, spider):
self.table.insert(dict(item))
return item
def close_spider(self, spider):
self.conn.close()
# settings配置编码:
UA伪装
Robots协议
Item_Pipeline
import scrapy
import json
class FySpider(scrapy.Spider):
name = ‘fy‘
# allowed_domains = [‘www.baidu.com‘]
start_urls = [‘https://fanyi.baidu.com/sug‘]
def start_requests(self):
data = {
‘kw‘:‘boy‘
}
yield scrapy.FormRequest(url=self.start_urls[0], callback=self.parse, formdata=data)
def parse(self, response):
print(1111111111111111111111111111111111111111111111111111111111111111111111111111111111)
print(response.text)
print(json.loads(response.text))
print(2222222222222222222222222222222222222222222222222222222222222222222222222222222222)
selenium可以实现抓取动态数据
scrapy不能抓取动态数据, 如果是ajax请求, 可以请求接口, 如果是js动态加载, 需要结合selenium
import scrapy
from selenium import webdriver
from ..items import WynewsItem
from selenium.webdriver import ChromeOptions
class NewsSpider(scrapy.Spider):
name = ‘news‘
# allowed_domains = [‘www.baidu.com‘]
start_urls = [‘https://news.163.com/domestic/‘]
option.add_experimental_option(‘excludeSwitches‘, [‘enable-automation‘]) bro=webdriver.Chrome(executable_path=r‘C:\Users\Administrator\Desktop\news\wynews\wynews\spiders\chromedriver.exe‘)
def detail_parse(self, response):
content_list = response.xpath(‘//div[@id="endText"]/p//text()‘).extract()
content = ‘‘
title = response.meta[‘title‘]
for s in content_list:
content += s
item = WynewsItem()
item["title"] = title
item["content"] = content
yield item
def parse(self, response):
div_list = response.xpath(‘//div[contains(@class, "data_row")]‘)
for div in div_list:
link = div.xpath(‘./a/@href‘).extract_first()
title = div.xpath(‘./div/div[1]/h3/a/text()‘).extract_first()
yield scrapy.Request(url=link, callback=self.detail_parse, meta={"title":title})
# 中间件编码:
from scrapy.http import HtmlResponse
class WynewsDownloaderMiddleware(object):
def process_response(self, request, response, spider):
bro = spider.bro
if request.url in spider.start_urls:
bro.get(request.url)
time.sleep(3)
js = ‘window.scrollTo(0, document.body.scrollHeight)‘
bro.execute_script(js)
time.sleep(3)
response_selenium = bro.page_source
return HtmlResponse(url=bro.current_url, body=response_selenium, encoding="utf-8", request=request)
return response
# Pipeline编码:
import pymongo
class WynewsPipeline(object):
conn = pymongo.MongoClient(‘localhost‘, 27017)
db = conn.wynews
table = db.newsinfo
def process_item(self, item, spider):
self.table.insert(dict(item))
return item
# MongoDB交互:
import Pymongo
# 管道类
class MongoPipeline(object):
# 初始化方法, __new__: 构造方法, 在内存中开辟一块空间
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri = crawler.settings.get(‘MONGO_URI‘),
mongo_db = crawler.settings.get(‘MONGO_DB‘)
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
self.db[‘news‘].insert(dict(item))
# 在一个项目中可能存在多个管道类, 如果该管道类后面还有管道类需要存储数据, 必须return item
return item
def close_spider(self, spider):
self.client.close()
# MySQL交互:
import pymysql
class MysqlPipeline(object):
def __init__(self, host, database, user, password, port):
self.host = host
self.database = database
self.user = user
self.password = password
self.port = port
@classmethod
def from_crawler(cls, crawler):
return cls(
host = crawler.settings.get(‘MYSQL_HOST‘)
database = crawler.settings.get(‘MYSQL_DATABASE‘)
user = crawler.settings.get(‘MYSQL_USER‘)
password= crawler.settings.get(‘MYSQL_PASSWORD‘)
port = crawler.settings.get(‘MYSQL_PORT‘)
)
def open_spider(self, spider):
self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset=‘utf8‘, port=self.port)
self.cursor = self.db.cursor()
def process_item(self, item, spider):
data = dict(item)
# data.keys()--> 获取所有的键, 字段 --> ‘title,content‘
keys = ‘,‘.join(data.keys())
# [‘%s‘]*len(data) --> [‘%s‘, ‘%s‘]
# ‘,‘.join([‘%s‘, ‘%s‘]) --> ‘%s,%s‘
values = ‘,‘.join([‘%s‘]*len(data))
sql = ‘insert into %s (%s) values (%s)‘ % (tablename, keys, values)
self.cursor.execute(sql, tuple(data.values()))
self.db.commit()
return item
# 用于文件下载的管道类
# spider编码:
import scrapy
from ..items import XhxhItem
class XhSpider(scrapy.Spider):
name = ‘xh‘
# allowed_domains = [‘www.baidu.com‘]
start_urls = [‘http://www.521609.com/qingchunmeinv/‘]
def parse(self, response):
li_list = response.xpath(‘//div[@class="index_img list_center"]/ul/li‘)
for li in li_list:
item = XhxhItem()
link = li.xpath(‘./a[1]/img/@src‘).extract_first()
item[‘img_link‘] = ‘http://www.521609.com‘ + link
print(item)
yield item
# items编码:
import scrapy
class XhxhItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
img_link = scrapy.Field()
# 管道编码:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
class XhxhPipeline(object):
def process_item(self, item, spider):
return item
class ImgPipeLine(ImagesPipeline):
def get_media_requests(self, item, info):
yield scrapy.Request(url=item[‘img_link‘])
def file_path(self, request, response=None, info=None):
url = request.url
file_name = url.split(‘/‘)[-1]
return file_name
def item_completed(self, results, item, info):
return item
# settings编码:
ITEM_PIPELINES = {
‘xhxh.pipelines.XhxhPipeline‘: 300,
‘xhxh.pipelines.ImgPipeLine‘: 301,
}
IMAGES_STORE = ‘./mvs‘
# 安装:
pip install virtualenvwrapper-win
# 常用命令:
mkvirtualenv envname # 创建虚拟环境并自动切换到该环境下
workon envname # 切换到某虚拟环境下
pip list / pip show / pip freeze / pip freeze -all
rmvirtualenv envname # 删除虚拟环境
deactivate # 退出虚拟环境
lsvirtualenv # 列出所有常见的虚拟环境
mkvirtualenv --python==C:\...\python.exe envname # 指定Python解释器创建虚拟环境
# 同一项目环境
pip freeze > requirements.txt
pip install -r C:\...\requirements.txt
pip uninstall -r C:\...\requirements.txt
# spider编写:
import scrapy
from dl.items import DlItem
class PSpider(scrapy.Spider):
name = ‘p‘
# allowed_domains = [‘www.baidu.com‘]
start_urls = [‘https://www.kuaidaili.com/free/‘]
def parse(self, response):
# print(response)
tr_list = response.xpath(‘//*[@id="list"]/table/tbody/tr‘)
# print(tr_list)
for tr in tr_list:
ip = tr.xpath(‘./td[1]/text()‘).extract_first()
port = tr.xpath(‘./td[2]/text()‘).extract_first()
typ = tr.xpath(‘./td[3]/text()‘).extract_first()
protocal = tr.xpath(‘./td[4]/text()‘).extract_first()
position = tr.xpath(‘./td[5]/text()‘).extract_first()
# print(ip, port, protocal, position)
item = DlItem()
item[‘ip‘] = ip
item[‘port‘] = port
item[‘typ‘] = typ
item[‘protocal‘] = protocal
item[‘position‘] = position
print(item)
yield item
# items编码
import scrapy
class DlItem(scrapy.Item):
ip = scrapy.Field()
port = scrapy.Field()
typ = scrapy.Field()
protocal = scrapy.Field()
position = scrapy.Field()
# Django项目创建与所有配置:
1.models创建:
from django.db import models
# Create your models here.
class Proxy(models.Model):
ip = models.CharField(max_length=50)
port = models.CharField(max_length=50)
typ = models.CharField(max_length=50)
protocal = models.CharField(max_length=50)
position = models.CharField(max_length=50)
2.在scrapy框架项目中嵌入django
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath(‘.‘)))
os.environ[‘DJANGO_SETTINGS_MODULE‘] = ‘proxyscan.settings‘
# 手动初始化Django:
import django
django.setup()
3.修改爬虫item:
import scrapy
from scrapy_djangoitem import DjangoItem
from proxy import models
class DlItem(DjangoItem):
django_model = models.Proxy
4.pipeline编码:
class DlPipeline(object):
def process_item(self, item, spider):
print(‘开启数据库, 进行数据存储‘)
item.save()
print(‘关闭数据库‘)
return item
5.Django项目迁移数据库与admin后台配置
Python manage.py makemigrations
python manage.py migrate
from proxy.models import Proxy
admin.site.register(Proxy)
# 创建超级用户:
Python manage.py createsuperuser
# 路由:
from django.conf.urls import url
from django.contrib import admin
from proxy.views import index
urlpatterns = [
url(r‘^admin/‘, admin.site.urls),
url(r‘^index/‘, index),
]
# 视图函数:
from django.shortcuts import render
from proxy.models import Proxy
def index(requests):
p = Proxy.objects.all()
return render(requests, ‘index.html‘, {"p":p})
# 前端代码:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
<script src="https://cdn.bootcss.com/jquery/3.4.1/jquery.min.js"></script>
<link href="https://cdn.bootcss.com/twitter-bootstrap/4.3.1/css/bootstrap.min.css" rel="stylesheet">
</head>
<body>
<div class="container">
<div class="row" >
<div class="col-md-10 col-md-offset-2" style="margin:0 auto">
<div class="panel panel-primary">
<div class="panel-heading" style="margin-top:50px">
<h3 class="panel-title">代理IP一览表</h3>
</div>
<div class="panel-body">
<table class="table table-striped">
<thead>
<tr>
<th>IP</th>
<th>Port</th>
<th>Type</th>
<th>Protocal</th>
<th>Positon</th>
</tr>
</thead>
<tbody>
{% for i in p %}
<tr>
<th>{{ i.ip }}</th>
<td>{{ i.port }}</td>
<td>{{ i.typ }}</td>
<td>{{ i.protocal }}</td>
<td>{{ i.position }}</td>
</tr>
{% endfor %}
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
标签:sts bootstra 规律 styles 热点 models 命令行 self 加载
原文地址:https://www.cnblogs.com/djl-0628/p/14598002.html