理解爬虫原理

时间：2019-04-02 10:48:15 阅读：202 评论：0 收藏：0 [点我收藏+]

标签：content body 百度一下三方 mysql 结果二进制 sel query

1. 简单说明爬虫原理

爬虫：请求网站并提取数据的自动化程序

百科：网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。

2. 理解爬虫开发过程

1).简要说明浏览器工作原理；

方式1：浏览器提交请求--->下载网页代码--->解析成页面

方式2：模拟浏览器发送请求(获取网页代码)->提取有用的数据->存放于数据库或文件中

爬虫要做的就是方式2；

技术图片

1、发起请求

使用http库向目标站点发起请求，即发送一个Request

Request包含：请求头、请求体等

Request模块缺陷：不能执行JS 和CSS 代码

2、获取响应内容

如果服务器能正常响应，则会得到一个Response

Response包含：html，json，图片，视频等

3、解析内容

解析html数据：正则表达式（RE模块），第三方解析库如Beautifulsoup，pyquery等

解析json数据：json模块

解析二进制数据:以wb的方式写入文件

4、保存数据

数据库（MySQL，Mongdb、Redis）

文件

2).使用 requests 库抓取网站数据；

requests.get(url) 获取校园新闻首页html代码

import requests
from bs4 import BeautifulSoup
 
url=‘http://news.gzcc.cn/html/2019/tongzhigonggao_0321/11036.html‘
response=requests.get(url) #获取网页html
 
response.encoding=‘utf-8‘
print(response.text)

结果：

技术图片

3).了解网页

写一个简单的html文件，包含多个标签，类，id

<html>
<head>
<meta charset="utf-8">
<body>
<h1 id="title">标题</h1>
<p class="time">发布时间：2019年</p>
<a href="http://www.runoob.com">点击百度一下，你就知道</a>    <br>
 <label for="inputname" class="col-sm-2 control-label ">姓名</label>
 <input type="text" class="form-control" id="inputname"value="huang"  >  
</body>
</head>
</html>

4).使用 Beautiful Soup 解析网页；

通过BeautifulSoup(html_sample,‘html.parser‘)把上述html文件解析成DOM Tree

select（选择器）定位数据

找出含有特定标签的html元素

找出含有特定类名的html元素

找出含有特定id名的html元素

import requests
from bs4 import BeautifulSoup
url = ‘http://news.gzcc.cn/html/2019/xibusudi_0329/11097.html‘
news = requests.get(url)
news.encoding = ‘utf-8‘
newSoup = BeautifulSoup(news.text,‘html.parser‘)
#找出含有特定标签的html元素
newSpan = newSoup.select(‘span‘);
print(‘找出含有span标签的html元素:‘)
print(newSpan);
#找出含有特定类名的html元素
newInfo = newSoup.select(‘.show-info‘);
print(‘找出class=show-info的html元素:‘);
print(newInfo);
#找出含有特定id名的html元素
newContent = newSoup.select(‘#content‘)[0].text;
print(‘找出id=content的html元素:‘);
print(newContent);

技术图片

3.提取一篇校园新闻的标题、发布时间、发布单位、作者、点击次数、内容等信息

如url = ‘http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0320/11029.html‘

要求发布时间为datetime类型，点击次数为数值型，其它是字符串类型。

import requestsrequests
import bs4
from bs4 import BeautifulSoup
bs4#获取特定网站数据
url="http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0320/11029.html‘"
res=requests.get(url)
type(res)
res.encoding="utf-8"
soup1=BeautifulSoup(res.text,‘html.parser‘)
#得到新闻标题
soup1.select(‘title‘)#得到新闻的发布时间和单位
soup1.select(‘.show-info‘)#遍历for news in soup1.select(‘li‘):if len(news.select(‘.news-list-title‘))>0:t=news.select(‘.news-list-title‘)[0].texta=news.select(‘a‘)[0][‘href‘]d=news.select(‘.news-list-info‘)[0].textprint(t)

理解爬虫原理

标签：content body 百度一下三方 mysql 结果二进制 sel query

原文地址：https://www.cnblogs.com/Tily/p/10640934.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行