1. 认识urllib
urllib是python的标准库,它提供丰富的函数例如从web服务器请求数据、处理cookie等,在python2中对应urllib2库,不同于urllib2,python3的urllib被分为若干子模块:urllib.request、urllib.parse、urllib.error等,urllib库的使用可以参考https://docs.python.org/3/library/urllib.html
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())
b‘<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n‘
2. 认识BeautifulSoup
BeautifulSoup库用于解析html文本,并转化为BeautifulSoup对象。
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read(),"lxml")
print(bsObj.h1)
<h1>An Interesting Title</h1>
BeautifulSoup函数需要制定解析库,下表列出常见的几种解析库,并给出优缺点:
解析库 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(html,’html.parser’) | Python内置标准库;执行速度快 | 容错能力较差 |
lxml HTML解析库 | BeautifulSoup(html,’lxml’) | 速度快;容错能力强 | 需要安装,需要C语言库 |
lxml XML解析库 | BeautifulSoup(html,[‘lxml’,’xml’]) | 速度快;容错能力强;支持XML格式 | 需要C语言库 |
htm5lib解析库 | BeautifulSoup(html,’htm5llib’) | 以浏览器方式解析,最好的容错性 | 速度慢 |
3. 可靠性爬虫
我们知道在网站访问中通常会出现404 Page not found的情况,或者服务器暂时关闭了,在调用urlopen函数时就会抛出异常,使得程序无法继续运行,我们可以urllib.error模块来处理异常。
from urllib.request import urlopen
from urllib.error import URLError
try:
html = urlopen("https://www.baid.com/") #url is wrong
except URLError as e:
print(e)
<urlopen error [Errno 111] Connection refused>
在取得可靠性连接后,我们用BeautifulSoup处理html,通常会出现网站改版后无法找到某个标签从而抛出异常的情形。
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
try:
bsObj = BeautifulSoup(html.read(),"lxml")
li = bsObj.ul.li
print(li)
except AttributeError as e:
print(e)
‘NoneType‘ object has no attribute ‘li‘
4. 第一个爬虫程序
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
try:
bsObj = BeautifulSoup(html.read(),"lxml")
title = bsObj.body.h1
except AttributeError as e:
return None
return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
print("Title could not be found.")
else:
print(title)
<h1>An Interesting Title</h1>