互联网上有很多信息并不是存在数据库中也不是API格式,这些数据存储网页上。提取这些数据的一个技术就是网页爬虫(web scraping)。
在Python中进行爬虫的过程大概就是:使用requests库加载这个网页,然后使用beautifulsoup 库从这个网页中提取出相关的信息。
网页是由HyperText Markup Language (HTML)编写的,HTML是一种标记语言(markup language),它有自己的语法规则,浏览器下载了这些网页根据这些规则将正确的内容呈现给用户。从这里可以看到HTML中所有的tag。
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
content = response.content
bytes (<class ‘bytes‘>)
b‘<!DOCTYPE html>\n<html>\n <head>\n <title>A simple example page</title>\n </head>\n <body>\n <p>Here is some simple content for this page.</p>\n </body>\n</html>‘
from bs4 import BeautifulSoup
# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, ‘html.parser‘)
# 观察content的内容,可以发现p标签在body标签里面
body = parser.body
p = body.p
# Text is a property that gets the inside text of a tag.
# 而title在head标签里面
head = parser.head
title = head.title
title_text = title.text
parser = BeautifulSoup(content, ‘html.parser‘)
# Get a list of all occurences of the body tag in the element.
body = parser.find_all("body")
# Get the paragraph tag
p = body[0].find_all("p")
# body中有很多段落p
head = parser.find_all("head")
title = head[0].find_all("title")
title_text = title[0].text
<title>A simple example page</title>
<p id="first">
First paragraph.
<p id="second">
Second paragraph.
# Get the page content and setup a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, ‘html.parser‘)
# Pass in the id attribute to only get elements with a certain id.
first_paragraph = parser.find_all("p", id="first")[0]
First paragraph.
second_paragraph = parser.find_all("p", id="second")[0]
second_paragraph_text = second_paragraph.text
<title>A simple example page</title>
<p class="inner-text">
First inner paragraph.
<p class="inner-text">
Second inner paragraph.
<p class="outer-text">
First outer paragraph.
<p class="outer-text">
Second outer paragraph.
# Get the website that contains classes.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_classes.html")
content = response.content
parser = BeautifulSoup(content, ‘html.parser‘)
# Get the first inner paragraph.
# Find all the paragraph tags with the class inner-text.
# Then take the first element in that list.
first_inner_paragraph = parser.find_all("p", class_="inner-text")[0]
First paragraph.
second_inner_paragraph = parser.find_all("p", class_="inner-text")[1]
second_inner_paragraph_text = second_inner_paragraph.text
first_outer_paragraph = parser.find_all("p", class_="outer-text")[0]
first_outer_paragraph_text = first_outer_paragraph.text
Cascading Style Sheets(CSS)是一种向HTML网页中添加风格的方法,前面我们展示的网页都是很简洁的没有任何风格,段落内容是黑色,并且字体大小相同。但是大部分网页的字体都是五颜六色的,这都是因为使用了CSS。CSS利用selectors 来选择元素以及元素的classes/id来确定在哪里添加某种风格,比如颜色字体大小等。
color: red
color: red
color: red
color: red
<title>A simple example page</title>
<p class="inner-text first-item" id="first">
First paragraph.
<p class="inner-text">
Second paragraph.
<p class="outer-text first-item" id="second">
First outer paragraph.
<p class="outer-text">
Second outer paragraph.
# Get the website that contains classes and ids
response = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
content = response.content
parser = BeautifulSoup(content, ‘html.parser‘)
[<p class="outer-text first-item" id="second">
First outer paragraph.
</p>, <p class="outer-text">
Second outer paragraph.
[<p class="outer-text first-item" id="second">
First outer paragraph.
使用CSS Selector 也可以向前面那样找到嵌套的tag,我们可以使用CSS来完成复杂的爬虫任务:
div p
div .first-item
.first-item #first
<meta charset="UTF-8">
<title>2014 Superbowl Team Stats</title>
<table class="stats_table nav_table" id="team_stats">
<tr id="teams">
<tr id="first-downs">
<td>First downs</td>
<tr id="total-yards">
<td>Total yards</td>
<tr id="turnovers">
<tr id="penalties">
<tr id="total-plays">
<td>Total Plays</td>
<tr id="time-of-possession">
<td>Time of Possession</td>
这个内容是2014年超级碗的一段节选的成绩,成绩包含了每个团队的信息:每个团队赢了多少码,失误了多少次等等。如上面显示的网页呈现的是一个表格,第一列是Seattle Seahawks队,第二列是 New England Patriots队,每一行代表一个不同的数据。
# Get the super bowl box score data.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html")
content = response.content
parser = BeautifulSoup(content, ‘html.parser‘)
# #total-plays存储的是两个队的人数,第三个是New England Patriots,所以此处
patriots_total_plays_count = parser.select("#total-plays")[0].select("td")[2].text
# #total-yards存储的是两个队的码数,第一个td是Seahawks
seahawks_total_yards_count = parser.select("#total-yards")[0].select("td")[1].text