爬虫的操作步骤:
爬虫三步走
- 爬虫第一步:使用requests获得数据:
1.导入requests
2.使用requests.get获取网页源码 -
import requests r = requests.get(‘https://book.douban.com/subject/1084336/comments/‘).text
- 爬虫第二步:使用BeautifulSoup4解析数据:
1.导入bs4
2.解析网页数据
3.寻找数据
4.for循环打印
from bs4 import BeautifulSoup soup = BeautifulSoup(r,‘lxml‘) pattern = soup.find_all(‘p‘,‘comment-content‘) for item in pattern: print(item.string)
- 爬虫第三步:使用pandas保存数据:
1.导入pandas
2.新建list对象
3.使用to_csv写入
import pandas comments = [] for item in pattern: comments.append(item.string) df = pandas.DataFrame(comments) df.to_csv(‘comments.csv‘)
完整的爬虫
import requests r = requests.get(‘https://book.douban.com/subject/1084336/comments/‘).text from bs4 import BeautifulSoup soup = BeautifulSoup(r,‘lxml‘) pattern = soup.find_all(‘p‘,‘comment-content‘) for item in pattern: print(item.string) import pandas comments = [] for item in pattern: comments.append(item.string) df = pandas.DataFrame(comments) df.to_csv(‘comments.csv‘)
代码运行结果: