标签:
最近公司要求编写一个爬虫,需要完善后续金融项目的数据,由于工作隐私,就不付被爬的网址url了,下面总结下spider的工作原理。
语言:python;工具:jupyter;
(1)使用requests模块,获取url页面。
import requests url = "http://www.~~~~~~~~~~~~~~~~~~~~~~~~~~" r = requests.get(url)
(2)解析html页面(若是pdf页面需要其他工具模块)需要使用BeautifulSoup模块,把request下来的页面信息保存为soup格式。
from bs4 import BeautifulSoup soup = BeautifulSoup(r.text)
(3)利用soup找到超链接href 并把href保存到文件中,为了后续的使用;
with open(r"E:\aa.txt", "wb") as code: for link in soup.find_all(‘a‘): code.write(str(link.get(‘href‘)) + ‘\r\n‘) print "Download Complete!"
(4)在上一步的文件中,读取保存的href连接,并保存到list数据结构中;
fd = open(r"E:\juchao.txt","r") mylist = []for line in fd: mylist.append(line)
(5)编写header,为了post方式伪装成浏览器(必要的话,设置参数data);并拼接成访问的url格式(利用浏览器调试,查看网络中的信息);
headers = { ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘, ‘Accept-Language‘: ‘zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3‘, ‘Accept-Encoding‘: ‘gzip, deflate‘, ‘Connection‘: ‘keep-alive‘, ‘Cookie‘: ‘JSESSIONID=27AF575249A833C368677F9B5869A463‘, ‘Host‘: ‘www.cninfo.com.cn‘, ‘Referer‘: ‘http://www.~~~~~~~~~~~~~~~‘, ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0‘, ‘Content-Length‘: ‘262‘, ‘Content-Type‘: ‘application/x-www-form-urlencoded; charset=UTF-8‘, ‘X-Requested-With‘: ‘XMLHttpRequest‘, }
urlpath = ‘http://www.cninfo.com.cn/information/brief/szmb‘
myUrls = []
for submylist in mylist:
urlId = ‘‘
url = ‘‘
urlId = submylist[-7:-1]
url = urlpath + urlId + ‘.html‘
myUrls.append(url)
(6)新拼接的url是我们需要的最终页面,requests获取url页面(注意编码问题),利用soup解析html页面,生成json字符串,保存到文件。
import json with open(r"E:\juchao_json.txt", "wb") as code: for k in xrange(len(myUrls)): r1 = requests.get(myUrls[k]) r1.encoding = r1.apparent_encoding # print r1.encoding soup = BeautifulSoup(r1.text) jsonMap = {} jsonMapKey = [] jsonMapValue = [] for i in soup.select(".zx_data"): jsonMapKey.append(i.text) for i in soup.select(".zx_data2"): jsonMapValue.append(i.text[:-3]) for j in xrange(len(jsonMapKey)): jsonMap[jsonMapKey[j]] = jsonMapValue[j] strJson = json.dumps(jsonMap, ensure_ascii=False) # print strJson code.write(strJson.encode(‘utf-8‘) + ‘\r\n‘) print ‘Done!‘
爬虫:工作中编写的一个python爬取web页面信息的小demo
标签:
原文地址:http://www.cnblogs.com/rongyux/p/5499332.html