python提取页面信息beautifulsoup正则lxml

时间：2017-11-04 19:31:25 阅读：153 评论：0 收藏：0 [点我收藏+]

标签：pen beautiful header 添加 zha div python @class width

beautifulsoup正则lxml

# -*- coding: utf-8 -*-
import re
from urllib.request import urlopen
from urllib.request import Request
from bs4 import BeautifulSoup
from lxml import etree

#添加模拟浏览器协议头
headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘}
url = "http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC&kw=python&sm=0&p=1"
req_timeout = 5
req = Request(url=url,headers=headers)
f = urlopen(req,None,req_timeout)
s = f.read()
s = s.decode(‘utf-8‘)

ss = str(s)

#lxml提取
selector = etree.HTML(ss)
links = selector.xpath(‘//tr/td[@class="zwmc"]/div/a/@href|//tr/td[@class="zwmc"]/div/a/text()‘)
for link in links:
	print(link)
‘‘‘
#beautifulsoup提取
soup = BeautifulSoup(ss,‘html.parser‘)
aList = soup.find_all("tr")
for item in aList:
	aList1 = item.find_all("a")
	for item1 in aList1:
		print(item1.get(‘href‘))
		print(item1.get_text())
		break
	#print(item)
	#print(item.get(‘href‘))
	#print(item.get_text())
‘‘‘

#正则提取
‘‘‘
mm = re.findall(‘<div style="width: 224px;*width: 218px; _width:200px; float: left"><a style=\"font-weight: bold\" par=\"(.*)\" href=\"(.*)\" target=\"_blank\">(.*)</a>‘,ss)

print(mm)
‘‘‘

标签：pen beautiful header 添加 zha div python @class width

原文地址：http://www.cnblogs.com/yongxinboy/p/7783954.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行