标签:
‘‘‘ 得到当前页面所有连接 ‘‘‘ import requests
import re from bs4 import BeautifulSoup from lxml import etree
url = ‘http://www.ok226.com‘ r = requests.get(url) r.encoding = ‘gb2312‘ # 利用 re (太黄太暴力!) matchs = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\‘).+?(?=\‘)" , r.text) for link in matchs: print(link) print()
# 利用 BeautifulSoup4 (DOM树) soup = BeautifulSoup(r.text,‘lxml‘) for a in soup.find_all(‘a‘): link = a[‘href‘] print(link) print()
# 利用 lxml.etree (XPath) tree = etree.HTML(r.text) for link in tree.xpath("//@href"): print(link)
标签:
原文地址:http://www.cnblogs.com/hhh5460/p/5044038.html