标签:one python get 链接 tee opened erro lap 信息
urllib 标准库(py2中是urllib2)
子模块:request、parse、error
request:
urlopen函数:打开并读取一个从网络获取的远程对象
from urllib.request import urlopen # html = urlopen(‘http://pythonscraping.com/pages/page1.html‘) html2 = urlopen(‘http://baidu.com/robots.txt‘) #获取该网页全部HTML代码 print(html2.read()) #显示的字节码
beautifulsoup4(bs4) 非标准库
sudo apt-get install python-bs4
In [5]: from bs4 import BeautifulSoup
#改进
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen(‘http://news.ifeng.com/a/20171111/53165257_0.shtml‘)
#网页可能不存在,返回HTTPError,此时用try-except语句;服务器不存在返回None即链接打不开或URL写错,加一个if语句;还有可能我们所需标签不存在,返回None bsobj = BeautifulSoup(html.read(),‘html.parser‘) # ‘html.parser‘避免出现警告信息 print(bsobj.h1) #获取h1标签
<h1 id="artical_topic" itemprop="headline">又一名女空乘从波音客机上掉下</h1>
from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoup def getTitle(url): try: html = urlopen(url) except HTTPError as e: return None try: bsobj = BeautifulSoup(html.read(),‘html.parser‘) title = bsobj.body.h1 except AttributeError as e: return None return title title = getTitle(‘http://news.ifeng.com/a/20171111/53165257_0.shtml‘) if title ==None: print(‘Title cound not be found‘) else: print(title)
标签:one python get 链接 tee opened erro lap 信息
原文地址:http://www.cnblogs.com/lybpy/p/7819643.html