码迷,mamicode.com
首页 > 其他好文 > 详细

9.3.4 BeaufitulSoup4

时间:2018-05-04 21:57:06      阅读:196      评论:0      收藏:0      [点我收藏+]

标签:数据   pip   演示   mouse   get   文件中   mes   参考   http   

  BeautifulSoup 是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。

  使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。

  下面简单演示下BeautifulSoup4的功能,更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。

 

  1 >>> from bs4 import BeautifulSoup
  2 >>> 
  3 >>> #自动添加和补全标签
  4 >>> BeautifulSoup(hello world,lxml)
  5 <html><body><p>hello world</p></body></html>
  6 >>> 
  7 >>> #自定义一个html文档内容
  8 >>> html_doc = """
  9 <html><head><title>The Dormouse‘s story</title></head>
 10 <body>
 11 <p class="title"><b>The Dormouse‘s story</b></p>
 12 <p class="story">Once upon a time there were three little sisters;and their names were
 13 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
 14 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
 15 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 16 and they lived at the bottom of a well.</p>
 17 
 18 <p class="story">...</p>
 19 """
 20 >>> 
 21 >>> #解析这段html文档内容,以优雅的方式展示出来
 22 >>> soup = BeautifulSoup(html_doc,html.parser)
 23 >>> print(soup.prettify())
 24 <html>
 25  <head>
 26   <title>
 27    The Dormouses story
 28   </title>
 29  </head>
 30  <body>
 31   <p class="title">
 32    <b>
 33     The Dormouses story
 34    </b>
 35   </p>
 36   <p class="story">
 37    Once upon a time there were three little sisters;and their names were
 38    <a class="sister" href="http://example.com/elsie" id="link1">
 39     Elsie
 40    </a>
 41    ,
 42    <a class="sister" href="http://example.com/lacie" id="link2">
 43     Lacie
 44    </a>
 45    and
 46    <a class="sister" href="http://example.com/tillie" id="link3">
 47     Tillie
 48    </a>
 49    ;
 50 and they lived at the bottom of a well.
 51   </p>
 52   <p class="story">
 53    ...
 54   </p>
 55  </body>
 56 </html>
 57 >>> 
 58 >>> #访问特定标签
 59 >>> soup.title
 60 <title>The Dormouses story</title>
 61 >>> 
 62 >>> #标签名字
 63 >>> soup.title.name
 64 title
 65 >>> 
 66 >>> #标签文本
 67 >>> soup.title.text
 68 "The Dormouse‘s story"
 69 >>> 
 70 >>> #title标签的上一级标签
 71 >>> soup.title.parent
 72 <head><title>The Dormouses story</title></head>
 73 >>> 
 74 >>> soup.head
 75 <head><title>The Dormouses story</title></head>
 76 >>> 
 77 >>> soup.b
 78 <b>The Dormouses story</b>
 79 >>> 
 80 >>> soup.b.name
 81 b
 82 >>> soup.b.text
 83 "The Dormouse‘s story"
 84 >>> 
 85 >>> #把整个BeautifulSoup对象看作标签对象
 86 >>> soup.name
 87 [document]
 88 >>> 
 89 >>> soup.body
 90 <body>
 91 <p class="title"><b>The Dormouses story</b></p>
 92 <p class="story">Once upon a time there were three little sisters;and their names were
 93 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 94 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
 95 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 96 and they lived at the bottom of a well.</p>
 97 <p class="story">...</p>
 98 </body>
 99 >>> 
100 >>> soup.p
101 <p class="title"><b>The Dormouses story</b></p>
102 >>> 
103 >>> #标签属性
104 >>> soup.p[class]
105 [title]
106 >>> 
107 >>> soup.p.get(class)         #也可以这样查看标签属性
108 [title]
109 >>> 
110 >>> soup.p.text
111 "The Dormouse‘s story"
112 >>> 
113 >>> soup.p.contents
114 [<b>The Dormouses story</b>]
115 >>> 
116 >>> soup.a
117 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
118 >>> 
119 >>> #查看a标签所有属性
120 >>> soup.a.attrs
121 {class: [sister], id: link1, href: http://example.com/elsie}
122 >>> 
123 >>> #查找所有a标签
124 >>> soup.find_all(a)
125 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
126 >>> 
127 >>> #同时查找<a>和<b>标签
128 >>> soup.find_all([a,b])
129 [<b>The Dormouses story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
130 >>> 
131 >>> import re
132 >>> #查找href包含特定关键字的标签
133 >>> soup.find_all(href=re.compile("elsie"))
134 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
135 >>> 
136 >>> soup.find(id=link3)
137 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
138 >>> 
139 >>> soup.find_all(a,id=link3)
140 [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
141 >>> 
142 >>> for link in soup.find_all(a):
143     print(link.text,:,link.get(href))
144 
145     
146 Elsie : http://example.com/elsie
147 Lacie : http://example.com/lacie
148 Tillie : http://example.com/tillie
149 >>> 
150 >>> print(soup.get_text())           #返回所有文本
151 
152 The Dormouses story
153 
154 The Dormouses story
155 Once upon a time there were three little sisters;and their names were
156 Elsie,
157 Lacieand
158 Tillie;
159 and they lived at the bottom of a well.
160 ...
161 
162 >>> 
163 >>> #修改标签属性
164 >>> soup.a[id]=test_link1
165 >>> soup.a
166 <a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>
167 >>> 
168 >>> #修改标签文本
169 >>> soup.a.string.replace_with(test_Elsie)
170 Elsie
171 >>> 
172 >>> soup.a.string
173 test_Elsie
174 >>> 
175 >>> print(soup.prettify())
176 <html>
177  <head>
178   <title>
179    The Dormouses story
180   </title>
181  </head>
182  <body>
183   <p class="title">
184    <b>
185     The Dormouses story
186    </b>
187   </p>
188   <p class="story">
189    Once upon a time there were three little sisters;and their names were
190    <a class="sister" href="http://example.com/elsie" id="test_link1">
191     test_Elsie
192    </a>
193    ,
194    <a class="sister" href="http://example.com/lacie" id="link2">
195     Lacie
196    </a>
197    and
198    <a class="sister" href="http://example.com/tillie" id="link3">
199     Tillie
200    </a>
201    ;
202 and they lived at the bottom of a well.
203   </p>
204   <p class="story">
205    ...
206   </p>
207  </body>
208 </html>
209 >>> 
210 >>> 
211 >>> #遍历子标签
212 >>> for child in soup.body.children:
213     print(child)
214 
215     
216 
217 
218 <p class="title"><b>The Dormouses story</b></p>
219 
220 
221 <p class="story">Once upon a time there were three little sisters;and their names were
222 <a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,
223 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
224 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
225 and they lived at the bottom of a well.</p>
226 
227 
228 <p class="story">...</p>
229 
230 
231 >>> 

 

9.3.4 BeaufitulSoup4

标签:数据   pip   演示   mouse   get   文件中   mes   参考   http   

原文地址:https://www.cnblogs.com/avention/p/8991818.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!