scrapy--BeautifulSoup

时间：2018-09-13 17:31:41 阅读：209 评论：0 收藏：0 [点我收藏+]

标签：ext Once mes otto print 选择 attr port 官方

BeautifulSoup官方文档:https://beautifulsoup.readthedocs.io/zh_CN/latest/#id8

太繁琐的,精简了一些自己用的到的。

1.index.html

<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title"><b>The Dormouse‘s story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

2..prettify()--标准的缩进格式输出

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, ‘html.parser‘)

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse‘s story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse‘s story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

3.选择标签,属性

soup.title
# <title>The Dormouse‘s story</title>

soup.title.name
# u‘title‘

soup.title.string
# u‘The Dormouse‘s story‘

soup.title.parent.name
# u‘head‘

soup.p
# <p class="title"><b>The Dormouse‘s story</b></p>

soup.p[‘class‘]
# u‘title‘

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all(‘a‘)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
for link in soup.find_all(‘a‘):
    print(link.get(‘href‘))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie
print(soup.get_text())
# The Dormouse‘s story
#
# The Dormouse‘s story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

#Tag
    soup = BeautifulSoup(‘<b class="boldest">Extremely bold</b>‘)
    tag = soup.b
    type(tag)
    # <class ‘bs4.element.Tag‘>
#Name
    tag.name
    # u‘b‘
    tag.name = "blockquote"
    tag
    # <blockquote class="boldest">Extremely bold</blockquote>
#Attributes
    tag[‘class‘]
    # u‘boldest‘
    tag.attrs
    # {u‘class‘: u‘boldest‘}
    tag[‘class‘] = ‘verybold‘
    tag[‘id‘] = 1
    tag
    # <blockquote class="verybold" id="1">Extremely bold</blockquote>

    del tag[‘class‘]
    del tag[‘id‘]
    tag
    # <blockquote>Extremely bold</blockquote>

    tag[‘class‘]
    # KeyError: ‘class‘
    print(tag.get(‘class‘))
    # None

scrapy--BeautifulSoup

标签：ext Once mes otto print 选择 attr port 官方

原文地址：https://www.cnblogs.com/eilinge/p/9641598.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行