码迷,mamicode.com
首页 > 编程语言 > 详细

python-爬虫day1

时间:2017-10-16 16:56:11      阅读:244      评论:0      收藏:0      [点我收藏+]

标签:gbk   格式化   ati   xxx   contex   登录   static   nts   leak   

定义:

网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。

*************************************************************************

Requests

Requests 是使用 Apache2 Licensed 许可证的 基于Python开发的HTTP 库,其在Python内置模块的基础上进行了高度的封装,从而使得Pythoner进行网络请求时,变得美好了许多,使用Requests可以轻而易举的完成浏览器可有的任何操作。

1、GET请求

技术分享
# 1、无参数实例
 
import requests
 
ret = requests.get(https://github.com/timeline.json)
 
print ret.url
print ret.text
无参实例
技术分享
# 2、有参数实例
 
import requests
 
payload = {key1: value1, key2: value2}
ret = requests.get("http://httpbin.org/get", params=payload)
 
print ret.url
print ret.text
有参实例

向 https://github.com/timeline.json 发送一个GET请求,将请求和响应相关均封装在 ret 对象中。

2、POST请求

技术分享
# 1、基本POST实例
 
import requests
 
payload = {key1: value1, key2: value2}
ret = requests.post("http://httpbin.org/post", data=payload)
 
print ret.text
data参数
技术分享
# 2、发送请求头和数据实例
 
import requests
import json
 
url = https://api.github.com/some/endpoint
payload = {some: data}
headers = {content-type: application/json}
 
ret = requests.post(url, data=json.dumps(payload), headers=headers)
 
print ret.text
print ret.cookies
data参数配合headers 参数

向https://api.github.com/some/endpoint发送一个POST请求,将请求和相应相关的内容封装在 ret 对象中。

3、其他请求

技术分享
requests.get(url, params=None, **kwargs)
requests.post(url, data=None, json=None, **kwargs)
requests.put(url, data=None, **kwargs)
requests.head(url, **kwargs)
requests.delete(url, **kwargs)
requests.patch(url, data=None, **kwargs)
requests.options(url, **kwargs)
 
# 以上方法均是在此方法的基础上构建
requests.request(method, url, **kwargs)
其他请求
技术分享
def request(method, url, **kwargs):
    """Constructs and sends a :class:`Request <Request>`.

    :param method: method for the new :class:`Request` object.
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
    :param json: (optional) json data to send in the body of the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
    :param files: (optional) Dictionary of ``‘name‘: file-like-objects`` (or ``{‘name‘: (‘filename‘, fileobj)}``) for multipart encoding upload.
    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
    :param timeout: (optional) How long to wait for the server to send data
        before giving up, as a float, or a :ref:`(connect timeout, read
        timeout) <timeouts>` tuple.
    :type timeout: float or tuple
    :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
    :type allow_redirects: bool
    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
    :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
    :param stream: (optional) if ``False``, the response content will be immediately downloaded.
    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, (‘cert‘, ‘key‘) pair.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response

    Usage::

      >>> import requests
      >>> req = requests.request(‘GET‘, ‘http://httpbin.org/get‘)
      <Response [200]>
    """

    # By using the ‘with‘ statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
    with sessions.Session() as session:
        return session.request(method=method, url=url, **kwargs)

更多参数
更多参数,来自源码

更多requests模块相关的文档见:http://cn.python-requests.org/zh_CN/latest/

 

====================实例

爬取汽车之家 新闻(无需登录)

技术分享
import requests
from bs4 import BeautifulSoup

response=requests.get(http://www.autohome.com.cn/news/)
response.encoding=gbk
                            # html.parser
soup=BeautifulSoup(response.text,html.parser)
div_all=soup.find(div,attrs={id:auto-channel-lazyload-article})
li_l=div_all.find_all(li)
p = 0
for li in li_l:
    li_a=li.find(a)
    li_h=li.find(h3)
    p+=1
    if li_a:
        print(aaaaaaaaaaaaaaa,li_a.get(href).strip(//))
    else:
        print(aaaaaaaaaaaaaa)
    if li_h:
        print(hhhhhhhhhhhhhhhh,li_h.text)
    else:
        print(hhhhhhhhhhhhhhhh)
    print(---------------->)
    if p>=5:
        break
http://www.autohome.com.cn/news/

 

抽屉,不登录 拿 热点文章 标题,URL

登录 为某个文章点赞

技术分享
import requests
from bs4 import BeautifulSoup

#抽屉 content-list   http://dig.chouti.com/
response=requests.get(http://dig.chouti.com/)
r1_dic=response.cookies.get_dict()

r2=requests.post(
        http://dig.chouti.com/login,
        data={
          phone:86123123,
            password:aaaa,
            oneMonth:1
        },
        cookies=r1_dic)
r2_dic=r2.cookies.get_dict()
all_dic={}
all_dic.update(r1_dic)
all_dic.update(r2_dic)

#登录并点赞
r3=requests.post(http://dig.chouti.com/link/vote?linksId=14720226,cookies=all_dic)
print(r3.text)
#拿 url 和标题
soup=BeautifulSoup(response.text,html.parser)
div_all=soup.find(div,attrs={id:content-list})
div_li=div_all.find_all(div,attrs={class:news-content})
p=0
for div in div_li:
    p+=1
    div_a=div.find(a)
    if div_a:
        url=div_a.get(href)
        str=div_a.text.strip()
        print(url-------->,url)
        print(str-------->,str)
        print(----------===============)
    if p>=5:
        break
http://dig.couti.com

 

 

 

技术分享
def param_method_url():
    # requests.request(method=‘get‘, url=‘http://127.0.0.1:8000/test/‘)
    # requests.request(method=‘post‘, url=‘http://127.0.0.1:8000/test/‘)
    pass


def param_param():
    # - 可以是字典
    # - 可以是字符串
    # - 可以是字节(ascii编码以内)

    # requests.request(method=‘get‘,
    # url=‘http://127.0.0.1:8000/test/‘,
    # params={‘k1‘: ‘v1‘, ‘k2‘: ‘水电费‘})

    # requests.request(method=‘get‘,
    # url=‘http://127.0.0.1:8000/test/‘,
    # params="k1=v1&k2=水电费&k3=v3&k3=vv3")

    # requests.request(method=‘get‘,
    # url=‘http://127.0.0.1:8000/test/‘,
    # params=bytes("k1=v1&k2=k2&k3=v3&k3=vv3", encoding=‘utf8‘))

    # 错误
    # requests.request(method=‘get‘,
    # url=‘http://127.0.0.1:8000/test/‘,
    # params=bytes("k1=v1&k2=水电费&k3=v3&k3=vv3", encoding=‘utf8‘))
    pass


def param_data():
    # 可以是字典
    # 可以是字符串
    # 可以是字节
    # 可以是文件对象

    # requests.request(method=‘POST‘,
    # url=‘http://127.0.0.1:8000/test/‘,
    # data={‘k1‘: ‘v1‘, ‘k2‘: ‘水电费‘})

    # requests.request(method=‘POST‘,
    # url=‘http://127.0.0.1:8000/test/‘,
    # data="k1=v1; k2=v2; k3=v3; k3=v4"
    # )

    # requests.request(method=‘POST‘,
    # url=‘http://127.0.0.1:8000/test/‘,
    # data="k1=v1;k2=v2;k3=v3;k3=v4",
    # headers={‘Content-Type‘: ‘application/x-www-form-urlencoded‘}
    # )

    # requests.request(method=‘POST‘,
    # url=‘http://127.0.0.1:8000/test/‘,
    # data=open(‘data_file.py‘, mode=‘r‘, encoding=‘utf-8‘), # 文件内容是:k1=v1;k2=v2;k3=v3;k3=v4
    # headers={‘Content-Type‘: ‘application/x-www-form-urlencoded‘}
    # )
    pass


def param_json():
    # 将json中对应的数据进行序列化成一个字符串,json.dumps(...)
    # 然后发送到服务器端的body中,并且Content-Type是 {‘Content-Type‘: ‘application/json‘}
    requests.request(method=POST,
                     url=http://127.0.0.1:8000/test/,
                     json={k1: v1, k2: 水电费})


def param_headers():
    # 发送请求头到服务器端
    requests.request(method=POST,
                     url=http://127.0.0.1:8000/test/,
                     json={k1: v1, k2: 水电费},
                     headers={Content-Type: application/x-www-form-urlencoded}
                     )


def param_cookies():
    # 发送Cookie到服务器端
    requests.request(method=POST,
                     url=http://127.0.0.1:8000/test/,
                     data={k1: v1, k2: v2},
                     cookies={cook1: value1},
                     )
    # 也可以使用CookieJar(字典形式就是在此基础上封装)
    from http.cookiejar import CookieJar
    from http.cookiejar import Cookie

    obj = CookieJar()
    obj.set_cookie(Cookie(version=0, name=c1, value=v1, port=None, domain=‘‘, path=/, secure=False, expires=None,
                          discard=True, comment=None, comment_url=None, rest={HttpOnly: None}, rfc2109=False,
                          port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False)
                   )
    requests.request(method=POST,
                     url=http://127.0.0.1:8000/test/,
                     data={k1: v1, k2: v2},
                     cookies=obj)


def param_files():
    # 发送文件
    # file_dict = {
    # ‘f1‘: open(‘readme‘, ‘rb‘)
    # }
    # requests.request(method=‘POST‘,
    # url=‘http://127.0.0.1:8000/test/‘,
    # files=file_dict)

    # 发送文件,定制文件名
    # file_dict = {
    # ‘f1‘: (‘test.txt‘, open(‘readme‘, ‘rb‘))
    # }
    # requests.request(method=‘POST‘,
    # url=‘http://127.0.0.1:8000/test/‘,
    # files=file_dict)

    # 发送文件,定制文件名
    # file_dict = {
    # ‘f1‘: (‘test.txt‘, "hahsfaksfa9kasdjflaksdjf")
    # }
    # requests.request(method=‘POST‘,
    # url=‘http://127.0.0.1:8000/test/‘,
    # files=file_dict)

    # 发送文件,定制文件名
    # file_dict = {
    #     ‘f1‘: (‘test.txt‘, "hahsfaksfa9kasdjflaksdjf", ‘application/text‘, {‘k1‘: ‘0‘})
    # }
    # requests.request(method=‘POST‘,
    #                  url=‘http://127.0.0.1:8000/test/‘,
    #                  files=file_dict)

    pass


def param_auth():
    from requests.auth import HTTPBasicAuth, HTTPDigestAuth

    ret = requests.get(https://api.github.com/user, auth=HTTPBasicAuth(wupeiqi, sdfasdfasdf))
    print(ret.text)

    # ret = requests.get(‘http://192.168.1.1‘,
    # auth=HTTPBasicAuth(‘admin‘, ‘admin‘))
    # ret.encoding = ‘gbk‘
    # print(ret.text)

    # ret = requests.get(‘http://httpbin.org/digest-auth/auth/user/pass‘, auth=HTTPDigestAuth(‘user‘, ‘pass‘))
    # print(ret)
    #


def param_timeout():
    # ret = requests.get(‘http://google.com/‘, timeout=1)
    # print(ret)

    # ret = requests.get(‘http://google.com/‘, timeout=(5, 1))
    # print(ret)
    pass


def param_allow_redirects():
    ret = requests.get(http://127.0.0.1:8000/test/, allow_redirects=False)
    print(ret.text)


def param_proxies():
    # proxies = {
    # "http": "61.172.249.96:80",
    # "https": "http://61.185.219.126:3128",
    # }

    # proxies = {‘http://10.20.1.128‘: ‘http://10.10.1.10:5323‘}

    # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies)
    # print(ret.headers)


    # from requests.auth import HTTPProxyAuth
    #
    # proxyDict = {
    # ‘http‘: ‘77.75.105.165‘,
    # ‘https‘: ‘77.75.105.165‘
    # }
    # auth = HTTPProxyAuth(‘username‘, ‘mypassword‘)
    #
    # r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
    # print(r.text)

    pass


def param_stream():
    ret = requests.get(http://127.0.0.1:8000/test/, stream=True)
    print(ret.content)
    ret.close()

    # from contextlib import closing
    # with closing(requests.get(‘http://httpbin.org/get‘, stream=True)) as r:
    # # 在此处理响应。
    # for i in r.iter_content():
    # print(i)


def requests_session():
    import requests

    session = requests.Session()

    ### 1、首先登陆任何页面,获取cookie

    i1 = session.get(url="http://dig.chouti.com/help/service")

    ### 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权
    i2 = session.post(
        url="http://dig.chouti.com/login",
        data={
            phone: "8615131255089",
            password: "xxxxxx",
            oneMonth: ""
        }
    )

    i3 = session.post(
        url="http://dig.chouti.com/link/vote?linksId=8589623",
    )
    print(i3.text)
参数实例

 

requests的所有参数(get,post)

"""
1. method  类型,方法
2. url    地址
3. params  get,传参数
4. data    post,传参数
5. json    post,传参数2
6. headers  头信息
7. cookies  客户端cookies
8. files   上传文件   
9. auth    验证
10. timeout  超时时间
11. allow_redirects  
12. proxies
13. stream
14. cert
================ session,保存请求相关信息(不推荐)===================

技术分享
# 8. files,文件上传
requests.post(url=xx,files=())

# 9. auth,用户认证
from requests.auth import HTTPBasicAuth, HTTPDigestAuth
ret = requests.get(https://api.github.com/user, auth=HTTPBasicAuth(wupeiqi, sdfasdfasdf))
print(ret.text)


from contextlib import closing
with closing(requests.get(http://httpbin.org/get, stream=True)) as r:
    # 在此处理响应。
    for i in r.iter_content():
        print(i)
files


 

BeautifulSoup

BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。

安装:pip3 install beautifulsoup4

from bs4 import BeautifulSoup

技术分享
from bs4 import BeautifulSoup
 
html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
asdf
    <div class="title">
        <b>The Dormouse‘s story总共</b>
        <h1>f</h1>
    </div>
<div class="story">Once upon a time there were three little sisters; and their names were
    <a  class="sister0" id="link1">Els<span>f</span>ie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</div>
ad<br/>sf
<p class="story">...</p>
</body>
</html>
"""
实例
技术分享
soup = BeautifulSoup(html_doc, html.parser)
# 找到第一个a标签
tag1 = soup.find(name=a)
# 找到所有的a标签
tag2 = soup.find_all(name=a)
# 找到id=link2的标签
tag3 = soup.select(#link2)
使用格式

 

1. name,标签名称

技术分享
# tag = soup.find(‘a‘)
# name = tag.name # 获取
# print(name)
# tag.name = ‘span‘ # 设置
# print(soup)
name

2. attrs,标签属性

技术分享
# tag = soup.find(‘a‘)
# attrs = tag.attrs    # 获取
# print(attrs)
# tag.attrs = {‘ik‘:123} # 设置
# tag.attrs[‘id‘] = ‘iiiii‘ # 设置
# print(soup)
attr

3. children,所有子标签

技术分享
# body = soup.find(‘body‘)
# v = body.children
children

4. children,所有子子孙孙标签

技术分享
# body = soup.find(‘body‘)
# v = body.descendants
View Code

5. clear,将标签的所有子标签全部清空(保留标签名)

技术分享
# tag = soup.find(‘body‘)
# tag.clear()
# print(soup)
View Code

6. decompose,递归的删除所有的标签

技术分享
# body = soup.find(‘body‘)
# body.decompose()
# print(soup)
View Code

7. extract,递归的删除所有的标签,并获取删除的标签

技术分享
# body = soup.find(‘body‘)
# v = body.extract()
# print(soup)
View Code

8. decode,转换为字符串(含当前标签);decode_contents(不含当前标签)

技术分享
# body = soup.find(‘body‘)
# v = body.decode()
# v = body.decode_contents()
# print(v)
View Code

9. encode,转换为字节(含当前标签);encode_contents(不含当前标签)

技术分享
# body = soup.find(‘body‘)
# v = body.encode()
# v = body.encode_contents()
# print(v)
View Code

10. find,获取匹配的第一个标签

技术分享
# tag = soup.find(‘a‘)
# print(tag)
# tag = soup.find(name=‘a‘, attrs={‘class‘: ‘sister‘}, recursive=True, text=‘Lacie‘)
# tag = soup.find(name=‘a‘, class_=‘sister‘, recursive=True, text=‘Lacie‘)
# print(tag)
View Code

11. find_all,获取匹配的所有标签

技术分享
# tags = soup.find_all(‘a‘)
# print(tags)
 
# tags = soup.find_all(‘a‘,limit=1)
# print(tags)
 
# tags = soup.find_all(name=‘a‘, attrs={‘class‘: ‘sister‘}, recursive=True, text=‘Lacie‘)
# # tags = soup.find(name=‘a‘, class_=‘sister‘, recursive=True, text=‘Lacie‘)
# print(tags)
 
 
# ####### 列表 #######
# v = soup.find_all(name=[‘a‘,‘div‘])
# print(v)
 
# v = soup.find_all(class_=[‘sister0‘, ‘sister‘])
# print(v)
 
# v = soup.find_all(text=[‘Tillie‘])
# print(v, type(v[0]))
 
 
# v = soup.find_all(id=[‘link1‘,‘link2‘])
# print(v)
 
# v = soup.find_all(href=[‘link1‘,‘link2‘])
# print(v)
 
# ####### 正则 #######
import re
# rep = re.compile(‘p‘)
# rep = re.compile(‘^p‘)
# v = soup.find_all(name=rep)
# print(v)
 
# rep = re.compile(‘sister.*‘)
# v = soup.find_all(class_=rep)
# print(v)
 
# rep = re.compile(‘http://www.oldboy.com/static/.*‘)
# v = soup.find_all(href=rep)
# print(v)
 
# ####### 方法筛选 #######
# def func(tag):
# return tag.has_attr(‘class‘) and tag.has_attr(‘id‘)
# v = soup.find_all(name=func)
# print(v)
 
 
# ## get,获取标签属性
# tag = soup.find(‘a‘)
# v = tag.get(‘id‘)
# print(v)
View Code

 

12. has_attr,检查标签是否具有该属性

技术分享
# tag = soup.find(‘a‘)
# v = tag.has_attr(‘id‘)
# print(v)
View Code

13. get_text,获取标签内部文本内容

技术分享
# tag = soup.find(‘a‘)
# v = tag.get_text(‘id‘)
# print(v)
View Code

14. index,检查标签在某标签中的索引位置

技术分享
# tag = soup.find(‘body‘)
# v = tag.index(tag.find(‘div‘))
# print(v)
 
# tag = soup.find(‘body‘)
# for i,v in enumerate(tag):
# print(i,v)
View Code

15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签,

    判断是否是如下标签:‘br‘ , ‘hr‘, ‘input‘, ‘img‘, ‘meta‘,‘spacer‘, ‘link‘, ‘frame‘, ‘base‘

技术分享
# tag = soup.find(‘br‘)
# v = tag.is_empty_element
# print(v)
View Code

16. 当前的关联标签

技术分享
# soup.next
# soup.next_element
# soup.next_elements
# soup.next_sibling
# soup.next_siblings
 
#
# tag.previous
# tag.previous_element
# tag.previous_elements
# tag.previous_sibling
# tag.previous_siblings
 
#
# tag.parent
# tag.parents
View Code

17. 查找某标签的关联标签

技术分享
# tag.find_next(...)
# tag.find_all_next(...)
# tag.find_next_sibling(...)
# tag.find_next_siblings(...)
 
# tag.find_previous(...)
# tag.find_all_previous(...)
# tag.find_previous_sibling(...)
# tag.find_previous_siblings(...)
 
# tag.find_parent(...)
# tag.find_parents(...)
 
# 参数同find_all
View Code

18. select,select_one, CSS选择器

技术分享
soup.select("title")
 
soup.select("p nth-of-type(3)")
 
soup.select("body a")
 
soup.select("html head title")
 
tag = soup.select("span,a")
 
soup.select("head > title")
 
soup.select("p > a")
 
soup.select("p > a:nth-of-type(2)")
 
soup.select("p > #link1")
 
soup.select("body > a")
 
soup.select("#link1 ~ .sister")
 
soup.select("#link1 + .sister")
 
soup.select(".sister")
 
soup.select("[class~=sister]")
 
soup.select("#link1")
 
soup.select("a#link2")
 
soup.select(a[href])
 
soup.select(a[href="http://example.com/elsie"])
 
soup.select(a[href^="http://example.com/"])
 
soup.select(a[href$="tillie"])
 
soup.select(a[href*=".com/el"])
 
 
from bs4.element import Tag
 
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr(href):
            continue
        yield child
 
tags = soup.find(body).select("a", _candidate_generator=default_candidate_generator)
print(type(tags), tags)
 
from bs4.element import Tag
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr(href):
            continue
        yield child
 
tags = soup.find(body).select("a", _candidate_generator=default_candidate_generator, limit=1)
print(type(tags), tags)
View Code

19. 标签的内容

技术分享
# tag = soup.find(‘span‘)
# print(tag.string)          # 获取
# tag.string = ‘new content‘ # 设置
# print(soup)
 
# tag = soup.find(‘body‘)
# print(tag.string)
# tag.string = ‘xxx‘
# print(soup)
 
# tag = soup.find(‘body‘)
# v = tag.stripped_strings  # 递归内部获取所有标签的文本
# print(v)
View Code

20.append在当前标签内部追加一个标签

技术分享
# tag = soup.find(‘body‘)
# tag.append(soup.find(‘a‘))
# print(soup)
#
# from bs4.element import Tag
# obj = Tag(name=‘i‘,attrs={‘id‘: ‘it‘})
# obj.string = ‘我是一个新来的‘
# tag = soup.find(‘body‘)
# tag.append(obj)
# print(soup)
View Code

21.insert在当前标签内部指定位置插入一个标签

技术分享
# from bs4.element import Tag
# obj = Tag(name=‘i‘, attrs={‘id‘: ‘it‘})
# obj.string = ‘我是一个新来的‘
# tag = soup.find(‘body‘)
# tag.insert(2, obj)
# print(soup)
View Code

22. insert_after,insert_before 在当前标签后面或前面插入

技术分享
# from bs4.element import Tag
# obj = Tag(name=‘i‘, attrs={‘id‘: ‘it‘})
# obj.string = ‘我是一个新来的‘
# tag = soup.find(‘body‘)
# # tag.insert_before(obj)
# tag.insert_after(obj)
# print(soup)
View Code

23. replace_with 在当前标签替换为指定标签

技术分享
# from bs4.element import Tag
# obj = Tag(name=‘i‘, attrs={‘id‘: ‘it‘})
# obj.string = ‘我是一个新来的‘
# tag = soup.find(‘div‘)
# tag.replace_with(obj)
# print(soup)
View Code

24. 创建标签之间的关系

技术分享
# tag = soup.find(‘div‘)
# a = soup.find(‘a‘)
# tag.setup(previous_sibling=a)
# print(tag.previous_sibling)
View Code

25. wrap,将指定标签把当前标签包裹起来

技术分享
# from bs4.element import Tag
# obj1 = Tag(name=‘div‘, attrs={‘id‘: ‘it‘})
# obj1.string = ‘我是一个新来的‘
#
# tag = soup.find(‘a‘)
# v = tag.wrap(obj1)
# print(soup)
 
# tag = soup.find(‘a‘)
# v = tag.wrap(soup.find(‘p‘))
# print(soup)
View Code

26. unwrap,去掉当前标签,将保留其包裹的标签

技术分享
# tag = soup.find(‘a‘)
# v = tag.unwrap()
# print(soup)
View Code

 

python-爬虫day1

标签:gbk   格式化   ati   xxx   contex   登录   static   nts   leak   

原文地址:http://www.cnblogs.com/onda/p/7676773.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!