码迷,mamicode.com
首页 > 编程语言 > 详细

Python爬虫知识点二

时间:2017-11-16 11:32:42      阅读:185      评论:0      收藏:0      [点我收藏+]

标签:test   sel   流式   通过   img   col   表单提交   二进制   html文本   

一。request库

import json
import requests

from io import BytesIO
#显示各种函数相当于api
# print(dir(requests))


url = http://www.baidu.com
r = requests.get(url)
print(r.text)
print(r.status_code)
print(r.encoding)
结果: 技术分享

 

# 传递参数:不如http://aaa.com?pageId=1&type=content

params = {k1:v1, k2:v2}
r = requests.get(http://httpbin.org/get, params)
print(r.url)
结果:
技术分享

 

# 二进制数据

# r = requests.get(‘http://i-2.shouji56.com/2015/2/11/23dab5c5-336d-4686-9713-ec44d21958e3.jpg‘)
# image = Image.open(BytesIO(r.content))
# image.save(‘meinv.jpg‘)


# json处理

r = requests.get(https://github.com/timeline.json)
print(type(r.json))
print(r.text)
结果: 技术分享

 

# 原始数据处理
# 流式数据写入
r = requests.get(http://i-2.shouji56.com/2015/2/11/23dab5c5-336d-4686-9713-ec44d21958e3.jpg, stream = True)
with open(meinv2.jpg, wb+) as f:
    for chunk in r.iter_content(1024):
        f.write(chunk)


# 提交表单

form = {username:user, password:pass}
r = requests.post(http://httpbin.org/post, data = form)
print(r.text)
结果:参数以表单形式提交,所以参数放在form参数中
技术分享

 


r = requests.post(http://httpbin.org/post, data = json.dumps(form))
print(r.text)
结果:参数不是以form表单提交的,所以放在json字段中

 技术分享

# cookie

url = http://www.baidu.com
r = requests.get(url)
cookies = r.cookies
#cookie实际上是一个字典
for k, v in cookies.get_dict().items():
    print(k, v)
结果:cookie实际上是一个键值对
技术分享

 


cookies = {c1:v1, c2: v2}
r = requests.get(http://httpbin.org/cookies, cookies = cookies)
print(r.text)
结果:
技术分享

 

# 重定向和重定向历史

r = requests.head(http://github.com, allow_redirects = True)
print(r.url)
print(r.status_code)
print(r.history)
结果:通过301定向
技术分享

 

# # 代理
#
# proxies = {‘http‘: ‘,,,‘, ‘https‘: ‘...‘}
# r = requests.get(‘...‘, proxies = proxies)

    

 二。BeautifulSoup库

html:举例如下

<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse‘s story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

解析代码如下:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open(test.html))
#使html文本更加结构化
# print(soup.prettify()) # Tag print(type(soup.title))
结果:bs4的一个类
技术分享
print(soup.title.name)
print(soup.title)
结果如下:
技术分享

 

# String

print(type(soup.title.string))
print(soup.title.string)
结果如下:只显示标签里面内容
技术分享

 

# Comment

print(type(soup.a.string))
print(soup.a.string)
结果:显示注释中的内容,所以有时需要判断获取到的内容是不是注释 技术分享

 

#
# ‘‘‘
for item in soup.body.contents:
    print(item.name)

结果:body下面有三个item
技术分享

 

# CSS查询

print(soup.select(.sister))
结果:样式选择器返回带有某个样式的所有内容 结果为一个list
技术分享

 

print(soup.select(#link1))
结果:ID选择器,选择ID等于link1的内容
技术分享

 

print(soup.select(head > title))
结果:

技术分享

 



a_s = soup.select(a)
for a in a_s:
    print(a)
结果:标签选择器,选择所有a标签的
技术分享

 


 

 持续更新中。。。。,欢迎大家关注我的公众号LHWorld.

技术分享

Python爬虫知识点二

标签:test   sel   流式   通过   img   col   表单提交   二进制   html文本   

原文地址:http://www.cnblogs.com/LHWorldBlog/p/7842267.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!