python爬虫（一）

时间：2016-03-31 23:23:36 阅读：229 评论：0 收藏：0 [点我收藏+]

标签：

　　本文主要是记录一下学习过程，相当于做一次笔记吧

　　主要参考崔庆才的Python爬虫学习系列教程(http://cuiqingcai.com/1052.html)

　　这里主要是一些Python的基础知识和爬糗事百科的一个实例：

　　一：基础知识

　　　　1.爬虫：趴在网络上的蜘蛛，遇见想要的资源，就会抓取下来。

　　　　2.浏览网页的过程：用户输入网站->DNS服务器->找到服务器主机->向服务器发送请求->服务器解析->发给浏览器相应的文件->浏览器解析

　　　　3.url：统一资源定位符（网址）：是对互联网上的资源的定位和访问方式的表示，是互联网上标准资源的地址。互联网上每个文件对应着一个URL。（协议+IP(有时有端口号)+具体地址）

　　二：urllib库的使用：

　　　　urlopen(url,data,timeout):data是访问URL时要传送的数据，timeout是设置超时（有默认值）

　　　　response = urllib2.urlopen(URL)

　　　　print response.read() : 返回获取到的网页内容

　　　　print response : 返回对该对象的描述（个人理解：类似于指针和指针所指向的内容）

　　　　request = urllib2.Requset(URL)

　　　　response =urllib2.urlopen(request)

　　　　(建立一个request，服务器响应，用户接受数据)

　　　　Post和get：

　　　　get:直接以链接形式访问，链接中包含参数，post则不会显示参数

　　　　post:

　　　　　　values = {"name":"213.@qq.com","pwd":"xxx"}#理解为序列化

　　　　　　data=urllib.urlencode(values)

　　　　　　url = "URL"

　　　　　　requset = urllib2.Request(url,data)

　　　　　　response = urllib2.urlopen(request)

　　　　GET:

　　　　　　values = {"name":"213.@qq.com","pwd":"xxx"}#理解为序列化

　　　　　　data=urllib.urlencode(values)

　　　　　　url="URL"

　　　　　　gurl=url+"?"+data

　　　　　　request = urllib2.Request(gurl)

　　　　　　response = urllib2.urlopen(request)

　　　　设置 Headers：为了模拟浏览器，需要有一个请求身份

　　　　　　user_agent=‘Mozilla/4.0(compatioble;MSIE5.5;Windows NT)‘

　　　　　　headers = {‘User-Agent‘:user_agent}　　

　　　　　　data= DATA

　　　　　　request = urllib2.Request(url,data,headers)

　　　　　　response = urllib2.urlopen(request)

　　　　代理Proxy：每隔一段时间换一个代理：

　　　　　　enable_proxy=True

　　　　　　proxy_handler=urllib2.ProxyHandler({"http":‘http://some-proxy.com:8080‘})

　　　　　　null_proxy_handler = urllib2.ProxyHandler({})

　　　　　　if enable_proxy:

　　　　　　　　opener = urllib2.build_opener(proxy_handler)

　　　　　　else:

　　　　　　　　 opener = urllib2.build_opener(null_proxy_handler)

　　　　　　urllib2.install_opener(opener)

　　出错处理：

　　cookie:

　　以及一个爬虫的例子：

#!/usr/bin/env python
# -*- coding:utf-8 -*-
#!/usr/bin/env python # -*- coding: utf-8 -*- 
"""
Created on Tue Mar 22 19:44:06 2016

@author: mz
"""

import urllib
import re
import urllib2

page = 2
url = ‘http://www.qiushibaike.com/hot/page/‘ + str(page)
user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘
headers = { ‘User-Agent‘ : user_agent }

try:
    request = urllib2.Request(url, headers = headers)
    response = urllib2.urlopen(request)
    content = response.read().decode(‘utf-8‘)
    #pattern = re.compile(‘<div class="author clearfix>.*?title="(.*?)">\n<h2>.*?"content">(.*?)\n<!--.*?<span class="stats-vote"<i class="number">(*?)</i>\s[\u4e00-\u9fa5][\u4e00-\u9fa5].*?tagert="_blank">\n<i class="number">(.*?)</i>\s[\u4e00-\u9fa5][\u4e00-\u9fa5]\n</a>‘,re.S)    
    pattern = re.compile(‘<div class="author clearfix">.*?title.*?>\n<h2>(.*?)</h2>.*?<div class="content">(.*?)<!--.*?-->.*?<span class="stats-vote"><i class="number">(.*?)</i>.*?<span class="dash">.*?<i class="number">(.*?)</i>.*?‘,re.S)    
    items = re.findall(pattern,content)
    for item in items:
        print item[0],item[1],item[2],item[3]
    print "no"
except urllib2.URLError,e:
    if hasattr(e,‘code‘):
        print e.code
    if hasattr(e,‘reason‘):
        print e.reason

python爬虫（一）

标签：

原文地址：http://www.cnblogs.com/muzhiwan/p/5343044.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行