getshops

时间：2015-07-16 07:12:31 阅读：156 评论：0 收藏：0 [点我收藏+]

标签：

#!/usr/bin/env python
#encoding=utf-8 
import urllib2,sys,re,os
#url="http://www.dianping.com/search/category/1/20/g122"

def httpCrawler(url):
    content = httpRequest(url)
    info=parseHtml(content)
    saveData(info)
    
def httpRequest(url):
    try:
        html= None
        req_header = {
            ‘User-Agent‘:‘Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0‘
            #‘Accept‘:‘text/html;q=0.9,*/*;q=0.8‘,
            #‘Accept-Language‘:‘en-US,en;q=0.5‘,
            #‘Accept-Encoding‘:‘gzip‘,
            #‘Host‘:‘j3.s2.dpfile.com‘,
            #‘Connection‘:‘keep-alive‘,
            #‘Referer‘:‘http://www.baidu.com‘
        }
        req_timeout = 5
        req = urllib2.Request(url,None,req_header)
        resp = urllib2.urlopen(req,None,req_timeout)
        html = resp.read()
        print html
    finally:
        if resp:
            resp.close()
    return html

def parseHtml(html):
    content = None
    pattern = ‘<title>([^<]*?)</title>‘
    temp = re.findall(pattern, html)
    if temp:
        content = temp[0]
    ‘‘‘
    province =
    city =
    adminDistrict =
    businessDistrict =
    businessName =
    address =
    averageComsumption =
    ‘‘‘
    return content    

def saveData(data):
    if not os.path.exists(‘./zhubao‘):
        os.mkdir(r‘./zhubao‘)
    f = open(‘./zhubao/zhubao_shops.csv‘, ‘wb‘)
    f.write(data)
    f.close()

if __name__ == ‘__main__‘:
    url="http://www.dianping.com/search/category/1/20/g122"
    httpCrawler(url)



‘‘‘
python2.6 没有urllib.request
多线程
gevent
爬虫系统基本的结构：
1.网络请求；
最简单的工具就是urllib、urllib2。这两个工具可以实现基本的下载功能，如果进阶想要异步可以使用多线程，如果想效率更高采用非阻塞方案tornado和curl可以实现非阻塞的下载。
2.抓取结构化数据；
要想在页面中找到新链接需要对页面解析和对url排重，正则和DOM都可以实现这个功能，看自己熟悉哪一种。
正则感觉速度较快一些，DOM相对较慢并且复杂一点，如果只是为了要url正则可以解决，如果还想要页面中其他的结构或者内容DOM比较方便。
url的排重两小可以用memcache或者redis，量大就要用到bloomfilter。
3.数据存储；
抓的少怎么存都行，抓的多并且要方便读取那就要好好设计了，用哈希分布存储在RDBMS上或者直接存在HBase上都要看你的数据量和具体需求。 
‘‘‘

#!/usr/bin/env python
#encoding=utf-8 
import urllib2,sys,re,os
#url="http://www.dianping.com/search/category/1/20/g122"

def httpCrawler(url):
    content = httpRequest(url)
    info=parseHtml(content)
    saveData(info)
    
def httpRequest(url):
    try:
        html= None
        req_header = {
            ‘User-Agent‘:‘Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0‘
            #‘Accept‘:‘text/html;q=0.9,*/*;q=0.8‘,
            #‘Accept-Language‘:‘en-US,en;q=0.5‘,
            #‘Accept-Encoding‘:‘gzip‘,
            #‘Host‘:‘j3.s2.dpfile.com‘,
            #‘Connection‘:‘keep-alive‘,
            #‘Referer‘:‘http://www.baidu.com‘
        }
        req_timeout = 5
        req = urllib2.Request(url,None,req_header)
        resp = urllib2.urlopen(req,None,req_timeout)
        html = resp.read()
        print html
    finally:
        if resp:
            resp.close()
    return html

def parseHtml(html):
    content = None
    pattern = ‘<title>([^<]*?)</title>‘
    temp = re.findall(pattern, html)
    if temp:
        content = temp[0]
    ‘‘‘
    province =
    city =
    adminDistrict =
    businessDistrict =
    businessName =
    address =
    averageComsumption =
    ‘‘‘
    return content    

def saveData(data):
    if not os.path.exists(‘./zhubao‘):
        os.mkdir(r‘./zhubao‘)
    f = open(‘./zhubao/zhubao_shops.csv‘, ‘wb‘)
    f.write(data)
    f.close()

if __name__ == ‘__main__‘:
    url="http://www.dianping.com/search/category/1/20/g122"
    httpCrawler(url)



‘‘‘
python2.6 没有urllib.request
多线程
gevent
爬虫系统基本的结构：
1.网络请求；
最简单的工具就是urllib、urllib2。这两个工具可以实现基本的下载功能，如果进阶想要异步可以使用多线程，如果想效率更高采用非阻塞方案tornado和curl可以实现非阻塞的下载。
2.抓取结构化数据；
要想在页面中找到新链接需要对页面解析和对url排重，正则和DOM都可以实现这个功能，看自己熟悉哪一种。
正则感觉速度较快一些，DOM相对较慢并且复杂一点，如果只是为了要url正则可以解决，如果还想要页面中其他的结构或者内容DOM比较方便。
url的排重两小可以用memcache或者redis，量大就要用到bloomfilter。
3.数据存储；
抓的少怎么存都行，抓的多并且要方便读取那就要好好设计了，用哈希分布存储在RDBMS上或者直接存在HBase上都要看你的数据量和具体需求。 
‘‘‘

#!/usr/bin/env python
#encoding=utf-8 
import urllib2,sys,re,os
#url="http://www.dianping.com/search/category/1/20/g122"

def httpCrawler(url):
    content = httpRequest(url)
    info=parseHtml(content)
    saveData(info)
    
def httpRequest(url):
    try:
        html= None
        req_header = {
            ‘User-Agent‘:‘Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0‘
            #‘Accept‘:‘text/html;q=0.9,*/*;q=0.8‘,
            #‘Accept-Language‘:‘en-US,en;q=0.5‘,
            #‘Accept-Encoding‘:‘gzip‘,
            #‘Host‘:‘j3.s2.dpfile.com‘,
            #‘Connection‘:‘keep-alive‘,
            #‘Referer‘:‘http://www.baidu.com‘
        }
        req_timeout = 5
        req = urllib2.Request(url,None,req_header)
        resp = urllib2.urlopen(req,None,req_timeout)
        html = resp.read()
        print html
    finally:
        if resp:
            resp.close()
    return html

def parseHtml(html):
    content = None
    pattern = ‘<title>([^<]*?)</title>‘
    temp = re.findall(pattern, html)
    if temp:
        content = temp[0]
    ‘‘‘
    province =
    city =
    adminDistrict =
    businessDistrict =
    businessName =
    address =
    averageComsumption =
    ‘‘‘
    return content    

def saveData(data):
    if not os.path.exists(‘./zhubao‘):
        os.mkdir(r‘./zhubao‘)
    f = open(‘./zhubao/zhubao_shops.csv‘, ‘wb‘)
    f.write(data)
    f.close()

if __name__ == ‘__main__‘:
    url="http://www.dianping.com/search/category/1/20/g122"
    httpCrawler(url)



‘‘‘
python2.6 没有urllib.request
多线程
gevent
爬虫系统基本的结构：
1.网络请求；
最简单的工具就是urllib、urllib2。这两个工具可以实现基本的下载功能，如果进阶想要异步可以使用多线程，如果想效率更高采用非阻塞方案tornado和curl可以实现非阻塞的下载。
2.抓取结构化数据；
要想在页面中找到新链接需要对页面解析和对url排重，正则和DOM都可以实现这个功能，看自己熟悉哪一种。
正则感觉速度较快一些，DOM相对较慢并且复杂一点，如果只是为了要url正则可以解决，如果还想要页面中其他的结构或者内容DOM比较方便。
url的排重两小可以用memcache或者redis，量大就要用到bloomfilter。
3.数据存储；
抓的少怎么存都行，抓的多并且要方便读取那就要好好设计了，用哈希分布存储在RDBMS上或者直接存在HBase上都要看你的数据量和具体需求。 
‘‘‘

getshops

标签：

原文地址：http://www.cnblogs.com/x113/p/4650009.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行