爬虫入门笔记

时间：2017-07-29 16:37:40 阅读：159 评论：0 收藏：0 [点我收藏+]

标签：while 通过 type set 蜘蛛 pre not html deque

爬虫，在网络中爬行的一只蜘蛛，如遇到资源，就会按指定的规则抓取下来

爬虫爬取HTML代码后，通过分析和过滤这些HTML代码，实现对图片，文字等资源的获取

URL的格式由三部分组成：

　　1、第一部分是协议

　　2、第二部分是存储该资源的主机IP和端口

　　3、第三部分是资源的具体地址，如目录和文件名

爬虫爬取数据时必须有一个目标URL才可以获取数据，因此，它是爬虫获取数据根本

import re
import urllib.request
import urllib

from collections import deque

queue = deque()
visited = set()

url = ‘https://jecvay.com/‘

queue.append(url)
cnt = 0

while queue:
    url = queue.pop()
    visited.add(url)

    print(‘Count: ‘ + str(cnt) + ‘ visiting <--- ‘ + url)
    cnt += 1
    urlop = urllib.request.urlopen(url)
    if ‘html‘ not in urlop.getheader(‘Content-Type‘):
        continue

    try:
        data = urlop.read().decode(‘utf-8‘)
    except:
        continue

    linkre = re.compile(r‘href="(.+?)"‘)
    
    for x in linkre.findall(data):
        if ‘http‘ in x and x not in visited:
            queue.append(x)
            print(‘add---> ‘ + x)

爬虫入门笔记

标签：while 通过 type set 蜘蛛 pre not html deque

原文地址：http://www.cnblogs.com/m2492565210/p/7251285.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行