【00】Python爬虫初次开发

时间：2016-11-12 23:21:33 阅读：203 评论：0 收藏：0 [点我收藏+]

我的第00篇博客

Python爬虫初次开发：

　　这周四讲了正则表达式，晚上就开始摸索着写一个网络爬虫。这个爬虫的功能就是从指定的网页开始，爬取这个网页里所有的链接，然后进入这些链接继续爬取新的链接，不断继续这个过程，并保存下所有爬取到的链接。这个爬虫目前还没有什么实际用处，后续可以在此基础上开发搜索指定信息等功能。

　　这个Python程序将用到以下模块：urllib, re, time

　　urllib：用来调用urlopen函数打开链接

　　re：编译正则表达式

　　time：用于计时[可选]

以下是我的代码：

 1 from urllib.request import urlopen
 2 import re
 3 import time
 4 fo=open("pc00_result.txt","w")          #打开要用于储存链接的文本文档
 5 list=[]                                 #储存所有的链接
 6 x=0                                     #爬过的链接次数
 7 connected=0                             #成功连上的数量
 8 num=0                                   #上一次储存的最后一个链接的缩印
 9 list.append(input("输入网址："))
10 xn=int(input("输入爬虫总次数："))
11 start=time.clock()                      #开始计时
12 while x<=xn:
13     if x>=len(list):                    #次数超出总链接数量就结束
14         end=time.clock()                #结束计时
15         print("爬虫结束...!")
16         print("本次爬虫共爬过{}个网站，爬得{}个链接".format(connected, len(list)-1))
17         print("共耗时{:.3f}s".format(end-start))
18         fo.writelines("爬虫结束...!\n")
19         fo.writelines("本次爬虫共爬过{}个网站，爬得{}个链接\n".format(connected, len(list)-1))
20         fo.writelines("共耗时{:.3f}s\n".format(end-start))
21         break
22     try:
23         print("No.{}".format(x))
24         fo.writelines("No.{}\n".format(x))
25         print("正在连接{}".format(list[x]))
26         fo.writelines("正在连接{}\n".format(list[x]))
27         temp=urlopen(list[x],timeout=10)                    #打开链接 10秒超时
28         temp=temp.read().decode("utf-8")                    #读取网页内容并以utf-8方式解码
29         print("已连接上{}".format(list[x]))
30         fo.writelines("已连接上{}\n".format(list[x]))
31         patten=re.compile(r‘https?://[^\\\‘"\.].+?[^\\\‘"](?:/|com|org|net|cn|cc|tv)‘)
32         print("正在解析{}".format(list[x]))
33         fo.writelines("正在解析{}\n".format(list[x]))
34         temp0=re.findall(patten, temp)                      #在之前读取的内容里进行匹配
35         connected+=1                                        #成功连接数加一
36         for i in range(len(temp0)):
37             if temp0[i] not in list:                        #新链接储存起来
38                 list.append(temp0[i])
39         for j in range(num,len(list)):
40             print(list[j])                                  #输出这次新获得的链接
41             fo.writelines(list[j])
42             fo.writelines("\n")
43         num=len(list)
44         print("\n")
45         fo.writelines("\n\n")
46     except:
47         print("{}连接或解析失败\n\n".format(list[x]))
48         fo.writelines("{}连接或解析失败\n\n\n".format(list[x]))
49         x+=1
50     else:
51         x+=1
52 else:
53     end=time.clock()
54     print("爬虫结束...!")
55     print("本次爬虫共爬过{}个网站，爬得{}个链接".format(connected,len(list)-1))
56     print("共耗时{:.3f}s".format(end-start))
57     fo.writelines("\n")
58     fo.writelines("爬虫结束...!")
59     fo.writelines("\n")
60     fo.writelines("本次爬虫共爬过{}个网站，爬得{}个链接".format(connected,len(list)-1))
61     fo.writelines("\n")
62     fo.writelines("共耗时{:.3f}s".format(end-start))

　　这个爬虫的关键在于那个正则表达式：

1 patten=re.compile(r‘https?://[^\\\‘"\.].+?[^\\\‘"](?:/|com|org|net|cn|cc|tv)‘

　　这句的意思是把那个正则表达式编译成正则表达式对象然后储存在patten变量里。

　　而核心的正则表达式：https?://[^\\\‘"\.].+?[^\\\‘"](?:/|com|org|net|cn|cc|tv)

　　是指匹配以http开头，可能有s（https），加上://，以/、com、org、net、cn、cc、tv结尾的链接

　　中间的[^\\\‘"\.]指http(s)://后面不能直接跟\ ‘ " .这四个符号

　　.+?指非贪婪的匹配任何字符

　　[^\\\‘"]指在com等结尾之前不能出现\ ‘ "的符号

　　这个表达式花了我很大力气写出来，而且匹配仍会有一定的出错率，目前还不知道有什么解决办法。

以上。

【00】Python爬虫初次开发

标签：程序 .com 网站 tar 调用链接过程功能网络

原文地址：http://www.cnblogs.com/stevehawk/p/6057593.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行