我们老大给了我本书《spidering hacks》,说里面的学会了,走遍天下都不怕了!----看看去,400多页的英文书,本来想买纸质,但是太贵,买不起。
ok,我是先看目录,然后看段落标题,然后看书是如何解释段落标题的,段落标题无非就是中心思想嘛!嘿嘿,走起....
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96 |
spidering hacks 学习笔记(一) 一:overview: chapter1: (basics,philosophies(哲学),consideration,issues) chapter2: (spidering toolbox,modules galore(丰富的),prominent(突出的), ) chapter3: (media files,Library of Congress(美国国会图书馆)) chapter4: (get to the information which is
not as easy as just scraping) chapter5: (keep data current,mirror collections to hard disk,spider schedule) chapter6: (share own data to be spidered) 二:chapter1 1 :hack1: (traverse the Web) Q1:what is
the difference between spiders and
scrapers? spiders as programs that grab entire pages, files, or sets of either, while
scrapers grab very specific bits of information within these files. Q2:Why Spider: ( 1 ) Gain automated access to resources ( 2 ) Gather information and
present it in
an alternate format ( 3 ) Aggregate otherwise disparate data sources ( 4 ) Combine the functionalities of sites(很多搜索引擎资源整合) ( 5 ) Find and
gather specific kinds of information ( 6 ) Perform regular webmaster functions(充当网站管理的一部分职责) 2 :hack2:Best Practices for
You and
Your Spider ( 1 ) Be Liberal in
What You Accept(格式会很多HTML,XML..需格式转换,need boundary) ( 2 ) Don‘t Limit Your Dataset ( 3 ) Don‘t Reinvent the Wheel(不要推到重来,就是借鉴别人爬虫脚本!) Q3:best practices for
you and
your spider(几个注意的点): Choose the most structured format
available If you must scrape HTML, do so sparingly(The less HTML, the less fragile your spider will be!!!呵呵) Don ‘t go where you‘ re not
wanted Choose a good identifier Make information on your spider readily available Don‘t demand unlimited site access or
support Go light on the bandwidth(爬虫适可而止哦,关注bandwidth) Take just enough, and
don‘t take too often 3 :Hack3: Anatomy of an HTML Page <html> <header> <title> Title < / titile> < / header> <body> body < / body> < / html> (a):Header Information with the H Tags <H1> <H2> head的层次 (b): List
Information with Special HTML Tags oderlist <ol> <li> < / li> < / ol> you can grab everything between <ol> and
< / ol>, parse each <li>< / li> element into an array (c)Non?HTML Files XML‘s parts are defined more rigidly than HTML Using XML::RSS to Repurpose Everything Perl XML modules (such as XML::Simple, XML::RSS, or
XML::LibXML) 4 :Hack4: Registering Your Spider (a)naming your spider (取名有意义点!) (b)A Web Page About Your Spider (c)Places to Register Your Spider 5 :Hack5: Preempting Discovery (a)Making Contact:告诉别人你的爬虫,怎么contact!! (b)Making the Arguments for
Your Spider:告诉别人你做什么 (c)Making Your Spider Easy to Find and
Learn About (d)Considering Legal Issues 6 :Hac6 Keeping Your Spider Out of Sticky Situations (a)Bad Spider, No Biscuit!(强调不要做一些有害的事情) (b)Violating Copyright(提到不要取用别人知识产权的东西,小心触犯法律!) (c)Aggregating Data (d)Competitive Intelligence(竞争对手) (e)Possible Consequences of Misbehaving Spiders(里面说到,警察叔叔会敲你的门!) (f)Tracking Legal Issues(看看法律的东西)<br><br> 7 :Hack7:Finding the Patterns of Identifiers<br>(木有什么好说的!)<br> |
spidering hacks 学习笔记(一),布布扣,bubuko.com
原文地址:http://www.cnblogs.com/datacatcher/p/3747605.html