spidering hacks 学习笔记(一)

时间：2014-05-26 07:57:21 阅读：247 评论：0 收藏：0 [点我收藏+]

　　我们老大给了我本书《spidering hacks》，说里面的学会了，走遍天下都不怕了！----看看去，400多页的英文书，本来想买纸质，但是太贵，买不起。

ok，我是先看目录，然后看段落标题，然后看书是如何解释段落标题的，段落标题无非就是中心思想嘛！嘿嘿，走起....

                spidering hacks 学习笔记(一)
                     
                    一：overview:
chapter1：
(basics,philosophies(哲学),consideration,issues)
 
chapter2:
(spidering toolbox,modules galore（丰富的),prominent(突出的), )
 
chapter3:
(media files,Library of Congress(美国国会图书馆))
 
chapter4:
(get to the information which is 
not as easy as just scraping)
 
chapter5:
(keep data current,mirror collections to hard disk,spider schedule)
 
chapter6:
(share own data to be spidered)
 
                   二：chapter1
1：hack1：
(traverse the Web)
 
Q1:what is 
the difference between spiders and 
scrapers?
   spiders as programs that grab entire pages, files, or
sets of either, while 
scrapers grab very specific bits of information within these files.
  
Q2:Why Spider:
(1) Gain automated access to resources
(2) Gather information and 
present it in 
an alternate format
(3) Aggregate otherwise disparate data sources
(4) Combine the functionalities of sites(很多搜索引擎资源整合)
(5) Find and 
gather specific kinds of information
(6) Perform regular webmaster functions(充当网站管理的一部分职责)
 
2：hack2：Best Practices for 
You and 
Your Spider
(1) Be Liberal in 
What You Accept（格式会很多HTML，XML..需格式转换，need boundary）
(2) Don‘t Limit Your Dataset
(3) Don‘t Reinvent the Wheel(不要推到重来，就是借鉴别人爬虫脚本！)
 
Q3:best practices for 
you and 
your spider(几个注意的点):
Choose the most structured format 
available
If you must scrape HTML, do so sparingly（The less HTML, the less fragile your spider will be！！！呵呵）
Don‘t go where you‘re not 
wanted
Choose a good identifier
Make information on your spider readily available
Don‘t demand unlimited site access or 
support
Go light on the bandwidth(爬虫适可而止哦,关注bandwidth)
Take just enough, and 
don‘t take too often
 
3:Hack3： Anatomy of an HTML Page
<html>
<header>
<title>
 Title
</titile>
</header>
<body>
 body
</body>
</html>
 
(a):Header Information with the H Tags
<H1> <H2> head的层次
(b):List 
Information with Special HTML Tags
oderlist <ol> <li> </li> </ol>
you can grab everything between <ol> and 
</ol>, parse each <li></li>
element into an array
(c)Non?HTML Files
XML‘s parts are defined more rigidly than HTML
Using XML::RSS to Repurpose Everything
Perl XML modules
(such as XML::Simple, XML::RSS, or 
XML::LibXML)
 
4:Hack4: Registering Your Spider
 
(a)naming your spider (取名有意义点！)
(b)A Web Page About Your Spider
(c)Places to Register Your Spider
 
5:Hack5: Preempting Discovery
 
(a)Making Contact:告诉别人你的爬虫，怎么contact！！
(b)Making the Arguments for 
Your Spider：告诉别人你做什么
(c)Making Your Spider Easy to Find and 
Learn About
(d)Considering Legal Issues
 
6:Hac6 Keeping Your Spider Out of Sticky Situations
(a)Bad Spider, No Biscuit!(强调不要做一些有害的事情)
(b)Violating Copyright（提到不要取用别人知识产权的东西，小心触犯法律！）
(c)Aggregating Data
(d)Competitive Intelligence(竞争对手)
(e)Possible Consequences of Misbehaving Spiders(里面说到，警察叔叔会敲你的门！)
(f)Tracking Legal Issues（看看法律的东西）<br><br>7:Hack7:Finding the Patterns of Identifiers<br>(木有什么好说的！)<br>

spidering hacks 学习笔记(一),布布扣,bubuko.com

spidering hacks 学习笔记(一)

标签：c class blog code a int

原文地址：http://www.cnblogs.com/datacatcher/p/3747605.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行