标签:
Scrapy是python最好用的一个爬虫框架.要求: python2.7.x.
# pip --version
如果没有pip,安装:
# sudo apt-get install python-pip
Import the GPG key used to sign Scrapy packages into APT keyring:
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7
$ echo ‘deb http://archive.scrapy.org/ubuntu scrapy main‘ | sudo tee /etc/apt/sources.list.d/scrapy.list
$ sudo apt-get update && sudo apt-get install scrapy $ pip install service_identity --timeout 10000
$ wget https://pypi.python.org/packages/source/p/pyasn1/pyasn1-0.1.8.tar.gz#md5=7f6526f968986a789b1e5e372f0b7065 $ tar -zxvf pyasn1-0.1.8.tar.gz $ cd pyasn1-0.1.8 $ sudo python setup.py install
# wget "https://pypi.python.org/packages/source/p/pip/pip-1.5.4.tar.gz#md5=834b2904f92d46aaa333267fb1c922bb" --no-check-certificate # tar -xzvf pip-1.5.4.tar.gz # cd pip-1.5.4 # python2.7 setup.py install
# pip install scrapy --timeout 10000
TODO: 下载太慢啦。等下载完毕再完善这里
#!/usr/bin/python2.7 #-*- coding: UTF-8 -*- # stackoverflow.py # import scrapy class StackOverflowSpider(scrapy.Spider): name = ‘stackoverflow‘ start_urls = [‘http://stackoverflow.com/questions?sort=votes‘] def parse(self, response): for href in response.css(‘.question-summary h3 a::attr(href)‘): full_url = response.urljoin(href.extract()) yield scrapy.Request(full_url, callback=self.parse_question) def parse_question(self, response): yield { ‘title‘: response.css(‘h1 a::text‘).extract()[0], ‘votes‘: response.css(‘.question .vote-count-post::text‘).extract()[0], ‘body‘: response.css(‘.question .post-text‘).extract()[0], ‘tags‘: response.css(‘.question .post-tag::text‘).extract(), ‘link‘: response.url, }
$ scrapy runspider stackoverflow.py -o top-ques.json
看看爬虫得到了什么!
enjoy it !
版权声明:本文为博主原创文章,未经博主允许不得转载。
标签:
原文地址:http://blog.csdn.net/ubuntu64fan/article/details/47834669