标签:爬虫
网络爬虫(又被称为网页蜘蛛,网络机器人,更经常的称为网页追逐者),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁,自动索引,模拟程序或者蠕虫。
模块:scrapy requests
环境:centos
****************** 如果想深入了解scrapy 请绕路 *************************
推荐 http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html
第一种:利用爬虫模块scrapy
1、建立爬虫 scrapy start object 爬虫目录名
例:scrapy start object tututu #tututu为爬虫目录名
2、在 爬虫目录名/爬虫目录名/spider/ 下建立爬虫文件
例:vim pachong.py
3、书写爬虫代码
import scrapy
class DmozSpider(scrapy.Spider):
name="dadada" #定义爬虫名字 固定格式用name=‘爬虫名‘
start_urls = [
"http://www.cnblogs.com/wangkongming/default.html?page=22",
"http://www.cnblogs.com/wangkongming/default.html?page=21",
] #启始的url 固定格式用start_urls=[]
def parse(self,response):
filename = response.url.split("/")[-2] #response.url 即要爬取的url
with open(filename,"wb") as f:
f.write(response.body) #response.body 爬取到的网页代码4、启动代码 scrapy crawl dadada #这里的dadada为爬虫的名字
第二种:利用requests模块
#coding:utf-8
from datetime import datetime
from urllib.parse import urlencode
import requests
from bs4 import BeautifulSoup
import time
from itertools import product
url = "http://www.ivsky.com/"
def download_url(url):
headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0‘}
response = requests.get(url,headers=headers) #请求网页
return response.text #返回页面的源码
def connet_image_url(html):
soup = BeautifulSoup(html,‘lxml‘) #格式化html
body = soup.body #获取网页源码的body
data_main = body.find(‘div‘,{"class":"ileft"}) #找body到‘div‘标签 且标签中 class=ileft
if data_main:
images = data_main.find_all(‘img‘) #找到data_main中所有的img标签
with open(‘img_url‘,‘w‘) as f:
for i,image in enumerate(images): #遍历images 并添加序号
image_url = image.get(‘src‘) #获取image中src的值
f.write(image_url+‘\r‘)
save_image()
def save_image():
with open(‘img_url‘,‘r‘) as f:
i=0
headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0‘}
for line in f:
if line:
i+=1
line=line[0:(len(line)-1)] #切除换行符
response = requests.get(url=line,headers=headers)
filename=str(i)+‘.jpg‘
with open(filename,‘wb‘) as f:
f.write(response.content) #将图片写进f
print(‘这是第%s张图片‘%i)
connet_image_url(download_url(url))小白笔记 如有错误 请下方评论提醒修改
标签:爬虫
原文地址:http://dongxiaoyang.blog.51cto.com/12624314/1958459