1. 创建项目
2. 创建爬虫
3. 运行爬虫
======
爬虫技巧
设置setting.py
1. 这是不遵循 ROBOTSTXT_OBEY
ROBOTSTXT_OBEY = False
2. 设置延时
DOWNLOAD_DELAY = 3
3. 设置 DEFAULT_REQUEST_HEADERS
DEFAULT_REQUEST_HEADERS = {
‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
‘Accept-Language‘: ‘en‘,
‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36‘
}
4. 设置 下载中间件(设置爬虫的headers和proxoy)
// project_dir/middlewares.py
1 class ProxyMilldeware(object): 2 def process_request(self, spider, request): 3 print(‘*‘*100) 4 request.headers.setdefault(‘User-Agent‘, ‘在这里设置成你的浏览器用户代理‘)
5 request.meta[‘proxy‘]= ‘在这里设置成你的浏览器IP代理‘ # https://127.0.0.1:8080