标签:and 中间 import pre ges .com iad lxml nmp
说实话在0基础的情况下自己学习python确实有点吃力,可能是我笨了吧,废话不说上代码
1抓取各栏目的链接
from bs4 import BeautifulSoup
import requests
start_url = ‘http://bj.58.com/sale.shtml‘
url_host = ‘http://bj.58.com‘
def get_channel_urls(url):
web_data = requests.get(url)
soup = BeautifulSoup(web_data.text,‘lxml‘)
links = soup.select(‘ul.ym-submnu > li > b > a‘)
for link in links:
page_url =url_host+link.get(‘href‘)
print(page_url)
get_channel_urls(start_url)
channel_list = ‘‘‘
http://bj.58.com/shouji/
http://bj.58.com/danche/
http://bj.58.com/diandongche/
http://bj.58.com/fzixingche/
http://bj.58.com/sanlunche/
http://bj.58.com/peijianzhuangbei/
http://bj.58.com/diannao/
http://bj.58.com/bijiben/
http://bj.58.com/pbdn/
http://bj.58.com/diannaopeijian/
http://bj.58.com/zhoubianshebei/
http://bj.58.com/shuma/
http://bj.58.com/shumaxiangji/
http://bj.58.com/mpsanmpsi/
http://bj.58.com/youxiji/
http://bj.58.com/ershoukongtiao/
http://bj.58.com/dianshiji/
http://bj.58.com/xiyiji/
http://bj.58.com/bingxiang/
http://bj.58.com/jiadian/
http://bj.58.com/binggui/
http://bj.58.com/chuang/
http://bj.58.com/ershoujiaju/
http://bj.58.com/yingyou/
http://bj.58.com/yingeryongpin/
http://bj.58.com/muyingweiyang/
http://bj.58.com/muyingtongchuang/
http://bj.58.com/yunfuyongpin/
http://bj.58.com/fushi/
http://bj.58.com/nanzhuang/
http://bj.58.com/fsxiemao/
http://bj.58.com/xiangbao/
http://bj.58.com/meirong/
http://bj.58.com/yishu/
http://bj.58.com/shufahuihua/
http://bj.58.com/zhubaoshipin/
http://bj.58.com/yuqi/
http://bj.58.com/tushu/
http://bj.58.com/tushubook/
http://bj.58.com/wenti/
http://bj.58.com/yundongfushi/
http://bj.58.com/jianshenqixie/
http://bj.58.com/huju/
http://bj.58.com/qiulei/
http://bj.58.com/yueqi/
http://bj.58.com/bangongshebei/
http://bj.58.com/diannaohaocai/
http://bj.58.com/bangongjiaju/
http://bj.58.com/ershoushebei/
http://bj.58.com/chengren/
http://bj.58.com/nvyongpin/
http://bj.58.com/qinglvqingqu/
http://bj.58.com/qingquneiyi/
http://bj.58.com/chengren/
http://bj.58.com/xiaoyuan/
http://bj.58.com/ershouqiugou/
http://bj.58.com/tiaozao/
http://bj.58.com/tiaozao/
http://bj.58.com/tiaozao/
上图运行后将所抓取的代码中关于电话号码的那一页删掉并保存在一个list里面备用
2.新建一个python文件开始写抓取所有物品的链接的爬虫,并将所抓取的链接保存到mongodb的数据库中,其实在这一步看似是正常的,
但是在和第三个爬虫进行多进程爬去时会发现总是报错,主要的原因是转转的前几个网址是广告里面的结构与平常的帖子是不同的所以需要通过if函数将抓取的异常连接排除掉
from bs4 import BeautifulSoup
import requests
import pymongo
import time
client = pymongo.MongoClient(‘localhost‘,27017)
wuba = client[‘wuba‘]
url_list3 = wuba[‘url_list3‘]
item_infor = wuba[‘item_infor‘]
# spider 1
def get_links_from(channel,pages,who_shells=0):
list_view =‘{}{}/pn{}/‘.format(channel,str(who_shells),str(pages))
wb_data = requests.get(list_view)
time.sleep(1)
soup = BeautifulSoup(wb_data.text,‘lxml‘)
if soup.find(‘td‘,‘t‘):
links = soup.select(‘td.t a.t‘)
for link in links:
wor_url = ‘http://jump.zhineng.58.com/jump‘
item_link = link.get(‘href‘).split(‘?‘)[0]
if item_link == wor_url:
continue
else:
url_list3.insert_one({‘url‘:item_link})
print(item_link)
else:
pass
#Nothing
3写第三个抓取详情页的python爬虫在这里用到了一个try expect方法,主要是事后发现在抓取价格时总是出现一个错误始终无法解决所以直接忽略掉
def get_items_info(url):
wb_data = requests.get(url)
try:
soup = BeautifulSoup(wb_data.text,‘lxml‘)
title = soup.title.text
price = soup.select(‘span.price_now > i‘)[0].text if soup.find_all(‘span‘,‘price_now‘) else None
area = list(soup.select(‘div.palce_li > span > i‘)[0].stripped_strings) if soup.find_all(‘div‘,‘palce_li‘) else None
url =‘url‘
item_infor.insert_one({‘title‘:title,‘price‘:price,‘area‘:area,‘url‘:url})
print({‘title‘:title,‘price‘:price,‘area‘:area})
except IndexError:
pass
4.从新新建一个主python文件通过建一个进程池和一个主函数将所有栏目下的所有商品的链接全部抓取到并保存到mongodb中
from multiprocessing import Pool
from channel_extract import channel_list
from page_parsing import get_links_from
def get_all_links_from(channel):
for num in range(1,101):
get_links_from(channel,num)
if __name__== ‘__main__‘:
pool = Pool()
pool.map(get_all_links_from,channel_list.split())
此外可以在通过cmd去运行这个主函数的时候为了直观的了解到数据库数据的情况可以设置一个监控函数代码如下
import time
from page_parsing import url_list2
while True:
print(url_list2.find().count())
time.sleep(5)
每隔5秒返回下数据库中数据的量
5当讲所有的链接爬取完并放到数据库中时新建第二个主函数,主要是将数据库中的链接调出来并放在新建的主函数中抓取所有的详情页内容
from page_parsing import get_items_info,url_list3,item_infor
from multiprocessing import Pool
db_urls = [item[‘url‘]for item in url_list3.find()]
infor_url =[item[‘url‘]for item in item_infor.find()]
x = set(db_urls)
y = set(infor_url)
no_url = x - y
def get_all_items_info(db_urls):
get_items_info(db_urls)
if __name__== ‘__main__‘:
pool = Pool()
pool.map(get_all_items_info,no_url)
在这里用到了断点续传,主要是两个数据库中的url相减如果是在抓取过程中出现失误时再运行该函数可以在断了的地方进行抓取
通过上面的函数就可以抓取数据了,但是其实我是没有将数据全都抓取到的,主要是中间出现了错误,在调试错误的时候在同一个ip下访问的次数太多,被限制访问了,
本来想通过代理ip去试一下但是没有找到比较靠谱的代理ip
标签:and 中间 import pre ges .com iad lxml nmp
原文地址:http://www.cnblogs.com/gttpython/p/7517616.html