标签:list mozilla port 爬取 window tag html urlencode read
最近在找工作,所以爬取了拉钩网的全部python职位,以便给自己提供一个方向。拉钩网的数据还是比较容易爬取的,得到json数据直接解析就行,废话不多说, 直接贴代码:
1 import json 2 import urllib 3 import urllib2 4 from openpyxl import load_workbook 5 filename = ‘E:\excel\position_number_11_2.xlsx‘ 6 ws = load_workbook(filename=filename) 7 sheet = ws.create_sheet(0) 8 sheet.title = ‘position‘ 9 count = 1 10 11 for page in xrange(100): 12 from_data = { 13 ‘first‘: ‘false‘, 14 ‘pn‘: page, 15 ‘kd‘: ‘Python‘ 16 } 17 18 header = { 19 "User-Agent": ‘Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0‘, 20 ‘Referer‘: ‘https://www.lagou.com/jobs/list_Python?px=default&city=%E5%85%A8%E5%9B%BD‘, 21 } 22 request_url = ‘https://www.lagou.com/jobs/positionAjax.json?px=default&needAddtionalResult=false‘ 23 data = urllib.urlencode(from_data) 24 25 request = urllib2.Request(request_url, headers=header, data=data) 26 try: 27 html = urllib2.urlopen(request).read().decode(‘utf-8‘) 28 except Exception: 29 print ‘没有职位信息‘ 30 break 31 # print html 32 jsonobj = json.loads(html) 33 # print jsonobj 34 dict_obj = jsonobj[‘content‘][‘positionResult‘][‘result‘] 35 for item in dict_obj: 36 if item: 37 sheet.cell(row=count, column=1).value = item[‘companySize‘] 38 sheet.cell(row=count, column=2).value = item[‘workYear‘] 39 sheet.cell(row=count, column=3).value = item[‘education‘] 40 sheet.cell(row=count, column=4).value = item[‘financeStage‘] 41 sheet.cell(row=count, column=5).value = item[‘city‘] 42 sheet.cell(row=count, column=6).value = item[‘industryField‘] 43 sheet.cell(row=count, column=7).value = item[‘formatCreateTime‘] 44 sheet.cell(row=count, column=8).value = item[‘positionName‘] 45 sheet.cell(row=count, column=9).value = item[‘companyFullName‘] 46 sheet.cell(row=count, column=10).value = item[‘salary‘] 47 count += 1 48 ws.save(‘E:\excel\position_number_11_2.xlsx‘)
代码写得比较急,就没怎么规范。 过两天把微博和豆瓣的代码发出来,希望园里的大神多指点^_^
标签:list mozilla port 爬取 window tag html urlencode read
原文地址:http://www.cnblogs.com/viver-python/p/6032746.html