拉钩项目(一)--项目流程+数据提取

时间：2020-06-14 18:21:41 阅读：60 评论：0 收藏：0 [点我收藏+]

标签：sea 模拟请求 and head python 初始技术 type title

声明：

　　　1）仅作为个人学习，如有冒犯，告知速删！

　　　2）不想误导，如有错误，不吝指教！

目标：

　　　1. 爬取拉钩网中的关于编程语言的 1）薪资，2）城市范围，3）工作年限，4）学历要求;

　　　2 .将四部分保存到`mysql`中;

　　　3.对四部分进行数据可视化;

　　　4.最后通过`pyecharts+bootstrap`进行网页美化 .

技能点：

　　 1. python网络基础(`requests,xpath`语法等)；

　　　2. `MySQL+ pymysql`的语法基础；

　　　3. `pyecharts`基础；

　　　4. bootstrap基础；

项目流程及逻辑：

　　　大方向：先完成爬取一类的信息，进行可视化，走一遍流程很重要，再拓展！

技术图片

1.进入以下位置：

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　------->刷新找到请求`url`：<--------

技术图片

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　------->分析+请求参数：<--------

技术图片

　　　　　　　　　　　　　　　　　　　　　　　------->因为`url`是post请求，我们需要提交参数，往下滑：<-------

技术图片

2.解决反爬机制

1. 上面的操作解决的是------>拉钩的`ajax`请求方式

2. 隐藏在cookies中的时间戳处理：------>session来保持会话-----实时更新cookies

1 #获取cookies的函数
2 #start_url = "https://www.lagou.com/jobs/list_python?#labelWords=&fromSearch=true&suginput="
3 def cookieRequest(start_url):
4     r = requests.Session()
5     r.get(url=start_url, headers=headers, timeout=3)
6     return r.cookies

3.构造流程

1.构造主函数：

 1 if __name__ == ‘__main__‘:
 2     #初始url---获取cookies
 3     start_url = "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput="
 4     #模拟请求url
 5     post_url = "https://www.lagou.com/jobs/positionAjax.json?"
 6     #headers
 7     headers = {
 8         "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36",
 9         "accept": "application/json, text/javascript, */*; q=0.01",
10         "accept-encoding": "gzip, deflate, br",
11         "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
12         "referer": "https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=",
13     }
14     # 动态cookies
15     cookies = cookieRequest(start_url)
16     time.sleep(1)
17     #异常处理
18     try:
19         data = {
20             "first": "true",
21             "pn": 1  # 1
22             "kd": "python",
23         }
24         textInformation(post_url, data, cookies)
25         time.sleep(7)
26         print(‘------------第%s页爬取成功，正在进行下一页--------------‘ % s)
27     except requests.exceptions.ConnectionError:
28         r.status_code = "Connection refused"

2.构造基础页函数

 1 def textInformation(post_url, data, cookies):
 2     response = requests.post(post_url, headers=headers, data=data, cookies=cookies,timeout=3).text
 3     div1 = json.loads(response)
 4     # 拿到该页的职位信息
 5     position_data = div1["content"]["positionResult"]["result"]
 6     n = 1
 7     for list in position_data:
 8         infor = {
 9                     "positionName": result["positionName"],
10 ?
11                     "companyFullName": result["companyFullName"],
12                     "companySize": result["companySize"],
13                     "industryField": result["industryField"],
14                     "financeStage": result["financeStage"],
15 ?
16                     "firstType": result["firstType"],
17                     "secondType": result["secondType"],
18                     "thirdType": result["thirdType"],
19 ?
20                     "positionLables": result["positionLables"],
21 ?
22                     "createTime": result["createTime"],
23 ?
24                     "city": result["city"],
25                     "district": result["district"],
26                     "businessZones": result["businessZones"],
27 ?
28                     "salary": result["salary"],
29                     "workYear": result["workYear"],
30                     "jobNature": result["jobNature"],
31                     "education": result["education"],
32 ?
33                     "positionAdvantage": result["positionAdvantage"]
34                 }
35 ?
36         print(infor)
37         time.sleep(5)
38         print(‘----------写入%s次-------‘ %n)
39         n +=1

3.单独获取每个类的show_id(详情页使用):

https://www.lagou.com/jobs/4254613.html? show=0977e2e185564709bebd04fe72a34c9f

 1 show_id = []
 2 def getShowId(post_url, headers, cookies):
 3     data = {
 4         "first": "true",
 5         "pn": 1,
 6         "kd": "python",
 7     }
 8     response = requests.post(post_url, headers=headers, data=data, cookies=cookies).text
 9     div1 = json.loads(response)
10     # 拿到该页的职位信息
11     position_data = div1["content"]["positionResult"]["result"]
12     # 详情页的show_id
13     position_show_id = div1[‘content‘][‘showId‘]
14     show_id.append(position_show_id)
15     # return position_show_id

4.详情页信息

 1 def detailinformation(detail_id, show_id):
 2      get_url = "https://www.lagou.com/jobs/{}.html?show={}".format(detail_id, show_id)
 3      # time.sleep(2)
 4      # 详情页信息
 5      response = requests.get(get_url, headers=headers,timeout=5).text
 6      # print(response)
 7      html = etree.HTML(response)
 8      div1 = html.xpath("//div[@class=‘job-detail‘]/p/text()")
 9      # 职位详情/清洗数据
10      position_list = [i.replace(u‘\xa0‘, u‘‘) for i in div1]
11      # print(position_list)
12      return position_list

完整代码放在`GitHub`中：

　　https://github.com/xbhog/studyProject

4.暂没解决/完善的问题

详情页在mysql保存到的时候，有些没有数据，可能是网络抖动或者请求频繁

没有使用多线程
没有使用scrapy框架
没有使用类方法

------>下期内容<---------

数据存储：----存储环境ubuntu

Mysql存储
csv存储

拉钩项目(一)--项目流程+数据提取

标签：sea 模拟请求 and head python 初始技术 type title

原文地址：https://www.cnblogs.com/xbhog/p/13124722.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行

拉钩项目(一)--项目流程+数据提取

声明：

1）仅作为个人学习，如有冒犯，告知速删！

2）不想误导，如有错误，不吝指教！

目标：

1. 爬取拉钩网中的关于编程语言的 1）薪资，2）城市范围，3）工作年限，4）学历要求;

2 .将四部分保存到mysql中;

3.对四部分进行数据可视化;

4.最后通过pyecharts+bootstrap进行网页美化 .

技能点：

1. python网络基础(requests,xpath语法等)；

2. MySQL+ pymysql的语法基础；

3. pyecharts基础；

4. bootstrap基础；

项目流程及逻辑：

大方向：先完成爬取一类的信息，进行可视化，走一遍流程很重要，再拓展！

1.进入以下位置：

------->刷新找到请求url：<--------

------->分析+请求参数：<--------

------->因为url是post请求，我们需要提交参数，往下滑：<-------

2.解决反爬机制

1. 上面的操作解决的是------>拉钩的ajax请求方式

2. 隐藏在cookies中的时间戳处理：------>session来保持会话-----实时更新cookies

3.构造流程

1.构造主函数：

2.构造基础页函数

3.单独获取每个类的show_id(详情页使用):

4.详情页信息

完整代码放在GitHub中：

4.暂没解决/完善的问题

------>下期内容<---------

　　　1）仅作为个人学习，如有冒犯，告知速删！

　　　2）不想误导，如有错误，不吝指教！

　　　1. 爬取拉钩网中的关于编程语言的 1）薪资，2）城市范围，3）工作年限，4）学历要求;

　　　2 .将四部分保存到`mysql`中;

　　　3.对四部分进行数据可视化;

　　　4.最后通过`pyecharts+bootstrap`进行网页美化 .

　　 1. python网络基础(`requests,xpath`语法等)；

　　　2. `MySQL+ pymysql`的语法基础；

　　　3. `pyecharts`基础；

　　　4. bootstrap基础；

　　　大方向：先完成爬取一类的信息，进行可视化，走一遍流程很重要，再拓展！

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　------->刷新找到请求`url`：<--------

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　------->分析+请求参数：<--------

　　　　　　　　　　　　　　　　　　　　　　　------->因为`url`是post请求，我们需要提交参数，往下滑：<-------

1. 上面的操作解决的是------>拉钩的`ajax`请求方式

完整代码放在`GitHub`中：