爬取信件信息（更新）

时间：2020-02-13 23:08:13 阅读：59 评论：0 收藏：0 [点我收藏+]

标签：gecko http orm pos l数据库 bottom parse format enc

 1 import requests
 2 from bs4 import BeautifulSoup
 3 from selenium import webdriver
 4 
 5 
 6 
 7 url=‘http://www.beijing.gov.cn/hudong/hdjl/com.web.search.mailList.flow‘
 8 headers={‘User-Agent‘:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0‘}
 9 
10 resp = requests.post(url, headers=headers)
11 #print(resp)
12 resp.encoding = resp.apparent_encoding  # 将网页返回的字符集类型设置为 自动判断的字符集类型
13 html = resp.text
14 # 解析成为beautiful对象
15 soup = BeautifulSoup(html, ‘html.parser‘)
16 # print(soup)
17 result = soup.find(‘body‘).find(‘div‘, {‘class‘: ‘conouter‘}).find(‘div‘, {‘id‘: ‘mailul‘}).find_all(‘div‘, {
18         ‘class‘: ‘row clearfix my-2 list-group o-border-bottom2 p-3‘})
19 # print(result)
20 for info in result:
21      leixing = info.find(‘a‘).find(‘font‘).get_text()
22      biaoti = info.find(‘a‘).find(‘span‘).get_text()
23      shijian = info.find(‘div‘, {‘class‘: ‘o-font2 col-md-3 col-xs-12 text-muted‘}).get_text()
24      bumen = info.find(‘div‘, {‘class‘: ‘o-font2 col-md-5 col-xs-12 text-muted‘}).get_text()
25      str=(‘{} {} {} {}‘.format(leixing[2:4],biaoti,shijian[5:],bumen[6:]))
26      list=str.split()#分割空格
27      #print(list)
28 #以字典格式输出
29      dict={‘leixing‘:‘‘,‘biaoti‘:‘‘,‘shijian‘:‘‘,‘bumen‘:‘‘}
30      dict[‘leixing‘]=list[0]
31      dict[‘biaoti‘] = list[1]
32      dict[‘shijian‘] = list[2]
33      dict[‘bumen‘] = list[3]
34      print(dict)
35 page=soup.find(‘body‘).find_all(‘div‘,{‘id‘:‘pageDiv1‘})
36 print(page)

运行结果：

技术图片

现在只能爬一页，明天考虑翻页爬取，并提交到mysql数据库

爬取信件信息（更新）

标签：gecko http orm pos l数据库 bottom parse format enc

原文地址：https://www.cnblogs.com/qq1793033075/p/12305420.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行