用requests库和BeautifulSoup4库爬取新闻列表

时间：2017-09-28 22:33:09 阅读：485 评论：0 收藏：0 [点我收藏+]

标签：col 一个 web 标题 odi 大于 port img gettime

1、用requests库和BeautifulSoup4库，爬取校园新闻列表的时间、标题、链接、来源、详细内容。

要求：（1）将其中的时间str转换成datetime类型。（2）将取得详细内容的代码包装成函数。

 1 import requests
 2 from bs4 import BeautifulSoup
 3 from datetime import datetime
 4 
 5 webs="http://news.gzcc.cn/html/xiaoyuanxinwen/"
 6 res=requests.get(webs)
 7 res.encoding=‘utf-8‘  #编码转换，避免中文乱码输出
 8 soup=BeautifulSoup(res.text,"html.parser")  #html.parser是指定解析器
 9 
10 #下面函数是输出新闻的详细内容
11 def getdetail(url):
12     resd=requests.get(url)
13     resd.encoding=‘utf-8‘
14     soupd=BeautifulSoup(resd.text,‘html.parser‘)
15     return (soupd.select(‘.show-content‘)[0].text)
16 
17 #下面函数是输出新闻的时间，类型为datetime
18 def gettime(url):
19     resd=requests.get(url)
20     resd.encoding=‘utf-8‘
21     soupd=BeautifulSoup(resd.text,‘html.parser‘)
22     tx1=soupd.select(‘.show-info‘)[0].text
23     tx2="{0:.24}".format(tx1[5:24])
24     time=datetime.strptime(tx2,‘%Y-%m-%d %H:%M:%S‘) #把字符串类型转换成时间类型
25     return (time)
26 
27 for news in soup.select(‘li‘):
28     if len(news.select(‘.news-list-title‘))>0:
29         #如果存在新闻列表标题的话（有内容则会大于0）
30         title=(news.select(‘.news-list-title‘)[0].text)
31         #输出标题的内容
32         url=news.select(‘a‘)[0][‘href‘]
33         #输出a标签中的href内容（即网址）
34         
35         time=gettime(url)
36         #用列表列出子标签后取出第一个元素的内容（时间）
37         sorce=(news.select(‘.news-list-info‘)[0].contents[1].text)
38         #用列表列出子标签后取出第二个元素的内容（来源）
39         detail=getdetail(url)
40         #输出详细内容
41         print(time,sorce,title,‘\n‘,url,‘\n‘,detail)
42         #输出新闻时间、来源、标题、链接、和内容

技术分享

2、一个自己感兴趣的主题，做类似的操作，为后面“爬取网络数据并进行文本分析”做准备。

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 mt="http://gz.meituan.com/shop/2380968"
 5 res=requests.get(mt)
 6 res.encoding=‘utf-8‘
 7 soup=BeautifulSoup(res.text,"html.parser")
 8 
 9 for news in soup.select(‘li‘):
10     if len(news.select(‘.title‘))>0:
11         titles=(news.select(‘.title‘))
12      
13         print(titles)

技术分享

用requests库和BeautifulSoup4库爬取新闻列表

标签：col 一个 web 标题 odi 大于 port img gettime

原文地址：http://www.cnblogs.com/zj2017/p/7608750.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行