python采集新浪热门微博_郑晓_个人博客

时间：2014-12-23 19:28:28 阅读：209 评论：0 收藏：0 [点我收藏+]

标签：

这是之前学习python采集时的一个练习程序，程序基于python3和BeautifulSoup库。用来抓取新浪微博（热门微博hot.weibo.com）页面的信息，包括每条微博的发布人，微博内容和包含的图片，微博中含有的多张图片采集为一个图片列表。

由于在页面中没有发现比较精确的发布时间字段，所以也没有去弄（目前思路是获取到它的页面中的时间信息，然后做判断去转换）。这里以热门笑话的一个页面做为采集对象。

#-*-coding:utf-8 -*-
from?bs4?import?BeautifulSoup
import?urllib.request
#伪造的header
headers?=?{‘User-Agent‘:‘Mozilla/5.0 (Windows NT 5.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36‘}
#抓取地址 读入页面源文件
fromurl=‘http://hot.weibo.com/?v=1899&page=2‘
r?=?urllib.request.Request(url=fromurl,?headers=headers)
response=urllib.request.urlopen(r)
page=response.read()
#实例化BS对象
soup=?BeautifulSoup(page)
#定位到微博信息主节点 页面中每一条微博是它的子节点
tags?=?soup.find_all(name=‘div‘,?attrs={‘class‘:‘WB_detail‘})
#遍历所有子节点
for?tag?in?tags:
? ??#从子节点中找到发布人
? ? sender?=?tag.find(name=‘a‘,?attrs={‘class‘:‘WB_name S_func1‘}).get_text()
? ??#从子节点中找到微博内容
? ? text?=?tag.find(name=‘div‘,?attrs={‘class‘:‘WB_text‘}).get_text()
? ??#查找节点下的微博图片
? ? thumbList?=?tag.find_all(name=‘img‘,?attrs={‘class‘:‘bigcursor‘})
? ? img?=?[]
? ??#如果有图，把所有图片的地址放到img数组中
? ??if?thumbList:
? ? ? ??for?t?in?thumbList:
? ? ? ? ? ? img.append(t[‘src‘])
? ??print(sender+text)
? ??print(img)
? ??print()
? ??print()
input()

程序运行结果如图：
技术分享

本文由豆约翰博客备份专家远程一键发布

python采集新浪热门微博_郑晓_个人博客

标签：

原文地址：http://www.cnblogs.com/douyuehan/p/4180713.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行