新浪微博爬取笔记（3）：wap端爬取用户微博列表，转发列表+数据清理

时间：2015-04-18 19:02:37 阅读：175 评论：0 收藏：0 [点我收藏+]

标签：

wap端登陆成功后，就可以安心开始爬取数据了。我这次需要的数据是：

（1）用户的近期1000条微博，需要：微博id，转发量，发布时间

（2）某条微博的转发列表，需要：转发人，转发时间

（3）某个用户的关注数，粉丝数，微博数，最近100条微博的平均转发量

=========================================

相比模拟登陆，爬数据的工作就简单很多。但需要注意几个坑，一边做一边总结如下：

##爬微博列表##

（1）虽然目前wap端每页加载的微博数貌似是固定的，但其实是不固定的！爬一页的时候一定要先读取当页的实际微博数量。

　　有时候每页显示5条，有时候10条。。。

（2）“发布时间”和“来自xxx”基本在一个tag下，但是这个tag中可能还嵌套了tag，获取字节的时候要注意这点。

（3）我设置的每页爬取完后time.sleep(2)，目前爬100页还没有出现问题。(爬100页实际用了5分半)(代理这时候又不好用了，试了10几个都不行，可能是校园网限制，所以自己的ip一定要小心使用。。)

代码的主要部分是这个样子：

 1 for page in range(1, 201):
 2     newUserUrl = user_url + ‘?page=%s‘%page
 3     print newUserUrl    ########
 4     req = urllib2.Request(newUserUrl, headers = headers)
 5     resp = urllib2.urlopen(req)
 6     soup = BeautifulSoup(resp.read())
 7     post_num = len(soup.find_all(‘div‘, attrs = {"id":re.compile("M_"),"class":‘c‘}))
 8     for post in range(0,post_num):
 9         tmp = open(‘test_6.txt‘,‘a‘)
10         #[-1] means the ‘last‘ div
11         tmp.write(str(soup.find_all(‘div‘, attrs = {"id":re.compile("M_"),"class":‘c‘})[post]["id"]))
12         tmp.write(‘ ‘)
13         tmp.write(str(soup.find_all(‘div‘, attrs = {"id":re.compile("M_"),"class":‘c‘})[post].find_all(‘div‘)[-1].find_all(‘a‘)[-3].string.encode(‘utf-8‘)))
14         #find the repost number of a post
15         tmp.write(‘ ‘)
16         for string in soup.find_all(‘div‘, attrs = {"id":re.compile("M_"),"class":‘c‘})[post].find_all(‘div‘)[-1].find_all(‘span‘)[-1].strings:
17         #find the repost time of a post
18         #re.compile helps to find a part of the attr
19             tmp.write(str(string.encode(‘utf-8‘)) + ‘ ‘)
20 
21         tmp.write(‘\n‘)
22     time.sleep(2)
23     #tmp.write(‘page=%s‘%page+‘\n‘)
24 
25 tmp.close()

这样得到的一条数据是：

M_CdF7juKD8 转发[43] 04月17日 10:28 来自微博 weibo.com

清理数据：

用正则表达式提取数字等，参考http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

=========================================

##爬转发列表##

新浪微博爬取笔记（3）：wap端爬取用户微博列表，转发列表+数据清理

标签：

原文地址：http://www.cnblogs.com/manqing/p/4437776.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行

新浪微博爬取笔记（3）：wap端爬取用户微博列表 ，转发列表+数据清理

新浪微博爬取笔记（3）：wap端爬取用户微博列表，转发列表+数据清理