Python爬取百度贴吧数据

时间：2017-07-25 12:34:12 阅读：173 评论：0 收藏：0 [点我收藏+]

标签：html xpath xpage 广州 htm href clear http logs

　　本渣除了工作外，在生活上还是有些爱好，有些东西，一旦染上，就无法自拔，无法上岸，从此走上一条不归路。花鸟鱼虫便是我坚持了数十年的爱好。

　　本渣还是需要上班，才能支持我的业余爱好。上班时间还是尽量访问外网，少做一些和工作不太相关的事。有段时间，像是中毒一样，经常想关注百度贴吧中牡丹鹦鹉，及玄凤鹦鹉的交易图。

　　于是就写出一下代码：

import requests
from lxml import etree

url = r"http://tieba.baidu.com/p/5197963751"
url = r"http://tieba.baidu.com/p/5195568368"
# url = r"http://tieba.baidu.com/p/5004763771"
keyword = "广州"

s = requests.session()


def findgz(pageindex):
    r = s.get("{1}?pn={0}".format(pageindex, url))
    # print(r.text.encode("utf-8"))
    htmlpage = etree.HTML(r.text)

    divlist = htmlpage.xpath(
        "//div[@class=‘d_post_content j_d_post_content  clearfix‘]")
    print("第{0}页".format(pageindex))
    for x in divlist:
        for y in x.xpath(‘text()‘):
            if keyword in y:
                for z in x.xpath(‘text()‘):
                    print(z.replace(‘ ‘, ‘‘))
                else:
                    print(‘\n‘)


r = s.get(url)
tmphtml = etree.HTML(r.text)
maxpageindex = tmphtml.xpath("//a[text()=‘尾页‘]")[0].get("href").split("=")[-1]
print("总共{0}页".format(maxpageindex))

[findgz(x) for x in range(1, int(maxpageindex) + 1)]

　　输出如下：

总共8页
第1页
百度昵称：aiiye1234
交易物品：白脸黄脸
物品价格：400-1000
联系方式：扣扣822616382
地理位置：广州
其它备注：开始学吃了
物品图片：

Python爬取百度贴吧数据

标签：html xpath xpage 广州 htm href clear http logs

原文地址：http://www.cnblogs.com/yicaifeitian/p/7233224.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行