BeautifulSoup解析

时间：2018-07-01 17:48:56 阅读：130 评论：0 收藏：0 [点我收藏+]

正文的抽取

import json
from bs4 import BeautifulSoup
import requests
user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘
headers={‘User-Agent‘:user_agent}
r = requests.get(‘http://seputu.com/‘,headers=headers)
soup = BeautifulSoup(r.text,‘html.parser‘,from_encoding=‘utf-8‘)
content=[]
for mulu in soup.find_all(class_="mulu"):
    h2 = mulu.find(‘h2‘)
    if h2!=None:
        h2_title = h2.string#获取标题
        list=[]
        for a in mulu.find(class_=‘box‘).find_all(‘a‘):#获取所有的a标签中url和章节内容
            href = a.get(‘href‘)
            box_title = a.get(‘title‘)
            list.append({‘href‘:href,‘box_title‘:box_title})
        content.append({‘title‘:h2_title,‘content‘:list})
with open(‘result.json‘,‘w‘) as fp:
    json.dump(content,fp=fp,indent=4)

BeautifulSoup解析

标签：div find use odi win box 标签获取 header

原文地址：https://www.cnblogs.com/wanglinjie/p/9250489.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行