码迷,mamicode.com
首页 > 其他好文 > 详细

BeautifulSoup解析

时间:2018-07-01 17:48:56      阅读:130      评论:0      收藏:0      [点我收藏+]

标签:div   find   use   odi   win   box   标签   获取   header   

正文的抽取

import json
from bs4 import BeautifulSoup
import requests
user_agent = Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)
headers={User-Agent:user_agent}
r = requests.get(http://seputu.com/,headers=headers)
soup = BeautifulSoup(r.text,html.parser,from_encoding=utf-8)
content=[]
for mulu in soup.find_all(class_="mulu"):
    h2 = mulu.find(h2)
    if h2!=None:
        h2_title = h2.string#获取标题
        list=[]
        for a in mulu.find(class_=box).find_all(a):#获取所有的a标签中url和章节内容
            href = a.get(href)
            box_title = a.get(title)
            list.append({href:href,box_title:box_title})
        content.append({title:h2_title,content:list})
with open(result.json,w) as fp:
    json.dump(content,fp=fp,indent=4)

 

BeautifulSoup解析

标签:div   find   use   odi   win   box   标签   获取   header   

原文地址:https://www.cnblogs.com/wanglinjie/p/9250489.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!