码迷,mamicode.com
首页 > 编程语言 > 详细

python BeautifulSoup html解析

时间:2018-08-05 11:53:39      阅读:248      评论:0      收藏:0      [点我收藏+]

标签:pytho   网页   findall   key   java   html解析   函数   utf-8   api   

* BeautifulSoup 的.find(), .findAll() 函数原型

findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

  

* 取得 span.green

bsObj.findAll("span", {"class":"green"})

#-*- coding: UTF-8 -*-
#!/usr/local/bin/python
from urllib.request import urlopen
from urllib.request import HTTPError, URLError
from bs4 import BeautifulSoup

def getBsObj(url):
    try:
        html = urlopen(url, None, 3)
    except(HTTPError, URLError) as e:
        print(e)
        return None
    try:
        bsObj = BeautifulSoup(html.read(), "html.parser")
    except AttributeError as e:
        return None
    return bsObj

bsObj = getBsObj("http://www.pythonscraping.com/pages/warandpeace.html")
nameList = bsObj.findAll("span", {"class":"green"})
for name in nameList:
    print(name.get_text())

  

* 取得 h1,h2,h3,h4,h5,h6

bsObj.findAll({"h1","h2","h3","h4","h5","h6"});

  

// javascript 生成引号 包裹每个元素的字符串

function quote(s) {
    return "\"" + s.split(",").join("\",\"") + "\"";
}
var s = "h1,h2,h3,h4,h5,h6"
console.log(quote(s))

  

* 取得 span.green, span.red

bsObj.findAll("span", {"class":{"green", "red"}})

* 取得网页中包含"the prince"内容的标签数量

nameList = bsObj.findAll(text="the prince")
print(len(nameList))

* 找到#text  id="text"

allText = bsObj.find(id="text")
print(allText.get_text())

* 找到div#text

allText = bsObj.find("div", {"id":"text"})

* 找到div#text > span.red:first-child

red = bsObj.find("div", {"id":"text"}).find("span", {"class":"red"}, False)
print(red.get_text())

  

 

python BeautifulSoup html解析

标签:pytho   网页   findall   key   java   html解析   函数   utf-8   api   

原文地址:https://www.cnblogs.com/mingzhanghui/p/9424791.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!