码迷,mamicode.com
首页 > 编程语言 > 详细

python爬取网页内容demo

时间:2018-09-15 12:21:57      阅读:200      评论:0      收藏:0      [点我收藏+]

标签:parser   dem   this   exce   dataframe   class   note   pre   sts   

 1 #html文本提取
 2 from bs4 import BeautifulSoup
 3 html_sample =  4 <html>  5 <body>  6 <h1 id = "title">Hello world</h1> 7 <a href = "#www.baidu.com" class = "link"> This is link1</a> 8 <a href = "#link2" class = "link"> This is link2</a>  9 </body> 10 </html>
11 soup = BeautifulSoup(html_sample,html.parser)
12 print(soup.text)
13 soup.select(h1)
14 print(soup.select(h1)[0].text)
15 print(soup.select(a)[0].text)
16 print(soup.select(a)[1].text)
17 
18 for alink in soup.select(a):
19     print(alink.text)
20 
21 print(soup.select(#title)[0].text)
22 print(soup.select(.link)[0].text)
23 
24 alinks = soup.select(a)
25 for link in alinks:
26     print(link[href])

demo2:

 1 import requests
 2 from bs4 import BeautifulSoup
 3 res = requests.get(http://news.qq.com/)
 4 soup = BeautifulSoup(res.text,html.parser)
 5 newsary = []
 6 for news in soup.select(.Q-tpWrap .text):
 7     newsary.append({title:news.select(a)[0].text, url:news.select(a)[0][href]})
 8 
 9 import pandas 
10 newsdf = pandas.DataFrame(newsary)
11 newsdf.to_excel(news.xlsx)

 推荐使用:Jupyter Notebook 做练习,很方便。

python爬取网页内容demo

标签:parser   dem   this   exce   dataframe   class   note   pre   sts   

原文地址:https://www.cnblogs.com/hujianglang/p/9650329.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!