python爬糗百

时间：2017-01-19 23:59:17 阅读：335 评论：0 收藏：0 [点我收藏+]

标签：html 文字 import 作用 ber 其他 tle com from

目的：显示糗百多页文字内容，一次看个够，节约时间。

工具：python 2.7，BeautifulSoup，requests (没有采用urllib2，因为比较麻烦）

先把源码贴出来：

#-*- coding:utf-8 -*-

import requests

from bs4 import BeautifulSoup

page_number = 1

pages = int(raw_input("输入你想要看的总页数：\n“)

while page_number <= pages:

myUrl = "http://www.qiubai.com/hot/page/" + str(page_number)

print ‘第%d页：’ %page_number

res = requests.get(myUrl)

res.encoding = ‘utf-8‘

soup = BeautifulSoup(res.text,‘html.parser‘)

content_link = soup.select(‘.content‘)

for clink in content_link:

print clink.text

page_number +=1

知识点：

1.requests

网络资源（URLs）截取套件

改善Urllib2的缺点，让使用者以最简单的方式获取网络资源

可以使用REST操作(POST,PUT,GET,DELETE)存取网络资源

import requests

newsurl = ‘http://qiubai.com/hot/page/1‘ # 糗百的网址，第1页

res = requests.get(newsrul)

res.encoding = ‘utf-8‘ # 网页的内容是utf-8的格式

# encode的作用是将unicode编码转换成其他编码的字符串
# decode的作用是将其他编码的字符串转换成unicode编码

print（res.text)

2. BeautifulSoup

from bs4 import BeautifulSoup

soup = BeautifulSoup(res.text, ‘html.parser‘) # ‘html.parser ‘是网页解析器，不加解析器会出现警告

# print soup.text #获取网页里面的文字，

我们要的内容在特定标签里，要用到select找出

‘‘‘soup = Beautiflsoup(html_sample)

header = soup.select(‘h1)#这段是用select找出含有h1标签的内容

print(header)‘‘‘ 回传一个list

print header[0]

print header[0].text #取出内文

2.1.取得含有特定CSS属性的元素：

a. 使用select找出所有id为title的元素（id前面需加#）

alink = soup.select(‘#title‘)

print(alink)

b. 使用select找出所有class为link的元素（class前面需加点号.)

soup = BeautifulSoup(html_sample)

for link in soup.select(‘.link‘)

print link

2.2 取得所有a标签内的链接

使用select找出所有a tag 的 href的链接

alinks = soup.select(‘a‘)

for link in alinks:

print link[‘href‘]

#the end

简单的爬虫，练练手，有好的建议或意见请留言，谢谢！

python爬糗百

标签：html 文字 import 作用 ber 其他 tle com from

原文地址：http://www.cnblogs.com/leanjay/p/6308964.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行