从Web抓取信息

时间：2017-05-20 21:52:06 阅读：231 评论：0 收藏：0 [点我收藏+]

标签：clipboard font sel several close googl chunk 对象 bsp

一、webbrowser模块——打开浏览器获取指定页面

open()函数能够启动一个新浏览器

#！python 3
#！mapIt.py - Launches a map in the browser using an address  from the command line or clipboard.

import webbrowser, sys, pyperclip
if len(sys.argv) > 1:
    address = ‘ ‘.join(sys.argv[1:])  # Get address from command line.
else:
    address = pyperclip.paste()       # Get address from clipboard.

webbrowser.open(‘https://www.google.com/map/place/‘ ＋ address)

二、requests模块——从Internet上下载文件和网页

下载并保存到文件的步骤：

①调用requests.get()下载该文件

②用‘wb‘调用open()，以写二进制的方式打开一个新文件

③利用Respose对象的iter_content()方法循环

④在每次迭代中调用write()，将内容写入该文件

⑤调用close()关闭该文件

import requests
res = requests.get(‘http://www.gutenberg.org/cache/epub/1112/pg1112.txt‘)
res.raise_for_status()     # 确保程序在下载失败时停止
playFile = open(‘RomeoAndJuliet.txt‘, ‘wb‘)
for chunk in res.iter_content(100000):
    playFile.write(chunk)

100000
78981

playFile.close()

三、Beautiful Soup——解析HTML，即网页编写格式

1. bs4.BeautufulSoup() 返回一个BeautifulSoup对象

2. soup.select() 方法返回一个Tag对象的列表，是BeautifulSoup表示一个HTML元素的方式

CSS选择器（网络上有各个示例）

3. getText() 方法返回该元素文本，或内部HTML

4. get() 方法返回属性值

#! python3
# lucky.py - Open several Google search results.

import requests, sys, webbrowser, bs4

print(‘Googling... ‘)   # display text while downloading the Google page
res = requests.get(‘http://google.com/search?q=‘ + ‘ ‘.join(sys.argv[1: ]))
res.raise_for_status()

soup = bs4.BeautifulSoup(res.text)  # Retrieve top search result links.
linkElems = soup.select(‘.r a‘)     # Open a browser tab for each result.
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    webbrowser.open(‘http://google.com‘ + linkElems[i].get(‘href‘))

四、selenium——启动并控制一个Web浏览器

（selenium能够填写表单，并模拟鼠标在这个浏览器中点击）

1. 启动 selenium 控制的浏览器

>>> from selenium import webdriver
>>> browser = webdriver.Firefox()
>>> type(browser)
<class ‘selenium.webdriver.Firefox.webdriver.WebDriver‘>
>>> browser.get(‘http://inventwithpython.com‘)

2. 在页面中寻找元素

1. find_element_* 方法返回一个WebElement对象

2. find_elements_* 方法返回WebElement_*对象的列表

3. click() 方法：点击页面

4. send_keys() 方法：填写并提交表单

5. from selenium.webdriver.commom.keys import Keys ：发送特殊键

从Web抓取信息

标签：clipboard font sel several close googl chunk 对象 bsp

原文地址：http://www.cnblogs.com/llw1121/p/6850579.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行