知乎爬虫（基于selenium）

时间：2018-03-30 21:53:01 阅读：279 评论：0 收藏：0 [点我收藏+]

今天写一下关于知乎的爬虫。利用selenium实现爬去数据.

思路：打开网页选择登录界面-------->选择二维码登录------>点击“发现”------>在输入框中输入要查询的内容，回车--------->把滚动条下拉到最下面------------->获取所有的信息，写入txt文件中。

总的代码：

 1 #!/usr/bin/env python3
 2 # -*- coding: utf-8 -*-
 3 
 4 from selenium import webdriver
 5 from selenium.webdriver.common.keys import Keys
 6 from selenium.webdriver.common.action_chains import ActionChains
 7 import time
 8 from selenium.webdriver.common.by import By
 9 from selenium.webdriver.support.wait import WebDriverWait
10 from selenium.webdriver.support import expected_conditions as EC
11 
12 url = ‘https://www.zhihu.com/‘
13 browser = webdriver.Chrome()
14 browser.implicitly_wait(10)
15 browser.get(url)
16 
17 #登录帐号,点击登录。
18 a = browser.find_element_by_xpath(‘//span[@data-reactid="94"]‘)
19 WebDriverWait(browser,15,0.6).until(EC.presence_of_element_located((By.XPATH,‘//span[@data-reactid="94"]‘)))
20 ActionChains(browser).click(a).perform()
21 
22 #定位二维码登录。
23 b = browser.find_element_by_xpath(‘//button[@class="Button Button--plain"]‘)
24 WebDriverWait(browser,15,0.7).until(EC.visibility_of_element_located((By.XPATH,‘//button[@class="Button Button--plain"]‘)))
25 ActionChains(browser).click(b).perform()
26 time.sleep(15)
27 #登录后查询关键词
28 #定位
29 c = browser.find_element_by_link_text(‘发现‘)
30 WebDriverWait(browser,17,0.7).until(EC.presence_of_element_located((By.LINK_TEXT,‘发现‘)))
31 ActionChains(browser).click(c).perform()
32 
33 #点击搜索输入框
34 d = browser.find_element_by_xpath(‘//input[@placeholder="搜索你感兴趣的内容..."]‘)
35 WebDriverWait(browser,15,0.5).until(EC.visibility_of_element_located((By.XPATH,‘//input[@placeholder="搜索你感兴趣的内容..."]‘)))
36 ActionChains(browser).click(d).perform()
37 e = input(‘请输入你要查询的内容：‘)
38 ActionChains(browser).send_keys(e).perform()
39 ActionChains(browser).send_keys(Keys.ENTER).perform()
40 #拉动滚动条到最底部
41 for x in range(20000):
42     ActionChains(browser).send_keys(Keys.DOWN).perform()
43 #获取所有的标题及链接
44 url_list = browser.find_elements_by_xpath(‘//div[@class="List-item"]//a‘)
45 title_list = browser.find_elements_by_xpath(‘//div[@class="List-item"]//a/span[@class="Highlight"]‘)
46 
47 info_dict = {}
48 for k,v in zip(url_list,title_list):
49     urls = k.get_attribute(‘href‘)
50     titles = v.text
51     info_dict[titles] = urls
52 
53 print(info_dict)
54 #写入文本文档
55 with open(‘/home/xxxxxxxx/桌面/知乎爬虫/%s.txt‘ % e,‘w+‘) as text_document:
56     for ka,va in info_dict.items():
57         text_document.write(‘%s-------------------------%s\n‘ %(ka,va))
58 text_document.close()
59 
60 browser.quit()

由于还为学习数据库存储，所以用文本文档存储。模拟拉动滚动条不明白的可以看这个：http://www.cnblogs.com/sniper-huohuohuo/p/8671895.html

这几行代码比较简单，容易看懂，就不一一解释了。有不明白的可以评论，我会在第一时间回复的。

谢谢大家的阅读。

------------by sniper-huohuo-------------

---------- 知耻而后勇 ---------

知乎爬虫（基于selenium）

标签：blog 输入 from col 内容 path 回车下拉 class

原文地址：https://www.cnblogs.com/sniper-huohuohuo/p/8678050.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行