python爬虫：案例一：360指数

时间：2016-05-13 01:46:02 阅读：451 评论：0 收藏：0 [点我收藏+]

标签：

pip installbeautifulsoup4

pip install requests

pip install selenium

下载 phantomjs（phantoms是一个无界面浏览器，用来解析js代码）

给 firefox 安装 firebug

创建一个目录名为baidupc

cd baidupc

创建虚拟环境

virtualenv macp

激活虚拟环境

进入macp/Scripts下输入命令

activate

mac下进入/macp/bin

source activate

虚拟环境的好处在于环境独立，可以随便折腾也不影响自己原有的环境

将phantomjs-2.1.1-macosx.zip解压，把bin目录下的phantoms拷贝到 baidupc/macp/bin下面

(phantomjs根据不同系统下载不同压缩包，windows下虚拟环境的目录应该是baiducp\macp\Script)

案例1：360指数

360指数的数据显示很直观，你在首页输入一个关键字再看一下url就知道了

url:http://index.so.com/#trend?q=欢乐颂

(其实真实的url为：http://index.so.com/#trend?q=%E6%AC%A2%E4%B9%90%E9%A2%82 ，后面的中文要编码)

加上日期查询的url有三种：

http://index.so.com/#trend?q=%E6%AC%A2%E4%B9%90%E9%A2%82&t=7

http://index.so.com/#trend?q=%E6%AC%A2%E4%B9%90%E9%A2%82&t=30

http://index.so.com/#trend?q=%E6%AC%A2%E4%B9%90%E9%A2%82&t=201603|201605

了解到这些信息，我们再从html中找到需要的数据节点就知道如何获取基本数据了

#coding=utf-8
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
import urllib
from selenium import webdriver

class Qh():
    def pc(seif,name,date='7'):
       ＃时间默认7天，参数’30’为三十天，’201603|201605’为自定时月份
        url_name=urllib.quote(name)
        ＃urllib.quote( )将中文url编码
        url='http://index.so.com/#trend?q='+url_name+'&t='+date
        driver = webdriver.PhantomJS() 
        ＃webdriver.PhantomJS() 调用PhantomJS浏览器 
        driver.get(url)
        sszs=driver.find_element_by_xpath('//*[@id="bd_overview"]/div[2]/table/tbody/tr/td[1]').text
        sszshb=driver.find_element_by_xpath('//*[@id="bd_overview"]/div[2]/table/tbody/tr/td[2]').text
        sszstb=driver.find_element_by_xpath('//*[@id="bd_overview"]/div[2]/table/tbody/tr/td[3]').text
       ＃搜索指数，搜索指数环比，搜索指数同比(均为全国数据)
        driver.quit
       ＃quit( )关闭
        return sszs+'|'+sszshb+'|'+sszstb
        

s=Qh()
print s.pc('欢乐颂')
print s.pc('欢乐颂','30')
print s.pc('欢乐颂','201603|201605')

结果：

1,392,286|36.28%|>1000%
657,310|>1000%|>1000%

657,310|>1000%|>1000%

(这里有个很有意思的地方，网上显示的201603|201605的数据与7天数据一样，而我爬下来的数据201603|201605与30天一样，当然按常理的确是应该与30天数据一致，但网页的显示不知道为什么和7天一致)

以上是最简单的版本，观察360指数的页面会发现数据可以选择地区，默认是全国，以上的程序爬取到的只是全国的数据，想要得到各地的数据，我们需要分析一下

打开 firefox 的 firebug 点击’更改’，选择’浙江‘，会发现这是一段ajax，我们在firebug中找XHR，会找到一个GET的地址，我们将url复制到浏览器中会得到一段JSON，这就是AJAX返回的数据

url：http://index.so.com/index.php?a=overviewJson&q=欢乐颂&area=浙江

(其实真实url：http://index.so.com/index.php?a=overviewJson&q=%E6%AC%A2%E4%B9%90%E9%A2%82&area=%E6%B5%99%E6%B1%9F )

返回的json数据：

{"status":0,"data":[{"query":"\u6b22\u4e50\u9882","data":{"week_year_ratio":">1000%","month_year_ratio":">1000%","week_chain_ratio":"31.52%","month_chain_ratio":">1000%","week_index":97521,"month_index":47646}}],"msg":false}

我们会发现只有七天和一个月的搜索指数，搜索指数环比，搜索指数同比

代码：

#coding=utf-8
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
import urllib
from selenium import webdriver

class Qh():
    def pc(seif,name,dq='全国'):
        url_name=urllib.quote(name)
        dq_name=urllib.quote(dq)
        url='http://index.so.com/index.php?a=overviewJson&q='+url_name+'&area='+dq_name
        driver = webdriver.PhantomJS() 
        driver.get(url)
        json=driver.find_element_by_xpath('/html/body/pre').text
        driver.quit
        return json

s=Qh()
print s.pc('欢乐颂')
print s.pc('欢乐颂','浙江')

结果：

{"status":0,"data":[{"query":"\u6b22\u4e50\u9882","data":{"week_year_ratio":">1000%","month_year_ratio":">1000%","week_chain_ratio":"36.28%","month_chain_ratio":">1000%","week_index":1392286,"month_index":657310}}],"msg":false}
{"status":0,"data":[{"query":"\u6b22\u4e50\u9882","data":{"week_year_ratio":">1000%","month_year_ratio":">1000%”,"week_chain_ratio":"31.52%","month_chain_ratio":">1000%","week_index":97521,"month_index":47646}}],"msg":false}

week_year_ratio为7天搜索指数同比

month_year_ratio为30天搜索指数同比

week_chain_ratio为7天搜索指数环比

month_chain_ratio为30天搜索指数环比

week_index为7天搜索指数

month_index为30天搜索指数

如果要爬一堆关键字的话可以写一个配置文件，然后循环去抓取数据

如果只想要自己想要的数据而不是json 的话可以解析json，再存储到数据库或者文件中

关于趋势图和关注度，我现在还不知道怎么去取数据，需求图谱 和 人群特征 我看了一下数据可以从html中取得

一般这个数据分析到写好代码一天左右，如果要完善代码应该需要两三天左右，测试代码那就不好说了，代码可能会隐藏一些自己忽略的bug，毕竟测试是最费时间的事儿

python爬虫：案例一：360指数

标签：

原文地址：http://blog.csdn.net/u013055678/article/details/51347837

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行