码迷,mamicode.com
首页 > 编程语言 > 详细

下载《物理》文章的Python脚本

时间:2015-02-13 00:05:14      阅读:359      评论:0      收藏:0      [点我收藏+]

标签:

本人虽然是个物理渣,没事还是喜欢看看物理方面的内容以陶冶情操。一个比较好的来源是《物理》,这里的文章是可以免费下载的。但是一篇篇下载有点麻烦,而且保存的文件名是文章标题的utf-8编码,下完了还得改下文件名。文章的下载地址不是直接写在网页里的,而是在点击下载的时候生成的,于是像DownThemAll、迅雷之类的工具就没用了。于是自己动手写一个下载脚本。

通过查看网页的源码,它是用文件的类型(应该都是pdf)和id来生成下载地址的。它是用的post,我用的get,我还不是很清楚这之间的区别,也准备学习下jQuery的内容。

我原本希望能只下载感兴趣的文章。网页上每篇文章对应有一个勾选框,勾选后对应的文章就会高亮,说实话我不知道网站用这个来干什么。。也许我可以勾选感兴趣的文章后再下载。勾选后这个元素的class会从noselectrow变为selectedrow. 相关的代码如下:

技术分享
function hightlightrowaction(rowid) {
    var thisrow = $("#"+rowid);
    if ($(thisrow).hasClass("selectedrow")) {
        $(thisrow).removeClass("selectedrow");
        $(thisrow).addClass("noselectrow");
    } else {
        $(thisrow).addClass("selectedrow");
        $(thisrow).removeClass("noselectrow");
    }
}
View Code

但有时勾选后class没变,似乎有点问题,还没搞清楚。

Python脚本如下,用到了BeautifulSoup和requests。正则表达式写得很渣。。

技术分享
 1 # -*- coding: utf-8 -*-
 2 """
 3 This script is used to download file from《物理》(http://www.wuli.ac.cn/CN/volumn/home.shtml) automatically.
 4 example usage:
 5 
 6 downloadFiles(u‘f:\\物理\\‘, "http://www.wuli.ac.cn/CN/volumn/volumn_1696.shtml")
 7 """
 8 import requests
 9 from bs4 import BeautifulSoup
10 import urllib
11 import re
12 import os
13 def hasDownloadLink(tag):
14     return tag.has_attr(onclick) and tag[onclick].startswith(showArticleFile)
15 
16 def getFileTypeAndID(fileInfo):
17     """
18     :param fileInfo:
19     :return: file type(usually pdf) and file ID
20     """
21     m = re.match(r[^,]*,\s*[\‘\"](.*)[\‘\"][^,]*,\s*([^\)]*).*, fileInfo)
22     return m.groups()[0], m.groups()[1]
23 
24 def getPublicationYearMonth(tag):
25     """
26     :param tag:
27     :return: publication year and month in the form YYYY-MM
28     """
29     return re.match(r.*(\d{4}-\d{2}).*, tag.get_text()).groups()[0]
30 
31 def modifyFileName(fname):
32     # get rid of characters which are not allowed to be used in file name by Windows
33     for inValidChar in r\/:?"<>|:
34         fname = fname.replace(inValidChar, ‘‘)
35     return fname
36 
37 def writeLog(saveDirectory, errMsg):
38     fhandle = open(saveDirectory + "download log.txt", w)
39     for msg in errMsg:
40         fhandle.write(msg.encode(utf-8));
41     fhandle.close()
42 
43 def downloadFiles(saveDirectory, url, onlyDownloadSeleted = False):
44     """
45     :param saveDirectory: directory to store the downloaded files
46     :param url: url of the download page
47     :param onlyDownloadSeleted: not implemented yet. Ideally, it should allow one to download only interested instead of all files.
48     :return: None
49     """
50     page = urllib.urlopen(url)
51     soup = BeautifulSoup(page)
52     volumeAndDateTag = soup.find(class_="STYLE5")
53     yearMonth = getPublicationYearMonth(volumeAndDateTag)
54     year = yearMonth[:4]
55     relativePath = year + "\\" + yearMonth + "\\"
56     absolutePath = saveDirectory + relativePath
57     if not os.path.exists(absolutePath):
58         os.makedirs(absolutePath)
59     articleMark = "selectedrow" if onlyDownloadSeleted else "noselectrow"
60     articles = soup.find_all(class_ = articleMark)
61     errMsg = []
62     for index, article in enumerate(articles, 1):
63         print Downloading the %d th file, %d left. % (index, len(articles) - index)
64         # the title of one article in contained in the first anchor
65         title = article.find(a).get_text()
66         title = modifyFileName(title)
67         try:
68             downloadAnchor = article.find(hasDownloadLink)
69             fileInfo = downloadAnchor[onclick]
70             fileType, fileID = getFileTypeAndID(fileInfo)
71             fileName = title+.+fileType.lower()
72             filePath = absolutePath + fileName
73             param = {"attachType":fileType, "id":fileID}
74             if not os.path.exists(filePath):
75                    articleFile = requests.get("http://www.wuli.ac.cn/CN/article/downloadArticleFile.do",params=param)
76                    fhandle = open(filePath, "wb")
77                    fhandle.write(articleFile.content)
78                    fhandle.close()
79         except:
80             errMsg.append(title + " download failed")
81 
82     if len(errMsg) > 0:
83         writeLog(absolutePath, errMsg)
84 
85 if __name__ == "__main__":
86     downloadFiles(uf:\\物理\\, "http://www.wuli.ac.cn/CN/volumn/volumn_921.shtml")
View Code

 

下载《物理》文章的Python脚本

标签:

原文地址:http://www.cnblogs.com/demoZ/p/4289324.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!