码迷,mamicode.com
首页 > 其他好文 > 详细

【R】爬虫案例

时间:2019-09-02 23:40:13      阅读:109      评论:0      收藏:0      [点我收藏+]

标签:ima   agent   xhtml   view   http   https   att   code   user   

爬取豆瓣相册

library(RCurl)
library(XML)



myHttpheader <- c("User-Agent"="Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1.6) ",
                  "Accept"="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                  "Accept-Language"="en-us",
                  "Connection"="keep-alive",
                  "Accept-Charset"="GB2312,utf-8;q=0.7,*;q=0.7")


ye<-c(1,seq(18,630,18))
info<-NULL


for(i in ye){
  url<-paste("https://www.douban.com/photos/album/50903114/?start=",i,sep="")
  web<-getURL(url,httpheader=myHttpheader)
  doc<- htmlTreeParse(web,encoding="UTF-8", error=function(...){}, useInternalNodes = TRUE,trim=TRUE)
  node<-getNodeSet(doc, "//div[@class='photo_wrap']/a")
  info=c(info,sapply(node,xmlGetAttr,"href"))
}


x<-1
dir.create("./image1/")
for(urlweb in info){
  web1<-getURL(urlweb,httpheader=myHttpheader)
  doc1<- htmlTreeParse(web1,encoding="UTF-8", error=function(...){}, useInternalNodes = TRUE,trim=TRUE)
  node1<-getNodeSet(doc1, "//div[@class='photo-edit']/a")
  info1=sapply(node1,xmlGetAttr,"href")
  web2<-getURL(info1,httpheader=myHttpheader)
  doc2<- htmlTreeParse(web2,encoding="UTF-8", error=function(...){}, useInternalNodes = TRUE,trim=TRUE)
  node2<-getNodeSet(doc2, "//td[@id='pic-viewer']/a/img")
  info2=sapply(node2,xmlGetAttr,"src")
  y<-paste("./image1/",x,".jpg")
  tryCatch({
    download.file(info2,y,mode="wb")
    x<-x+1},error=function(e){
      cat("ERROR:",conditionMessage(e),"\n")
      print("loser")})
}

【R】爬虫案例

标签:ima   agent   xhtml   view   http   https   att   code   user   

原文地址:https://www.cnblogs.com/jessepeng/p/11450425.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!