码迷,mamicode.com
首页 > 其他好文 > 详细

豆瓣电影爬虫编写教程

时间:2019-07-24 00:29:19      阅读:238      评论:0      收藏:0      [点我收藏+]

标签:head   教程   学习   com   src   chrome   end   tostring   ring   

import  requests

from lxml import etree

headers ={
    User-Agent:"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
    Referer:"https://movie.douban.com/"

}
url="https://movie.douban.com/cinema/nowplaying/shijiazhuang/"
response = requests.get(url,headers=headers)
text = response.text

html = etree.HTML(text)
ul = html.xpath("//ul[@class=‘lists‘]")[0]
#print(etree.tostring(ul,encoding=‘utf-8‘).decode("utf-8"))
lis = ul.xpath("./li")
movies = []
for li in lis:
        #print(etree.tostring(li,encoding=‘utf-8‘).decode("utf-8"))
    title = li.xpath("@data-title") [0]
    score = li.xpath("@data-score")[0]
    duration = li.xpath("@data-duration")[0]
    region = li.xpath("@data-region")[0]
    director = li.xpath("@data-director")[0]
    actors = li.xpath("@data-actors")[0]
    thumbnail = li.xpath(".//img/@src")[0]
    movie = {
        title:title,
        score:score,
        duration:duration,
        region:region,
        director:director,
        actors:actors,
        thumbnail:thumbnail
    }
    movies.append(movie)

print(movies)

以上代码仅供参考学习!

豆瓣电影爬虫编写教程

标签:head   教程   学习   com   src   chrome   end   tostring   ring   

原文地址:https://www.cnblogs.com/secsafe/p/11235126.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!