码迷,mamicode.com
首页 > 其他好文 > 详细

爬虫大作业

时间:2018-04-29 17:49:33      阅读:222      评论:0      收藏:0      [点我收藏+]

标签:init   __name__   python   object   day   出版社   ##   sts   ini   

对豆瓣读书网进行书评书单推荐简介和推荐链接数据爬取:

  

from bs4 import BeautifulSoup
import requests
import jieba
import time
import datetime

r = requests.get(‘https://book.douban.com‘)
lyrics = ‘‘
html=r.text

soup = BeautifulSoup(html, ‘html.parser‘)

items = []
global_nav_items = soup.find(‘div‘, class_=‘global-nav-items‘)

for tag in global_nav_items.find_all(‘a‘):
    items.append(tag.string)

print(items)
# /定义一个数据存储类
class Info(object):
    def __init__(self, title, img, link, author, year, pulisher, abstract):
        self.title = title
        self.img = img
        self.link = link
        self.author = author
        self.year = year
        self.publisher = publisher
        self.abstract = abstract

new_book_html = soup.find(‘ul‘, class_=‘list-col list-col5 list-express slide-item‘)

book_info_list = []

for tag in new_book_html.find_all(‘li‘):
    info_html = tag.find(‘div‘, class_=‘info‘)
    info_title = info_html.find(‘a‘)
    title = info_title.string.strip()
    cover = tag.find(‘div‘, class_=‘cover‘)
    img = cover.find(‘img‘)[‘src‘].strip()
    href = info_title[‘href‘].strip()
    author = info_html.find(class_=‘author‘).string.strip()
    year = info_html.find(class_=‘year‘).string.strip()
    publisher = info_html.find(class_=‘publisher‘).string.strip()
    abstract = info_html.find(class_=‘abstract‘).string.strip()
    book = Info(title, img, href, author, year, publisher, abstract)
    book_info_list.append(book)

print(‘推荐%s本新书‘ %  len(book_info_list))
for book in book_info_list:
    print(‘*‘*100)
    print(book.title)
    print(book.img)
    print(book.link)
    print(book.author)
    print(book.year)
    print(book.publisher)
    print(book.abstract)

  

将所爬的数据存储在mark down文件中:

def save():
    today = datetime.datetime.fromtimestamp(time.time()).strftime(‘%Y-%m-%d‘)
    file_name = ‘豆瓣‘+today+‘推荐书单‘
    with open(file_name+‘.md‘, ‘w‘) as file:
        file.write(‘#‘+file_name)
        file.write(‘\\n---‘)
    with open(file_name+‘.md‘, ‘a‘) as file:
        num = 1
        for book in book_info_list:
            file.write(‘\\n\\n‘)
            file.write(‘## ‘ + str(num) +‘. ‘ + book.title)
            file.write(‘\\n‘)
            file.write(‘![‘+book.title+‘ cover img](‘+book.img+‘)‘)
            file.write(‘\\n\\n‘)
            file.write(‘简介\\n‘)
            file.write(‘---\\n‘)
            file.write(book.abstract)
            file.write(‘\\n\\n‘)
            file.write(‘作者:     ‘+book.author+‘\\n\\n‘)
            file.write(‘出版时间: ‘+book.year+‘\\n\\n‘)
            file.write(‘出版社:   ‘+book.publisher+‘\\n\\n‘)
            file.write(‘[更多...](‘+book.link+‘)‘)
            num = num + 1

if __name__ == ‘__main__‘:
    save()

  

截图:

技术分享图片

技术分享图片

词云生成截图:

技术分享图片

 

 

相关问题:

1.在电脑无法安装词云wordcloud,将代码复制在在线词云生成器进行词云生成;

2.数据无法直接保存至文本文件,改用mark down 文件存储;

 

爬虫大作业

标签:init   __name__   python   object   day   出版社   ##   sts   ini   

原文地址:https://www.cnblogs.com/whr7116365/p/8971346.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!