爬虫——抓取播客翻译

时间：2019-07-16 08:52:43 阅读：97 评论：0 收藏：0 [点我收藏+]

标签：file div usr imp headers 程序 like enc utf-8

抓取播客翻译

#!/usr/bin/env python
#!encoding: UTF-8
# get_transcript.py

"""
一个自动从https://podcast.duolingo.com/spanish中下载transcripts的程序
"""

# requests.encoding 编码
# requests.status_code 状态码
#     200 成功
#    4xx 客户端错误 -> 404 Page Not Found
#    5xx 服务器错误

import requests
import re
import os

main = ‘https://podcast.duolingo.com/spanish‘  # 主页面
headers = {
    ‘User-Agent‘: ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36‘,
}

for i in range(1, 10):  # 遍历所有页面
    if i == 1:  # 第一页即主页面
        page = main
    else:  # ‘https://podcast.duolingo.com/spanish2‘ 以此类推
        page = main + str(i)
    r = requests.get(page, headers=headers)
    print(‘{page} with status code {status}.‘.format(page=page, status=r.status_code)) 

    if r.status_code == 404:  # 如果找不到更多的页面，跳出
        print(‘404 Page Not Found!‘)
        break

    hrefs = re.findall(‘entry-title">\s*<a href="(.*)" rel‘, r.text)  # 获取页面所有节目链接

    for h in hrefs:
        title = h[2:]
        episode = main[:-7] + title  # 节目链接
        filename = ‘transcript/‘ + title + ‘.txt‘
        if os.path.exists(filename):
            print(filename, ‘existed!‘)
            continue
        req = requests.get(episode, headers=headers)
        print(‘{episode} with status code {status}.‘.format(episode=episode, status=req.status_code))
        if not os.path.exists(‘transcript‘):
            os.mkdir(‘transcript‘)
        with open(filename, ‘w+‘) as fp:
            for lines in re.findall(‘strong>(.*)</strong>(.*)</p>‘, req.text):
                for line in lines:
                    fp.write(line)
                fp.write(‘\n\n‘)
            print(filename, ‘added!‘)

结果：　　

技术图片

爬虫——抓取播客翻译

标签：file div usr imp headers 程序 like enc utf-8

原文地址：https://www.cnblogs.com/noonjuan/p/11192582.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行