新练习

时间：2018-04-12 13:41:26 阅读：153 评论：0 收藏：0 [点我收藏+]

标签：tail format click inf split .text request api print

import re
import requests
from bs4 import BeautifulSoup
from datetime import datetime

def getClickCount(newsUrl):
     newId =re.search(‘\_(.*).html‘,newsUrl).group(1).split(‘/‘)[1]
     clickUrl = "http://oa.gzcc.cn/api.php?op=count&id={}&modelid=80".format(newId)
     return(int(requests.get(clickUrl).text.split(‘.html‘)[-1].lstrip("(‘").rstrip("‘);")))

def getNewsDetail(newsUrl):
    resd = requests.get(newsUrl)
    resd.encoding = ‘utf-8‘
    soupd = BeautifulSoup(resd.text,‘html.parser‘)
    c = soupd.select(‘#content‘)[0].text
    info = soupd.select(‘.show-info‘)[0].text
    d = info.lstrip(‘发布时间;‘)[:19]
    dt = datetime.strptime(d,‘%Y-%m-%d %H:%M:%S‘)
    au = info[info.find(‘作者:‘):].split()[0].lstrip(‘作者:‘)
    clickCount = getClickCount(newsUrl)
    print(clickCount,newsUrl,dt,au)

def getNewsList(pageUrl):
    res = requests.get(pageUrl)
    res.encoding = ‘utf-8‘
    soup = BeautifulSoup(res.text,‘html.parser‘)
    for news in soup.select(‘li‘):
        if len(news.select(‘.news-list-title‘))>0:
            newsUrl = news.select(‘a‘)[0].attrs[‘href‘]
            getNewsDetail(newsUrl)
            break
pageUrl =‘http://news.gzcc.cn/html/xiaoyuanxinwen‘
getNewsList(pageUrl)
for i in (2,233):
    getNewsList(pageUrl)

新练习

标签：tail format click inf split .text request api print

原文地址：https://www.cnblogs.com/lg916843/p/8806683.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行