数据结构化与保存

时间：2017-10-19 19:53:49 阅读：203 评论：0 收藏：0 [点我收藏+]

标签：链接 strip() model div requests 时间 rip frame 结构化

1、结构化：

单条新闻的详情字典：news
一个列表页所有单条新闻汇总列表：newsls.append(news)
所有列表页的所有新闻汇总列表：newstotal.extend(newsls)

2、转换成pandas的数据结构DataFrame

3、从DataFrame保存到excel

4、从DataFrame保存到sqlite3数据库

import requests
from bs4 import BeautifulSoup
from datetime import datetime
import re
import pandas
import sqlite3
 
 
def getclick(url):#给定单条新闻链接，返回点击次数
    id=re.search(‘_(.*).html‘,url).group(1).split(‘/‘)[1]
    clickurl=‘http://oa.gzcc.cn/api.php?op=count&id={}&modelid=80‘.format(id)
    click=int(requests.get(clickurl).text.split(‘.‘)[-1].lstrip("html(‘").rstrip("‘);"))
    return click
 
def getdetail(url):#给定单条新闻链接，返回新闻细节的字典
    resd=requests.get(url)
    resd.encoding=‘utf-8‘
    soupd=BeautifulSoup(resd.text,‘html.parser‘)
    news={}
    news[‘title‘]=soupd.select(‘.show-title‘)[0].text
    news[‘url‘]=url
    info=soupd.select(‘.show-info‘)[0].text
    #news[‘dt‘]=datetime.strptime(info.lstrip(‘发布时间:‘)[0:19],‘%Y-%m-%d %H:%M:‘)
    #news[‘source‘]=re.search(‘来源:(.*)点击‘,info).group(1).strip()
    news[‘content‘]=soupd.select(‘.show-content‘)[0].text.strip()
    news[‘click‘]=getclick(url)
    return(news)
         
def onepage(pageurl):#给定新闻列表页的链接，返回该页所有新闻细节字典的列表
    res=requests.get(pageurl)
    res.encoding=‘utf-8‘
    soup = BeautifulSoup(res.text,‘html.parser‘)
    newsls=[]
    for news in soup.select(‘li‘):
        if len(news.select(‘.news-list-title‘))>0:
            newsls.append(getdetail(news.select(‘a‘)[0][‘href‘]))
    return(newsls)
 
ns=[]
gzccurl=‘http://news.gzcc.cn/html/xiaoyuanxinwen/‘
ns.extend(onepage(gzccurl))
res=requests.get(gzccurl)
res.encoding=‘utf-8‘
soup = BeautifulSoup(res.text,‘html.parser‘)
 
pages=int(soup.select(‘.a1‘)[0].text.rstrip(‘条‘))//10+1#新闻列表页的总页数
 
for i in range(2,3):
    listurl=‘http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html‘.format(i)
    ns.extend(onepage(listurl))#后面的每一个列表页
 
df=pandas.DataFrame(ns)#转换成pandas的数据结构DataFrame
print(df.head())
df.to_excel(‘gzccnews.xlsx‘)#从DataFrame保存到excel
with sqlite3.connect(‘gzccnewsdbl.sqlite‘)as db:#从DataFrame保存到sqlite3数据库
    df.to_sql(‘gzccnewsdb1‘,con=db)

技术分享

数据结构化与保存

标签：链接 strip() model div requests 时间 rip frame 结构化

原文地址：http://www.cnblogs.com/zhoujinpeng/p/7694044.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行