python简单实现爬取小说《天龙八部》，并在页面本地访问

时间：2017-09-27 22:31:16 阅读：202 评论：0 收藏：0 [点我收藏+]

标签：www 技术 color html sqli sql strong tab 流程

写在前面：第一次使用爬虫，甚至都算不上爬虫，水平有限，主要作为学习记录。

主要业务流程如下：

使用python的requests模块获取页面信息

通过re模块（正则表达式）取出需要的内容（小说标题，正文）

通过MysqlDB模块入库

使用webpy模块进行访问

下面是效果图，简单实现了点击上一页、下一页翻页的功能：

技术分享

目录结构如下：

D:\PROJECT\SPIDER
│ fiction_spider.py
│ webapp.py
│
└─template
index.html

爬取信息及入库代码如下：

#coding:utf-8
#fiction_spider.py
import requests
import re
import MySQLdb

def get_title():
    html = requests.get(‘http://www.jinyongwang.com/tian/‘).content
    rem = r‘<li><a href="(.*?)">(.*?)</a>‘
    return  re.findall(rem,html)

def get_content(url):
    html = requests.get(‘http://www.jinyongwang.com/‘+url).content
    #print html
    matchs_p = r‘<p>(.*?)</p><script.*?‘
    data = re.findall(matchs_p, html)
    return data[0]

if __name__ == ‘__main__‘:
    a = MySQLdb.connect(host=‘10.1.*.*‘, port=3306, user=‘user‘, passwd=‘passwd‘, db=‘testdb‘, charset=‘utf8‘)
    for i in get_title():
        cur = a.cursor()
        print i[1]
        print i[0]
        sqli = ‘INSERT INTO `fiction` (`title`, `content`) VALUES ("%s","%s" )‘%(i[1],get_content(i[0]))
        cur.execute(sqli)
        cur.close()
        a.commit()
    a.close()

页面代码如下：

#coding:utf-8
#webapp.py
import web
import re

urls = (‘/(.*)‘,‘Index‘)

db = web.database(dbn = ‘mysql‘,host=‘10.1.*.*‘, port=3306, user=‘user‘, passwd=‘passwd‘, db=‘testdb‘, charset=‘utf8‘)

render = web.template.render(‘template‘)

class Index:
    def GET(self,html):
        id = re.findall(‘(.*?).html‘,html)[0]
        print id
        data = db.query("select * from fiction where id=%s"%id)
        return render.index(data[0],id)

if __name__ == ‘__main__‘:

    web.application(urls,globals()).run()

页面访问的index.html内容如下：

$def with(data,s)
<meta charset="utf-8"/>
<title>$:data.title</title>
<h1>$:data.title</h1>
<div style="margin:0px auto;text-align:center;">
<a href="$:(int(s)-1).html">上一页</a>
<a href="$:(int(s)+1).html">下一页</a>
</div>
$:data.content
<br>
<div style="margin:0px auto;text-align:center;">
<a href="$:(int(s)-1).html">上一页</a>
<a href="$:(int(s)+1).html">下一页</a>
</div>

python简单实现爬取小说《天龙八部》，并在页面本地访问

标签：www 技术 color html sqli sql strong tab 流程

原文地址：http://www.cnblogs.com/Detector/p/7604151.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行