在写爬虫的过程中,最麻烦的就是写正则表达式,还要一个一个的尝试,一次次的调试,很是费时间。于是我就写了一个网页版的,只需要输入要爬的网址,和正则式,网页上就可以显示爬到的数据。
思路:其实很简单,将网址和正则式传到服务器,服务器解析之后,将结果返回到前端。我用的是bootcss(前端)+bottle(后台用python处理),代码很简单,就是过程有些复杂。由于传递的参数是一个网址,而后台判断参数结束的标志是/......./,所以每次都是传值失败,后来想到用先用base64加密再传递
webRegx.py
import urllib2 import re import json def getHtml(url): html = urllib2.urlopen(url).read() return html def getResult(url,reg): html = urllib2.urlopen(url).read() reg = re.compile(reg) results = reg.findall(html) if len(results)>0: for result in results: print result else: print "not result" return json.dumps(results)注意:最后要返回一个json结构的数据
main.py
from bottle import route,request,template,run,Bottle,static_file from webRegx import getResult import base64 app = Bottle() @app.route('/') def show(): return template('templates/index') @app.route('/jiexi/:webstr#.*?#',method='post') def test(webstr): #return "hello{}!".format(name) #webstr = webstr.replace(',','?') base64_url,base64_reg =webstr.split(",") url=base64.decodestring(base64_url)#解密 reg=base64.decodestring(base64_reg) return getResult(url,reg) @app.route('/templates/:filename') def send_static(filename): return static_file(filename, root='./templates') run(app, host='localhost', port=8080)index.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html lang="en"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="description" content=""> <meta name="author" content=""> <title>Sticky Footer Template for Bootstrap</title> <!-- 新 Bootstrap 核心 CSS 文件 --> <link rel="stylesheet" href="http://cdn.bootcss.com/bootstrap/3.2.0/css/bootstrap.min.css"> <!-- 可选的Bootstrap主题文件(一般不用引入) --> <link rel="stylesheet" href="http://cdn.bootcss.com/bootstrap/3.2.0/css/bootstrap-theme.min.css"> <!-- jQuery文件。务必在bootstrap.min.js 之前引入 --> <script src="http://cdn.bootcss.com/jquery/1.11.1/jquery.min.js"></script> <script src="./templates/base64.js"></script> <!-- 最新的 Bootstrap 核心 JavaScript 文件 --> <script src="http://cdn.bootcss.com/bootstrap/3.2.0/js/bootstrap.min.js"></script> <!-- Custom styles for this template --> <style type="text/css"> /* Sticky footer styles -------------------------------------------------- */ html { position: relative; min-height: 100%; } body { /* Margin bottom by footer height */ margin-bottom: 60px; font-family: 'microsoft yahei', 'Times New Roman', 宋体, Times, serif; } .footer { position: absolute; bottom: 0; width: 100%; /* Set the fixed height of the footer here */ height: 60px; background-color: #f5f5f5; } /* Custom page CSS -------------------------------------------------- */ /* Not required for template or sticky footer method. */ .container { width: auto; max-width: 800px; padding: 0 15px; } .container .text-muted { margin: 20px 0; } </style> </head> <body> <!-- Begin page content --> <div class="container"> <div class="page-header"> <h1>正则匹配</h1> </div> <div> <div class="input-group input-group-lg"> <span class="input-group-addon">url</span> <input type="text" class="form-control" placeholder="输入网址" id="url" name ="url"> </div><br/> <div class="input-group input-group-lg"> <span class="input-group-addon">reg</span> <input type="text" class="form-control" placeholder="输入正则表达式" id="reg" name ="reg"> <span class="input-group-btn"> <button class="btn btn-default" type="submit" onclick="HtmlRegx()" id="myButton">搜索</button> </span> </div> <div class="modal fade" id="tip"> <div class="modal-dialog"> <div class="modal-content"> <h3 class="modal-title">提示</h3> <div class="modal-body"><p><h3>正在加载...</h3></p></div> </div> </div> </div> </div> <br/> <div> <ul class="list-group" id="data-table"> </ul> </div> </div> <div class="footer"> <div class="container"> <p class="text-muted">Place sticky footer content here.</p> </div> </div> </body> <script type="text/javascript"> function HtmlRegx() { $('#tip').modal('show'); var url = document.getElementById("url").value; //网址 var reg = document.getElementById("reg").value; //正则式 if(url=="" || reg=="") { alert("网址或者正则式为空"); return; } var base64 = new Base64(); var base64_url = base64.encode(url); var base64_reg = base64.encode(reg); //var posturl = "/jiexi/"+ url.split("?")+""+reg; var posturl = "/jiexi/"+base64_url+","+base64_reg; postdata(posturl,reg); } function postdata(url,reg) { $.ajax({ type:"POST", url:url, dataType:"json", success:function(data) { console.log(data[0]); /* $("#table").append('<tr><td>' + data.length + '</td></tr>')*/ show(data); } }); } function show(data) { $('#tip').modal('hide'); for(var i=0;i<data.length;i++) { $("#data-table").append('<li class="list-group-item">'+data[i]+'</li>'); } } </script> </html>查询用的是ajax方式。
最后效果:
原文地址:http://blog.csdn.net/iloster/article/details/40581317