码迷,mamicode.com
首页 > Web开发 > 详细

网页解析正则表达式

时间:2014-10-29 10:55:52      阅读:239      评论:0      收藏:0      [点我收藏+]

标签:python   爬虫   正则表达式   

  在写爬虫的过程中,最麻烦的就是写正则表达式,还要一个一个的尝试,一次次的调试,很是费时间。于是我就写了一个网页版的,只需要输入要爬的网址,和正则式,网页上就可以显示爬到的数据。

思路:其实很简单,将网址和正则式传到服务器,服务器解析之后,将结果返回到前端。我用的是bootcss(前端)+bottle(后台用python处理),代码很简单,就是过程有些复杂。由于传递的参数是一个网址,而后台判断参数结束的标志是/......./,所以每次都是传值失败,后来想到用先用base64加密再传递

webRegx.py

import urllib2
import re
import json

def getHtml(url):
    html = urllib2.urlopen(url).read()
    return html

def getResult(url,reg):
    html = urllib2.urlopen(url).read()
    reg = re.compile(reg)
    results = reg.findall(html)
    if len(results)>0:
        for result in results:
            print result
    else:
        print "not result"
    return json.dumps(results)
注意:最后要返回一个json结构的数据

main.py

from bottle import route,request,template,run,Bottle,static_file
from webRegx import getResult
import base64

app = Bottle()

@app.route('/')
def show():
    return template('templates/index')

@app.route('/jiexi/:webstr#.*?#',method='post')
def test(webstr):
    #return "hello{}!".format(name)
    #webstr = webstr.replace(',','?')
    base64_url,base64_reg =webstr.split(",") 
    url=base64.decodestring(base64_url)#解密
    reg=base64.decodestring(base64_reg)
    return getResult(url,reg)

@app.route('/templates/:filename')
def send_static(filename):
    return static_file(filename, root='./templates')

run(app, host='localhost', port=8080)
index.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"  
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="description" content="">
    <meta name="author" content="">

    <title>Sticky Footer Template for Bootstrap</title>

   <!-- 新 Bootstrap 核心 CSS 文件 -->
    <link rel="stylesheet" href="http://cdn.bootcss.com/bootstrap/3.2.0/css/bootstrap.min.css">

<!-- 可选的Bootstrap主题文件(一般不用引入) -->
    <link rel="stylesheet" href="http://cdn.bootcss.com/bootstrap/3.2.0/css/bootstrap-theme.min.css">

<!-- jQuery文件。务必在bootstrap.min.js 之前引入 -->
    <script src="http://cdn.bootcss.com/jquery/1.11.1/jquery.min.js"></script>
    <script src="./templates/base64.js"></script>
<!-- 最新的 Bootstrap 核心 JavaScript 文件 -->
    <script src="http://cdn.bootcss.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
    <!-- Custom styles for this template -->
    <style type="text/css">
      /* Sticky footer styles
      -------------------------------------------------- */
      html {
        position: relative;
        min-height: 100%;
      }
      body {
        /* Margin bottom by footer height */
        margin-bottom: 60px;
        font-family: 'microsoft yahei', 'Times New Roman', 宋体, Times, serif;
      }
      .footer {
        position: absolute;
        bottom: 0;
        width: 100%;
        /* Set the fixed height of the footer here */
        height: 60px;
        background-color: #f5f5f5;
      }


      /* Custom page CSS
      -------------------------------------------------- */
      /* Not required for template or sticky footer method. */

      .container {
        width: auto;
        max-width: 800px;
        padding: 0 15px;
      }
      .container .text-muted {
        margin: 20px 0;
      }
    </style>

  </head>

  <body>

    <!-- Begin page content -->
    <div class="container">
      <div class="page-header">
        <h1>正则匹配</h1>
      </div>

      <div>
          <div class="input-group input-group-lg">
            <span class="input-group-addon">url</span>
            <input type="text" class="form-control" placeholder="输入网址" id="url" name ="url">
          </div><br/>
          <div class="input-group input-group-lg">
            <span class="input-group-addon">reg</span>
            <input type="text" class="form-control" placeholder="输入正则表达式" id="reg" name ="reg">
            <span class="input-group-btn">
              <button class="btn btn-default" type="submit"  onclick="HtmlRegx()" id="myButton">搜索</button>
            </span>
          </div>
          <div class="modal fade" id="tip">
            <div class="modal-dialog">
             <div class="modal-content">
               <h3 class="modal-title">提示</h3>
                <div class="modal-body"><p><h3>正在加载...</h3></p></div>
              </div>
           </div>
          </div>
      </div>
      <br/>
      <div>
        <ul class="list-group" id="data-table">
        </ul>
      </div>

     </div>
    <div class="footer">
      <div class="container">
        <p class="text-muted">Place sticky footer content here.</p>
      </div>
    </div>

  </body>

<script type="text/javascript">
function HtmlRegx()
{
  $('#tip').modal('show');
  var url = document.getElementById("url").value; //网址

  var reg = document.getElementById("reg").value; //正则式
  if(url=="" || reg=="")
  {
    alert("网址或者正则式为空");
    return;
  }
  var base64 = new Base64();
  var base64_url = base64.encode(url);
  var base64_reg = base64.encode(reg);

  //var posturl = "/jiexi/"+ url.split("?")+""+reg; 
  var posturl = "/jiexi/"+base64_url+","+base64_reg; 

  postdata(posturl,reg);
}

function postdata(url,reg)
{
    $.ajax({
            type:"POST",
            url:url,
            dataType:"json",
            success:function(data)
              {
              console.log(data[0]);
             /* $("#table").append('<tr><td>' + data.length + '</td></tr>')*/
              show(data);
               }
            });
 }

function show(data)
{
   $('#tip').modal('hide');
    for(var i=0;i<data.length;i++)
    {    
     $("#data-table").append('<li class="list-group-item">'+data[i]+'</li>');
   }
}
</script>
</html>
查询用的是ajax方式。

最后效果:

bubuko.com,布布扣




网页解析正则表达式

标签:python   爬虫   正则表达式   

原文地址:http://blog.csdn.net/iloster/article/details/40581317

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!