标签:jsoup网络爬虫工具 webspider 爬取网页 jxl解析excel和写入到excel java解析html工具
1:闲话少说,直接看需求:
抓取的url:http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=.
参考的资料:http://blog.csdn.net/lmj623565791/article/details/23272657客户需求是相同的信息只需保留一项,所以在导入前需要过滤.
2:看完需求看要抓取的页面:
3.分析网页源代码:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <!-- saved from url=(0074)http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page= --> <html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <link href="./全市停车场-上海停车网_files/base.css" rel="stylesheet" type="text/css"> <link href="./全市停车场-上海停车网_files/reset.css" rel="stylesheet" type="text/css"> <!--[if IE 6]> <script type="text/javascript" src="http://www.shparking.cn/Scripts/DD_belatedPNG.js" ></script> <script type="text/javascript"> DD_belatedPNG.fix('body,img,div,ul,li,dl,dd,dt,a,input,span');</script> <![endif]--> <script type="text/javascript" src="./全市停车场-上海停车网_files/jquery-1.7.1.min.js"></script> <script type="text/javascript" src="./全市停车场-上海停车网_files/index.js"></script> <link rel="stylesheet" href="http://www.shparking.cn/colorbox/colorbox.css"> <script type="text/javascript" src="./全市停车场-上海停车网_files/lrtk.js"></script> <script src="./全市停车场-上海停车网_files/jquery.colorbox.js"></script> <script> $(document).ready(function(){ $(".reg").colorbox({iframe:true, innerWidth:496, innerHeight:411}); $(".forgetpwd").colorbox({iframe:true, innerWidth:496, innerHeight:242}); $(".mylogin").colorbox({iframe:true, innerWidth:496, innerHeight:266}); }); </script> <title>全市停车场-上海停车网</title> <script> $(function(){ $("#btn-search").click(function(){ var key = $("#txt-key").val(); location.href='http://www.shparking.cn/index.php/welcome/municipal_parking?key=' + key; }); }); </script> </head> <body><div id="cboxOverlay" style="display: none;"></div><div id="colorbox" class="" style="padding-bottom: 32px; padding-right: 0px; display: none;"><div id="cboxWrapper"><div><div id="cboxTopLeft" style="float: left;"></div><div id="cboxTopCenter" style="float: left;"></div><div id="cboxTopRight" style="float: left;"></div></div><div style="clear: left;"><div id="cboxMiddleLeft" style="float: left;"></div><div id="cboxContent" style="float: left;"><div id="cboxLoadedContent" style="width: 0px; height: 0px; overflow: hidden; float: left;"></div><div id="cboxLoadingOverlay" style="float: left;"></div><div id="cboxLoadingGraphic" style="float: left;"></div><div id="cboxTitle" style="float: left;"></div><div id="cboxCurrent" style="float: left;"></div><div id="cboxNext" style="float: left;"></div><div id="cboxPrevious" style="float: left;"></div><div id="cboxSlideshow" style="float: left;"></div><div id="cboxClose" style="float: left;"></div></div><div id="cboxMiddleRight" style="float: left;"></div></div><div style="clear: left;"><div id="cboxBottomLeft" style="float: left;"></div><div id="cboxBottomCenter" style="float: left;"></div><div id="cboxBottomRight" style="float: left;"></div></div></div><div style="position: absolute; width: 9999px; visibility: hidden; display: none;"></div></div> <script> $(function(){ $('#username').focus(function(){ if($(this).val() == '请输入用户名'){ $(this).val(''); } }); $('#username').blur(function(){ if($(this).val() == ''){ $(this).val('请输入用户名'); } }); $.ajax({ type: "post", url: "http://www.shparking.cn/index.php/member/getUser", success: function(result) { var user; eval("user="+result); if(user != null){ if(user.usertype==2){ $("#member-type").html('<a>工作人员</a>'); } $('#logout-box').show(); }else{ $('#signinform-box').show(); } }, error: function() { $('#signinform-box').show(); } }); }); </script> <div class="top"> <div class="main"> <dl id="signinform-box" style=""> <dt><img src="./全市停车场-上海停车网_files/ren.png" width="18" height="19"></dt> <dd><a class="reg cboxElement" href="http://www.shparking.cn/index.php/member/reg">注册</a></dd> <span> <dd class="login"><a href="http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=#">登录</a></dd> <div class="loginInfo"> <form action="http://www.shparking.cn/index.php/member/signin" method="post" name="signinform" id="signinform"> <input type="hidden" name="act" value="signin"> <ul> <li> <input name="username" id="username" type="text"> </li> <li> <input name="password" id="password" type="password"> </li> <li class="submit"> <input name="" type="image" src="./全市停车场-上海停车网_files/denlu.jpg"> <a href="http://www.shparking.cn/index.php/member/forgetpassword" class="forgetpwd cboxElement">忘记密码</a> </li> </ul> </form> </div> </span> </dl> <dl id="logout-box" style="display:none"> <dt><img src="./全市停车场-上海停车网_files/ren.png" width="18" height="19"></dt> <dd id="member-type"><a href="http://www.shparking.cn/index.php/member/main">会员中心</a></dd> <dd><a href="http://www.shparking.cn/index.php/member/logout">退出</a></dd> </dl> </div> </div> <div class="menu"> <div class="main"> <div class="logo"><img src="./全市停车场-上海停车网_files/logo.png" width="168" height="55"></div> <div class="info"> <ul> <li class="index" style="cursor: pointer;"><a href="http://www.shparking.cn/">首页</a></li> <li class="" style="cursor: pointer;"><a href="http://www.shparking.cn/index.php/welcome/travel">出行宝典</a></li> <li class="" style="cursor: pointer;"><a href="http://www.shparking.cn/index.php/welcome/onlineservice">网上办事</a></li> <li class="" style="cursor: pointer;"><a href="http://www.shparking.cn/index.php/welcome/news">资讯中心</a></li> <li class="" style="cursor: pointer;"><a href="http://www.shparking.cn/index.php/welcome/policy">政策法规</a></li> <li class="" style="cursor: pointer;"><a href="http://www.shparking.cn/index.php/welcome/companylist">会员中心</a></li> <li class="" style="cursor: pointer;"><a href="http://www.shparking.cn/index.php/welcome/aboutass">关于协会</a></li> </ul> </div> </div> </div> <div class="main"> <div class="NlistL"> <dl class="aboutnews"> <dt>查询<span>SEARCH</span></dt> <dd><input id="txt-key" type="text" value=""><input id="btn-search" type="button" value="查询"></dd> </dl> <div class="ad"> <div style="float:left; width:240px; height:160px; overflow:hidden; background:url(http://www.shparking.cn/uppic/municipal_parking.png) no-repeat 0px 0px;"> <div style="float:left; width:100px; height:35px; margin:28px 0px 0px 23px; cursor:pointer" onclick="location.href='http://www.shparking.cn/index.php/welcome/municipal_parking'"></div> <div style="float:left; width:100px; height:35px; margin:28px 0px 0px 23px; cursor:pointer" onclick="location.href='http://www.shparking.cn/index.php/welcome/dl_parking'"></div> </div> </div> <div class="ad" style="height:5px; clear:both"> </div> <dl class="aboutnews"> <dt>相关新闻<span>NEW</span></dt> <dd><a href="http://www.shparking.cn/index.php/welcome/newsinfo/6/6/1771">上海停车信息(第140期...</a></dd> <dd><a href="http://www.shparking.cn/index.php/welcome/newsinfo/6/6/1616">关于开展“同心共筑中...</a></dd> <dd><a href="http://www.shparking.cn/index.php/welcome/newsinfo/6/6/1791">家门口的停车场到底该...</a></dd> <dd><a href="http://www.shparking.cn/index.php/welcome/newsinfo/6/6/1790">春运首日虹桥机场已难...</a></dd> <dd><a href="http://www.shparking.cn/index.php/welcome/newsinfo/6/6/1789">上海静安交警“先礼后...</a></dd> <dd><a href="http://www.shparking.cn/index.php/welcome/newsinfo/6/6/1788">废旧停车场改建成188间...</a></dd> </dl> </div> <div class="NlistR"> <div class="nav">当前状态:<a href="http://www.shparking.cn/">上海停车网</a> > 全市停车场库</div> <ul class="list"> <li style="background:none; background-color:#f1f1f1;"> <span style="float:left; width:320px; font-size:13px;" align="left">停车场名称</span> <span style="float:left; width:230px; font-size:13px;" align="left">地址</span> <span style="float:left; width:60px; font-size:13px;" align="center">泊位数</span> <span style="float:left; width:90px; font-size:13px;" align="center">价格</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海银河宾馆有限公司" align="left">上海银河宾馆有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="中山西路888号" align="left">中山西路888号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">88</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">10</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海扬子江大酒店有限公司" align="left">上海扬子江大酒店有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="延安西路2099号" align="left">延安西路2099号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">46</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">20</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海北圣投资管理有限公司" align="left">上海北圣投资管理有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="黄杨路18号" align="left">黄杨路18号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">71</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">7</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海北圣投资管理有限公司" align="left">上海北圣投资管理有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="黄杨路18号" align="left">黄杨路18号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">71</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">4</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海东方明珠物产管理有限公司" align="left">上海东方明珠物产管理有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="东方路1866-1882号" align="left">东方路1866-1882号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">15</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">5</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海百联汽车经营服务发展有限公司" align="left">上海百联汽车经营服务发展有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="共和新路3550号" align="left">共和新路3550号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">70</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">5</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海江海建设开发公司" align="left">上海江海建设开发公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="张杨路818号" align="left">张杨路818号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">149</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">9</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海江海建设开发公司" align="left">上海江海建设开发公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="张杨路818号" align="left">张杨路818号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">149</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">5</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海奥凯酒店经营管理有限公司" align="left">上海奥凯酒店经营管理有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="石龙路569号" align="left">石龙路569号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">31</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">4</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海静工宏林投资发展有限公司" align="left">上海静工宏林投资发展有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="昌平路68号" align="left">昌平路68号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">0</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center"></span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海人民企业集团物业管理有限公司" align="left">上海人民企业集团物业管理有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="吴江路188号" align="left">吴江路188号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">0</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center"></span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海惠真医药有限公司" align="left">上海惠真医药有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="宝山区友谊路185号" align="left">宝山区友谊路185号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">105</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">3</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海惠真医药有限公司" align="left">上海惠真医药有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="宝山区友谊路185号" align="left">宝山区友谊路185号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">105</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">6</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海同乐坊物业管理有限公司" align="left">上海同乐坊物业管理有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="海防路537号9幢01室" align="left">海防路537号9幢01室</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">32</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">10</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海长翎管理咨询有限公司" align="left">上海长翎管理咨询有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="淮安路735号" align="left">淮安路735号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">21</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">6</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海香梅休闲健身俱乐部有限公司" align="left">上海香梅休闲健身俱乐部有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="浦东梅花路1号" align="left">浦东梅花路1号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">17</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">5</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="泊亿管理咨询(上海)有限公司虹口分公司" align="left">泊亿管理咨询(上海)有限公司虹口分公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="东宝兴路777号" align="left">东宝兴路777号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">80</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">8</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海金沙江大酒店有限公司" align="left">上海金沙江大酒店有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="怒江路257号" align="left">怒江路257号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">60</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">4</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海维思鼎文化发展有限公司" align="left">上海维思鼎文化发展有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="兰溪路138号" align="left">兰溪路138号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">32</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">6</span> </li> <li> <span style="float:left; width:320px; height:28px; font-size:13px; overflow:hidden" title="上海磊明工贸有限公司" align="left">上海磊明工贸有限公司</span> <span style="float:left; width:230px; height:28px; font-size:12px; overflow:hidden" title="华联路130弄32号" align="left">华联路130弄32号</span> <span style="float:left; width:60px; height:28px; font-size:12px; overflow:hidden" align="center">22</span> <span style="float:left; width:90px; height:28px; font-size:12px; overflow:hidden" align="center">4</span> </li> </ul> <ul class="page"> <li>1</li><li><a href="http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=20">2</a></li><li><a href="http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=40">3</a></li><li><a href="http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=60">4</a></li><li><a href="http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=80">5</a></li><li><a href="http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=100">6</a></li><li><a href="http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=120">7</a></li><li><a href="http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=140">8</a></li><li><a href="http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=160">9</a></li><li><a href="http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=180">10</a></li><li><a href="http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=200">11</a></li><li><a href="http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3300">>|</a></li> </ul> </div> </div> <div class="clear"></div> <div class="footer"> <div class="main"> <div class="link"> <dl> <dt><img src="./全市停车场-上海停车网_files/linklogo.png" width="110" height="40"></dt> <dd><a href="http://www.jt.sh.cn/">上海市交通网</a></dd> <dd><a href="http://www.shygc.net/pageHome.do;jsessionid=60CFD91E9852E9DCC95C163CD759BB89?page=initList">上海市城市交通运输管理处</a></dd> <dd><a href="http://www.shjtaq.com/main/index.asp">上海交通安全信息网</a></dd> <dd><a href="http://www.shjpxh.org.cn/">上海市机动车驾驶员培训行业网站</a></dd> <dd><a href="http://www.chinaparking.org/">中国停车网</a></dd> </dl> <div class="clear"></div> </div> </div> <div class="line"></div> <div class="main"><div class="copyright"><a href="http://www.shparking.cn/index.php/welcome/view/8/14/40">关于我们</a><a href="http://www.shparking.cn/index.php/welcome/view/8/14/41">联系我们</a><a href="http://www.shparking.cn/index.php/welcome/view/8/14/42">网站地图</a> <script src="./全市停车场-上海停车网_files/stat.php" language="JavaScript"></script><script src="./全市停车场-上海停车网_files/core.php" charset="utf-8" type="text/javascript"></script><a href="http://www.cnzz.com/stat/website.php?web_id=5192959" target="_blank" title="站长统计">站长统计</a><br> <span>沪ICP备11029623号-2 © 2015 版权所有</span><span>主办单位:上海市停车服务业行业协会 承办单位:上海市中停车管理服务有限公司</span></div></div> </div> </body></html>4.分析关于访问网页设计到的一些重要信息
5.结合业务,创建要抓取的数据模型类
package com.ilucky.util.webspider; /** * 抓取的数据模型. * @author IluckySi * @since 20150210 */ public class DataModel { private String parkName;//停车场名称. private String parkAddress;//停车场地址. private String parkCount;//停车场停车位个数. private String parkPrice;//停车场价格. public String getParkName() { return parkName; } public void setParkName(String parkName) { this.parkName = parkName; } public String getParkAddress() { return parkAddress; } public void setParkAddress(String parkAddress) { this.parkAddress = parkAddress; } public String getParkCount() { return parkCount; } public void setParkCount(String parkCount) { this.parkCount = parkCount; } public String getParkPrice() { return parkPrice; } public void setParkPrice(String parkPrice) { this.parkPrice = parkPrice; } public String toString() { return ("parkName = " + parkName + ", parkAddress = " + parkAddress + ", parkCount = " + parkCount + ", parkPrice = " + parkPrice); } }
6.抓取工具类,结合对抓取文件的源代码
package com.ilucky.util.webspider; import java.util.ArrayList; import java.util.HashSet; import java.util.List; import java.util.Set; import org.jsoup.Connection; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; /** * 抓取工具类,最重要的是如何获取被抓取数据的特征. * 抓取的url:http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=. * @author IluckySi * @since 20150210 */ public class WebSpiderUtil { private String url;//抓取的url. private List<DataModel> dataModelList = new ArrayList<DataModel>();//存放被抓取数据. private Set<String> urlSet = new HashSet<String>();//存放内部url数据. private Connection connection = null; public String getUrl() { return url; } public void setUrl(String url) { this.url = url; } public List<DataModel> getDataModelList() { return dataModelList; } public void setDataModelList(List<DataModel> dataModelList) { this.dataModelList = dataModelList; } public Set<String> getUrlSet() { return urlSet; } public void setUrlSet(Set<String> urlSet) { this.urlSet = urlSet; } /** * 通过第三方类库解析url对应的html数据,并根据特征获取需要的数据, * 同时根据特征获取需要的内部url,并继续解析内部url对应的html数据, * 即通过递归的方式,完成对所有相关url对应的html数据的解析. */ public void caputerData() { try { if(url != null && !url.equals("")) { System.out.println("正在抓取的页面url = " + url); connection = Jsoup.connect(url); Document document = connection.timeout(100000).get(); //抓取当前页面的内部url数据. Elements resultUrls = document.getElementsByClass("page"); Elements links = null; for (Element resultUrl : resultUrls) { links = resultUrl.getElementsByTag("a"); } //解析当前url对应的html数据. Elements resultDatas = document.getElementsByClass("list"); for (Element resultData : resultDatas) { Elements uls = resultData.getElementsByTag("ul"); for (Element ul : uls) { Elements lis = ul.getElementsByTag("li"); for (int i = 1; lis != null && i < lis.size(); i++) { Element li = lis.get(i); Elements spans = li.getElementsByTag("span"); if(spans != null && spans.size() == 4) { DataModel dataModel = new DataModel(); dataModel.setParkName(spans.get(0).text()); dataModel.setParkAddress(spans.get(1).text()); dataModel.setParkCount(spans.get(2).text()); dataModel.setParkPrice(spans.get(3).text()); dataModelList.add(dataModel); } } } } //难点:获取需要抓取的下一个url,通过Set集合保存已经解析过的url,防止重复解析. urlSet.add(url); boolean flag = false; for(int i = 0; links != null && i < links.size(); i++) { String link = links.get(i).attr("href"); if(!urlSet.contains(link)) { url = link; flag = true; break; } } if(flag) { caputerData(); } } } catch (Exception e) { System.out.println("抓取数据出现异常: url = " + url + ", e = " + e); } } }7.再看测试类
package com.ilucky.util; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import com.ilucky.util.jxl.JxlUtil; import com.ilucky.util.webspider.DataModel; import com.ilucky.util.webspider.WebSpiderUtil; /** * 抓取的url:http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=. * 参考的资料:http://blog.csdn.net/lmj623565791/article/details/23272657 * 需求:抓取指定url,共n页信息(即需要分析内部链接),将页面上的停车信息抓取出来(即需要分析停车信息的特征),并导入到excel中. * 问题分析:因为页面上存在相同的停车场信息(停车场名称,停车场地址和停车场车位个数相同),那么导入到excel中也存在相同信息, * 客户需求是相同的信息只需保留一项,所以在导入前需要过滤. * @author IluckySi * @since 20150210 */ public class MainTest { public static void main(String[] args) { //抓取数据. WebSpiderUtil wsu = new WebSpiderUtil(); wsu.setUrl("http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page="); long start = System.currentTimeMillis(); wsu.caputerData(); System.out.println("抓取数据共耗时:" + (System.currentTimeMillis() - start)/1000 + "秒"); //过滤重复数据. List<DataModel> all = wsu.getDataModelList(); List<DataModel> del = new ArrayList<DataModel>(); List<DataModel> save = new ArrayList<DataModel>(); System.out.println("抓取到的数据共:" + all.size()); for(int i = 0; all != null && i < all.size(); i++) { boolean flag = false; DataModel dm = all.get(i); for(int j = 0; save != null && j < save.size(); j++) { DataModel dm2 = save.get(j); if(dm.getParkAddress().equals(dm2.getParkAddress()) && dm.getParkName().equals(dm2.getParkName()) && dm.getParkCount().equals(dm2.getParkCount())) { del.add(dm2); flag = true; break; } } if(flag == false) { save.add(dm); } } //将过滤后数据写入到excel中. JxlUtil ju = new JxlUtil(); ju.setPath("C:\\Users\\IluckySi\\Desktop\\model.xls"); Map<String, List<List<String>>> listListMap = new HashMap<String, List<List<String>>>(); List<List<String>> listList = new ArrayList<List<String>>(); List<String> list1 = new ArrayList<String>(); list1.add("停车场名称"); list1.add("地址"); list1.add("泊位数"); list1.add("价格"); listList.add(list1); for(DataModel dm : save) { List<String> list = new ArrayList<String>(); list.add(dm.getParkName()); list.add(dm.getParkAddress()); list.add(dm.getParkCount()); list.add(dm.getParkPrice()); listList.add(list); } System.out.println("存入到excel的数据共:" + save.size()); listListMap.put("上海停车网", listList); ju.write(listListMap); } }
<pre code_snippet_id="602483" snippet_file_name="blog_20150211_3_6245062" name="code" class="java">/** 控制台输出: 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page= 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=20 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=40 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=60 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=80 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=100 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=120 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=140 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=160 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=180 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=200 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=220 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=240 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=260 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=280 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=300 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=320 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=340 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=360 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=380 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=400 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=420 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=440 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=460 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=480 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=500 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=520 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=540 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=560 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=580 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=600 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=620 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=640 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=660 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=680 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=700 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=720 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=740 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=760 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=780 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=800 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=820 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=840 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=860 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=880 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=900 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=920 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=940 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=960 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=980 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1000 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1020 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1040 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1060 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1080 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1100 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1120 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1140 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1160 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1180 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1200 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1220 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1240 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1260 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1280 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1300 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1320 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1340 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1360 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1380 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1400 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1420 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1440 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1460 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1480 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1500 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1520 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1540 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1560 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1580 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1600 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1620 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1640 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1660 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1680 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1700 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1720 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1740 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1760 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1780 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1800 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1820 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1840 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1860 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1880 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1900 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1920 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1940 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1960 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=1980 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2000 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2020 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2040 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2060 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2080 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2100 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2120 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2140 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2160 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2180 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2200 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2220 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2240 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2260 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2280 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2300 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2320 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2340 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2360 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2380 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2400 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2420 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2440 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2460 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2480 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2500 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2520 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2540 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2560 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2580 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2600 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2620 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2640 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2660 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2680 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2700 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2720 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2740 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2760 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2780 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2800 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2820 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2840 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2860 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2880 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2900 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2920 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2940 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2960 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=2980 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3000 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3020 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3040 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3060 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3080 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3100 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3120 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3140 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3160 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3180 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3200 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3220 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3240 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3260 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3280 正在抓取的页面url = http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=3300 抓取数据共耗时:58秒 抓取到的数据共:3318 存入到excel的数据共:2465 成功写入文件 */
package com.ilucky.util.jxl; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Map.Entry; import jxl.Cell; import jxl.Sheet; import jxl.Workbook; import jxl.read.biff.BiffException; import jxl.write.Label; import jxl.write.WritableSheet; import jxl.write.WritableWorkbook; /** * 注意:此工具类只支持解析excel2003,不支持解析高版本的excel,如果解析高版本excel会报如下错误: * jxl.read.biff.BiffException: Unable to recognize OLE stream * 解决方案:将高版本excel文件另存为Excel97-2003工作薄,然后再解析. * jxl较poi的好处是跨平台,因为是用纯java编写,poi虽然功能比jxl强大,但是是基于windows系统的. * @author IluckySi * @since 20141215 */ public class JxlUtil { private String filePath; public String getPath() { return filePath; } public void setPath(String filePath) { this.filePath = filePath; } /** * 解析excel文件. * @return Map<String, List<List<String>>> */ public Map<String, List<List<String>>> parse() { File file = new File(filePath); if(!file.exists() || !file.getName().endsWith(".xls")) { try { throw new Exception("要解析的路径有问题: " + filePath); } catch (Exception e) { e.printStackTrace(); } } Map<String, List<List<String>>> listListMap = new HashMap<String, List<List<String>>>(); Workbook workBook = null; FileInputStream fis = null; try { fis = new FileInputStream(file); workBook = Workbook.getWorkbook(fis); Sheet[] sheetArray = workBook.getSheets(); for(int i = 0; sheetArray != null && i < sheetArray.length; i++) { Sheet sheet = sheetArray[i]; List<List<String>> listList = parseSheet(sheet); if(listList != null && listList.size() > 0) { listListMap.put(sheet.getName(), listList); } } } catch (BiffException e) { System.out.println("解析文件发生异常: " + e); } catch (IOException e) { System.out.println("解析文件发生异常: " + e); } finally { try { if(workBook != null) { workBook.close(); workBook = null; } if(fis != null) { fis.close(); fis = null; } } catch (Exception e) { System.out.println("关闭文件流发生异常: " + e); } } return listListMap; } /** * 解析sheet,需要注意的地方:合并单元格, * 例:如果A6-A12合并了单元格,那么解析excel时,解析类库只认为A6有值. * @param sheet */ private List<List<String>> parseSheet(Sheet sheet) { List<List<String>> listList = new ArrayList<List<String>>(); int rowCount = sheet.getRows(); for(int i = 1; i < rowCount; i++) { List<String> list = new ArrayList<String>(); Cell[] cellArray = sheet.getRow(i); for(int j = 0; cellArray != null && j < cellArray.length; j++) { list.add(cellArray[j].getContents()); } listList.add(list); } return listList; } /** * 将数据源写入到excel中. * 注意:20150211加的写入方法. * @param listListMap * @return */ public boolean write(Map<String, List<List<String>>> listListMap) { File file = new File(filePath); boolean result = false; WritableWorkbook workBook = null; FileOutputStream fos = null; try { fos = new FileOutputStream(file); workBook = Workbook.createWorkbook(fos); int sheetNo = 0; for(Entry<String, List<List<String>>> entry : listListMap.entrySet()) { String key = entry.getKey(); List<List<String>> listList = entry.getValue(); WritableSheet sheet = workBook.createSheet(key, sheetNo++); for(int i = 0; i < listList.size(); i++) { List<String> list = listList.get(i); for(int j = 0; j < list.size(); j++) { Label label = new Label(j, i, list.get(j)); sheet.addCell(label); } } } workBook.write(); System.out.println("成功写入文件"); } catch (Exception e) { System.out.println("写入文件发生异常: " + e); } finally { try { if(workBook != null) { workBook.close(); } if(fos != null) { fos.close(); } } catch (IOException e) { System.out.println("关闭文件流发生异常: " + e); } } return result; } }
注意:由于我水平有限,代码写的比较死!后期会完善此工具类!也希望有网络爬虫经验的大神给与指导!万分感谢!
如何通过jsoup网络爬虫工具爬取网页数据,并通过jxl工具导出到excel
标签:jsoup网络爬虫工具 webspider 爬取网页 jxl解析excel和写入到excel java解析html工具
原文地址:http://blog.csdn.net/sidongxue2/article/details/43733189