码迷,mamicode.com
首页 > 编程语言 > 详细

python BeautifulSoup4 获取 script 节点问题

时间:2019-11-05 21:44:49      阅读:187      评论:0      收藏:0      [点我收藏+]

标签:merge   lag   name   参考   div   odi   https   htm   tips   

在爬取12306站点名时发现,BeautifulSoup检索不到station_version的节点

因为script标签在</html>之外,如果用‘lxml’解析器会忽略这一部分,而使用html5lib则不会。

  ...
1
<!-- 购物车 --> 2 <div style="display: none;" class="buy-cart"><div class="cart-hd"><span class="num">0</span> 3 </div> 4 <div class="cart-bd" style="display: none;"><div class="cart-bd-top"><h3><span id="hbTrainDate">候补购票需求列表</span> 5 <a id="hbClear" href="javascript:void(0)" shape="rect">[清空]</a> 6 </h3> 7 <a href="javascript:void(0)" class="close" shape="rect">×</a> 8 </div> 9 <div class="cart-bd-con"><ul class="cart-tlist"></ul> 10 </div> 11 <div class="cart-bd-ft"><p class="cart-ft-tips">1、候补订单需求中可包含2个相邻乘车日期,每个乘车日期可包含2个不同“车次+席别”的组合需求。</p> 12 <p class="cart-ft-tips">2、排位是指您的订单在待兑现订单中的位置。当前排位仅供参考,实际排位以支付成功后为准。</p> 13 <a id="hbSubmit" href="javascript:void(0)" class="btn72 fr" shape="rect">添加乘客</a> 14 </div> 15 </div> 16 </div> 17 </body> 18 </html>  # 用‘lxml’得到的汤到此为止 19 <script type="text/javascript" src="/otn/resources/js/framework/station_name.js?station_version=1.9115" xml:space="preserve"></script> 20 <script type="text/javascript" src="/otn/resources/js/framework/favorite_name.js" xml:space="preserve"></script> 21 <script type="text/javascript" src="/otn/resources/merged/queryLeftTicket_end_js.js?scriptVersion=1.9158" xml:space="preserve"></script>
  ...

 

 1 >>> url = "https://kyfw.12306.cn/otn/leftTicket/init?linktypeid=dc&fs=%E4%B8%87%E5%B7%9E,WYW&ts=%E8%A5%BF%E5%AE%89,XAY&date=2019-11-05&flag=N,N,Y"
 2 ... response = requests.get(url, timeout=10)
 3 ... response.encoding = ‘utf-8‘
 4 ... lxml = bs(response.text, ‘lxml‘)
 5 ... html5lib = bs(response.text, ‘html5lib‘)
 6 ... response.close()
 7 >>> lxml.find_all(src=re.compile(".*station_version.*"))
 8 []
 9 >>> html5lib.find_all(src=re.compile(".*station_version.*"))
10 [<script src="/otn/resources/js/framework/station_name.js?station_version=1.9115" type="text/javascript" xml:space="preserve"></script>]

 

python BeautifulSoup4 获取 script 节点问题

标签:merge   lag   name   参考   div   odi   https   htm   tips   

原文地址:https://www.cnblogs.com/wawawawa-briefnote/p/11801636.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!