Python 爬虫学习 beatiful soup

时间：2016-07-12 01:26:11 阅读：156 评论：0 收藏：0 [点我收藏+]

标签：

做回测系统发现股票季报不能很好的表现每天的总股本。于是在新浪发现了数据源，决定用beatifulSoup爬一下。

先是读取股票code对应页面，

1 code = yahooCode[0][:6]
2 html = urlopen("http://money.finance.sina.com.cn/corp/go.php/vCI_StockStructureHistory/stockid/"+str(code)+"/stocktype/TotalStock.phtml")
3 bsObj = BeautifulSoup(html, "lxml")

这里推荐用下面这种，因为用lxml会有几个code的页面解析不到，具体原因可能是页面太长。

bsObj = BeautifulSoup(html, "html.parser",from_encoding="gb18030")

下面直接找id的话用find(id=xxx)就行。

得到的数据就像dom结构一样。可以用find或者findAll查找子tag，例如下面，区别在于匹配第一个和匹配全部。

最后得到数据插入mysql，推荐replace into好用的很。

table = bsObj.find(id="StockStructureHistoryTable")
subtable = table.findAll(width="100%")
#print(subtable)
divs = subtable[0].findAll("div", {"align":"center"}) 
try:
        for div in divs:
            if(index==0):
                chgDate =  div.string
            if(index==1):
                totalShares = float(div.string[:-2])
                rows = cur.fetchall()
                insertSql = "REPLACE into TotalShares (ticker,chgDate,totalShares) values (%s,%s,%s)"
                cur.execute(insertSql,[yahooCode[0],chgDate,totalShares])
                print("insert :"+str([code,chgDate,totalShares]))
            index = (index+1) % 2
except Exception:
        print("wrong code info:"+code+":"+traceback.format_exc())

Python 爬虫学习 beatiful soup

标签：

原文地址：http://www.cnblogs.com/hyfwin/p/5662129.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行