码迷,mamicode.com
首页 > 其他好文 > 详细

使用BeautifulSoup抓取

时间:2015-02-16 00:21:42      阅读:880      评论:0      收藏:0      [点我收藏+]

标签:

年前有个坑爹的需求来了,要把某点评网商户数据都给获取下来存储于数据库,好啦其实这个东西是蛮简单的;

首先到点评网把城市数据给拷贝下来,当然你也可以写个脚本把数据抓取下来,不够我没这么干;好了下面是用于抓取数据的脚本,下面我分享下给大家:

城市列表:

技术分享
alashan|57|阿拉善
anshan|58|鞍山
anqing|117|安庆
anhuisuzhou|121|宿州
anyang|164|安阳
aba|255|阿坝
anshun|261|安顺
ali|288|阿里
ankang|297|安康
akesudiqu|332|阿克苏地区
aletaidiqu|338|阿勒泰地区
macau|342|澳门
alaer|389|阿拉尔
australia|2318|澳大利亚其他
auckland|2384|奥克兰
orlando|2401|奥兰多
agra|2410|阿格拉
antwerp|2422|安特卫普
amsterdam|2428|阿姆斯特丹
antalya|2445|安塔丽亚
ankara|2446|安卡拉
athens|2455|雅典
edinburgh|2465|爱丁堡
alexandria|2473|亚历山大
aswan|2474|亚斯文
ethiopia|2496|埃塞俄比亚
alishan|2503|阿里山
beijing|2|北京
baoding|29|保定
baotou|47|包头
bayannaoer|56|巴彦淖尔
benxi|60|本溪
baishan|75|白山
baicheng|77|白城
bengbu|112|蚌埠
bozhou|124|亳州
binzhou|158|滨州
beihai|228|北海
baise|233|百色
bazhong|253|巴中
bijiediqu|264|毕节地区
baoshan|270|保山
baoji|291|宝鸡
baiyin|302|白银
boertala|330|博尔塔拉
bayinguoleng|331|巴音郭楞
beitun|346|北屯
baisha|390|白沙
baoting|391|保亭
bangkok|2342|曼谷
pattaya|2344|芭堤雅
pai|2349|拜县
bali|2351|巴厘岛
bandung|2352|万隆
boracay|2355|长滩岛
palawan|2357|巴拉望岛
bohol|2358|薄荷岛
busan|2370|釜山
hokkaido|2375|北海道
brisbane|2381|布里斯班
paris|2388|巴黎
boston|2403|波士顿
brussels|2420|布鲁塞尔
bruges|2421|布鲁日
berlin|2423|柏林
prague|2431|布拉格
brno|2433|布尔诺
porto|2435|波尔图
bern|2443|伯尔尼
barcelona|2449|巴塞罗纳
budapest|2457|布达佩斯
pisa|2462|比萨
pretoria|2480|比勒陀利亚
buenosaires|2485|布宜诺斯艾利斯
brunei|2492|文莱
chengdu|8|成都
chongqing|9|重庆
chengde|31|承德
cangzhou|32|沧州
changzhi|38|长治
chifeng|49|赤峰
chaoyang|68|朝阳
changchun|70|长春
changzhou|93|常州
chuzhou|119|滁州
chaohu|122|巢湖
chizhou|125|池州
changde|197|常德
chenzhou|200|郴州
chaozhou|221|潮州
chuxiongzhou|272|楚雄州
changdudiqu|284|昌都地区
changjizhou|329|昌吉州
changsha|344|长沙
changjiang|392|昌江
chengmai|393|澄迈县
chongzuo|394|崇左
cixi|421|慈溪
cangnan|911|苍南
changle|981|长乐
cambodia|2316|柬埔寨其他
chiangmai|2345|清迈
chiangrai|2348|清莱
boracay|2355|长滩岛
cebu|2356|宿雾
okinawa|2377|冲绳
canberra|2382|堪培拉
cairns|2383|凯恩斯
christchurch|2387|基督城
cannes|2391|戛纳
chicago|2400|芝加哥
cologne|2425|科隆
creteisland|2453|克里特
cambridge|2466|剑桥
cairo|2472|开罗
casablanca|2477|卡萨布兰卡
capetown|2478|开普敦
cancun|2482|坎坤
cuzco|2487|库斯科
costarica|2500|哥斯达黎加
dalian|19|大连
datong|36|大同
dandong|61|丹东
daqing|84|大庆
daxinganling|91|大兴安岭
dongying|147|东营
dezhou|156|德州
dongguan|219|东莞
deyang|241|德阳
dazhou|251|达州
dali|277|大理
dehong|278|德宏
diqing|281|迪庆
dingxi|309|定西
danzhou|358|儋州
dingan|395|定安县
dongfang|396|东方
tokyo|2372|东京
osaka|2374|大阪
dijon|2394|第戎
tahiti|2405|大溪地
delhi|2407|新德里
toronto|2413|多伦多
turin|2463|都灵
dublin|2470|都柏林
eerduosi|51|鄂尔多斯
ezhou|181|鄂州
enshizhou|188|恩施州
edinburgh|2465|爱丁堡
ethiopia|2496|埃塞俄比亚
fuzhou|14|福州
fushun|59|抚顺
fuxin|64|阜新
fuyang|120|阜阳
jiangxifuzhou|143|抚州
foshan|208|佛山
fangchenggang|229|防城港
fenghua|422|奉化
fuqing|433|福清
fuyangfy|869|富阳
fuding|1031|福鼎
philippines|2327|菲律宾其他
fiji|2328|斐济
france|2331|法国其他
busan|2370|釜山
fujisan|2376|富士山
frankfurt|2426|法兰克福
florence|2459|佛罗伦萨
fukuoka|2505|福冈
guangzhou|4|广州
ganzhou|140|赣州
guilin|226|桂林
guigang|231|贵港
guangxiyulin|232|玉林
guangyuan|243|广元
guangan|250|广安
ganzi|256|甘孜??
guiyang|258|贵阳
gannanzhou|312|甘南
guoluo|318|果洛
guyuan|324|固原
guowai|343|国外其他
kaohsiung|2337|高雄
goldcoast|2380|黄金海岸
gothenburg|2437|哥德堡
geneva|2440|日内瓦
costarica|2500|哥斯达黎加
hangzhou|3|杭州
haikou|23|海口
handan|27|邯郸
hengshui|34|衡水
huhehaote|46|呼和浩特
hulunbeier|52|呼伦贝尔
huludao|69|葫芦岛
haerbin|79|哈尔滨
hegang|82|鹤岗
heihe|89|黑河
huaian|96|淮安
huzhou|103|湖州
hefei|110|合肥
huainan|113|淮南
huaibei|115|淮北
huangshan|118|黄山
heze|159|菏泽
hebi|165|鹤壁
huangshi|177|黄石
huanggang|185|黄冈
hengyang|194|衡阳
huaihua|202|怀化
huizhou|213|惠州
heyuan|216|河源
hezhou|234|贺州
hechi|235|河池
honghe|273|红河
hanzhong|295|汉中
haidong|314|海东
haibei|315|海北
huangnan|316|黄南
haixi|320|海西
hamidiqu|328|哈密地区
hetiandiqu|335|和田地区
hongkong|341|香港
hainanzhou|411|海南州
korea|2314|韩国其他
hualien|2336|花莲
hochiminh|2366|胡志明市
hanoi|2367|河内
haiphong|2368|海防市
hokkaido|2375|北海道
hakone|2378|箱根
goldcoast|2380|黄金海岸
queenstown|2385|皇后镇
wellington|2386|惠灵顿
hawaii|2404|夏威夷
hamburg|2424|汉堡
thehague|2430|海牙
jinan|22|济南
jincheng|39|晋城
jinzhong|41|晋中
jinzhou|62|锦州
jilin|71|吉林
jixi|81|鸡西
jiamusi|86|佳木斯
jiaxing|102|嘉兴
jinhua|105|金华
jingdezhen|135|景德镇
jiujiang|137|九江
jian|141|吉安
jiangxiyichun|142|宜春
jiangxifuzhou|143|抚州
jining|150|济宁
jiaozuo|167|焦作
jinmen|182|荆门
jingzhou|184|荆州
jiangmen|209|江门
jieyang|222|揭阳
jiayuguan|300|嘉峪关
jinchang|301|金昌
jiuquan|307|酒泉
jiyuan|397|济源
jingjian|853|靖江
jinjiang|1009|晋江
japan|2315|日本其他
cambodia|2316|柬埔寨其他
jakarta|2353|雅加达
kualalumpur|2359|吉隆坡
phnompenh|2364|金边
jeju|2371|济州岛
kyoto|2373|京都
christchurch|2387|基督城
sanfrancisco|2396|旧金山
jaipur|2411|斋浦尔
cambridge|2466|剑桥
killarney|2471|基拉尼
johannesburg|2479|约翰内斯堡
jordan|2495|约旦
jamaica|2501|牙买加
kaifeng|161|开封
kunming|267|昆明
kelamayi|326|克拉玛依
kezilesu|333|克孜勒苏
kashidiqu|334|喀什地区
kunshan|416|昆山
korea|2314|韩国其他
kaohsiung|2337|高雄
kohphiphi|2347|皮皮岛
kohsamet|2350|沙美岛
kualalumpur|2359|吉隆坡
kyoto|2373|京都
canberra|2382|堪培拉
cairns|2383|凯恩斯
kenting|2406|垦丁
cologne|2425|科隆
karlovyvary|2432|卡罗维瓦立
creteisland|2453|克里特
killarney|2471|基拉尼
cairo|2472|开罗
casablanca|2477|卡萨布兰卡
capetown|2478|开普敦
cancun|2482|坎坤
cuzco|2487|库斯科
kenya|2497|肯尼亚
langfang|33|廊坊
linfen|44|临汾
lvliang|45|吕梁
liaoyang|65|辽阳
liaoyuan|73|辽源
lianyuangang|95|连云港
lishui|109|丽水
liuan|123|六安
longyan|132|龙岩
laiwu|154|莱芜
linyi|155|临沂
liaocheng|157|聊城
luoyang|162|洛阳
luohe|170|漯河
loudi|203|娄底
liuzhou|225|柳州
luzhou|240|泸州
leshan|246|乐山
liangshan|257|凉山
liupanshui|259|六盘水
lijiang|279|丽江
linchang|282|临沧
lasa|283|拉萨
linzhi|289|林芝地区
lanzhou|299|兰州
longnan|310|陇南
linxiazhou|311|临夏州
laibin|398|来宾
ledong|399|乐东
lingao|400|临高县
lingshui|401|陵水
liyang|867|溧阳
linan|868|临安
yueqing|905|乐清
liuhai|1015|龙海
liuyang|1376|浏阳
langkawi|2361|兰卡威
lyon|2393|里昂
losangeles|2397|洛杉矶
lasvegas|2398|拉斯维加斯
rotterdam|2429|鹿特丹
lisbon|2434|里斯本
luzern|2442|卢塞恩
rome|2458|罗马
london|2464|伦敦
liverpool|2469|利物浦
luxor|2475|卢克索
riodejaneiro|2483|里约热内卢
lima|2488|利马
laos|2490|老挝
lebanon|2491|黎巴嫩
mudanjiang|88|牡丹江
maanshan|114|马鞍山
maoming|211|茂名
meizhou|214|梅州
mianyang|242|绵阳
meishan|248|眉山
macau|342|澳门
malaysia|2312|马来西亚其他
melbourne|2322|墨尔本
maldives|2324|马尔代夫
mauritius|2329|毛里求斯
unitedstates|2332|美国其他
bangkok|2342|曼谷
manila|2354|马尼拉
marseille|2392|马赛
miami|2402|迈阿密
mumbai|2409|孟买
montreal|2412|蒙特娄
munich|2427|慕尼黑
malmo|2438|马尔默
pamukkale|2448|棉花堡
madrid|2450|马德里
majorca|2452|马略卡岛
mykonos|2456|米科诺斯
manchester|2467|曼彻斯特
marrakech|2476|马拉喀什
mexicocity|2481|墨西哥城
machupicchu|2489|马丘比丘
myanmar|2493|缅甸
madagascar|2498|马达加斯加
nanjing|5|南京
ningbo|11|宁波
nantong|94|南通
nanping|131|南平
ningde|133|宁德
nanchang|134|南昌
nanyang|172|南阳
nanning|224|南宁
neijiang|245|内江
nanchong|247|南充
nujiang|280|怒江
naqu|287|那曲
ninghai|2308|宁海
newzealand|2319|新西兰其他
nepal|2333|尼泊尔
newtaipei|2340|新北
nice|2390|尼斯
newyork|2395|纽约
niagarafalls|2414|尼亚加拉瀑布
naples|2461|那不勒斯
oxford|2468|牛津
riyuetan|2504|南投
osaka|2374|大阪
okinawa|2377|冲绳
orlando|2401|奥兰多
ottawa|2416|渥太华
oxford|2468|牛津
panjin|66|盘锦
putian|127|莆田
pingxiang|136|萍乡
pingdingshan|163|平顶山
puyang|168|濮阳
panzhihua|239|攀枝花
puer|275|普洱
pingliang|306|平凉
pingyang|908|平阳
philippines|2327|菲律宾其他
phuket|2343|普吉岛
pattaya|2344|芭堤雅
kohphiphi|2347|皮皮岛
pai|2349|拜县
palawan|2357|巴拉望岛
penang|2360|槟城
phnompenh|2364|金边
paris|2388|巴黎
provence|2389|普罗旺斯
prague|2431|布拉格
porto|2435|波尔图
pamukkale|2448|棉花堡
pisa|2462|比萨
pretoria|2480|比勒陀利亚
palau|2502|帕劳
qingdao|21|青岛
qinghuangdao|26|秦皇岛
qiqihaer|80|齐齐哈尔
qitaihe|87|七台河
quzhou|106|衢州
quanzhou|129|泉州
qianjiang|190|潜江
qingyuan|218|清远
qinzhou|230|钦州
qianxinan|263|黔西南
qiandongnan|265|黔东南
qiannan|266|黔南
qujing|268|曲靖
qingyang|308|庆阳
qionghai|402|琼海
qiongzhong|403|琼中
chiangmai|2345|清迈
chiangrai|2348|清莱
queenstown|2385|皇后镇
rizhao|153|日照
rikazediqu|286|日喀则地区
ruian|904|瑞安
rongcheng|1161|荣成
japan|2315|日本其他
rotterdam|2429|鹿特丹
geneva|2440|日内瓦
rome|2458|罗马
riodejaneiro|2483|里约热内卢
riyuetan|2504|南投
shanghai|1|上海
suzhou|6|苏州
shenzhen|7|深圳
shenyang|18|沈阳
shijiazhuang|24|石家庄
shuozhou|40|朔州
siping|72|四平
songyuan|76|松原
shuangyashan|83|双鸭山
suihua|90|绥化
suqian|100|宿迁
shaoxing|104|绍兴
anhuisuzhou|121|宿州
sanming|128|三明
shangrao|144|上饶
sanmenxia|171|三门峡
shangqiu|173|商丘
shiyan|178|十堰
suizhou|187|随州
shaoyang|195|邵阳
shaoguan|205|韶关
shantou|207|汕头
shanwei|215|汕尾
suining|244|遂宁
shannan|285|山南
shangluo|298|商洛
shizuishan|322|石嘴山
shihezi|339|石河子
sanya|345|三亚
shennongjia|404|神农架林区
shishi|1008|石狮
sansha|2310|三沙
singapore|2311|新加坡
saipan|2326|塞班岛
srilanka|2330|斯里兰卡
seychelles|2334|塞舌尔
samui|2346|苏梅岛
kohsamet|2350|沙美岛
cebu|2356|宿雾
sabah|2362|沙巴
siemreap|2363|暹粒
sihanoukville|2365|西哈努克
seoul|2369|首尔
sydney|2379|悉尼
sanfrancisco|2396|旧金山
seattle|2399|西雅图
salzburg|2418|萨尔兹堡
stockholm|2436|斯德哥尔摩
zurich|2439|苏黎世
seville|2451|塞维利亚
santorini|2454|圣托里尼
saopaulo|2484|圣保罗
santiago|2486|圣地亚哥
tianjin|10|天津
tangshan|25|唐山
taiyuan|35|太原
tongliao|50|通辽
tieling|67|铁岭
tonghua|74|通化
taizhou|99|泰州
zhejiangtaizhou|108|台州
tongling|116|铜陵
taian|151|泰安
tianmen|191|天门
tongrendiqu|262|铜仁地区
tongchuan|290|铜川
tianshui|303|天水
tulufandiqu|327|吐鲁番地区
tachengdiqu|337|塔城地区
taiwan|340|台湾其他
tumushuke|405|图木舒克
tunchang|406|屯昌县
thailand|2313|泰国其他
taipei|2335|台北
tainan|2338|台南
taoyuan|2339|桃园
taichung|2341|台中
tokyo|2372|东京
tahiti|2405|大溪地
toronto|2413|多伦多
thehague|2430|海牙
turin|2463|都灵
tanzania|2499|坦桑尼亚
vietnam|2317|越南其他
varanasi|2408|瓦拉纳西
vancouver|2415|温哥华
vienna|2417|维也纳
venice|2460|威尼斯
wuxi|13|无锡
wuhan|16|武汉
wuhai|48|乌海
wulanchabu|55|乌兰察布
wenzhou|101|温州
wuhu|111|芜湖
weifang|149|潍坊
weihai|152|威海
wuzhou|227|梧州
wenshan|274|文山州
weinan|293|渭南
wuwei|304|武威
wuzhong|323|吴忠
wulumuqi|325|乌鲁木齐
wanning|407|万宁
wenchang|408|文昌
wujiaqu|409|五家渠
wuzhishan|410|五指山
wendeng|1163|文登
bandung|2352|万隆
wellington|2386|惠灵顿
varanasi|2408|瓦拉纳西
vancouver|2415|温哥华
vienna|2417|维也纳
venice|2460|威尼斯
brunei|2492|文莱
xiamen|15|厦门
xian|17|西安
xingtai|28|邢台
xinzhou|43|忻州
xingan|53|兴安盟
xilinguole|54|锡林郭勒
xuzhou|92|徐州
xuancheng|126|宣城
xinyu|138|新余
xinxiang|166|新乡
xuchang|169|许昌
xinyang|174|信阳
xiangyang|180|襄阳
xiaogan|183|孝感
xianning|186|咸宁
xiantao|189|仙桃
xiangtan|193|湘潭
xiangxi|204|湘西
xishuangbanna|276|西双版纳
xianyang|292|咸阳
xining|313|西宁
hongkong|341|香港
singapore|2311|新加坡
newzealand|2319|新西兰其他
newtaipei|2340|新北
sihanoukville|2365|西哈努克
hakone|2378|箱根
sydney|2379|悉尼
seattle|2399|西雅图
hawaii|2404|夏威夷
delhi|2407|新德里
yangzhou|12|扬州
yangquan|37|阳泉
yuncheng|42|运城
yingkou|63|营口
yanbian|78|延边
yichun|85|伊春
yancheng|97|盐城
yingtan|139|鹰潭
jiangxiyichun|142|宜春
yantai|148|烟台
yichang|179|宜昌
yueyang|196|岳阳
yiyang|199|益阳
yongzhou|201|永州
yangjiang|217|阳江
yunfu|223|云浮
guangxiyulin|232|玉林
yibin|249|宜宾
yaan|252|雅安
yuxi|269|玉溪
yanan|294|延安
yulin|296|榆林
yushu|319|玉树
yinchuan|321|银川
yili|336|伊犁
yiwu|385|义乌
yuyao|423|余姚
yongkang|893|永康
yueqing|905|乐清
vietnam|2317|越南其他
indonesia|2325|印度尼西亚其他
jakarta|2353|雅加达
innsbruck|2419|因斯布鲁克
interlaken|2441|因特拉肯
istanbul|2444|伊斯坦布尔
izmir|2447|伊兹密尔
athens|2455|雅典
alexandria|2473|亚历山大
aswan|2474|亚斯文
johannesburg|2479|约翰内斯堡
israel|2494|以色列
jordan|2495|约旦
jamaica|2501|牙买加
chongqing|9|重庆
zhangjiakou|30|张家口
zhengjiang|98|镇江
zhoushan|107|舟山
zhejiangtaizhou|108|台州
zhangzhou|130|漳州
zibo|145|淄博
zaozhuang|146|枣庄
zhengzhou|160|郑州
zhoukou|175|周口
zhumadian|176|驻马店
zhuzhou|192|株洲
zhangjiajie|198|张家界
zhuhai|206|珠海
zhanjiang|210|湛江
zhaoqing|212|肇庆
zhongshan|220|中山
zigong|238|自贡
ziyang|254|资阳
zunyi|260|遵义
zhaotong|271|昭通
zhangye|305|张掖
zhongwei|351|中卫
zhuji|883|诸暨
zhangqiu|1118|章丘
chicago|2400|芝加哥
jaipur|2411|斋浦尔
zurich|2439|苏黎世
View Code

抓取列表页面数据:

技术分享
# -*- coding: utf-8 -*- 
import codecs
import traceback
import urllib2
import re
from bs4 import BeautifulSoup
import sys
import MySQLdb
import string
import json
import time

URL_LIST = "http://www.dianping.com/search/category/%s/%s/p%s"  # 列表
RUL_DETAIL = http://www.dianping.com/shop/%s  # 详情

f1 = open("f1.log", "a", 1)
f2 = open("f2.log", "a", 1)

reload(sys)
sys.setdefaultencoding(utf-8)
type = sys.getfilesystemencoding()

def deal(city_id, category_id, p):
    url = URL_LIST % (city_id, category_id, p)
    print url
    opener = urllib2.build_opener()
    opener.addheaders = [(User-agent, Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36), (Accept, application/json, text/javascript), (Accept-Language, zh-CN,zh;q=0.8,en;q=0.6)]
    urlopen = opener.open(url, timeout=100)
    rsp = urlopen.read()
    
    if "404" in rsp:
        return 404 
    print "=====================start=========================="
    soup = BeautifulSoup(rsp)
    soup = soup.find("div", { "id" : "shop-all-list" })
    # print soup
    row = soup.find_all("li")
    for so in row:
        # print so
        get_business(so)
    print ‘‘

INSERT_BUSINESS = "INSERT INTO tb_dianping_business_zx (businessID,NAME,Url,BranchName,Address,Regions,Categories,City,AvgRating,AvgPrice,ReviewCount,PhotoUrl,SPhotoUrl,HasCoupon,HasDeal,DealCount,Deals) VALUES (%s,‘%s‘,‘%s‘,‘%s‘,‘%s‘,‘%s‘,‘%s‘,‘%s‘,%s,%s,%s,‘%s‘,‘%s‘,%s,%s,%s,‘%s‘);"
db_interest = MySQLdb.connect(host="ip", port=3306, user="xxx", passwd="xxx", db="db_xxx", charset="utf8");
cur_interest = db_interest.cursor();

def save(business_id, Name, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals):
    sql = INSERT_BUSINESS % (business_id, Name, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)
    print "========================sql=========================="
    print sql
    try:
        cur_interest.execute(sql);
        db_interest.commit()
    except MySQLdb.IntegrityError:
        db_interest.rollback()
        print "*********************** duplicate business_id: %s" % sql
        
    print ;

def get_business(soup):
#     print soup
    
    business_id = get_business_id(soup)
    NAME = get_business_name(soup)
    Url = RUL_DETAIL % business_id
    BranchName = ‘‘
    if "(" in NAME:
        BranchName = NAME[NAME.find("(") + 1:NAME.find(")")]
    Address = get_Address(soup)
    Regions = get_Regions(soup)
    Categories = get_Categories(soup)
    City = 北京
    AvgRating = get_AvgRating(soup)
    AvgPrice = get_AvgPrice(soup)
    ReviewCount = get_ReviewCount(soup)
    PhotoUrl = get_PhotoUrl(soup)
    SPhotoUrl = PhotoUrl;
    DealCount = get_DealCount(soup)
    HasCoupon = DealCount > 0 and 1 or 0
    HasDeal = HasCoupon
    Deals = get_Deals(soup)
    
    print business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, DealCount, Deals
    save(business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)

def get_business_id(soup):
    return soup.find("div", {"class":"tit"}).find("a")["href"].strip().replace("/shop/", "")
def get_business_name(soup):
    return soup.find("div", {"class":"tit"}).find("a")["title"].strip()
def get_Address(soup):
    if soup.find("span", {"class":"addr"}):
        return soup.find("span", {"class":"addr"}).get_text().strip()
    else:
        return ""
def get_Regions(soup):
    if soup.find("div", {"class":"tag-addr"}):
        return soup.find("div", {"class":"tag-addr"}).find_all("a")[0].find("span", {"class":"tag"}).get_text().strip()
    else:
        return ""
def get_Categories(soup):
    if soup.find("div", {"class":"tag-addr"}):
        return soup.find("div", {"class":"tag-addr"}).find_all("a")[1].find("span", {"class":"tag"}).get_text().strip()
    else:
        return ""
def get_AvgRating(soup):
    return soup.find("span", {"class":"sml-rank-stars"})["class"][1].strip().replace("sml-str", "")
def get_AvgPrice(soup):
    b = soup.find("a", {"class":"mean-price"}).find("b")
    if b:
        return b.get_text().strip().replace("", "")
    return 0

def get_ReviewCount(soup):
    b = soup.find("a", {"class":"review-num"})
    if b:
        return soup.find("a", {"class":"review-num"}).find("b").get_text().strip()
    return 0
        
def get_PhotoUrl(soup):
    return soup.find("div", {"class":"pic"}).find("img")["data-src"].strip()
def get_DealCount(soup):
    soup = soup.find("div", {"class":"si-deal"})
    if soup :
        return len(soup.find_all("a", {"class":"J_dinfo"}))  # .count("a", {"class":"J_dinfo"})
    return 0
def get_Deals(soup):
    soup = soup.find("div", {"class":"si-deal"})
    if soup :
        data_deal_id = ‘‘
        rows = soup.find_all("a", {"class":"J_dinfo"})
        for so in rows:
            data_deal_id = %s,%s % (data_deal_id, so["data-deal-id"])
        return data_deal_id
    return ‘‘

if __name__ == "__main__":
    cities = []
    cas = [10, 20, 25, 30, 45, 50, 60, 70]
    cas = [30, 45, 50, 60, 70]
    ct = codecs.open("cities", r, utf-8)
    lines = ct.readlines()
    for word in lines:
        word = word[word.find("|") + 1:]
        word = word[0:word.find("|")]
        cities.append(word.strip())
    
    for city in cities:
        for ca in cas:
            p = 0
            while p <= 50:
                try:
                    print deal(%s,%s,%s) % (city, ca, p)
                    p = p + 1
                    code = deal(city, ca, p)
#                     if 404==code:
#                         break
#                      2 25 12
                except Exception:
                    traceback.print_exc()
#                     print "*********************** duplicate business_id: %s" % sql
                print "休眠5秒 ... "
                time.sleep(1)
            
#     f = codecs.open("li", r, utf-8)
#         soup = BeautifulSoup(f.read())
#     soup = BeautifulSoup(f.read())
#     get_business(soup)
    
    # save(business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)
View Code

 抓取详情数据:

技术分享
# -*- coding: utf-8 -*- 
import codecs
import traceback
import urllib2
import re
from bs4 import BeautifulSoup
import sys
import MySQLdb
import string
import json
import time
from tokenize import Double

URL_LIST = "http://www.dianping.com/search/category/%s/%s/p%s"  # 列表
RUL_DETAIL = http://www.dianping.com/shop/%s  # 详情

f1 = open("f1.log", "a", 1)
f2 = open("f2.log", "a", 1)

reload(sys)
sys.setdefaultencoding(utf-8)
type = sys.getfilesystemencoding()

def deal(businessID, url):
    print url
    opener = urllib2.build_opener()
    opener.addheaders = [(User-agent, Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36), (Accept, application/json, text/javascript), (Accept-Language, zh-CN,zh;q=0.8,en;q=0.6)]
    urlopen = opener.open(url, timeout=30)
    rsp = urlopen.read()
    print "=====================start=========================="
    soup = BeautifulSoup(rsp)
    str = soup.find("div", {"class":"breadcrumb"})
    str2 = soup.find("div", {"id":"basic-info"})
    str3 = soup.find("div", {"id":"sales"})
    str4 = soup.find("div", {"id":"aside"})
    if str4:
        str4 = soup.find("div", {"id":"aside"}).find("script")
    else:
        str4=""
    print "----------------"
    print %s%s%s%s % (str, str2, str3, str4)
    soup = BeautifulSoup(%s%s%s%s % (str, str2, str3, str4))
#     print soup
    get_business(businessID, soup)

UPDATE_BUSINESS = "UPDATE tb_dianping_business_zx SET Address=‘%s‘,Regions=‘%s‘,Categories=‘%s‘,City=‘%s‘,lat=%s,lng=%s,Deals=‘%s‘ where businessID= %s "
SELECT_BUSINESS = "SELECT businessID,url FROM tb_dianping_business_zx WHERE lat=0 and businessID > %d order by businessID asc LIMIT 100 "
db_interest = MySQLdb.connect(host="xxx", port=xxx, user="xxx", passwd="xxx", db="db_xxx", charset="utf8");
cur_interest = db_interest.cursor();

def save(Address, Regions, Categories, City, lat, lng, Deals, business_id):
    sql = UPDATE_BUSINESS % (Address, Regions, Categories, City, lat, lng, Deals, business_id)
    try:
        print sql
        cur_interest.execute(sql);
        db_interest.commit()
    except MySQLdb.IntegrityError:
        db_interest.rollback()
        print "*********************** duplicate business_id: %s" % sql
    print ;

def fetchall(cur, sql):
    cur.execute(sql)
    return cur.fetchall()

def fetchone(cur, sql):
    cur.execute(sql)
    return cur.fetchone()

def get_business(business_id, soup):
    
    business_id = business_id
    cs = get_Regions_Categories(soup)
    City = cs[0]
    Regions = cs[1]
    Categories = cs[2]
    
#     print %s , %s , %s % (City, Regions, Categories)
    Address = get_Address(soup)
    point = get_point(soup)
    lat = point[0]
    lng = point[1]
    Deals = get_Deals(soup)
#     print Address, Regions, Categories, City, lat, lng, Deals, business_id
    save(Address, Regions, Categories, City, lat, lng, Deals, business_id)

def get_Regions_Categories(soup):
    rows = soup.find("div", {"class":"breadcrumb"}).find_all("a")
    City = ‘‘
    RegionsCs = []
    CategoriesCs = []
    i = 0
    length = len(rows)
    for row in rows :
        if i == 0:
            City = row.get_text().strip()
        elif length % 2 == 0 and i < length / 2:
            RegionsCs.append(row.get_text().strip())
        elif length % 2 == 0 and i >= length / 2:
            CategoriesCs.append(row.get_text().strip())
        elif length % 2 == 1 and i < length / 2 + 1:
            RegionsCs.append(row.get_text().strip())
        else:
            CategoriesCs.append(row.get_text().strip())
        i = i + 1
    
    Regions = ""
    for c in RegionsCs:
        Regions = %s,"%s" % (Regions , c)
    Regions = [%s] % Regions
    Regions = Regions.replace("[,", "[")

    Categories = ""
    for c in CategoriesCs:
        Categories = %s,"%s" % (Categories , c)
    Categories = [%s] % Categories
    Categories = Categories.replace("[,", "[")
    
    return City, Regions, Categories
            
        
def get_Address(soup):
    return %s %s % (soup.find("div", {"class":"address"}).find("a").find("span").get_text().strip(), soup.find("div", {"itemprop":"street-address"}).find("span", {"class":"item"}).get_text().strip())
def get_point(soup):
    lat = ‘‘
    lng = ‘‘
    str = soup.find("script").get_text().strip()
    str = str[str.find("({lng:") + 6:]
    lat = str[:str.find(",lat:")]
    lng = str[str.find(",lat:") + 5:str.find("}")]
    
    la = int(float(lat)*1000000)
    ln = int(float(lng)*1000000)
    return la, ln
    
def get_Deals(soup):
    soup = soup.find("div", {"id":"sales"})
    if soup:
        Deals = []
        rows = soup.find_all("div", {"class":"item"})
        for row in rows:
            if row.find("span", {"class":"price"}):
                deal = {}
                title = row.find("p", {"class":"title"})
                url = ""
                if title:
                    deal["name"] = title.get_text().strip()
                    url = row.find("a", {"class":"block-link"})["href"]
                else:
                    deal["name"] = rows.get_text().strip()
                    url = row["href"]
                deal["url"] = url
                deal["id"] = url.replace("http://t.dianping.com/deal/", "")
                deal["h5_url"] = url
                Deals.append(deal)
        
        deals = ""
        for c in Deals:
            deals = %s,{"url":"%s", "name": "%s", "h5_url": "%s", "id": "%s"} % (deals , c.get("url"), c.get("name"), c.get("h5_url"), c.get("id"))
        deals = [%s] % deals
        deals = deals.replace("[,", "[")
        return deals
    return ‘‘

if __name__ == "__main__":
#     deal(http://www.dianping.com/shop/11566327)
    maxId = 0
    SELECT_BUSINESS_NEXT = "";
    while True:
        try:
            SELECT_BUSINESS_NEXT = SELECT_BUSINESS % maxId
            print SELECT_BUSINESS_NEXT
            rows = fetchall(cur_interest, SELECT_BUSINESS_NEXT)
            for row in rows:
                print row
                deal(row[0], row[1])
                maxId = row[0]
        except Exception:
            traceback.print_exc()
        print "休眠5秒 ... "
        time.sleep(5)

#     f = codecs.open("detail", r, utf-8)
#     soup = BeautifulSoup(f.read())
#     get_business(soup, 11566327)
    
    # save(business_id, NAME, Url, BranchName, Address, Regions, Categories, City, AvgRating, AvgPrice, ReviewCount, PhotoUrl, SPhotoUrl, HasCoupon, HasDeal, DealCount, Deals)
View Code

备注:抓取数据速度尽量去控制下来,好拉,今天都16号了,哥可以放大假了,大伙加完班,也好好回家过个好年

使用的解析库:http://www.crummy.com/software/BeautifulSoup/bs4/doc/

使用BeautifulSoup抓取

标签:

原文地址:http://www.cnblogs.com/super-d2/p/4293611.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!