码迷,mamicode.com
首页 > 编程语言 > 详细

python 爬虫爬取腾讯新闻科技类的企鹅智酷系列(1)

时间:2015-05-19 10:44:07      阅读:145      评论:0      收藏:0      [点我收藏+]

标签:python   爬虫   utf-8   urllib   

废话不多说,直接贴代码,主要采用BeautifulSoup写的

# -*- coding: utf-8 -*-

"""

Created on Mon May 18 19:12:06 2015


@author: Administrator

"""


import urllib

import os

from bs4 import BeautifulSoup

import sys

reload(sys)

sys.setdefaultencoding("utf-8")


i = 0

j = 0

list_a = []


def gettext(href):

global j,list_a

page = urllib.urlopen(href).read()

soup = BeautifulSoup(page,from_encoding="gb18030")

div = soup.find_all("div",class_="content")

p_text = div[0].find_all("p")

for p in p_text:

fp = file("%s.txt" % list_a[j],"a")

fp.write(‘ ‘)

fp.write(p.get_text())

fp.write(" \n")

j+=1


def gethref(url): #获得所有链接

global i,list_a

fp = file("AllTitle.txt","w+")

page = urllib.urlopen(url).read()

soup = BeautifulSoup(page,from_encoding="gb18030")

ul = soup.find_all("ul",class_="row1")

li = ul[0].find_all("li")

for lia in li:

list_a.append(("%s、" % (i+1))+lia.h3.get_text())

href = lia.a.get(‘href‘)

# 将标题简介和链接有规则的写入文件中

fp.write("%s、" % (i+1))

i+=1

fp.write("标题:")

fp.write(lia.h3.get_text())

fp.write("\n 简介:")

fp.write(lia.p.get_text())

fp.write("\n 链接:")

fp.write(lia.a.get("href"))

fp.write("\n")

gettext(href)


if "__main__"==__name__:

url ="http://re.qq.com/biznext/zkht.htm"

gethref(url)

print "All Is OK!"


python 爬虫爬取腾讯新闻科技类的企鹅智酷系列(1)

标签:python   爬虫   utf-8   urllib   

原文地址:http://blog.csdn.net/gamer_gyt/article/details/45833271

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!