码迷,mamicode.com
首页 > 编程语言 > 详细

python2.7 urllib2 爬虫

时间:2018-06-17 23:25:16      阅读:286      评论:0      收藏:0      [点我收藏+]

标签:cin   encoding   2.7   opener   %s   time   head   rom   attr   

 # _*_ coding:utf-8 _*_

import urllib2
import cookielib
import random
import re
from bs4 import BeautifulSoup
import datetime

dax = datetime.datetime.now().strftime(‘%Y-%m-%d‘)
print(dax)

url = ‘http://ww=singlemessage&isappinstalled=0‘

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
request = urllib2.Request(url)
headers = [
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)‘,
‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)‘,
‘Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11‘,
‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0‘,
‘Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50‘
]

hds = random.choice(headers)
# print(hds)
request.add_header(‘User-Agent‘,‘%s‘ % hds)
#response = urllib2.urlopen("http://www.hn1m=singlemessage&isappinstalled=0")
response = urllib2.urlopen(request)
cont = response.read()
#print(cont)

soup = BeautifulSoup(cont,‘html.parser‘,from_encoding=‘utf-8‘)
# print(soup)
# listyj = soup.find_all(‘dl‘)
# for listyjx in listyj:
# print(listyjx.name,listyjx.attrs,listyjx.gettext())
# # if dax in listyjx:
# # print(listyjx)

技术分享图片

技术分享图片

 

python2.7 urllib2 爬虫

标签:cin   encoding   2.7   opener   %s   time   head   rom   attr   

原文地址:https://www.cnblogs.com/ruiy/p/9193940.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!