标签:text 名称 http mobile imp jpg use amp 二手房
58二手房解析房源名称
from lxml import etree import requests url = ‘https://haikou.58.com/chuzu/j2/‘ headers = { ‘User-Agent‘: ‘Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36‘ } parser = etree.HTMLParser(encoding=‘utf-8‘) page_text = requests.get(url=url).text tree = etree.HTML(page_text,parser=parser) lis = tree.xpath(‘//ul[@class="house-list"]/li‘) for li_item in lis: res=li_item.xpath(‘.//h2/a/text()‘) #注意 ./ print(res[0].strip())
爬取彼岸图网图片
from lxml import etree import requests url = ‘http://pic.netbian.com/4kfengjing‘ headers = { ‘User-Agent‘: ‘Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36‘ } parser = etree.HTMLParser(encoding=‘utf-8‘) page_text = requests.get(url=url,headers=headers).text tree = etree.HTML(page_text,parser=parser) res = tree.xpath(‘//div[@class="slist"]//li/a/img/@src‘) count=0 for url_item in res: full_url = "%s%s"%(‘http://pic.netbian.com/‘,url_item) res = requests.get(url=full_url).content with open(‘图片%s.jpg‘%count,‘wb‘)as f: f.write(res) count+=1
乱码问题:
1.整体
- response = requests.get(url=xxx,headers=xxx)
-response.encoding = ‘utf-8‘
2. 单独
- xxx.encode(‘iso-8859-1‘).decode(‘gbk‘) (通用处理中文乱码方案)
xpath案例 爬取58出租房源信息&解析下载图片数据&乱码问题
标签:text 名称 http mobile imp jpg use amp 二手房
原文地址:https://www.cnblogs.com/Jnhnsnow/p/11612292.html