标签:live href range lang zip lan gen omv ssid
基于搜索语法site:url搜索目录/子域名等信息
使用百度搜索引擎, 抓取页面标题获取对应链接和title, 可以自主输入搜索域名和页数
def baidu_search(site,num):
# 自定义head头
headers = {
‘Accept ‘: ‘*/*‘,
‘Accept-Encoding‘ : ‘gzip, deflate, br‘,
‘Accept-Language‘ : ‘zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2‘,
‘Connection‘ : ‘keep-alive‘,
‘Cookie‘ : ‘BAIDUID=84E2A458FA68404161B7BF58A7339FAD:FG=1; BIDUPSID=84E2A458FA68404161B7BF58A7339FAD; PSTM=1618368791; BDRCVFR[Fc9oatPmwxn]=mk3SLVN4HKm; delPer=0; BD_CK_SAM=1; PSINO=1; H_PS_PSSID=33739_33272_33849_33757_33855; BD_UPN=13314752; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; H_PS_645EC=0664L5ZbBq1xK%2B%2F4kA9RHWttl5Q9WhRaOwWzieNeTSuLs%2FjwP7gaRsoltETDVb0hmwqU; BA_HECTOR=04a5242hak0h8k20l31g7d0mt0r; BDSVRTM=0; WWW_ST=1618379595274‘,
‘Host‘ : ‘www.baidu.com‘,
# 这个字段好像是百度搜索相关的, 到时候一起改掉他
‘is_pbs‘ : ‘site%3Ahbkjxy.cn‘,
# ‘is_referer‘ : ‘https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=monline_3_dg&wd=site%3Ahbkjxy.cn&oq=seit%253Ahbkjxy.cn&rsv_pq=d52768e4000b1a9c&rsv_t=1074mj%2FMIzoFyt8h%2FJsMRtcR0Q9XZQnMp8vomVo3EsyZBm9DZ1ZlRCO6WHVwa9YHPoTV&rqlang=cn&rsv_dl=tb&rsv_enter=1&rsv_sug3=4&rsv_sug2=0&rsv_btype=t&inputT=774&rsv_sug4=1560&bs=seit%3Ahbkjxy.cn‘,
‘is_xhr‘ : ‘1‘,
# ‘eferer‘ : ‘https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=monline_3_dg&wd=site%3Ahbkjxy.cn&oq=site%253Ahbkjxy.cn&rsv_pq=d81947b400077895&rsv_t=0664L5ZbBq1xK%2B%2F4kA9RHWttl5Q9WhRaOwWzieNeTSuLs%2FjwP7gaRsoltETDVb0hmwqU&rqlang=cn&rsv_dl=tb&rsv_enter=0&rsv_btype=t‘,
‘User-Agent‘ : ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0‘,
‘X-Requested-With‘ : ‘XMLHttpRequest‘
}
# 定义请求URL, 后续操作需要把他改成自主输入
# url = ‘https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=site%3Awww.hbkjxy.cn‘
# html = requests.get(url, headers=headers)
#
# # 使用BeautifulSoup库抓取网页数据,需要指定格式,否则会报异常
# # job_bt1 --> 抓取页面所有h3标签下数据
# soup1 = BeautifulSoup(html.text, ‘html.parser‘)
# job_bt1 = soup1.findAll(‘h3‘)
# 通过循环, 将第一页的数据进行输出(需要嵌套循环, 实现多页面循环输出)
for j in range(0,num):
n = int(num * 10)
# 定义请求URL, 后续操作需要把他改成自主输入
url = ‘https://www.baidu.com/s?wd=site%3Ahbkjxy.cn&pn=‘ + str(n) + ‘&oq=site%3Ahbkjxy.cn&tn=monline_3_dg&ie=utf-8&rsv_pq=bc740e0d0006022d&rsv_t=b97fYg6KXBKZegSvnBVu6qTVJ4tugqrU2YfvoV%2Bm1BuDyUONhuRlSva1I8KCgJc2pWic&topic_pn=‘
html = requests.get(url, headers=headers)
# 使用BeautifulSoup库抓取网页数据,需要指定格式,否则会报异常
# job_bt1 --> 抓取页面所有h3标签下数据
soup1 = BeautifulSoup(html.text, ‘html.parser‘)
job_bt1 = soup1.findAll(‘h3‘)
for i in job_bt1:
# 从获取的页面h3标签下数据循环获取链接, 并进行输出
link = i.a.get(‘href‘)
# 请求获取的页面链接并访问,抓取当前访问的链接(因为直接从页面抓取抓取的不是页面链接, 而是百度的链接地址)
a = requests.get(link)
a.encoding = ‘utf8‘
# 同job_bt1作用相同, 获取请求链接的title
soup2 = BeautifulSoup(a.text, ‘html.parser‘)
job_bt2 = soup2.title.text
print(job_bt2, a.url, ‘第‘, j+1 ,‘页‘)
if name == ‘main‘:
# 输入URL
try:
num = int(input(‘页数: ‘))
except Exception:
print(‘error: 请输入搜索页数‘)
exit()
try:
host = str(input(‘域名: ‘))
except Exception:
print(‘error: 请输入域名‘)
exit()
baidu_search(host,num)
标签:live href range lang zip lan gen omv ssid
原文地址:https://www.cnblogs.com/Frieza/p/14661282.html