码迷,mamicode.com
首页 > 其他好文 > 详细

爬取github项目。

时间:2018-07-04 16:47:30      阅读:489      评论:0      收藏:0      [点我收藏+]

标签:使用   encoding   cti   通过   def   连接   setting   enc   webkit   

import requests
from bs4 import BeautifulSoup

url = ‘https://github.com/login‘
headers = {
    ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36‘,
    ‘Referer‘: ‘https://github.com/‘,
    ‘Upgrade-Insecure-Requests‘: ‘1‘,  # 此处的1 必须是字符串,不是数字
    ‘Host‘: ‘github.com‘,
    ‘Connection‘: ‘keep-alive‘,
    ‘Accept-Language‘: ‘zh-CN,zh;q=0.8‘,
    ‘Accept-Encoding‘: ‘gzip, deflate, br‘,
    ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8‘}
res1 = requests.get(url, headers=headers)
# 检验
print(res1.status_code)
print(res1.reason)
# 通过解析页面来获取动态token
soup = BeautifulSoup(res1.text, ‘lxml‘)
tag_input = soup.find(name=‘input‘, attrs={‘name‘: ‘authenticity_token‘})
authenticity_token = tag_input.get(‘value‘)
data = {‘commit‘: ‘Sign+in‘,
        ‘utf8‘: ‘?‘,
        ‘authenticity_token‘: authenticity_token,
        ‘login‘: ‘295345t54341@qq.com‘,
        ‘password‘: ‘234523456345‘}

cookies = res1.cookies.get_dict()
# 这里的url是https://github.com/session,不是https://github.com/login
res2 = requests.post(url=‘https://github.com/session‘, headers=headers, cookies=cookies, data=data)
print(authenticity_token)
print(res2.status_code)
print(res2.reason)
cookies.update(res2.cookies.get_dict())
res3 = requests.get(url=‘https://github.com/settings/repositories‘,
                    cookies=cookies,
                    headers=headers
                    )

print(res3.url)
print(res3.status_code)
print(res3.reason)

soup3 = BeautifulSoup(res3.text, ‘lxml‘)
project = soup3.find(name=‘div‘, attrs={‘class‘: ‘listgroup‘})
print(project)
project_list = project.find_all(name=‘a‘, attrs={‘class‘: ‘mr-1‘})
for i in project_list:
    project_name = i.text
    project_ = i.get(‘href‘)
    project_href = ‘https://github.com/‘ + project_.split(‘/‘, maxsplit=1)[1]
    print(‘项目名称:%s , 项目连接:%s‘ % (project_name, project_href), ‘\n‘)

    # 爬取github注意事项,1.以后携带的cookie使用的是登录后的cookie
    # 2.需要在登录页面找到token,该token是动态的需要使用bs4,或者正则表达式来获取动态值

 

爬取github项目。

标签:使用   encoding   cti   通过   def   连接   setting   enc   webkit   

原文地址:https://www.cnblogs.com/luobiao-114/p/9263876.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!