python爬虫---->github上python的项目

时间：2017-12-19 15:19:47 阅读：226 评论：0 收藏：0 [点我收藏+]

标签：toc beautiful imp 基础功链接 search connect tin .exe

　　这里面通过爬虫github上的一些start比较高的python项目来学习一下BeautifulSoup和pymysql的使用。我一直以为山是水的故事，云是风的故事，你是我的故事，可是却不知道，我是不是你的故事。

github的python爬虫

爬虫的需求：爬取github上有关python的优质项目，以下是测试用例，并没有爬取很多数据。

一、实现基础功能的爬虫版本

这个案例可以学习到关于pymysql的批量插入、使用BeautifulSoup解析html数据以及requests库的get请求数据的知识。至于pymysql的一些使用，可以参考博客：python框架---->pymysql的使用

import requests
import pymysql.cursors
from bs4 import BeautifulSoup

def get_effect_data(data):
    results = list()
    soup = BeautifulSoup(data, ‘html.parser‘)
    projects = soup.find_all(‘div‘, class_=‘repo-list-item‘)
    for project in projects:
        writer_project = project.find(‘a‘, attrs={‘class‘: ‘v-align-middle‘})[‘href‘].strip()
        project_language = project.find(‘div‘, attrs={‘class‘: ‘d-table-cell col-2 text-gray pt-2‘}).get_text().strip()
        project_starts = project.find(‘a‘, attrs={‘class‘: ‘muted-link‘}).get_text().strip()
        update_desc = project.find(‘p‘, attrs={‘class‘: ‘f6 text-gray mb-0 mt-2‘}).get_text().strip()

        result = (writer_project.split(‘/‘)[1], writer_project.split(‘/‘)[2], project_language, project_starts, update_desc)
        results.append(result)
    return results


def get_response_data(page):
    request_url = ‘https://github.com/search‘
    params = {‘o‘: ‘desc‘, ‘q‘: ‘python‘, ‘s‘: ‘stars‘, ‘type‘: ‘Repositories‘, ‘p‘: page}
    resp = requests.get(request_url, params)
    return resp.text


def insert_datas(data):
    connection = pymysql.connect(host=‘localhost‘,
                                 user=‘root‘,
                                 password=‘root‘,
                                 db=‘test‘,
                                 charset=‘utf8mb4‘,
                                 cursorclass=pymysql.cursors.DictCursor)
    try:
        with connection.cursor() as cursor:
            sql = ‘insert into project_info(project_writer, project_name, project_language, project_starts, update_desc) VALUES (%s, %s, %s, %s, %s)‘
            cursor.executemany(sql, data)
            connection.commit()
    except:
        connection.close()


if __name__ == ‘__main__‘:
    total_page = 2 # 爬虫数据的总页数
    datas = list()
    for page in range(total_page):
        res_data = get_response_data(page + 1)
        data = get_effect_data(res_data)
        datas += data
    insert_datas(datas)

运行完之后，可以在数据库中看到如下的数据：

11	tensorflow	tensorflow	C++	78.7k	Updated Nov 22, 2017
12	robbyrussell	oh-my-zsh	Shell	62.2k	Updated Nov 21, 2017
13	vinta	awesome-python	Python	41.4k	Updated Nov 20, 2017
14	jakubroztocil	httpie	Python	32.7k	Updated Nov 18, 2017
15	nvbn	thefuck	Python	32.2k	Updated Nov 17, 2017
16	pallets	flask	Python	31.1k	Updated Nov 15, 2017
17	django	django	Python	29.8k	Updated Nov 22, 2017
18	requests	requests	Python	28.7k	Updated Nov 21, 2017
19	blueimp	jQuery-File-Upload	JavaScript	27.9k	Updated Nov 20, 2017
20	ansible	ansible	Python	26.8k	Updated Nov 22, 2017
21	justjavac	free-programming-books-zh_CN	JavaScript	24.7k	Updated Nov 16, 2017
22	scrapy	scrapy	Python	24k	Updated Nov 22, 2017
23	scikit-learn	scikit-learn	Python	23.1k	Updated Nov 22, 2017
24	fchollet	keras	Python	22k	Updated Nov 21, 2017
25	donnemartin	system-design-primer	Python	21k	Updated Nov 20, 2017
26	certbot	certbot	Python	20.1k	Updated Nov 20, 2017
27	aymericdamien	TensorFlow-Examples	Jupyter Notebook	18.1k	Updated Nov 8, 2017
28	tornadoweb	tornado	Python	14.6k	Updated Nov 17, 2017
29	python	cpython	Python	14.4k	Updated Nov 22, 2017
30	reddit	reddit	Python	14.2k	Updated Oct 17, 2017

友情链接

python爬虫---->github上python的项目

标签：toc beautiful imp 基础功链接 search connect tin .exe

原文地址：http://www.cnblogs.com/huhx/p/usepythongithubspider.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行