preface

Any important decision should be based on data, and the same is true for information projects and software development. If you don’t look closely at the data that describes the evolution of the project, you can’t understand the health of the project and suggest reasonable improvements. To analyze and mine this information, meaningful data can be obtained from Git repositories and the code hosting platforms (such as GitHub, Gitlab) where the project is located. However, getting data from Git/GitHub is actually not a simple task. This article will introduce some Git/GitHub open source analysis tools for you to learn reference.

GitHub API

First up is Github’s official API, which is the best way to get the details of the Github repository. The API is so handy that you can use Curl or any other language to package your repository and retrieve all of its information. (Other public online Git hosting platforms and self-built Gitlab have similar apis.) However, the annoying thing is that Github limits calls to the API, and the number of requests per hour is limited (60 for anonymous users, 5000 for authorized users), so using the API is not a good solution if you want to analyze large projects (or global analysis of some of them). However, certain dashboards typically used to focus on individual projects or contributors for individual builds are not affected.

Through the Github API, you basically get all the information you see when you visit and browse your project’s Github repository, but have limited internal information about the repository’s Git information (for example, wanting to know which lines of code were changed the last day). You need to clone the repository and run git commands to get full information.

GHCrawler

GHCrawler is a robust GitHub API crawler developed by Microsoft that traverses, searches and tracks GitHub entities and messages. GHCrawler is especially useful if you want to analyze the activities of an organization or project. GHCrawler is also limited by the number of Github API requests, but GHCrawler optimizes THE use of API tokens by using token pools and rotation. GHCrawler supports command-line calls as well as a Web interface operation (ghcrawler-Dashboard)

Project official warehouse: github.com/Microsoft/g…

GH Archive

GH Archive is an open source project to document the public GitHub timeline, Archive it, and make it easily accessible for further analysis. GitHub Archive Captures all GitHub events information stored in a set of JSON files that can be downloaded and processed offline as needed.

In addition, the GitHub Archive is also available on Google BigQuery as a public data set. The dataset is automatically updated every hour, and you can run any SQL-like query against the entire dataset in seconds.

Project official website: www.gharchive.org

GHTorren

Similar to the GH Archive, the GHTorrent project is used to monitor Github’s public event timeline information. For each event, it exhaustively retrieves its content and interdependencies. Then the resulting JSON information is stored in the MongoDB database, and its structure is also extracted into the MySQL database.

GHTorrent is similar to GH Archive, except that GH Archive aims to provide a more detailed collection of events, with hourly frequency of access to information. GH Torrent provides event data in a more structured way to make it easier to get information about all events in a month.

Project official warehouse: github.com/ghtorrent

Kibble

Apache Kibble is a set of tools for collecting, summarizing, and visualizing activities in software projects. The Kibble architecture consists of a central Kibble server and a set of scanning applications dedicated to handling specific types of resources (a Git repo, a mailing list, a JIRA instance, etc.) and pushing compiled data objects to the Kibble server.

From this data, you can customize a dashboard that contains a number of widgets that display project data (language categories, major contributors, code evolution, and so on). In this sense, Kibble is more like a tool to help create a project data presentation on the Web side.

Project official website: kibble.apache.org/

CHAOSS

CHAOSS is a Linux Foundation project dedicated to creating data analysis and metric definitions to help a healthy open source community. The CHAOSS Program has many tools to mine and calculate the metrics needed for the project:

Augur is a Python library, Flask Web application, and REST server that provides metrics about the health and sustainability of open source software development projects. The goal is rapid prototyping of new indicators of interest to the CHAOSS community.

Cregit focuses on generating views to visualize where code changes are coming from

**GrimoireLab **Bitergia’s most mature and ambitious tool to date. GrimoireLab’s purpose is to provide an open source platform implementation:

1. Automatic and incremental data can be collected from almost any tool (data source) related to open source development (source control, problem tracking systems, forums, etc.)

Automatic data enrichment to clean up and extend the data collected above (merging duplicate identities, adding additional information about contributor affiliations, computing delays, geographic data, etc.)

Data visualization, filter search by time horizon, project, repository, contributor, etc.

GrimoireLab uses Kibana to provide all of these great visualizations on top of the collected data.

CHAOSS project official website: chaoss.community/

Sourced

Sourced with a self-proclaimed data platform for development lifecycle. Compared with previous tools, it focuses more on project code than community collaboration. Sourced projects can use a generic AST to query code base details in a language-independent manner.

Several interesting data analysis tools can be found at Sourced project organizations. Include:

Go-git: a highly extensible Git implementation library written in the pure Golang language.

**Hercule: **Golang implementation of the entire commit history analysis tool for repositories.

** GitBase: **Golang implements Git repository SQL database interface. For example, by year and by submitter, you can use the following Sql statement:

SELECT YEAR,
MONTH,
repo_id,
committer_email,
COUNT(*) AS num_commits
FROM
(SELECT YEAR(committer_when) AS YEAR,
MONTH(committer_when) AS MONTH,
repository_id AS repo_id,
committer_email
FROM ref_commits
NATURAL JOIN commits
WHERE ref_name = 'HEAD') AS t
GROUP BY committer_email,
YEAR,
MONTH,
repo_id;
Copy the code

Project official website: sourced. Tech /

Github project organization: github.com/src-d

Hubble

Hubble is used to visualize collaboration, usage, and health data for GitHub Enterprise. It aims to help large companies understand how their internal organizations, projects and contributors are assigned and collaborate together.

The Hubble Enterprise consists of two components. The updater component is a Python script that queries relevant data from GitHub Enterprise devices on a daily basis and stores the results in a Git repository. The Docs component is a Web application that visualizes collected data and is hosted by GitHub Pages.

Official project hosting address: github.com/Autodesk/hu…

onefetch

Finally, a nifty command-line git project information visualization tool, available in over 50 languages, is mentioned because it is written in the emerging Rust language.

conclusion

In this article, we list some tools and projects for data mining on Github/Git. In addition to the open source software mentioned above, there are some commercially available tools that are very good, such as Snoot and Waydev.

Author: Bug search qi CCsearchit

Link: www.jianshu.com/p/47fe845d5… Source: Jane Book

The last

If you like the article, you can click a “like”. Finally, we will recommend a high-quality technology-related article every day, mainly sharing Java related technology and interview skills, learning Java without getting lost.