This article was originally published by Yanglbme on the public account “Doocs Open Source Community”. Unauthorized republication is prohibited.

The data source

Basic data from: www.gharchive.org

Statistical methods

Get the GitHub PushEvent of 2019, and identify the organization of the GitHub user by analyzing the email address in the submission record.

Specific methods refer to: www.freecodecamp.org/news/the-to…

Analysis tools

  • Google Big Query
  • Data Studio

The SQL statement

Since Google Big Query only gets a free 1TB of data processing per month, to make the most of it, we limit our data queries to a certain date range (20190301-20191001) to make sure that data processing is close to, but not more than, 1TB.

Data from this date range provides a rough indication of GitHub organizations’ open source contributions throughout 2019.

SELECT *
FROM `githubarchive.month.2019*` a
WHERE _TABLE_SUFFIX BETWEEN '0301' AND '1001'
Copy the code

The complete SQL statement is prepared as follows:

#standardSQL WITH period AS ( SELECT * FROM `githubarchive.month.2019*` a WHERE _TABLE_SUFFIX BETWEEN '0301' AND '1001' ), repo_stars AS ( SELECT repo.id, COUNT(DISTINCT actor.login) stars, APPROX_TOP_COUNT(repo.name, 1)[OFFSET(0)].value repo_name FROM period WHERE type='WatchEvent' GROUP BY 1 HAVING stars>20 ), pushers_guess_emails_and_top_projects AS ( SELECT *, REGEXP_EXTRACT(email, r'@(.*)') domain FROM ( SELECT actor.id , APPROX_TOP_COUNT(actor.login,1)[OFFSET(0)].value login , APPROX_TOP_COUNT(JSON_EXTRACT_SCALAR(payload, '$.commits[0].author.email'),1)[OFFSET(0)].value email , COUNT(*) c , ARRAY_AGG(DISTINCT TO_JSON_STRING(STRUCT(b.repo_name,stars))) repos FROM period a JOIN repo_stars b ON a.repo.id=b.id WHERE type='PushEvent' GROUP BY 1 HAVING c>3 ) ) SELECT * FROM ( SELECT domain , githubers , (SELECT COUNT(DISTINCT repo) FROM UNNEST(repos) repo) repos_contributed_to , ARRAY( SELECT AS STRUCT JSON_EXTRACT_SCALAR(repo, '$.repo_name') repo_name , CAST(JSON_EXTRACT_SCALAR(repo, '$.stars') AS INT64) stars , COUNT(*) githubers_from_domain FROM UNNEST(repos) repo GROUP BY 1, 2 HAVING githubers_from_domain>1 ORDER BY stars DESC LIMIT 3 ) top , (SELECT SUM(CAST(JSON_EXTRACT_SCALAR(repo, '$.stars') AS INT64)) FROM (SELECT DISTINCT repo FROM UNNEST(repos) repo)) sum_stars_projects_contributed_to FROM ( SELECT domain, COUNT(*) githubers, ARRAY_CONCAT_AGG(ARRAY(SELECT * FROM UNNEST(repos) repo)) repos FROM pushers_guess_emails_and_top_projects #WHERE domain  IN UNNEST(SPLIT('google.com|microsoft.com|amazon.com', '|')) WHERE domain NOT IN UNNEST(SPLIT('gmail.com|users.noreply.github.com|qq.com|hotmail.com|163.com|me.com|googlemail.com|outlook.com|yahoo.com| web.de|iki.fi|foxmail.com|yandex.ru', '|')) # email hosters GROUP BY 1 HAVING githubers > 30 ) WHERE (SELECT MAX(githubers_from_domain) FROM (SELECT repo, COUNT(*) githubers_from_domain FROM UNNEST(repos) repo GROUP BY repo))>4 # second filter email hosters ) ORDER BY githubers DESCCopy the code

As you can see from the figure below, this query statistics will process 918.4GB of data.

The statistical results

Click Run, and after 17.8 seconds, we can see the query result.

Top level organization comparison

  • Microsoft and Google lead the way in open source contribution, followed by redhat, Intel and amazon.
  • Microsoft and Google each have over 1,000 employees (Githubers) repos_contributed_to multiple GitHub repositories;
  • For Microsoft, the Top3 repositories for 2019 are Terminal, vscode and TypeScript, while for Google, flutter, tensorflow and kubernetes.

Rounding out the top 10 are Pivotal, Facebook, Apache, SAP, and Shopify.

Comparison of domestic large factories

Ali employees have the highest contribution to open source in domestic big factories, ranking 12th. The top3 warehouses are flutter-go, nacos and sqlflow, and all projects have received more than 90,000 stars.

Baidu and Tencent ranked 21st and 23rd respectively.

The overview

The top 38 contributors to open source are as follows:

If you have any ideas, welcome to interact with me in the comment area.


Welcome to follow my wechat public account “Doocs Open Source Community” and push original technical articles as soon as possible.