Skip to main content

GitHub releases data on 2.8 million open source repositories through Google BigQuery

The GitHub Octocat figurine.
Image Credit: GitHub Shop

testsetset

GitHub today announced that it’s releasing activity data for 2.8 million open source code repositories and making it available for people to analyze with the Google BigQuery cloud-based data warehousing tool.

The data set is free to explore. (With BigQuery you get to process up to one terabyte each month free of charge.)

This new 3TB data set includes information on “more than 145 million unique commits, over 2 billion different file paths and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions,” Arfon Smith, program manager for open source data at GitHub, wrote in a blog post.

To get people started, Smith has put together some starter queries. Felipe Hoffa, a Google developer advocate who focuses on BigQuery, has put together some tips for working with the data sets in a Medium post.


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.


The data set could be useful to anyone who want to get a sense of trends in open source software use on GitHub, and it’s simpler than tinkering with the GitHub application programming interface (API). For sure, GitHub, with more than 15 million users, isn’t the only place where open source software lives on the Internet — see also GitLab — but it is a very popular one, perhaps the most popular.

Today’s move effectively amounts to an expansion of the GitHub Archive, which was first introduced by Google web performance engineer Ilya Grigorik in 2012.

GitHub will update the data set every week, a spokesperson told VentureBeat in an email.