GitGuardian raises $12 million to find sensitive data hidden in online code

GitGuardian, a cybersecurity platform that helps companies detect sensitive data hidden in public and private code repositories, has raised $12 million in a series A round of funding led by London-based Balderton Capital, with participation from GitHub cofounder Scott Chacon and Docker cofounder Solomon Hykes.

Founded out of Paris in 2017, GitGuardian scans all GitHub public activity in real time to identify private data, such as database login credentials, API keys, cryptographic keys, and more. The company works with over 200 API providers, spanning payment systems, cloud services, messaging apps, crypto wallets, and more to ensure that any private information that does leak into the public domain is swiftly identified and the company is notified. The French startup said it has sent out more than 400,000 alerts since its inception.

Secret sauce

The type of private data GitGuardian is looking to protect is what is known in the industry as “secrets” and includes anything that can be used by unauthorized third parties to access a system (e.g. a cloud or database) — including passwords and API tokens.

Behind the scenes, GitGuardian links GitHub-registered developers with their companies and scans content covering 2.5 million code commits each day in an effort to find usernames and passwords, database connection string keys, SSL certificates, and more. The company said it uses “sophisticated pattern matching” and machine learning techniques, with its algorithm constantly learning through a developer “feedback loop.” In effect, GitGuardian’s clients help improve the technology by telling it whether an alert was valid or not.

June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.

Although monitoring public GitHub repositories is a major facet of GitGuardian’s offering, it also works to identify sensitive information that is inadvertently disseminated through internal systems, including private code repositories and message apps. Even companies that are careful to keep their code under lock and key can come unstuck if too many people inside an organization have access to it — the more people with access to “secrets,” the more avenues there are for data to become compromised. This is what is commonly referred to as “secret sprawl.”

“Secrets that are made too widely accessible in an organization [are] a huge issue for security professionals,” GitGuardian cofounder and CEO Jérémy Thomas told VentureBeat. “In the case of source code, if there are secrets in it, it takes only one developer account to be compromised for all the secrets they had access to to be compromised as well.”

Above: GitGuardian dashboard

Breaches

Back in 2017, Uber announced a major data breach that exposed the personal data of millions of riders and drivers. The company later confessed it wasn’t using multifactor authentication on its GitHub account — meaning anyone who encountered the login credentials could access its private repositories unhindered — and it was through the GitHub repository that the intruders managed to find access keys for Uber’s AWS data store, where its user data was kept.

In a Federal Trade Commission (FTC) filing from 2018, Uber revealed how the intruders managed to gain access to the private GitHub repository in the first place. Uber had granted its engineers access to the private repositories via their own personal GitHub accounts, which had weak security. The filing noted:

Uber granted its engineers access to Uber’s GitHub repositories through engineers’ individual GitHub accounts, which engineers generally accessed through personal email addresses. Uber did not have a policy prohibiting engineers from reusing credentials, and did not require engineers to enable multi-factor authentication when accessing Uber’s GitHub repositories. The intruders who committed the 2016 breach said that they accessed Uber’s GitHub page using passwords that were previously exposed in other large data breaches, whereupon they discovered the AWS access key they used to access and download files from Uber’s Amazon S3 Datastore.

As a result, the intruders accessed 16 files that contained unencrypted personal data, including nearly 26 million names and email addresses, 22 million names and mobile phone numbers, and 607,000 names and driver’s license numbers.

Poor password hygiene aside, Uber’s AWS access key should probably not have been anywhere near a GitHub repository — private or otherwise — in the first place. This kind of breach highlights what’s at stake for companies. Compromising customer data and losing trust is a major issue, but poor security can also lead to regulatory and legal tussles.

“Hardcoding secrets in source code or other private site[s] that are not specifically meant for secret storage breaks various compliance rules and industry standards and best practices,” Thomas noted.

Uber, which initially covered up its gargantuan leak, was widely viewed to have violated numerous data security and breach reporting laws, and it eventually settled the case by paying a $148 million fine. This is the type of scenario GitGuardian said it can help avert, as it claims it can detect and send an alert to the developer and security team within four seconds of a secret leaking into code repositories.

“Currently, every company with software development activities is concerned about secrets spreading within the organisation, and in the worst case, to the public space,” Thomas said. “As a company with so much sensitive information at hand, we have built a culture of unconditional secrecy at our core.”

GitGuardian said it has already helped more than 100 of the Fortune 500 companies, government organizations, and thousands of individual developers. And with another $12 million in the bank, it plans to expand its customer base in the U.S., where 75% of its current clients are based.

Some 40 million developers use GitHub, and with more than 100 million repositories, the Microsoft-owned code collaboration platform is fertile ground for any company looking to train algorithms. A few months back, Swiss startup DeepCode raised $4 million for a system that learns from GitHub project data to give developers automated code reviews. GitGuardian is adopting a similar philosophy in terms of how it’s using GitHub to train algorithms at scale so companies can further automate their cybersecurity setup.

“Rather than encumber technology organisations with limiting compliance procedures, GitGuardian allows the modern enterprise to develop code quickly and how it wants to, but with automated visibility and protection over how data, credentials, and other sensitive information is used, moved, and shared,” said Balderton Capital partner Suranga Chandratillake.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Secret sauce

Breaches

The insights you need without the noise