The Internet Archive has saved more than 200 terabytes (TB) of government websites and associated data since last fall as part of an ongoing initiative to preserve online information during transitions between administrations.
Back in December, the organization revealed it was capturing a vast swathe of data from federal and government websites in an exercise it estimated could total more than 100TB. Now, however, we know that it comes to over 200TB and includes more than 100TB of websites and 100TB of data from federal FTP file servers, encompassing in excess of 40 million PDFs and 70 million HTML pages.
The Internet Archive has been documenting the web’s evolution for two decades, letting anyone revisit the Apple homepage in 1998 or VentureBeat in 2006 by entering their desired URL into the Wayback Machine. But the organization is also preserving all manner of data, including old MS-DOS video games, political TV ads, and, increasingly, digital data from the government.
First introduced as George W. Bush’s time in office was ending in 2008, the End of Term Web Archive is a collaboration between the Internet Archive and a number of educational institutions, including the Library of Congress, University of North Texas, George Washington University, Stanford University, and California Digital Library. Its aim is to serve as a permanent record of government communications during presidential transitions, with some estimating that 83 percent of PDF documents on .gov domains vanished during President Obama’s first term in the White House.
June 5th: The AI Audit in NYC
Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.
While the End of Term Archive wasn’t built with Donald Trump in mind, its existence has become all the more important for those concerned about how data is and will be treated under the Trump administration. Indeed, shortly after Trump’s election victory in November, environmentalists, academics, and climate scientists frantically worked to preserve U.S. government climate data in a Canadian archive due to Trump’s track record of refuting climate change.
The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.
— Donald J. Trump (@realDonaldTrump) November 6, 2012
The Internet Archive also revealed it was building a replica database in Canada in response to concerns about Trump, and it later went on to launch The Trump Files, an online repository of everything Donald Trump has said on video.
Now, with 200TB worth of data gathered between fall 2016 and this spring, the Internet Archive has made every web page accessible through the Wayback Machine and said that it plans to add the database to the main End of Term Web Archive soon.