All of us have experienced clicking on a link and receiving an error, or 404 notice. Web pages are notoriously fragile documents, and many of the web resources we take for granted are at risk of disappearing. In one case study, archivists who were preserving the hashtags related to the Charlie Hebdo attacks in Paris found that just a few months later, between 7 and 10% of tweets had been deleted. The average life span of a webpage is between 44 and 100 days. And even if you think that we won’t really lose much in the long run if we don’t get every website of interest preserved – the issue of “link rot” is a big deal, as half of all URLs in Supreme Court opinion citations are now dead.
Web archiving overcomes these issues of obsolescence through thoughtful planning and curation of organizationally and historically valuable web content. Most web archiving today takes place through third-party services such as WebRecorder or Archive-It. To archive a website, you have to supply the URL of the website and give the web archiving service instructions about what you want to capture, and how many links below the main page you want the service to “crawl”. Web archiving is not simply “saving as a PDF” or taking a screenshot of a website. Because of the dynamic nature of most modern websites, their embedded media, interactive options, and rapidly changing nature, adequately capturing a website so that a user can interact with the archival file requires creating a WARC file. Web archiving services created files known as WARCs, which is a standardized file format for creating archival web content. Implementing web archiving services addresses several critical archives and records use cases.
A web archiving subscription service such as Archive-It offers both a preservation tool and collection development tool in one: the archivist can use the service to “crawl” a website in order to create a WARC file, and then the service also allows the archivist to present these resources to the public for research through the Internet Archive’s Archive-It website, which is currently used by over 60 ARL research libraries.
At the Archives and Rare Books Library, we have started using Archive-It to begin preserving important university websites. We’re just getting started, but so far we are prioritizing preserving the websites for the Board of Trustees, President, and Provost. All of these websites host important minutes, reports, documents, and other information that is important to retain for university archives. We are also capturing copies of “endangered” websites on the uc.edu – websites that may be going offline in the near future, but which have important university history embedded in them (you can see an example here).
Down the road, we’ll be expanding last year’s pilot project to collect websites from student organizations in order to fill in some of our archival gaps reflecting the student experience. You can see our pilot project collection here.
Do you have any ideas about important university websites that ought to be crawled? If so, contact Eira Tansey, ARB’s Digital Archivist via email at eira.tansey@uc.edu or 513-556-1958.