A Publication of the Public Library Association Public Libraries Online

A Needle in a Haystack: Writing Digitally about Proper Digital Preservation

by on January 14, 2016

A few weeks ago, while researching my article regarding whether digital content is being properly preserved, I came across an article about knowledge preservation by Claire McInerney, a professor in the Rutgers University Library Program, an online-based master’s program. When I referred back to the article this week, this is what I found:

Courtesy of Troy Lambert

Courtesy of Troy Lambert

With a little searching, I was able to find a link to the same study on the American Society for Information Science and Technology[1] website. Still it raised the question: what happened to the other article? And how did I have to structure my search to find it?

A simple Google search of “knowledge management” would not work. The article didn’t rank high enough according to Google to appear on the front page, and most users (including me) don’t look past page two, so I needed more information for my search string. Since I looked at the article recently, I knew what university and program the professor was affiliated with, and I remembered her last name. So a search of “knowledge management Rutgers Library McInerney” brought me to the information I was looking for, but the result was still on page two. This example highlights one of the many problems with proper preservation of digital content.

On November 20, 2015, Meredith Broussard of The Atlantic stated similar concerns in an article titled The Irony of Writing Online about Digital Preservation: “The Internet archive will allow you to find a needle in a haystack, but only if you know approximately where the needle is.”[2]

Imagine me trying to find the same Rutgers article, only a year from now. I’m likely to have read hundreds of other articles by then, and probably won’t remember the university the professor was from, and certainly not her name; I’d just have a vague notion of an article about knowledge management I would like to reference, and maybe a loose timeframe of when I read it, which has no real world relation to the date the article was published.

Not to mention the haystack is constantly growing.[3] The number of articles, like this one, regarding similar concerns over the preservation of knowledge, will be created and archived somewhere, maybe. That’s where the irony comes in. We are writing in a digital media about the difficulty of preserving digital data, and our thoughts themselves are challenging to preserve.

It’s a vicious circle, and an ongoing problem, one that libraries are ill-equipped to solve; however, these concerns have many sources and possible solutions.

Content Management

“The challenges of maintaining digital archives are as much social and institutional as technological,” said a National Science Foundation and Library of Congress study[4] from 2003. “Even the most ideal technological solutions will require management and support from institutions that in time go through changes in direction, purpose, management, and funding.”

Each website is hosted on some kind of platform designed to manage how the content looks to an end user, and many have unique themes. These vary from Drupal (used by Time magazine) to WordPress (where the content on my website is hosted), and dozens of others, some custom created for large media organizations. Media outlets that also create print materials have yet another Content Management System (CMS) for print content. All of this should be easy to preserve, right?

Not as easy as you think. Large archival organizations like LexisNexis or EBSCO scoop up digital feeds, bundle the information in a database, and license those packages to libraries, who can then search them by title, author, keyword, where and when they were posted, depending on what the feed is able to gather. But comparing EBSCO searches with searches in Google reveals a stark difference in the quantity of articles indexed, revealing one of many data gaps.

Gone are the days of print material being converted to microfiche, but there is a hazard: organizations that switch CMS or have several, with decades of information to preserve (i.e. The New York Times), all of it in different formats, face huge challenges, all of which can be costly.

User Expectations

User expectations have changed as well: researchers expect nearly instant results and unlimited access to information. But putting and keeping it all on the web just isn’t practical, and experiments searching for specific articles show just how challenging that is.


Such searches also raise the question of how necessary such preservation is. Unless a user is looking for a specific quote by a specific person, the proliferation of material on any subject means similar information will be found in any search. In the example above, if I hadn’t been trying to find a specific article to prove it could be done more than anything else, I could have used other sources discovered in the search string discussing knowledge management containing nearly the same information.

Social Media Interactions

Not yet included in library based searches are Tweets, Facebook posts and comments, and other online interactions authors have with their audience. These are also a source for relevant quotes and information, but social feeds are difficult for libraries to capture, archive, and preserve, let alone make useful. The Library of Congress has made an effort with Twitter, but has no idea (yet) how they will make the huge amount of data they’ve collected available to the public.[5]

The primary reason is cost, a constant issue with both preservation and public access. It’s not just about the hundreds of terabytes of storage, a number that grows daily, but about having servers fast enough to handle even the simplest search. Searching one term in a small portion of the tweets gathered, say from 2006 to 2011, would take twenty-four hours using the library’s current technology.

There are also privacy issues, even though technically each Tweet published or Facebook update posted is already in a certain portion of the public domain, depending on the user’s privacy settings. However, this is a different method of acquisition than anything libraries have done previously, and a system has to be in place to remove deleted Tweets and posts in order to comply with the same user agreements that make them public information.

Data Gaps


Even news sites struggle with shrinking budgets, migrating CRMs and changes in IT staff. A Newspaper Research Journal article reveals major data gaps.[6] “Not one publication has a complete archive of their website,” the article states. “Most can go back no further than 2008.”

So when you look for this article in a few months, how easy will it be to find? Even if you save it to your Twitter or Facebook feed, will the link still work? For how long? Fast forward a year. Two. Will our concerns even be the same? If you can find the article, will it be relevant? How quickly will it be lost in the haystack of other articles about digital preservation?

I don’t know how this site is being archived, or when Public Libraries Online will switch content management systems. I can save this article on my computer, or even in the cloud, but while that protects my access, at least for now, it doesn’t preserve it anywhere else. It’s likely the article I write today on digital preservation will not be preserved beyond a couple of years, whether it is of scholarly interest or not.

But with a little searching, maybe someone will find this needle in the haystack of information. At least, if they have some idea of where it might have been in the first place…


[1] McInerney, Claire. “Knowledge Management – A Practice Still Defining Itself.” American Society for Information Science and Technology 28, no. 3 (February/March 2002).

[2] Broussard, Meredith. “The Irony of Writing Online About Digital Preservation.” The Atlantic, November 20, 2015. http://theatln.tc/1Qyguv2.

[3] Fridman, Alan. “3 Ways Big Data Has Changed the Digital Age.” Inc.com, July 19, 2015. http://bit.ly/1fgNYho.

[4] Hedstrom, Margaret. “It’s About Time Research: Challenges in Digital Archiving and Long-Term Preservation.” Report presents to Workshop on Research Challenges an Digital Archiving and Long-Term Preservation, Washington, DC, April 12-13, 2002.

[5] LeFrance, Adrienne. “Library of Congress has archive of tweets, but no plan for its public display.” The Washington Post, January 13, 2013. http://wapo.st/1mTDBUJ.

[6] Hansen, Kathleen A., and Nora Paul. “Newspaper archives reveal major gaps in digital age.” Newspaper Research Journal 36, No. 3 (2015): 290–298. DOI: 10.1177/0739532915600745.

Tags: , , ,