I once had an uncle who, in his eccentric way, taught me the glories of reading widely and across many disciplines. Every year he would visit us from Florida and each time he came, he was off on another tangent, usually a scientific pursuit of some kind, and now and then a venture into linguistics. One of his more memorable visits found him arriving with a set of slides he had made from books on Egyptian hieroglyphics, and we went through them one at a time as he explained what he had learned about Egyptian culture by mastering these symbols.

Hieroglyphics Meet the Machine

I think about that every time I ponder the fate of digital data, and this Reuters story, which mentions hieroglyphics, triggered the memory. For as Andreas Rauber (University of Technology of Vienna) points out, hieroglyphics — or, for that matter, stone inscriptions or medieval manuscripts — have a shelf life of millennia, and have proven it. In my own wandering way, I was for a time focused on medieval manuscripts and recall wonderful moments at the British Museum and the Háskóli Íslands (Reykjavík) studying documents that recorded the stories, as well as the daily business and musings, of societies a thousand years old.

There’s a fascination in working out the conventions of a medieval scribe, but in the modern world, we have an equally thorny task in sorting out digital formatting. We live by data in our time. 100 gigabytes of data (the article points out this is equivalent to about 23 tons of books) have been created for every individual on the planet, and Adam Farquhar (British Library) notes that this amounts to about a trillion CDs worth of data spread across the globe. Farquhar and Rauber are worried about our continuing access to these data. Says Farquhar:

“Einstein’s notebooks you can take down off the shelf and read them today. Roll forward 50 years and most of Stephen Hawking’s notes will likely only be stored digitally and we might not be able to access them all.”

International Data Corporation says the size of the digital universe will expand by ten-fold, from 161 billion gigabytes (exabytes) in 2007 to 1800 exabytes in 2011. The figure is predicted to double every eighteen months. Yet amidst the data avalanche, we live in a world where CDs and DVDs have an exceedingly short life expectancy and many data backups may simply fail when the time comes to access them. We’re learning how to spread the backup into the ‘cloud’ of networked computers, but haven’t solved the many security issues that involves.

Preserving Changing Formats

Even more telling is the fact that we continue to change data formats seemingly at whim. Change that improves things is always welcome, but our digital rush to the future often seems to move to its own whimsical music. Trying to upgrade the software I use for Centauri Dreams, I find that the new version’s coders have re-written key aspects of the programming environment, so that all the formatting I put into the header design would simply fail if I made the ‘upgrade’ without digging into the code myself and making a set of manual fixes.

Well, I’ll do all this because new features are available (and be aware of this in the next week or so if the site suddenly starts acting strangely), but there is such a thing as backwards compatibility. And this is the tiniest of problems compared to what decades, not to mention centuries, of ongoing change could do to data stored cavalierly because access is assumed.

What Rauber and Farquhar intend to do about all this is to preserve a ‘digital genome’ deep in the Swiss Alps in a secret bunker, where the needed information to read all our formatting is made available to future generations. The sealed box is buried somewhere near the town of Gstaad, in a data facility known as the Swiss Fort Knox in the Bernese Oberland, an installation that consists of two underground data centers that are 10 kilometers apart but connected by high-speed networking. The facility is in a former military nuclear bunker designed to offer physical protection against all manner of environmental disasters.

Going Deep in Switzerland

Behind the burying of this ‘digital genome’ is the Planets project, which links European libraries, archives and research institutions in an attempt to preserve both software and hardware assets as older versions of each are superseded. Inside the box are at risk digital formats, from JPEG photographs to messages in Java, films in .MOV format and documents in PDF. Here’s how the project describes the contents of the Planets TimeCapsule:

Each object is stored in its original format and a new format more suitable for long-term preservation such as PDF/A, TIFF, JPEG2000 and MPEG4. The objects are stored on media that range from paper, microfilm and floppy discs to CDs, DVDs and flash-drives and HDDs.

Inside the box are the original and new objects, storage media, and some reading devices. It also includes conversion tools that were used to migrate the objects as well as software to open and view/use these objects and supporting software all the way down to an operating system; descriptions of the file formats, of the file systems and encodings used on the storage media; and description of all these objects and their relationship to supporting technology and recognised standards.

And yes, an online version is to be provided, while replicas will be available to libraries, archives, science museums and other interested parties. The need will surely grow for it to be augmented on a regular basis as our formats continue to sprout changes and extensions. Anyone who has been using computers for a sufficient time will be aware of how acute this issue can be. I think of the hundreds of articles, reviews and columns I wrote in the 1980s using the now defunct XyWrite program, which began with straight ASCII data formats but gradually changed so that its files were unreadable without the program. I would have to locate some very old software if I had a need to get into the digital versions of all this work today.

For more about the Planets project, you can download a brochure here. I notice that it’s in the baleful PDF format, all but universal for the dissemination of scientific papers, yet balky and bloated and forever at the mercy of Adobe. Every digital ‘upgrade’ has its advantages, or at least most of them do, and getting the most out of our data is in the hands of good coders, but as our bits and bytes head inexorably into the future, we’d better hope that data preservation eventually becomes built into every operating system we run on our machines, and that the future web of interconnections keeps the knowledge of past formats always at hand.