The internet archaeologists digging up the digital dark age

The earliest known version of the very first website is out there, somewhere, on an outdated disk drive. "Maybe someone is using it as a paperweight," says Dan Noyes, the web manager for the communication group at the European Organisation for Nuclear Research, known as Cern, in Switzerland. "We do know there's a disk drive from 1990 that was sent to a conference in Santa Clara and went missing. Ideally we'd like to get that. We want the earliest iterations we can get."

Noyes is one of the team trying to restore the site with which Tim Berners-Lee kicked off the web revolution (iti.ms/10ZgLiE). It went online in 1991. The project was launched this week to commemorate the 20th anniversary of Cern’s releasing the technology free to all.

Without that coveted prehistoric disk drive, Noyes and his colleagues have used Berners-Lee’s original computer to rebuild a version of the site as it existed in 1992, which is the earliest version they can find. “The first website was about the web project itself, and it was trying to encourage other developers to go and create a website. So it was self-referencing. It was out there by itself for quite a while.”

Restoring the history of the web is more difficult than it sounds. The world wide web might contain the biggest store of historical information, but it has a terrible memory when it comes to its own development.

The web generally likes to express itself in the present tense, dressed in up-to-the- minute design and forgetful of other iterations along the way. It is no country for old websites. Geocities web pages and early Gif image files don't quite have the cachet of stately buildings or old leather-bound books – if they are preserved at all.

“I think we’re at a time now where my kids can’t even understand the concept of the web being invented,” says Noyes. “They don’t understand what that even means. They ask what kind of web pages I surfed as a kid. They can’t understand that it’s new . . . It’s not like books, where you can just find earlier editions. Digital stuff fades away all the time.”

But there is increasing interest in the distant digital past. As well as projects such as that under way at Cern, there are more whimsical sites, such as internetarchaeology.org, that collect “artefacts of early internet culture”, the cheerfully tacky web pages, Gifs, Midi jingles and animations that cluttered the desktops and minds of people in the 1990s.

The collection includes science-fiction fan communities, UFO-obsessed conspiracy sites and – be warned – some early porn sites. What’s clear is that the web artefacts that have survived have done so largely by accident.

Peter Flynn, the webmaster at University College Cork and manager of its digital publishing unit, was responsible for Ireland's first web server – "We were the bottom link on Tim [Berners-Lee]'s webpage" – and he understands the fragility of digital data.

The earliest version of the site he helped to create is contained on a broken web server on the floor of his office. “I’m sending it to a data-recovery specialist,” he says, explaining that it’s recoverable.

“The problem is that early web users who put up web pages overwrote them with updates. In many cases it’s the same file overwritten thousands of times but now nothing like the original.

“In general, the idea that you should hang on to your electronic stuff is something most ordinary people don’t consider. They might keep it on their hard disk and copy it to another machine and another machine, but then if something blows they don’t have it.”

The fragility of online content finds some web philosophers worrying about a “digital dark age”.

People of a more catastrophic mindset fear disasters that could destroy digital knowledge in one fell swoop, but even optimists acknowledge that digital information falls through the cracks with each dead website and each new update.

The internet is constantly recontextualising itself. Whole archives can disappear overnight – soon after the closure of the Sunday Tribune , a newspaper I once wrote for, its large online archive of articles vanished – and personally significant sites can evaporate without warning.

Memory problem
The activist, computer scientist and digital librarian Brewster Kahle has been trying to solve the internet's memory problem with the Internet Archive (archive.org).

That organisation is working, in collaboration with the US Library of Congress and the Smithsonian Institution, to digitise all culture as well as “to prevent the internet – a new medium with major historical significance – and other ‘born-digital’ materials from disappearing into the past.”

An offshoot of that site, the Wayback Machine, allows people to access archived websites, many of which no longer exist. "It's an out-of-print web-pages service," says Kahle.

“We try to take a snapshot of every web page on every website every two months. The total collection is now more than 300 billion pages. We thought it would be like a research-library collection, but it’s more than that . . . It’s used by about 600,000 people a day.”

His site is addressing a real practical problem as well as fulfilling some emotional needs. I have just spent a nostalgic evening using it to explore websites and blogs I loved and contributed to in the 1990s.

“The average life of a webpage is 100 days before it is deleted or changed,” he says. “If the average life of a web page is 100 days, then the best of the web is not on the web. It’s already off. It’s already gone.”

And this, he believes, should not be simply shrugged off. “The internet is our civic space. It’s a mirror of our institutions and who we are.

“As we spend more of our lives staring into these damn screens we put more of our stuff into it, but it’s run on servers owned by corporations, and they come and go. At least their projects do.

"We've archived Geocities, Yahoo Video, Google Video, now all gone . . . How do we deal with that? We're pretty good at dealing with that with books and records. Books go out of print, but our bookshelf still has the book. Online things just evaporate . . . I think the Wayback Machine is the only real sense that there is a history [online]."

At the moment it’s simply too costly for companies or individuals to save earlier incarnations of their websites, and governmental organisations don’t seem to be interested in doing so. “Maybe as web storage gets cheaper [that process] might be automated,” says Noyes.

“But I don’t think people will do it if left to themselves. We’re too lazy. We won’t bother to record everything we do.”

“If you want to be sure of keeping something,” says Flynn, “print it out on paper.”