The fastest spiders in the world

IN its first week it had 300,000 requests a day

IN its first week it had 300,000 requests a day. Within three weeks, that had jumped to two million, with no advertising except word of mouth. Now hovering at 30 million bits a day, it has been onwards and upwards ever since for AltaVista, the search engine which resides in Digital Equipment Corporation's Western Research Laboratory in Palo Alto, California.

Behind a door labelled "Technoland", the blinking, humming world of 400 high performance computers is laid out in long rows. Lining the walls on storage hooks are dozens of snaky grey and black cables and leads. Two racks of 28 high bandwidth T1 phone lines - Internet links able to carry 1.544 megabits of data per second serve this room alone.

A huge battery run generator unit will keep the room operational during a power failure, as happened in the region's massive 1989 earthquake. A refrigerator sized unit houses the multiple gigabyte racks of AltaVista's indexing memory. Another small server is home to Scooter, the specially designed Internet spider" which clambers through 10 million Web pages a day, enabling every single word to be indexed.

But the "real" AltaVista consists of three floor level units, labelled Gotcha 1, 2 and 3, each barely the size of a hi fi receiver. "I'd love to show you a marble and walnut computer with three people guarding it," apologises Brian Reid, director of Digital's Network Systems Laboratory. "But this is it."

READ MORE

Each of the three units handles a third of the 30 million hits zinging in daily - an average of 500,000 queries per hour, peaking at 720,000 in the early evening, when Californians arrive home from work and crank up their browsers.

"One unit could handle the load, but we use three for backup; three instead of two because it's obviously very important," explains Reid, a solidly built, grey haired technowhiz with a laconic drawl.

The Western Research Lab is one of the Digital labs in Palo Alto. The city, 35 miles south of San Francisco, has always had the reputation of being a research hothouse. In the 1950s, two young men named Hewlett and Packard invented a technology cliche here - the multi million dollar electronics firm born in a garage. It's also home to Xerox's famous PARC (the Palo Alto Research Center), in which virtually every element of modern computing, from graphical user interfaces to networking and laser printing, was invented in the 1960s and 1970s.

Palo Alto can lay claim to being the most wired city in the world, sitting in a region with the highest number of Internet connections anywhere. Always looking ahead, the city itself is busy laying high speed fibre optic cable under its streets which will eventually run into businesses and homes.

Digital's labs, with a $1.1 billion annual budget, are part of that tradition: they hire intensely creative engineers, fund them, and see what happens. Palo Alto is one place on earth where engineers are an admired elite - not least because they pull down the salaries that keep maitre d's and luxury car salesmen properly ingratiating.

One of those engineers, Paul Flaherty, was in Florence in May 1995 to hear his wife present a computing paper when he started thinking about creating "a decent demo" on the (then young) Web - to show off Digital's new Alpha servers. These are muscular, 64 bit computers - well able to handle large memory applications (most computers are either 8, 16 or 32bit machines). While using one of the then excruciatingly slow search engines on the Web, he realised he had his project.

Working with Louis Monier, now the technical director of AltaVista, he swiftly fleshed out the idea, created the spider and wrote the operating software. By August they had an internal demo. By December 15th that year it was on the Web.

Flaherty says the most frequent query is for common names (which has led to a company cocktail party game: "You throw your name in and your status depends on how many hits come back").

If the Web crumbled tomorrow, the entire text could be rebuilt from AltaVista's page index, which contains all the words for some 31 million Web pages in its memory. In order to classify all that text, AltaVista takes a "vocabulary based approach", says Flaherty. Web pages are actually stored as a sequence of numbers based on words. The index automatically assigns a number to each unique word, rather than letter, on the Web. If it recognises a word, it gives it its preassigned number. If it's a new word, it creates a new number.

So the 40 gigabyte index is manageable because it sorts text by word rather than by letter, which compresses the space a given page occupies in the index's memory. As a result, AltaVista can respond to the average search request in 0.7 seconds. It's also designed to accommodate human fallibility, recognising a certain number of misspelled search terms.

In the near future, the engine will also be able to conduct multimedia searches to locate sound and image files (a rival search engine, Lycos, can already do this). AltaVista's blistering response times quickly attracted plenty of attention on the sluggish Web, and Digital - seemingly to its initial bemusement - suddenly found itself with a marketing dream, 30 million people a day voluntarily coming to see its corporate logo.

Within months, Digital had rebranded all its Internet/intranet products with the AltaVista name. Score one for the Western Research Lab.

Karlin Lillington

Karlin Lillington

Karlin Lillington, a contributor to The Irish Times, writes about technology