Search sucks. No matter how clever your search engine's system is, no matter how many clever page-ranking formulae you apply, or how many super-speedy processors you throw at it, the current way to search the internet doesn't work very well.
Searching by a keyword misses out a boatload of stuff, for one simple reason: many of the documents you might find useful do not contain the keyword. Consider this: you want to find documents on Iraqi politics. You're a leader writer, perhaps, and one sherry too far gone. You turn to Google, and what do you search for? "Iraqi politics?" "Saddam Hussein?" "Abd al-Rahman Arif?" Well, yes, all of these - and each one will be useful, but not the entire picture. You want the search to return not just the keyword hits, but documents on the same topic that don't necessarily mention the keyword.
What you need is a fast developing technique called latent semantic indexing. Latent semantic indexing (LSI), solves this problem by allowing the search engine to give you back the documents that are related to the keyword, but don't necessarily contain it.
Here's how it works: Take all the documents you want to index, and remove all the definite and indefinite articles, the conjunctions, the prepositions and so on. Then take off the case endings and plurals, until all you have left are proper nouns and the word stems. Now imagine you are making a table of all of the words you have along one axis, and all of the documents along another.
The document column will have an x for every word in that document, and the word row should have an x for every document it appears in. Here's where it gets a bit trippy. You get your computer to make one of these tables - or "matrixes" - but this time using a different axis for every word.
This requires you to have a matrix in hundreds of dimensions, and no amount of sherry will help you do it in your head. Your computer, however, will have no problems. Next, you use a mathematical technique called singular value decomposition to "squish" these multiple dimensions down to a more usable number. From 10,000, say, to 100. What you will have made here is a matrix where documents with similar content start to clump together.
By applying more arithmetical jiggery-pokery, called singular value decomposition (SVD), you can end up with a two-axis matrix again, but this time, with numbers instead of crosses. The higher the number, the more relevant that word is to the document, even if it doesn't appear in it.
Your search engine can find not only the documents that contain the keyword, but all the other documents that are clumped together - ones that are shown to be sufficiently similar to the keyword hits to have relevance.
Curiously enough, all this actually works. A team from the American National Institute for Technology and Liberal Education has been building on the idea. It wasn't theirs originally, but technology has only just caught up with the processing power needed to do the "matrix squishing" on usefully large document collections, and running on desktop hardware.
The lead developer, Maciej Ceglowski, is full of ideas on how the technology can be used: "It's good on large collections of text, written in a formal style: libraries of academic research, for example," he says. Biology, he points out, can benefit massively from LSI. The SVD algorithm doesn't actually require text at all. There is no language understanding, just a count of word frequency. If you take mass spectrographs of complex molecules, and treat each molecule as a document, and each peak on the spectrograph as a word, you can build searchable indexes in just the same way you can with text.
This could be revolutionary for medical science. By posting an entire text document into the search box, an LSI system will give you back a list of similar documents: a sort of "More Like This" search. While for text this is useful, it becomes revolutionary when applied to proteins. From a database of thousands of molecules, you can use LSI to find similar matches. You can find clusters - places on the matrix where the molecules are similar - and you might even find similarities you didn't know about.
One other possible use of LSI is for an automated essay writer. Given a big enough body of knowledge about a certain topic, an LSI program could take your homework, mark it and even suggest areas you haven't covered.
It's the perfect tool for the journalist, too: given live data from the wire services and newspapers around the world, our sherry-sozzled speculator could not only have all the relevant information at his fingertips, but a machine telling him what he's missed out. Perfect for that after-lunch writer's block. - (Guardian Service)