Thursday, October 23, 2008

Things They Should Invent: Google Corpus

A lot of the things I dislike about Google (localizing search results, overdoing it with the "Did you mean…?", including synonyms, making assumptions about what I want) occur when I'm trying to use it as a corpus instead of as a search engine. These functions are useful for people trying to find information for real-life applicable purposes, it's just terminological/phraseological/linguistic research that it's unhelpful for (unless there's another area where it's also unhelpful that I can't think of right now.)

So why not make another Google just for our obscure langling needs? They already have everything they need - the Google index is probably the largest corpus in the world, and Google is, obviously, the best search engine in the world. Just take away the localization and other unhelpful functions, perhaps make a few more precision operators (so you can search for two words near each other, or have a wildcard that represents any preposition), perhaps make it possible to compare the number of results for multiple searches side by side (Googlefight has this functionality in its own unique way), integrate as many publications and academic databases as possible (if you're stuck on copyright issues, you wouldn't have to make the texts themselves accessible through Google Corpus, just show the applicable snippets in the results) and you'll have the best possible tool for us language freaks. You can improve quality of translations everywhere and make life easier for linguistic researchers (and anyone else who needs a corpus of naturally-occuring language), and it will take practically no effort. You could just remove the localization function, call it a Beta version, and put it up in Google Labs today!

No comments: