Automatic Thesaurus

Last week, I landed on another PhD worthy research project.

Given a very large corpus of sentences, such as a digitized version of the Library of Congress, or a less noisy version of the Internet, how can you automatically generate a Thesaurus?

At first I thought the problem should be fairly easy, but the more I thought about it the more difficult and daunting the task became. For example, as a first approach, we might assume that textual substitution would be a good proxy for identifying synonymous terms. That is if a couple terms really are synonymous, then they ought to be substitutable for each other in a sentence. With a large enough set of sentences, we should be able to identify such situations, and thereby bootstrap the building of the thesaurus. But there’s a small problem, provided by my good friend EvB:

The sky is blue.
The ocean is blue.

But sky is not the same as ocean. Sure they are similar. A poet could compose a nice metaphor of fish swimming through their sky above the bottom feeders. But this metaphorical relationship isn’t one that would necessarily make it into a human compile thesaurus. So, textual substitution can easily lead us astray.

Continuing with EvB’s particularly good example we can also identify another problem. Suppose that we incorporate a bit of natural language understanding, enough to pull out parts of speech. Then, the system would easily identify the equation of sky with blue, or ocean with blue. But neither of these statements is true either. Usually people take the example to mean not that sky and blue are the same thing, but that the sky belongs to the set of objects that have a property called color, the value of which is blue. So this understanding depends on what the definition of ‘is’ is (obviously not a simple affair). We also would like to avoid drawing a relationship between any of the pronouns and the rest of the language.

Next lets look at how people tend to write. Any good library is gonna be full of metaphor, simile, pun, allusion, word play, sound play, and other such highly nuanced expression. All of these things will trump any reasonably simple attempt at drawing a link between synonymous words. Political propaganda and polemic, will probably be particularly bad at equating terms that should probably be kept logically distinct. Furthermore, at least when I write, I’m reminded of other things during the process, things that are associated, but not necessarily synonymous. These remindings are an important part of the essay writing process, but will certainly throw noise into the digital library.

But if it’s so hard to make a mechanical system for identifying synonyms, then how do humans do it? Here I have a hypothesis: that similar words stimulate similar patterns in the brain. Thus when a human tries to think up synonyms it’s really the same as playing word association with a filter. First, the word stimulates the brain, bring up certain associations. These associations will be based on ‘brain distance’, a measure of the similarity of brain activity for certain words and thoughts. But some associations will be radically different from the synonyms that we’re looking for. For example, antonyms and non-sequiturs often come up in word association games. So a filter is applied to weed these out, and what’s left is passed through a dictionary/meaning check. Anything passing this process will be reported as a synonymous term.

So, in order to really generate a thesaurus, we do need AI (or at least an underlying cognitive model). When I first thought of the thesaurus problem, I was hoping that it was paired down enough, small and simple enough that it would be doable without all this complexity. We might have to reduce the problem further, make it looser. Say, build an association dictionary, rather than a thesaurus. An association dictionary might be possible, because it forgoes the understanding of meaning and similarity, it doesn’t have to question or measure why two words should be associated, only record that they are used similarly.

So, if you can automate the building of a thesaurus, you should get a PhD in Linguistics.