Google based thesaurus
I was thinking today about language and grammar recognition by machines, as used for auto-translating, document rewriting, etc. I need a thesaurus of phrases for rewriting documents. I'm trying work out how a bot could compile one and it occurs to me that google has a huge database of English text from which to derive rules.
Suppose you were to search for ''A banana is a'' (with the double quotes). Taking only the sentences which begin with that phrase. Google returns results containing:
a banana is a banana
a banana is a fruit
a banana is a tropical herbaceous plant
a banana is a good source of water
a banana is a tropical fruit
a banana is a phallic symbol
a banana is a monoecious plant
a banana is a healthy snack
If a bot trims out the rest of the sentences then this can be used to create relationships between nouns.
banana --> fruit
banana --> phallic symbol
If this is done for other nouns we might get:
apple --> fruit
apple --> computer
So having done this for a bunch of words we can make a list of things 'which are' fruit. So far we've got 'apple' and banana', and we can do some text substitutions. 'she was eating a banana' can be substituted with 'she was eating a fruit'. If we substitute 'she was eating a phallic symbol' it's still gramatically correct (and sounds kinda sexy) but we've lost the meaning of the original phrase. Which is no good if we're rewriting document that humans will read. So how's a computer going to know which is the better substitution?
It's a tough. My best answer at the moment is to have the bot see what humans use more often, ie, Google both terms and see which comes up more often.
'she was eating a fruit' => 95
'she was eating a phallic symbol' => 0
'she was eating a snack' => 164
'she was eating a monoecious plant' => 0
Now we have a score for each substituion. To make a general case for each word (so as not to have to search each time, and because many phrases will not exist at all) we could search nouns against verbs for proximity and the number of Google matches will be the score for how appropriate they are to each other.
''eat * banana'' OR ''banana * eat'' => 142,000
''eat * snack'' OR ''snack * eat'' => 269,000
We can also test verb substitutions from a regular thesaurus this way. For example Roget's lists nosh, chow and masticate as alternatives to 'eat'.
'eat * banana' => 124,000
'nosh * banana' => 13
'chow * banana' => 303
'masticate * banana' => 4
So 'chow' is the most likely substitute for 'eat' out of these three (personally I prefer masticate) but it's not a very common switch. If chow and eat had simmilar scores (say within 66% of each other) then that would likely be a better substitution.
Ultimately I'd like to be able to make a bot rewrite text into infinite permutations retainging the original English (human) meaning as well as some of its nuance.
I'm sure it's possible, I'm not sure how. Try 'He attended MIT to study' (remove object from sentence). Googling for 'He * MIT to study' gives:
he was at MIT to study
he was accepted at MIT to study
he came to MIT to study
he entered MIT to study
As well as a buch of bad substitutions, most of which can be filtered out by context. 'he returned to MIT to study' would be harder for a machine to spot as a bad substitution because it changes the meaning.
Thinking.... thinking.... thinking....
Any thoughts or ideas, email me!
Created 2006-02-10 19:46:40 by 216 and filed under hacking
rmCKkv sjcyalpdzrdt, [url=http://gprdyhxkkqne.com/]gprdyhxkkqne[/url], [link=http://hcheqetqssie.com/]hcheqetqssie[/link], http://qffgrliwktwj.com/posted: 2011-03-10 05:51:59
Subscribe to this blog by RSS.