Google based thesaurus

I was thinking today about language and grammar recognition by machines, as used for auto-translating, document rewriting, etc. I need a thesaurus of phrases for rewriting documents. I'm trying work out how a bot could compile one and it occurs to me that google has a huge database of English text from which to derive rules.

Suppose you were to search for ''A banana is a'' (with the double quotes). Taking only the sentences which begin with that phrase. Google returns results containing:

a banana is a banana
a banana is a fruit
a banana is a tropical herbaceous plant
a banana is a good source of water
a banana is a tropical fruit
a banana is a phallic symbol
a banana is a monoecious plant
a banana is a healthy snack

If a bot trims out the rest of the sentences then this can be used to create relationships between nouns.

banana --> fruit
banana --> phallic symbol
...etc

If this is done for other nouns we might get:

apple --> fruit
apple --> computer

So having done this for a bunch of words we can make a list of things 'which are' fruit. So far we've got 'apple' and banana', and we can do some text substitutions. 'she was eating a banana' can be substituted with 'she was eating a fruit'. If we substitute 'she was eating a phallic symbol' it's still gramatically correct (and sounds kinda sexy) but we've lost the meaning of the original phrase. Which is no good if we're rewriting document that humans will read. So how's a computer going to know which is the better substitution?

It's a tough. My best answer at the moment is to have the bot see what humans use more often, ie, Google both terms and see which comes up more often.

'she was eating a fruit' => 95
'she was eating a phallic symbol' => 0
'she was eating a snack' => 164
'she was eating a monoecious plant' => 0

Now we have a score for each substituion. To make a general case for each word (so as not to have to search each time, and because many phrases will not exist at all) we could search nouns against verbs for proximity and the number of Google matches will be the score for how appropriate they are to each other.

''eat * banana'' OR ''banana * eat'' => 142,000
''eat * snack'' OR ''snack * eat'' => 269,000

We can also test verb substitutions from a regular thesaurus this way. For example Roget's lists nosh, chow and masticate as alternatives to 'eat'.

'eat * banana' => 124,000
'nosh * banana' => 13
'chow * banana' => 303
'masticate * banana' => 4

So 'chow' is the most likely substitute for 'eat' out of these three (personally I prefer masticate) but it's not a very common switch. If chow and eat had simmilar scores (say within 66% of each other) then that would likely be a better substitution.

Ultimately I'd like to be able to make a bot rewrite text into infinite permutations retainging the original English (human) meaning as well as some of its nuance.

I'm sure it's possible, I'm not sure how. Try 'He attended MIT to study' (remove object from sentence). Googling for 'He * MIT to study' gives:

he was at MIT to study
he was accepted at MIT to study
he came to MIT to study
he entered MIT to study

As well as a buch of bad substitutions, most of which can be filtered out by context. 'he returned to MIT to study' would be harder for a machine to spot as a bad substitution because it changes the meaning.

Thinking.... thinking.... thinking....

Any thoughts or ideas, email me!

Created 2006-02-10 19:46:40 by 216 and filed under hacking

Comments

bmajzcumx writes...

rmCKkv sjcyalpdzrdt, [url=http://gprdyhxkkqne.com/]gprdyhxkkqne[/url], [link=http://hcheqetqssie.com/]hcheqetqssie[/link], http://qffgrliwktwj.com/

posted: 2011-03-10 05:51:59


Add Comment

Name

Email (will not be published)

Website (optional)

Comment

if you are visually challenged please send comments by email
(Please retype the captcha.)



Subscribe to this blog by RSS.



Toys

Ransom Notes
Stencil Maker
Mix Poetry
Sarcasmotron
Shocked Robin
Context Free
myCFDG3d
Hotlink Lottery

Categories


hacking
internet
introspection
music
photography
stuff
stupid
things
webcomic

Recent Posts




Calling all Junior Scientists
Obama Cuts Funding For Abstinence Only Sex Ed
The Moar You Know
New York Became My Canvas
How I Wonder Is A Cat
Interactive Map Of Stupid
Ransom Notes Back Online

Popular Posts

The BBC iPlayer
BBC Documentaries as Torrents
Upside Down Horse Statue
Ransom Note Generator
Marmaduke is Watching You Masturbate
A Better Periodic Table
Obama Cuts Funding For Abstinence Only Sex Ed
New York Became My Canvas
The Hotlink Lottery
Lesotho Gets a New Hat

Galleries

Webcomic
ContextFree
SarcasmoTron
Drawings
Wallpaper
Interesting Maps
Ransom Notes
Photographs 2007
Photographs 2006
Nuts and Chickens
Made With Machines

CC Developing Nations
  Support Bloggers` Rights!
Developing Nations Licence.
dumbbuthappy at gmail dot com

- (Made By Machines) -