Google based thesaurus

I was thinking today about language and grammar recognition by machines, as used for auto-translating, document rewriting, etc. I need a thesaurus of phrases for rewriting documents. I'm trying work out how a bot could compile one and it occurs to me that google has a huge database of English text from which to derive rules.

Suppose you were to search for ''A banana is a'' (with the double quotes). Taking only the sentences which begin with that phrase. Google returns results containing:

a banana is a banana
a banana is a fruit
a banana is a tropical herbaceous plant
a banana is a good source of water
a banana is a tropical fruit
a banana is a phallic symbol
a banana is a monoecious plant
a banana is a healthy snack

If a bot trims out the rest of the sentences then this can be used to create relationships between nouns.

banana --> fruit
banana --> phallic symbol
...etc

If this is done for other nouns we might get:

apple --> fruit
apple --> computer

So having done this for a bunch of words we can make a list of things 'which are' fruit. So far we've got 'apple' and banana', and we can do some text substitutions. 'she was eating a banana' can be substituted with 'she was eating a fruit'. If we substitute 'she was eating a phallic symbol' it's still gramatically correct (and sounds kinda sexy) but we've lost the meaning of the original phrase. Which is no good if we're rewriting document that humans will read. So how's a computer going to know which is the better substitution?

It's a tough. My best answer at the moment is to have the bot see what humans use more often, ie, Google both terms and see which comes up more often.

'she was eating a fruit' => 95
'she was eating a phallic symbol' => 0
'she was eating a snack' => 164
'she was eating a monoecious plant' => 0

Now we have a score for each substituion. To make a general case for each word (so as not to have to search each time, and because many phrases will not exist at all) we could search nouns against verbs for proximity and the number of Google matches will be the score for how appropriate they are to each other.

''eat * banana'' OR ''banana * eat'' => 142,000
''eat * snack'' OR ''snack * eat'' => 269,000

We can also test verb substitutions from a regular thesaurus this way. For example Roget's lists nosh, chow and masticate as alternatives to 'eat'.

'eat * banana' => 124,000
'nosh * banana' => 13
'chow * banana' => 303
'masticate * banana' => 4

So 'chow' is the most likely substitute for 'eat' out of these three (personally I prefer masticate) but it's not a very common switch. If chow and eat had simmilar scores (say within 66% of each other) then that would likely be a better substitution.

Ultimately I'd like to be able to make a bot rewrite text into infinite permutations retainging the original English (human) meaning as well as some of its nuance.

I'm sure it's possible, I'm not sure how. Try 'He attended MIT to study' (remove object from sentence). Googling for 'He * MIT to study' gives:

he was at MIT to study
he was accepted at MIT to study
he came to MIT to study
he entered MIT to study

As well as a buch of bad substitutions, most of which can be filtered out by context. 'he returned to MIT to study' would be harder for a machine to spot as a bad substitution because it changes the meaning.

Thinking.... thinking.... thinking....

Any thoughts or ideas, email me!

Created 2006-02-10 19:46:40 by 216 and filed under hacking

Comments

bmajzcumx writes...

rmCKkv sjcyalpdzrdt, [url=http://gprdyhxkkqne.com/]gprdyhxkkqne[/url], [link=http://hcheqetqssie.com/]hcheqetqssie[/link], http://qffgrliwktwj.com/

posted: 2011-03-10 05:51:59


Add Comment

Name

Email (will not be published)

Website (optional)

Comment

if you are visually challenged please send comments by email
(Please retype the captcha.)



Subscribe to this blog by RSS.



Toys

Ransom Notes
Stencil Maker
Mix Poetry
Sarcasmotron
Shocked Robin
Jailbaiting
Google Adult
Yahoo Adult
Context Free
Google Video
myCFDG3d
Hotlink Lottery

Categories


hacking
internet
introspection
music
photography
stuff
stupid
things
webcomic

Recent Posts

Calling all Junior Scientists
Golf GTI Scale-Electric Flash Game
Obama Cuts Funding For Abstinence Only Sex Ed
Pictures from the Moscow Gas Pipeline Fire
Sea Monster in the English Channel
Give Him Bacon
The Moar You Know
A Song to Ride Ducks By
New York Became My Canvas
How I Wonder Is A Cat

Popular Posts

The BBC iPlayer
Christian Nudist Camp
Hacking iFriends
iFriends GreaseMonkey Script
A Higher Art, Nude Ballet
BBC Documentaries as Torrents
Hacking Dating Websites
HOWTO Make A Stencil
Assvertising
Golf GTI Scale-Electric Flash Game

Galleries

Webcomic
ContextFree
SarcasmoTron
Drawings
Wallpaper
Interesting Maps
Ransom Notes
Photographs 2007
Photographs 2006
Nuts and Chickens
Made With Machines

Webcams

 [See the world >>]
209.251.37.226 - Axis 217.168.94.130 - Axis 208.0.229.84 - Panasonic KX Series 218.44.129.125 - Panasonic KX Series 88.116.108.99 - Axis 129.210.144.237 - Axis
CC Developing Nations
  Support Bloggers` Rights!
Developing Nations Licence.
dumbbuthappy at gmail dot com

- (Made By Machines) -