Letter frequencies with Google

joelisjoel | geek | Thursday, June 14th, 2007

The Machine That Ate the Planet 

I’ll be starting to work at Google next week and from what I can gather they are basically building a computer the size of a planet to store and process all human knowledge.  Actually with the addition of video to searches it might be closer to say that the computer will store all of human experience, but hey, one step at a time kids…

I’ve always been suprised that the best way Google can pay for it’s ginormous database is by selling ads.  There’s nothing wrong with ads, it’s just seems to me that there’s a progression from an index of keywords on web pages to a semantic representation of information.  If we could alter the representation of a page suitably, would we transition from a machine which is able to find pages to a machine which knows things about the world.

 

What is ‘Knowing’? 

Of course this raises all kinds of spooky questions about what it means to know things.  If I can type a question into google’s search bar and find a page that gives me the answer without too much trouble, then this might be as good operationally as a true AI.

For example, if I type the question ‘What is the diameter of the earth?’ I receive the following response as the first result:

What is the diameter of the earth?

The answer to the geography question - what is the diameter of the earth?

Could an AI do any better?

Still, it is tempting for me to believe that Google might be able to do better if it represented information with some kind of semantic network.  I.e. breaking pages down into paragraphs and sentences.  Semantic networks are sort of an old idea in AI, and are discredited now as far as I know.  But they make possible several interesting linguistic tricks like the generation of non-sense sentences.

Sentence #123901210312

We tend to represent our thoughts online in a written language like English.  The information density of English is pretty low - something like 1 bit per ASCII character.  So we know there is a lot of redundancy in english sentences.   But there is another kind of reduncancy.  The same thought can be repeated over and over.  For example, a search for the sentence:

“Albert Einstiein is a smart guy”

 produces two hits.   The sentence:

“George Bush is an asshole.”

produces 14,500 pages.  We might be able to draw inferences about beliefs based on how frequent certain sentences are used.

There are a couple of trillion web pages out there, say 10 to the 10th power.  If every web page could be analyzed into say, 100 sentences on average, that means that every sentence on the web could be given a number from 1e15.  That’s a measly 52 bit number to represent all the possible sentences on the internet.

 

Letters Make Words

You could also choose to build a semantic network around the notion of sentences in pages, or words in sentences.  To do some simple visualization experiments though it’s fun to ask google about simple character frequencies.

As of this morning, Google has about 8 trillion pages in the index that contain the word “a”.  You can actually do this query for all 26 single-letter “words” and you get a frequency diagram that looks like this.

 

 

frequencies of one-letter words

 

The word “a” is the most common single letter word followed by “e” “i” “o” and curiously by “s”.

In true Martin Gardener fashion, you could do the same thing with two letters next and you get a little frequency chart like this:

 

Two-letter words

The bright spots in the image correspond to words like “be”, “is”, “of”, etc, so this figure does say something about the english language.

 

Word frequency 3 letters

Building a frequency table like this isn’t that enlightening and it gets progressively harder as the number of letters increase.  But it does provide some early clues about how we might build or visualize a semantic map.  Next time I’ll try to extend this idea with words.

 

Leave a comment

1 Comment »

  1. Terrific bar graph of one-letter words! I like how the letter index goes all the way to 30, as if there are slots available for letters that only appear in leap years or something.

    Comment by Craig — June 14, 2007 @ 11:07 am

RSS feed for comments on this post. TrackBack URI