Thoughts without words
If letters make words, then words make thoughts. One way to try to represent relationships between words is to look at how often they occur together on web pages. We can get an estimate of this by looking at page counts in search engines.
Pages of English
First we need an idea how many pages contain English sentences. A search for the term “a*” produces about 15 trillion, but many of these are pages which do not contain english sentences. A search for the word “the” returns about 4.8 trillion, which is probably a better estimate of the number of pages containing english. We may have missed quite a few, but we will be close.
Is George Bush a Maniac?
Here’s a simple example using the words “George Bush” and “Maniac”.
- A google search for “George Bush the” returns 165 million pages. This is a probability of about 3.4e-5.
- “maniac the” returns 2.75million, or a probability of 5.7e-7
- The joint probability if the two terms were independent is about 2e-11, which should produce about 94 pages.
- Instead we get 545k pages, which is 5.7e3 times more combined pages than what you would predict otherwise.
Something strange happens when I try to create a counter-example. If I look for the relationship between ‘George Bush’ and a randomly chosen number like ‘193443′, I get the following:
- Probability of a page containing ‘193443 the’: 2.2e-9 (15k pages)
- Expected number of joint pages: 0.52
- Actual number of pages: 52 pages
- Ratio of likelyhood: 101 more than chance
Although it is counter intuitive, it turns out that there are relationships between ‘George Bush’ and this particular number, even though the average person might not expect it.
Consider the following likelyhood ratios:
- cat-dog: 3561
- airplane-sasquatch: 10313
- copper-existentialism: 2898
- preponderance-evidence: 12887
- invisible-brick: 2360
What is going on here? Clearly the likelyhood ratio is high for common phrases, but it is also high for things that seem rather unlikely such as ‘invisible brick’. It’s actually really tough to come up with pages with low likelyhood ratios:
What is going on here?