Frequencies of English Words
Something strange was happening last time when we looked at the probability that two words occur together on the same page. It appeared that even distantly related concepts appeared to have joint probabilities hundreds of times higher than chance.
To figure out what is going on I’ve compiled a list of how many pages come up on google doing a single word search for about 50,000 words. Here are some significant entries:
and 5140000000
the 4790000000
reserved 4300000000
copyright 4200000000
home 3880000000
or 3700000000
not 3660000000
s 3520000000
are 3400000000
an 3280000000
that 3090000000
search 3090000000
page 3040000000
…
projectile 5450000
pert 5450000
partake 5450000
linguist 5450000
devolution 5450000
Reba 5450000
Grimaldi 5450000
sleepwear 5440000
…
mechanizer 585
fraternizer 575
acclaimer 557
entrammel 552
baulker 543
rumourmonger 520
homoeotherm 502
enfranchize 478
harmoniousnesses 340
humourer 279
vialful 278
arithmetise 250
non-sympathiser 115
shakeably 46
Amusingly, this web page will now have a 2% increase in the occurance of the word “shakeably” on the web.
Yesterday’s calculations were based on pages with the word ‘the’, though it seems like ‘and’ would have been a better choice to find more pages with English.