This page contains information about, and download links for, the dictionaries in use in the word search function. To download them on a PC, right-click on the links and choose “Save As”. On a Mac, click-hold on the link, then choose “Save as . . .” or “Save this link as . . . .”
After discussing each dictionary, there is a section that discusses the overlaps and relationships between dictionaries, where this is known.
The two NI2 word lists are known as “web2” and “web2a”, and have both been widely available for a long time. They are collectively known as the “Air Force” lists. You may be interested in information we received from Doug McIlroy about the origin of both these lists and the Unix dictionary (see below). The first contains all single words from NI2; the second, all compound forms. Not every word in NI2 is here; not every word here is in NI2. The correspondence is very close for web2, but not so good for web2a–there are many NI2 compound entries missing from web2a. In addition, there are many inflected forms missing from these lists. However, for NPL purposes these two lists are invaluable since they cover the largest current reference very well.
These lists are mixed case.
The official Scrabble player's dictionary, known as OSPD, is widely available on the internet. There is also a list targetted at Scrabble players known as the Enable list. This has been explicitly placed in the public domain. The Enable list is much more comprehensive than OSPD; you can read the authors' comments if you are interested. We have left their comments exactly as found, so a couple of references (e.g. to SIGWORDS.LST) do not make sense in this context. Note that OSPD contains no words over 8 letters; one reason that the Enable list is so much larger is that it includes longer words (which are of course much less likely to be used in Scrabble).
Grady Ward's Moby project released its various word lists to the public domain in the late 90's. Two of those lists are available here: the single and compound word lists. These are the largest lists, but it is necessary to be cautious when using them; there are many words in these lists that appear to be spurious. As a last resort these lists can be invaluable, however, particularly the compound word list.
The DICT development group provides access to a searchable dictionary that includes definitions. They also have a page where one can submit new definitions to what they refer to as the Free Internet Lexicon and Encyclopedia. On that page, they provide links to download several lists. The list described here as “FILE main list” was obtained from that page, using the link to a list of 100,000 words that were successfully looked up in their dictionary. They have also done analysis on words that failed to be found and came up with a short list of words to add. This list is referred to as the “FILE to-do list”. That work was done by Kevin Atkinson, and you may wish to read his account of the analysis.
This is the dictionary distributed with most versions of Unix. You may be interested in information we received from Doug McIlroy about the origin of both this list and the NI2 lists (see above). It is lower case and contains no compound forms.
This list was created, but no longer maintained, by Ross Beresford, who maintains a crossword page. It is mixed case. It includes letters with accents, such as 'À'; these will not match the equivalent ordinary letters so I have included another version with those special characters converted to their ASCII plain text equivalent.
For more information, see the readme file for the UK Advanced Cryptics Dictionary. Note that the list is not in the public domain and that this readme file must always be displayed when using the wordlist.
I obtained this via a link listed on the rec.puzzles.crosswords word list reference page. It is listed as “Dictionary I obtained personally from Roger King”; We have no more information about it. It contains no compound forms, and is lower case.
This list was in four parts, labelled words1.zip, words2.zip, word3.zip and words4.zip; in this form it is widely available on the net. There is a readme file for this list. It was apparently created by a company called Public Brand Software and put in its current form by Evan Antworth, about whom the readme file provides a little more information. It contains no compound forms, and is lower case.
The only documentation on this list describes it as “Dictionary from Center for Research in Lexicography”. It contains no compound forms, and is lower case.
This was originally a pronouncing dictionary, and can be found on the net (CMU Pronouncing) with phonetic information included in the file. That file also includes an informative text header explaining the list's origin and purpose. This list was originally upper case; I have converted it to lower case. It contained many duplicates, showing multiple pronunciations: these have been removed. It does not contain compound forms.
I believe this list was originally the word entry for a 1913 edition of Roget's Thesaurus. We have no other information about the list. It is lower case and has no compound forms.
We obtained this list from Orchy, but have received no response to emails requesting more information about the list and permission to use it. It is mixed case and has hyphenated compound forms.
We understand this list corresponds to Merriam-Webster's 9th Collegiate dictionary; 9C, in NPL parlance. It contains compound forms from which non-alphabetic forms have been removed, e.g. “byandlarge” is an entry. It is lower case.
There are several dictionaries available for which I have no information. Two are just called “unabridged”. I have just numbered them 1 and 2 here. Number 1 is often seen with the file name Unabr.dict, or Unabr.dict.Z in compressed form. Number 2 is found with the file names unabrd.dic, or unabrd.dic.Z. There is also a small “pocket” dictionary, and another list called that I have called wlist1.txt, but which was originally called w130794.Z. I have called the former “Pocket dictionary” and the latter “Anonymous word list”. I have heard that w130794.Z was associated with an organization called the Online Book Initiative, but have not received a response to my inquiry. Finally, there is a dictionary usually seen as words.english.Z, which I have called 2nd anonymous word list and named engwords.txt. They are all lower case except for the 2nd unabridged and the 2nd anonymous list, which are both mixed case.
The pocket dictionary was in upper case when I obtained it; I have converted it to lower case.
There are many other lists of words available on the internet. I have generally placed here only large lists that will be of use to logophiles.
Finally, you may wish to download one of the consolidated lists consisting of the superset of all the above lists.
The differences are explained in an email from Mr. Brown.
Other than listed above; web2 and web2a are not compared here, for example. You can see some overlap statistics on each pair of dictionaries (except the Orchy list and the 9C list, which I added since I did this analysis) if you are interested.
Thanks are due to /dev/joe for supplying this analysis, as well as some of the other information on this page; and to Edward Spires and Dr. J.D. Collins, who added further comments. I have changed the filenames to correspond to the file names used on this page.
First off, unabr1 omits lots of common words. It appears to be intended as an add-on to the file unixdict (also available at ftp://sable.ox.ac.uk/pub/wordlists/dictionaries/ which is where I am familiar with unabr1). These lists are mutually exclusive.
The combination of unabr1 and unixdict is very similar to web2. Before comparing them, you should turn all the words in web2 to lower case, because the unixdict and unabr1 files are like this. Only 410 words from unabr1 are not in web2. (If you include unixdict, you find quite a few more – unixdict looks like the traditional /usr/dict/words dictionary, intended to be used with a Unix spell-checker, and includes words like “1st” and “tektronix” and “unix”.)
Only 21 words from web2 are not in the combination of unabr1 and unixdict. The list of words in unabr.dict but not in web2 (410 words) looks like a mixture of several types of words. Some are inflected forms which were probably just omitted from the web2 list (e.g. dividers), or even base forms omitted from web2 (has?!). A great number, maybe as many as half these words, are modern formations which are probably not in NI2 because they didn't exist or were not in common enough usage yet (e.g. cosmonaut, desegregation, flamethrower). It is possible that unabr was prepared from a late printing of NI2 including the addenda (it would have had to have been a very late NI2 to have picked up cosmonaut).
There are only 21 words in web2 not present in unabr1 or unixdict, so I'll just include them all here; some appear to be typos (diagrammitically); others are variant spellings (bandolier, cockatiel, escallop, etc.)
Looking at the statistics on the dictionaries, some things are quickly apparent: