Word Lists and Regular Expressions

What are Word Lists?

Word Lists are compilations of words from a wide variety of sources including some dictionaries. The word lists are commonly in the public domain.

What sets the lists apart are what they are used for. For example, the official Scrabble™ player's dictionary, known as OSPD, includes words that are playable under the “official rules” of Scrabble. The words are 8 letters or less in length, no proper names, no abbreviations.

Each list has a history, a source, and some bias. We've tried to credit all the sources, as well as any biases we've observed. On our page of word lists you will also find compilations of all the lists into a single list.

What do we use word lists for?

The most common use of any word list in electronic form is to hand a LOT of known valid words to an electronic solving or constructing tool. For example, TEA (formerly The Electronic Alveary), from Crossword Man in the U.K., can consume these lists and extract a sort of “regular expression” for lots of letter patterns from it. Maybe you want to know a lot of words that are a specific length and have 'd' in a particular position. TEA can find it. Or you want all the anagrams of “blacksmith” for something. TEA is a great choice. And TEA can handle several lists at once, so if you have a word list of a particular sort, you can look there for matching words. Some of the Tools

A Regular Expression Primer

Originally by Daz, with edits by Qoz

If you are new to Regular Expressions, they can be quite daunting. If you are familiar with the basics, they can still be quite daunting! They are an extremely powerful tool and a remarkably concise way to interpret text. They would certainly not be considered “light reading,” but they can be read.

Caveat: When you get into more sophisticated features, regular expression systems vary. You'll want to confirm the syntax for the system you are using.

A regular expression is a symbolic representation of the most general search query one can make to that word finder. Any letter just means itself. A . (a period) means any single letter (or other character). A * after any expression means 0 or more consecutive occurrences of that expression (or more accurately, of anything that MATCHES that expression). Regular expressions generally match on lines of text in a file; since word lists have one word per line, your regular expressions will match on words.

Usually, regular expressions are case-SENSITIVE, so if you ask for a j you won't match J. Most regular expressions as well as the NPL word finder provide an option to override case sensistivity.

Regular expressions can be built up of simpler regular expressions; some examples follow.

  • Combine . and + to refer to arbitrary consecutive text with .+ (here the period matches any single character, and the + means 1 or more occurences thereof; the occurrences need not be same character)
  • Combine . and * to refer to optional arbitrary consecutive text with .* (here the period matches any single character, and the * means 0 or more occurences thereof)
  • The expression [xyz...w] (where x,y,z,...,w are any letters) means any one of these letters
  • For a run of consecutive letters, use a -. [a-e] is equivalent to [abcde] or [edcba]
  • ^ at the left of the whole expression means the beginning of a word; $ at the right means the end of the word; to find all words composed of just the five vowels, use the regular expression ^[aeiou]*$
  • A ^ at the left inside square brackets — like [^...] — means any character except the characters in brackets; [^aeiou] means any character except a,e,i,o,u; [^a-e] means any character except a,b,c,d,e

Combining some of these techniques, we can do more complex things:

To find all words that have 3 consecutive of these vowels somewhere in the word, use:

[aeiou][aeiou][aeiou]

Since we didn't user use ^ or $ here, this will find the letters anywhere in a line.

To find all words that use none of a,e,i,o,u, use:

^[^aeiou]*$

To search for “regex1 OR regex2” just use (regex1)|(regex2) — the parentheses ( ) serve to group the two halves of the regular expression so they are not combined in a way you did not intend. The vertical line character | as the character for OR).

There are a whole bunch of other things one can ask about. In these examples, regex can be any complete regular expression (excepting ^ and $, in most cases, as that would produce something that will find no matches).

  • To ask for all words that contain a consecutively repeated expression, use: (regex)+
  • To ask for all words that consist of a repeated expression, use: ^(regex)+$
  • To ask for all words that consist of an expression repeated exactly 3 times, use: ^(regex){3}$
  • To ask for all words that consist of an expression repeated between 2 and 5 times, use: ^(regex){2,5}$

Subexpressions are a very powerful concept. \( and \) indicate subexpressions. \n re-uses the nth subexpression.

  • To ask for all words containing an expression repeated twice, possibly with intervening letters, use: \(regex\).*\1
  • To ask for all words containing an expression repeated thrice anywhere in the word, use: \(regex\).*\1.*\1
  • To ask for two two-letter strings that occur alternately in the word as _A_B_A_B_, anywhere in the word, you'd use \(..\).*\(..\).*\1.*\2

Explanation: Using the pair \( as a left bracket and \) as a right one (note backslash), the subexpression \(something\) defines a pattern that the part of the regular expression to its right will understand as the pair \n, where n is the number of aforesaid subexpression counting from the left until it's first encountered. (All but the last example involved only one subexpression, so only \1 was used in those.) The notation is very dense and unforgiving, but it's really not complicated. Note: Some regular expression systems work with ( and ) for subexpressions, not \( and \).

If you want to match a character which is a regular expression character, surround it with [], as in [(] to match a begin parentheses.

For futher reading, you want to check out an online tutorial like this one or this one. You also might find this summary sheet handy.

Finding NPL Base Types

Originally by Lucifer, with edits by Qoz

Once you're familiar with regular expressions, you'll probably want to use them to find some bases for puzzles. The following examples should help you get started.

Transposals

Suppose you want to transpose “aeginrst”. Put [^ganister] into the first search string, and specify “don't match”; then put \(.\).*\1 into the second search string, and specify “don't match” there, too. Then choose a word length of 8. If you use NI2 as your dictionary, this query will retrieve you three words: astringe, ganister and gantries, all of which are transposals of aeginrst.

This works because the first string, along with the “don't match”, says: “find all words that don't contain any of the letters other than aeginrst”. So only words that contain some or all of those letters will be returned. The second query says “don't match anything that has a repeated letter anywhere. Since aeginrst has no repeated letters, that's a requirement. Finally, the 8-letter length requirement forces every one of the 8 letters to be used, so that every returned word must be a transposal.

This won't work for transposals containing spaces, or if any letters are repeated. You can get an approximation to the list by omitting the second search string, though, and this is sometimes good enough.

Letter Banks

What banks down to “lens”? Put [^lens] into the first search string and choose “don't match”; the result (using NI2) will be a list of 47 words, all of which contain only l's, e's, n's and/or s's. However, the list includes words like “eel” and “sense” which don't contain all four of the letters. The best that can be done here is to put …. into the second search string. This will force the resulting words to be at least four letters long, which eliminates “eel” but not “sense”.

Consonantcies

Suppose that you suspect part of a consonantcy's solution is “biochemistry”. The consonants are bchmstr; you could search for bchmstr and specify “consonants only” but the resulting list won't tell you what words contained those consonants. Instead, put the following string: [aeiouy]*, which matches any number of vowels, between every consonant, and enter the result as the search string:

[aeiouy]*b[aeiouy]*c[aeiouy]*h[aeiouy]*m[aeiouy]*s[aeiouy]*t[aeiouy]*r[aeiouy]*

This works, but in addition to “beachmaster” (which is the one we wanted) it returns “psychobiochemistry”. This can be avoided by adding a ^ and $ around the pattern, like so:

^[aeiouy]*b[aeiouy]*c[aeiouy]*h[aeiouy]*m[aeiouy]*s[aeiouy]*t[aeiouy]*r[aeiouy]*$

Cryptograms

You can find pattern words by using the back-referencing feature of regular expressions. Let's say a word in a crypt is KQFPWQP. Each letter that appears twice can be made into a back-reference:

^.\(.\).\(.\).\1\2$

As with letter banks, this isn't perfect: it returns alveole, which is good, but it also returns potoroo, which is not. There is no easy way to improve this; you just have to hope there aren't too many false positives in the list. For strongly patterned words this is not much of a problem; for weakly patterned words the list is likely to be very long anyway.

You can filter the result a little, though, by specifying in this case that no letter should repeat three times. You can do this by putting \(.\).*\1.*\1 in the second search string, and specifying “don't match”; this reduces the matches from 69 to 49, and eliminates potoroo.