The 2 Data structures you need in a Scrabble AI

I got into making a Scrabble game because I thought it’d be a relatively easy to enumerate the possible moves on a board and have an algorithm for finding the best move. But it turns out there a lot of possible moves on a typical board! You can play a large number of words in different places and orientations!

As I mentioned in the last post, Scrabble is not a solved problem. Even Quackle, the best OSS AI, does not dominate humans like DeepBlue or AlphaGo do and often loses in competitions.

While looking at how people do implement Scrabble, I found two interesting ideas! Today we’re going to be talking about ideas from the relatively old The World’s Fastest Scrabble Program and the newer A Faster Scrabble Generation Algorithm (“the GADDAG paper”).

Cross-Checks

The first idea (which comes out of the first paper), is to consider the constraints we’re adding add when playing parallel to a word, and how that can whittle down our options. For example if we wanted to play parallel below TODDLER, one of our options would be AXE:

E
T O D D L E R
A X E

How do we enumerate the possible words under TODDLER? Well, the words we can play are constrained by the perpendicular words we can form: here we build ETA, OX, and DE.

If the word we want to play parallel doesn’t make a 2 letter word perpendicularly at each step, we can’t play it. This is why the 107 two letter words in Scrabble are are so important, because they are so key to playing parallel words.

As players make moves and lay tiles down on the board, this set of possible cross-words changes slowly (or very little per turn), so we can store it in a Hash Map and re-use it each turn. Paper 1 calls these constraints cross-checks.

Let’s look at the cross checks for TODDLER:

E
T O D D L E R
= = = = = = =
A H A A I F E
H M E E A T
  F O O
  W
  N
  x

Each letter below the equals sign is a letter that can be added to the letter above the equal sign to make a 2 (or in the case of ET, 3) letter word. (This list is off the top of my head, not exhaustive.)

So now we can build words that are possible here by looking for valid words among the cross checks, which is a much smaller set of possibilities. We’re not wasting time formulating a word with our tiles, putting it parallel to another word, and then checking if the cross-words are valid, we can formulate only the words that are valid plays in that position first.

This cross-check data structure is important, but paper #2 has an even more interesting insight.

Dictionary Data Structures

Scrabble is all about building new and extending existing words on the board. The idea of cross-checks lets store the 2-letter words we can build when writing a word in parallel, but it’s only a special case. What data structure should we use to store the whole dictionary, not just the 2 letter words?

If we use a Prefix Trie (possibily optimizited as a Finite State Automaton) we can easily answer two questions:

Is this a valid word?
What can I append to this sequence to make a valid words? For example, “give me all the words that start with NINE” (3). Optionally we can store an extra “reverse dictionary” to allow us to go the other direction and ask for eg “words that end with PLAIN”.

But what about searching in both directions at once? For example, building EXPLAINING out of just PLAIN. That’s not possible with a DAG: we can’t go forward in one dictionary and then backward in the other. For example if we chose to append and then pre-pend, PLAINING isn’t a word that we can search for prefixes in the reverse dict.

A better DAG

One way to solve this problem is proposed in paper #2.

It’s a DAG but with extra variations of the words stored in it to be able to acommodate substring matches.

The additional variations are : “Every prefix (eg for BOAT that’s B, BO, BOA, BOAT) of the word, reversed, then a separator marker, plus the rest of the word.” Let’s look at the example word in the paper, CARES.

Here are its prefixes and how they are changed into the DAG version:

C      => C+ARES 
CA     => AC+RES 
CAR    => RAC+ES
CARE   => ERAC+S
CARES  => SERAC+

Note: We use + for the separator but the paper uses a diamond ⬦.

What’s the point of this? Let’s say RE is on the board, and we want to know what we can play off that. We can reverse RE to ER, and start searching our GADDAG. We’ll find ERAC+S. This tells us we can prepend CA and append S to RE to get CARES.

This is why the authors named the data structure a GAD+DAG, because it stores reversed parts of words in a DAG.

But what about the extra space?

Yeah, it’s storing a lot more words. If a word has N letters in it, it stores it N times over. The authors claim the average word length in the Scrabble word list is 5, but that may have been an old one? The National word list 2018 I have is closer to 10.

The argument the paper makes is that move generation is the bottleneck for Scrabble AI, and this significantly speeds up move generation so it’s worth the tradeoff.

With modern computers it’s also not that large a burden. My current GADDAG storing the NWL 18 list of 200k words weighs 6MB (and gzips down to 4MB), likely fitting in CPU cache these days. For a web app, that may be on the upper end of acceptable bundle size, but it’s still the bandwidth equivalent of scrolling social media for a couple minutes. I may keep it on the server only, or regenerate it on the client.

Implementing a GADDAG

Here I’ll show an implementation using just the previously mentioned fst crate. I like this approach because fst is lightweight (has no dependencies) and is pretty optimized. It’s built to work on much larger sets than the 2 million keys we’re about to throw at it. Specifically what I like about it is that it implements the part that combines

In the last post I showed how to use the CLI command to make a regular DAG:

$ fst set --sorted nwl18.txt nwl18.fst

This takes ~0.2 seconds on my computer. In his post explaining how fst works the author explains that the most time consuming part of building these data structures is sorting it, so in this situation with a pre-sorted dictionary it is quite fast.

We can query it like this:


static dict: Fst::Set = include_bytes!("nwl18.fst");
fn main() {
let words = ["DOG", "CAT", "MAGNIFICIENCE", "ETHICAL CONSUMPTION UNDER CAPITALISM"];
for word in words.iter() {
	println!("The dictionary contains {:?}: {:?}", word, dict.contains(word));
	}
}

When we build a GADDAG however, we have to insert variation on words, and these can’t really be pre-sorted.

So we do it ourselves, a quick step that takes ~3 seconds for the 200k NWL.

static MAX_WORD_LENGTH = 15 //The board is 15 characters wide

//FST stores bytes so we return our entries as byte arrays
pub fn build_entries(input: impl IntoIterator<Item= String>) -> BTreeSet<Vec<u8>> {
    let mut entries: BTreeSet<Vec<u8>> = BTreeSet::new();
    let mut new_word: Vec<u8> = Vec::with_capacity(MAX_WORD_LENGTH);

    //We're going to re-use these instead instead of allocating for every entry
    let mut before_sep: Vec<u8> = Vec::with_capacity(MAX_WORD_LENGTH);
    let mut after_sep: Vec<u8> = Vec::with_capacity(MAX_WORD_LENGTH);

    for word in input.into_iter() {
		//Insert the reversed variant without the separator
        let whole_word_rev = word.chars().rev().collect::<String>().as_bytes().to_vec();
        entries.insert(whole_word_rev);

		//Clear our intermediate state from the last input word
        after_sep.clear();
        before_sep.clear();

		//Start with eg SERA+C
        before_sep.extend(word.as_bytes());
        after_sep.push(before_sep.pop().unwrap());

        while before_sep.len() > 0 {
            new_word.clear();
			//We store before_sep backwards and use .rev() here so we can call .pop() on it later down
            new_word.extend(before_sep.iter().rev());
            new_word.push(SEP);
            new_word.extend(after_sep.iter().rev());
            after_sep.push(before_sep.pop().unwrap());

            entries.insert(new_word.iter().cloned().collect());
        }
    }
    entries
}

The code is a little more complicated because I tried to re-use the vector for each word I was working with and move a letter from one side of it to the other at each step. I’m not sure if the optimization helped very much, I just had fun writing it. Overall I used a generic BTreeSet to get a sorted set.

Using our GADDAG

There are 4 operations we might want to do:

Exact check for validity
Prefix search (“starts with AX”)
Substring search (“has AX in it somewhere)
Suffix search (“ends with AX”)

Exact check

Since we store the fulll reversed word without a separator, we can reverse our search word and check if it is in our dict.

Example: search for “AXOLOTL” would look for “LTOLOXA” and would find it.

Prefix Search

We can still do prefix searches by reversing our search prefix and adding the separator to it:

Example: words that start with “AX” -> search items in the GADDAG that start with “XA+”

XA+OLTL   -> AXOLOTL
XA+IOM    -> AXIOM
XA+ING    -> AXING

Substring Search

This is where our GADDAG shines: We reverse our input substring, and search for items that start with that. Then everything up to the separator is prefix, and everything after the separator is suffix.

Example: looking for “AX” somewhere would search for “XA” and find these items:

XA+OLTL   -> AXOLOTL
XA+IOM    -> AXIOM
XA+ING    -> AXING
XAF+ES    -> FAXES
XAM+IMUM  -> MAXIMUM
XALER     -> RELAX

Suffix Search

Suffix search was a little tricky for me to implement at first, but this optimization helps:

We can choose not to put separator at the end of full words, for example when storing “SERAC” for CARES. This a)saves us space by not storing the separator and b)tells us when something is a whole word by the absense of a separator at the end.

This way, our suffix search be the equivalent of the regex (reversed_suffix)[^,]*, or something that starts with our input reversed, then never has a comma.

Example: looking for words that end in “AX” would search for items starting with “XA” and not having separators:

XALER     -> RELAX
XAF       -> FAX
XAR       -> RAX

A small note about implementing sufffix search with `fst`

Skip this unless you want all the details:

I started with this matcher:

let matcher = Str::new(&search_val).starts_with().intersection(Str::new(SEP_STR).complement());

But this doesn’t work because the second automaton, which implements the “then never has a comma” constraint uses Str::new(SEP_STR).complement(). This looks for words that are exactly not the separator, so matches on basically every word. Instead I needed Subsequence::new(SEP_STR).complement(), which matches words that do not have our separator anywhere in them.

let matcher = Str::new(&search_val).starts_with().intersection(Subsequence::new(SEP_STR).complement());

Throughout my time with fst I was looking for a .and_then() helper method that would allow me to chain two search Automata, like String(“Hello”).and_then(NameAutomata), but I couldn’t find it or know how to build it.

Conclusion

There we have it, the data structure building blocks for a Scrabble AI.

We explained cross checks, the GADDAG, and how to implement and search one. My code for the GADDAG is at https://github.com/amedeedaboville/fst-gaddag. It’s still a bit raw but I’m going to be using it and iterating on it in the near future.

It was fun to use fst for a productive use case. The DFA minimization algorithm was some kind of magic to me in school, and transducers sound really cool.

If you’re up for more Scrabble AI tools, I recommend the page How Quackle Plays Scrabble that details the other algorithms Quackle uses.

Have a good day!