The Math Behind Wordle: Information Theory and Optimal Play

Wordle Is a Math Problem Disguised as a Word Game

Every time you make a guess in Wordle, you are running a calculation. You might not realize it — most people do not think about information theory while typing five-letter words over morning coffee — but the game is fundamentally about minimizing uncertainty. Each guess takes a pool of possible answers and splits it into smaller groups based on the color pattern. The better your guess, the more evenly it splits the pool, and the faster you converge on the answer. This is not a metaphor. It is literally what is happening, and the math that describes it is both elegant and surprisingly practical.

I have been fascinated by the math behind Wordle since I first read a paper from MIT researchers who formally analyzed optimal play. You do not need to understand the math to play well, but understanding the principles changed how I think about each guess — not because I am computing entropy at the keyboard, but because the framework gives me better intuitions about what makes a guess good or bad. This article explains those principles without requiring a background in mathematics.

11.2

Bits to Identify One Word

~5.8

Bits per Optimal Guess

2,309

Possible Answers

12,000+

Valid Guess Words

Wordle as an Information Theory Problem

Information theory, developed by Claude Shannon in the 1940s, quantifies uncertainty. The core unit is the bit — one bit cuts possibilities roughly in half. Eight bits cuts them by a factor of 256. In Wordle, you start with about 2,309 possible answers. How many bits do you need to identify one specific word? The base-2 logarithm of 2309 is roughly 11.2 bits. So if every guess gave maximum information, you could always solve Wordle in about 3 guesses (3 guesses can encode up to 15 bits with 243 possible color patterns each).

In practice, guesses do not give maximum information because color patterns are not equally likely. A guess like CRANE will produce diverse patterns — some all-gray, some with greens and yellows. A guess like XYLYL (a real word, a chemical group) will almost always produce all-gray because X and Y rarely appear. Both give you some information, but CRANE gives substantially more on average. The difference is not subtle — it is the difference between having 78 candidate words remaining and 430.

🔑

The fundamental insight: Wordle is not a word game. It is a search problem where you are trying to locate one specific item in a set of 2,309 using as few queries as possible. The words are the search space, the colors are the feedback, and entropy measures how much each query narrows the search. Understanding this transforms how you think about every guess.

What Entropy Means in Wordle Context

Entropy measures how much a guess reduces uncertainty on average. It is calculated by looking at all possible color patterns a guess can produce, determining what fraction of answers produce each pattern, and computing expected information gain. A guess splitting answers into 100 equally-sized groups has higher entropy than one where 90% of answers produce the same pattern and the remaining 10% are scattered across 99 others. The first resolves uncertainty reliably. The second usually tells you little, with a rare chance of being very informative.

The highest-entropy openers are SALET and SLATE, both producing roughly 5.8 bits of entropy. That means the first guess, on average, cuts possible answers by a factor of about 55 (2 to the power of 5.8). Starting from 2,309, you are down to roughly 42 on average. A poor opener like OUIJA produces only about 4.2 bits, cutting the pool by roughly 18 — leaving you with about 128 possible answers. Same number of guesses used, wildly different positions on the board.

SALET

5.8 bits

SLATE

5.8 bits

CRANE

5.6 bits

RAISE

5.4 bits

ADIEU

4.9 bits

AUDIO

4.7 bits

QUICK

4.1 bits

OUIJA

3.6 bits

How Many Bits Each Guess Provides

Not all guesses are created equal. The table below shows the approximate entropy for different categories of openers, along with the practical impact on remaining candidate count. The gap between the best and worst openers is staggering — nearly 2.2 bits, which translates to roughly 4.6x more remaining words after a poor opener compared to an optimal one.

Opener Category	Examples	Entropy (bits)	Avg Remaining	Pool Reduction
Optimal	SALET, SLATE	~5.8	~70	97%
Good	CRANE, TRACE, RAISE	~5.5-5.7	~78-86	96-97%
Vowel-heavy	ADIEU, AUDIO	~4.8-5.0	~119-135	94-95%
Mediocre	QUICK, JUMBO	~4.0-4.5	~200-300	87-91%
Poor	OUIJA, XYLYL	~3.5-4.2	~210-430	81-91%

Second guesses typically provide 3 to 5 bits. By guess three, you are often in the 1 to 3 bit range because remaining uncertainty is small and it is harder to split a small pool evenly. The theoretical minimum for solving in 4 guesses is about 11.2 bits. Optimal play extracts roughly 16.3 bits over four guesses, so there is substantial slack — which is why most strategic players solve in 4 on average even though they are not playing optimally.

Above: SLATE producing a typical pattern — 2 greens, 1 yellow, 2 grays. This particular pattern provides roughly 5.6 bits of information, cutting the candidate pool from 2,309 down to a manageable set. The math works even if you never calculate a single bit.

Why Some Words Are Worth More as Guesses

A guess's value depends on two things: which letters it tests and where they are positioned. Testing common letters is more valuable because you are more likely to get non-gray feedback, and non-gray feedback splits the possibility space. Position matters too. Testing E at the end is more informative than E at the beginning — E is much more common at the end of five-letter words.

SALET (S-A-L-E-T) and SLATE (S-L-A-T-E) test the same five letters, but SALET places E at position 4 and T at position 5, while SLATE places E at position 5 and T at position 4. The positional frequency differences give SALET a tiny edge in entropy — about 0.01 bits. In practice, this difference is invisible. You would need millions of games for it to show up in stats. But the math is clear: SALET is technically the optimal first guess by expected information gain.

The Calculation: Expected Remaining Words

For each possible color pattern a guess can produce, count how many answers would produce that pattern. Multiply by the probability of that pattern (count divided by total answers). Sum across all patterns. This gives you the expected pool size after one guess. Lower is better. SALET: roughly 70.8 expected remaining words. CRANE: about 78.4. ADIEU: about 119. OUIJA: about 213.

These numbers are not linear with entropy because entropy and expected pool size measure different things. Entropy measures how evenly the pool splits. Expected pool size measures average remaining candidates regardless of distribution. They are correlated but not identical. A guess that always leaves exactly 70 candidates has the same expected pool size as one that usually leaves 10 but occasionally leaves 400 — but the first is far more reliable, and entropy captures that difference.

💡

Practical takeaway: You do not need to calculate entropy to use these principles. Just ask two questions before every guess: "Which new letters am I testing?" and "How many possible answers does this help me distinguish?" If the answer to the first is "none," you are wasting a guess. If the answer to the second is "one," you had better have enough guesses left for the alternatives.

Why CRANE Beats XYLYL on Average

XYLYL contains X and Y, two of the least common letters in the answer pool. The most likely outcome is all five gray — eliminating maybe 30 to 40 percent of answers. Not useless, but not great. CRANE contains five of the most common letters. The most likely outcome is a mix of grays, yellows, and maybe a green. The diverse pattern variety means CRANE splits the answer pool into more, smaller groups — exactly what you want.

Expected remaining words after XYLYL: roughly 430. After CRANE: 78. Same guess slot, five times the elimination power. That is the difference between testing common letters and uncommon ones, and it compounds over the course of a game. A player who opens with CRANE and follows up intelligently will typically solve in 3-4 guesses. A player who opens with XYLYL will typically need 5-6.

The Optimal First Guess Depends on Your Metric

There is no single "best" opener — it depends on what you are optimizing for. Different mathematical objectives produce different optimal words, and understanding the distinction helps you choose the approach that matches your goals.

Optimization Goal	Best Opener	Why	Best For
Minimum average remaining words	SALET	Lowest expected pool size (~70.8)	Players optimizing average guess count
Minimum worst-case remaining words	SLATE	Fewest words in the worst pattern	Players who hate bad surprises
Minimax (solve in fewest worst-case guesses)	SALET	Guarantees solve in at most 5 guesses	Streak players who want safety bounds
Maximum entropy (information gain)	SALET	Highest expected bits (~5.8)	Theory enthusiasts

The differences are marginal. Any top-10 opener is within a few percentage points of optimal. The math matters more for understanding why some words work better than for choosing between SALET and SLATE. If you are using a top-5 opener, you are already extracting nearly all the available information from your first guess.

Why the Best Guess Is Not Always a Possible Answer

The answer pool (roughly 2,309 words) is a subset of valid guesses (roughly 12,000+). Words like TARSE, LARES, or AESIR are valid guesses that will never be answers. Sometimes a non-answer word splits remaining candidates more evenly than any answer word, because it can use letter combinations absent from the answer pool. This matters most in the late game with few candidates.

In normal mode, you can probe freely with non-answer words. In Hard Mode, they must still include all green and yellow letters, limiting their usefulness. The ability to use non-answer probe words is one of the most underappreciated advantages of normal mode — it gives you access to roughly 10,000 additional test words that can be strategically valuable when the remaining candidates share many letters.

How AI Solvers Work: Minimax vs Expected Value

Two main approaches, optimizing for different goals. Understanding these approaches does not just explain how bots play — it reveals the fundamental strategic tension in Wordle between optimizing your average performance and protecting your worst case.

Expected Value (Entropy Maximization)

Minimizes the average remaining words after each guess. Over many games, this minimizes average guesses to solve. SALET is the optimal first guess under this metric. Think of it as the "best on average" approach — it accepts occasional bad days in exchange for consistently strong average performance. Best for players who care about their guess distribution chart.

Minimax (Worst-Case Optimization)

Minimizes the worst-case outcome — asking "what is the biggest pool I could face after this guess." This guarantees solving any answer within a fixed number of guesses (5 for optimal play). It sacrifices average-case performance to ensure you never need more than 5 guesses. Best for streak players who cannot afford a single loss.

Neither is "better" — they optimize for different goals. Streak players should prefer minimax (bounds worst case). Average-seekers should prefer expected value. In my play, I use simplified expected value for the first two guesses and switch to minimax thinking for guesses 3 through 6. Not optimal, but practical — and it reflects the reality that early guesses benefit from maximizing information while late guesses benefit from guaranteeing solutions.

ℹ️

Fun fact: An optimal minimax solver can guarantee solving every Wordle puzzle in at most 5 guesses. Not 6 — 5. That means with perfect play, you never need the sixth row. The gap between this theoretical bound and most players' actual performance (3.7-4.2 average) shows how much room there is between "good enough" and "optimal."

My Simplified Approach: Using Math Without a Calculator

I do not compute entropy at the keyboard. But I do use the principles the math reveals. Before typing any guess, I ask two questions: which untested letters am I testing? And how many possible answers does this help me distinguish? If the answer to the first is "none," I am wasting a guess. If the answer to the second is "one," I had better have enough guesses left for the alternatives.

No spreadsheets, no entropy calculations. Two questions that encapsulate the core insight of information theory: a good guess reduces uncertainty, and the best guess reduces it the most evenly. The math is elegant and worth understanding. But the daily puzzle is played by humans, not algorithms. Use the math to choose a good opener and understand why some guesses feel productive. Then close the spreadsheet and play the game.

✅ Key Takeaways

Wordle is fundamentally a search problem — you are locating one word in 2,309 using as few queries as possible
Each guess provides roughly 3.5 to 5.8 bits of information depending on letter choice and positioning
SALET is the optimal opener by expected information gain, producing ~5.8 bits and leaving ~70 remaining words
Common letters in common positions produce more entropy because they split the answer pool more evenly
The gap between optimal and poor openers is massive: SALET leaves ~70 words vs OUIJA's ~210
Expected value optimization minimizes average guesses; minimax optimization minimizes worst-case guesses
You do not need to calculate entropy — just test new common letters and maximize the candidates you can distinguish

Frequently Asked Questions

Do I need to understand math to be good at Wordle?

Absolutely not. The vast majority of strong Wordle players have never calculated entropy or heard of Claude Shannon. The math explains why certain strategies work, but you can adopt those strategies purely through pattern recognition and experience. Think of it like driving — you do not need to understand internal combustion to be a good driver, but understanding that the engine needs fuel helps you avoid running out of gas. Similarly, understanding that common letters give more information helps you choose better guesses, even without doing any math.

What is the theoretical minimum number of guesses to solve Wordle?

An optimal solver using minimax strategy can guarantee solving any Wordle puzzle in at most 5 guesses. On average, optimal play solves in about 3.42 guesses. The theoretical minimum for any single puzzle is 1 guess (if you happen to guess the answer), and the minimum average across all puzzles with perfect play is approximately 3.42. Human players typically average 3.7-4.2 guesses with good strategy, which is remarkably close to the theoretical optimum given how much faster and more consistently algorithms can evaluate candidate guesses.

Why does the answer pool have 2,309 words?

Josh Wardle's partner curated the original answer list from a larger dictionary, selecting words that were common enough to be fair but varied enough to be interesting. The number 2,309 is simply how many five-letter words met their criteria for being reasonably guessable. The valid guess pool is much larger (12,000+ words) because it includes obscure words that would be unfair as answers but are still real English words. The NYT has made minor adjustments to the pool since acquiring the game, removing a handful of words and occasionally adding replacements.

Can Wordle ever be solved in 1 guess?

Technically yes, but the probability is roughly 1 in 2,309 (about 0.04%) if you guess randomly, or slightly higher if your first guess happens to be a common word. Some players have reported solving in 1 guess, and it does happen — but it is pure luck, not skill. The expected number of games before you solve in 1 guess is approximately 2,309, which at one game per day means roughly once every 6.3 years. Enjoy it if it happens, but do not count on it.

How do Wordle-solving algorithms actually work?

Most Wordle algorithms work by maintaining a list of possible answers, simulating every possible guess against every remaining answer, and choosing the guess that optimizes their chosen metric (expected value or minimax). For each candidate guess, they calculate what color pattern would result for each remaining answer, then evaluate how the remaining pool would split under each pattern. The guess that produces the best split (either lowest average remaining or lowest worst-case remaining) is selected. This brute-force approach is computationally intensive but feasible because the word pools are small enough to exhaustively search.