Your xkcd passwords are pwned

Passwords are incredibly hard to "get right." In fact, there's a pretty solid argument to be made that they can never be right (at least when used as a sole authN factor.) Yet we are inundated with "experts" telling us fantastic stories about how secure the right password policy can be. The biggest problem here is these policies aren't modeling real world attackers (and they certainly don't represent real world attacks.)

Some might question, If this is such a big problem, why isn't it a bigger deal? Well, that's complicated...

People generally can't care about things they don't know. The average person's exposure to password creation and policies is tied directly to the varied websites and services they use on the web every day. When I type "new password requirements" into Google, I get the following about it:

Must be at least 8 characters (12+ recommended)

Must contain one uppercase

Must container one lowercase

Must contain one number

Honestly, this sounds pretty common. A good number of sites call it quits right there. If you're really lucky, they might even require at least one special character, and maybe check you haven't used the password before. In fact, there are many tenured professionals who are so familiar with these rules, they can recite them in their sleep; and they'll often repeat them as sound advice to users. But why? What is it that these rules are actually protecting against? And why are these the wrong requirements to choose, even though they are the most popular?

Well, now we need to understand what makes a password "strong".

Password strength

Ideally, the strength of a password should be the approximate measure of how difficult it would be for an attacker to recover said password. Except it gets a little more complicated than that. How do we determine something is even difficult?

The industry has made a few important assumptions about these attacks. Assuming an attacker doesn't know anything about the target (and has no way to prioritize attacks), they would be forced to attempt a brute force starting from the lowest character positions and iterating through the highest positions. This is essentially "walking the key space". Statistically speaking, the probability of recovering a password this way only requires walking 50% of a key space before it is likely to be found.

So if our character set is just using digits (from 0-9), it would take longer to walk a 6 digit combination than a 3 digit combination. (As character sets change this calculation can become more complex. Especially disparate character sets per position of the secret.) This unit of time can be used as a deterrent for attackers, and is often used as a metric to directly equate password strength.

Now we have something to measure. The longer it takes to crawl a key space, the stronger the password. (At least, this is the common theory.)

But this is still only part of the problem. If an attacker knows something about how the passwords were created, they would be able to reduce the recovery time. To make sure we maximize recovery time, we need to make sure the selection process introduces randomness. If our selections are truly random, it removes the predictability factor and forces an attacker to walk more of the key space to recover our secrets. Cryptography tries to measure this randomness by what is called entropy.

Using these tools, we increase entropy to drive up recovery time, and this increases the strength of our password.

There's this whole other conversation to be had on whether or not we can even properly measure entropy, but that's outside the scope of this article.

How do we improve standards?

I think people have understood for a long time that NIST's original Special Publication 800-63B had not been a successful approach to a password policy. Passwords made with this policy often have a limited amount of time it will take an attacker to brute force the keyspace, and the difficulty it presents for most people to remember is pretty terrible. But what do we replace it with?

It's also worth noting that NIST has revised their standards. As of 2017, SP 800-63B now has more sensible recommendations for passwords and authN (which were again updated in 2019.) We now have some recommendations like: * no complexity requirements * no password expiration period * no password hints * no truncation of secrets * tagging SMS OTPs as "RESTRICTED" * and some other good things...

If you haven't read through their docs, I would highly recommend reading the pdf from NIST's website.

Let's take a look at the now ultra-famous xkcd recommendation:

password strength

We'll examine the benefits and drawbacks of this approach.

The good

This xkcd comic suggests what is essentially diceware over the traditional patterns. Diceware isn't a new concept, but it's definitely not as popular for creating passwords. The idea here is to create memorable secrets, chosen at random, with high levels on entropy compared to traditional passwords.

I think this is definitely a better direction, and I like that people are thinking about this and challenging the standards which have come before them. (I also think it's important not to stop there.)

The bad

Despite the gains, there are a few important things going wrong here.

how we choose words matters

It's nice to see a change in philosophy, but if a person is picking these words in their head they are relying on their cognitive facilities to fill in the blanks. This creates a surface area of predictability an attacker can leverage, and that's going to weaken the password. So suggesting diceware is great, but it should also come with a recommendation of how that pattern should be selected (and that should not simply be "thinking of the words.")

Technically speaking, selecting words at random is part of the diceware spec, but there's so much misunderstanding around it that it's worth taking the time to explain in detail.
measuring entropy is hard

We have an established set of rules for measuring entropy, but it can be easy to apply this incorrectly. For example, passwords are often measured in bits of entropy, but there's a strong argument to be made that bits are the wrong metric to determine password strength. We could easily be using n-grams or shingles (entire words) to constitute our key space, and this affects recovery times and resilience. In our example xkcd comic, 44 bits of entropy is estimated to take 550 years to brute force. There are quite a number of assumptions being made there, (assuming max key space walks for every attack, assuming static crack rates of 1000 H/s, assuming straight brute force of bits, etc.) and none of them model a real-world attack. By filtering on the base components of the passwords, we can skip irrelevant combinations of bits and reduce the key space for a successful attack. Realistically, we need a better way to measure password strength.

I would recommend anyone looking to measure password strength more effectively should research Dropbox's zxcvbn.
bad guys don't have to use brute force (they probably know more than you think)

A majority of the material and resources I see floating around seem to suggest that attackers are going to waste their time just straight brute-forcing your passwords. The reality is, tools like Hashcat don't even have a brute force mode anymore. You could emulate one if you really needed to (in Hashcat you would step through a large binary mask and disable markov chains), but there are often way more effective attacks to use instead. Using publicly available information, and well established patterns of human behaviour, an attacker can reduce the key space of an attack even further to a reasonable subset. The result is a significantly smaller recovery time than blue teams are expecting.
xkcd got diceware wrong

Back in the mid-90s, diceware's creator (Arnold Reinhold) originally claimed a minimum of 5 words was necessary to reasonably protect the average user. By the time xkcd's comic was released in 2014, he raised this minimum to 6 words. xkcd's comic is obviously below this requirement.

xkcd seems to suggest that the entropy is tied to the number of characters in the secret. However, previously set standards (eg., read Bruce Schneier's Applied Cryptography, which references the work of Claude Elmwood Shannon) tie the entropy of diceware to the size of the source list, which is defined as a minimum of log₂(6^5) words, and each word selected adds ~12.9 bits of entropy. To put this in perspective, the 51.6 bits of a 4 word diceware password is thought to be about the same strength as an 8 character password made up of random ASCII characters.

Although the concept is fair, this comic's implementation is flawed for achieving its goal.

The ugly

Cracking xkcd passwords is easier than you think. Let me tell you a story...

Password security is one of those things I spend a lot of time thinking about. Perhaps too much time, to be honest. I've been pretty fortunate that my role at OkCupid allows me significant lateral freedom for research, and the ability to implement strong security controls as the direct result of this research. As a good example, OkCupid stopped aging out employee passwords by policy a couple of years before NIST revised SP 800-63B. Instead, regular password audits are performed as a mitigating control. Employee password hashes for critical core resources are collected and thrown into a password cracking rig. Any passwords which are recovered are forced to be changed.

Since this has been going on for a few years, some of the more tenured employees have developed stronger password hygiene (which is exactly the goal of our program.) This is great against some of the more common attacks, but I started wondering what kind of advanced attacks we could perform against our more resilient employees. The above xkcd comic immediately came to mind, and I wondered what a reasonable attack would look like to recover someone choosing a 4-word xkcd-style diceware.

These particular passwords were being stored in salted SHA-512. (For legacy reasons, that was the best we could make them, but it's still better than your enterprise Active Directory's NTLM hashes.)

Don't ever let anyone tell you SHA-anything is "enough". If you can go bcrypt, scrypt, or argon2, you should.

The first thing I tried to do was model my targets. In doing so, I made the following assumptions:

Someone reading that comic would just pick 4 "random" words from their head and make that their password. (I qualify random here because, despite our best efforts, humans are a poor source of randomness. In fact, we're highly deterministic.)
Although qualitative, I also felt that people I've met who have created xkcd-like passwords tend not to use spaces within the password (they simply adjoin all the words.)
I also wanted to assume all passwords were going to be in lowercase.
OkCupid is a US-based company, and everyone in the office speaks English, so I assumed an English lexicon for this attack.
But there are a lot of words in the English language, and that would make this attack entirely unreasonable if I tried all of them. Since I previously determined that people were going to be selecting these words from their heads, I assumed we could narrow down our candidate pool quite a bit. After all, the active vocabulary of an adult in the US is estimated somewhere around ~40k words. My next assumption here is that the words someone were likely to pick would probably be between 4 and 7 characters long. (This was a little bit of a guess, but even the xkcd comic fit this template, so I decided to at least try it.)

Using awk, I grabbed a quick list of these candidates from Webster's dictionary. Then, I used the combinator utility from hashcat-utils to create a new list of all the combinations of this list with itself. I assumed people would not use the same word twice, so I filtered that out.

The resulting dictionary wasn't terrible, but when I fired up Hashcat, the estimate for the job was going to be a few months. That probably would have been fine to run but I was a little anxious and wanted to see if I could at least prioritize the key space a little before I ran such a massive job. So I decided to take one more guess. Could I also filter out words which started with less frequently used letters? (like U, X, Y, Z, etc.)

I ended up filtering only for words starting with a through r. The final dictionary used 2458 words.

Now, when I ran this through Hashcat, I had a job which estimated it was going to finish in 9 days. That seemed a lot more reasonable. I figured if it failed I could take a stab at re-priotizing some of the leftover key space, and worst-case I would run the giant job for a few months.

If you haven't already guessed, I got a password in less than that. After 6 days, I cracked the password for a senior systems administrator who held highly sensitive privileges to the entire infrastructure. (This was definitely an epic moment for me.)

Of course I let him know to change his password (he immediately started selecting much longer dicewares), but I also asked him how he selected the compromised password. Interestingly, he didn't select it from his head; he actually used a popular xkcd password generator. If you read the article paired with the generator, it explains how the author selected a source dictionary of 1949 words (not even the 2048 recommended by xkcd, because it's "close enough" - which is mathematically untrue in this context.) This only provides 1949^4 max combinations, which is actually a smaller key space than the job I ran at a little under 2458^4. In theory, I could have grabbed the source for this generator (available in the web page's source code) and just walked through that entire key space in less time.

If we are assuming that you can easily crack an xkcd password in 6 days, you would have to change your passwords at least 2 times a week as a reasonable mitigating control (but of course that comes with its own security problems!)

Conclusion

I'd like to be clear. Passwords suck. We shouldn't be using them. But in today's world, there are still places where we need to use them.

This wouldn't be much of an article on passwords if I didn't mention password managers. Everyone should be using password managers. A ton of them are free. If you don't already have one, feel free to go check out Bitwarden. This is going to make your life exponentially easier, and you can reasonably use strong passwords (randomly generated by your password manager) because you won't ever need to remember them.

While password managers are great (and I can't say enough how much they are), they don't fix the problem everywhere. For example, the password manager itself needs to be protected by a master password (but this is still infinitely better remembering one password rather than all the passwords within the manager.) So you are still going to need a small handful of passwords, and this is where it becomes important to have something memorable, yet strong.

Don't stop using passphrases or diceware. The point here was never to suggest they are more flawed than passwords, but keep realistic expectations. A weak passphrase is still going to be weak. When people talk about password strength stretching over years, always question how they modeled their work to form that conclusion. Maybe it's correct, but often it's not.

Also, if you're going to use diceware, make sure you do it right:

Use a minimum of 6 base words.
Use a decently sized pool of candidates for selection (Diceware's recommendation of 6^5 seems like a good bar.)
Make sure your selections are chosen at random. (Do not pick them yourself.)
Go ahead an use spaces in your passwords. Use all the spaces! It really does make a difference.

And the next time someone tries to tell you an xkcd password is going to take "centuries" to crack... just send them here.