Tuesday, December 1, 2015

Optimality of frequency analysis on deterministically-encrypted database columns

This fall at CCS '15, Naveed, Kamara, and Wright published a paper about their evaluation of some passive attacks on encrypted databases (EDBs). (There are posts about their paper on Kamara's blog and on the Bristol Cryptography Blog.) Their paper is important because it illustrates the power of simple attacks on columns encrypted with property-preserving encryption. All the attacker needs is a snapshot of the encrypted database and some relevant, publicly-available auxiliary data. In this post, I briefly explain an observation that Kenny Paterson and I made about the optimality of frequency analysis.

About the attacks

Two of the attacks Naveed, Kamara, and Wright evaluated, frequency analysis and ℓp-optimization, operate on deterministically-encrypted columns. (Of course, the appeal of deterministic encryption is that it naturally allows equality testing on encrypted data.) Both of these attacks are passive—they do not rely on seeing any queries to the EDB, and they require only a copy of it at rest. They do, however, require an auxiliary data set. In essence, an auxiliary data set is one that should have the same distribution as the target data (i.e., the data underlying the encrypted columns). Here's the main idea of the two attacks:

  • frequency analysis decrypts the column by matching the most frequent ciphertext with the most frequent plaintext from the auxiliary data (and so on for the other less frequent ciphertexts).
  • p-optimization decrypts the column by matching the frequencies of the ciphertexts with the frequencies of the auxiliary plaintexts in a way that minimizes the ℓp-distance of their histograms. (See Naveed et al.'s paper for details.)

Attacking a deterministically-encrypted database column using this type of auxiliary data is like attacking a monoalphabetic substitution cipher given plaintext letter frequencies. Consider the following example: suppose you intercept the following ciphertext of some English message that was encrypted with a keyword cipher.

XKSREJJKQBKLPQKCSDHYECPQQPNVKNHYVDQBKSQDILNKUDJAQBPDJYDUDYSEHOQKQBEQPJYPERBKTSOISOQVKNGTKNBDOKVJDILNKUPIPJQEJYEQQBPOEIPQDIPOBENPEAPJPNEHNPOLKJODCDHDQXTKNEHHBSIEJDQXKSNLENQDRSHENYSQXCPDJAQKEDYQBKOPQKVBKIVPQBDJGVPREJCPIKOQSOPTSH

The simplest attack is well-known: just look up English letter frequencies and you can probably crack the cipher by matching the most frequent letter in the ciphertext with the most frequent letter in the alphabet (E in English). A slightly more powerful attack would involve looking at which pairs of letters occur most often next to each other, and using bigram statistics (most common bigram in English: TH) to crack the cipher. The question of whether frequency analysis is an optimal attack on a keyword cipher is just not relevant since there's other information to exploit.

In the EDB context, though, the letters (i.e., individual data values) are not ordered! The ciphertext would look more like this, with one "letter" per record:

Attribute 1
0x523e
0x0852
0xa8ef
0x0852
0x772a
0x523e

Even if you knew the alphabet (say, EU country names), higher-order frequency statistics (like of bigrams) just can't apply since there is no concept of order or adjacency.

Our observation

So, given a copy of an EDB column and an auxiliary data set that is correlated with it, what is the best we can do in terms of attacking a deterministically-encrypted column? What should we use—frequency analysis, ℓp-optimization, or something else? We propose applying maximum likelihood estimation (MLE) to answer this question. MLE is a statistical technique for finding the value of a parameter of a probability distribution function that maximizes the likelihood1 of seeing some particular sample data having that distribution.

Why does it apply here? Well, we have samples of data that should have the distribution of the auxiliary data, but they've been relabelled (i.e., encrypted). The attacker wants to find the mapping between plaintext values (from the auxiliary data) and the relabelled values (from the encrypted target data). In other words, the distribution's parameter that we want MLE to find is a permutation.

Which permutation makes the target data most likely? The cool thing—and with hindsight, the unsurprising thing—is that according to the maximum likelihood principle, frequency analysis is exactly the optimal technique2 that arises from maximizing the likelihood function!

You can read the details in the short note we posted on ePrint.


1 What's the difference between probability and likelihood? Informally, probability matters when we're given a certain model and want to know about samples from it. Likelihood, on the other hand, matters when we're given the data, and want to know about the model (or distribution) from which it arises. Here's a simple example to illustrate the difference. Given a fair coin, we could ask (and compute) what is the probability of getting 2 heads when tossing it 10 times. On the other hand, suppose we have some coin of questionable origin. We toss it 10 times and observe 2 heads. Then, we could ask (and compute, using MLE) whether the coin is fair.

2 MLE may be optimal, but it is not infallible. If the sample size is very small (i.e., if there are few entries in the column of target data), then MLE won't perform well.

No comments:

Post a Comment