Daniel Beer Atom | RSS | About

Speeding up Adler-32 on repeated subsequences

31 Jan 2019

The Adler-32 checksum is used in the DEFLATE format as an integrity check on decompressed data. It’s fast to calculate, but it can be even faster if you know that parts of your data consist of repeated subsequences.

The Adler-32 checksum is defined in terms of the state of two 16-bit accumulators, updated for each byte xn as follows:

$$ \begin{array}{rcll} a_0 & = & 1 \\ b_0 & = & 0 \\ \\ a_{n+1} & = & a_n + x_n & \mod M \\ b_{n+1} & = & b_n + a_{n+1} & \mod M \\ & = & b_n + a_n + x_n & \mod M \\ \end{array} $$

All arithmetic is done modulo M = 65521. This can be implemented at a low fixed cost per input byte. If your input consists of a repeated subsequence, then it can be implemented in time proportional only to the length of the subsequence itself.

The checksum of an n-byte sequence can be expressed in closed form:

$$ \begin{array}{rcll} a_{n} & = & a_0 + \displaystyle\sum_{i=0}^{n-1} x_i & \mod M \\ \\ b_{n} & = & b_0 + {n}{a_0} + \displaystyle\sum_{i=0}^{n-1} (n-i) x_i & \mod M \\ \end{array} $$

If you have k repetitions of an m-byte subsequence, we can write the value of akm as:

$$ \begin{array}{rcll} a_{km} & = & a_0 + \displaystyle\sum_{j=0}^{k-1} \displaystyle\sum_{i=0}^{m-1} x_{jm+i} & \mod M \\ \\ & = & a_0 + k \displaystyle\sum_{i=0}^{m-1} x_i & \mod M \\ \end{array} $$

bkm can be written as:

$$ \begin{array}{rcll} b_{km} & = & a_0 + km{a_0} + \displaystyle\sum_{j=0}^{k-1} \displaystyle\sum_{i=0}^{m-1} (km - jm - i) x_{jm+i} & \mod M \\ \\ & = & a_0 + km{a_0} + \displaystyle\sum_{i=0}^{m-1} \left[ \frac{k(k+1)}{2}m - ki \right] x_i & \mod M \\ \\ & = & a_0 + km{a_0} + \displaystyle\frac{k(k+1)}{2}m \displaystyle\left[\sum_{i=0}^{m-1} x_i \right] - k \displaystyle\sum_{i=0}^{m-1} i x_i & \mod M \\ \end{array} $$

Both formulas involve only summations with m terms. We can then implement a fast C function for updating an Adler-32 checksum with a repeated subsequence:

/* a, b: pointers to current Adler-32 accumulator state.
 * x, m: subsequence data and length
 * k:    number of repetitions of subsequence
 *
 * The effect of this function is the same as updating (a, b) with
 * each byte of the subsequence, and repeating that procedure k
 * times over.
 */

#define M 65521

static void adler32_repeat(uint16_t *a, uint16_t *b,
                           const uint8_t *x, size_t m,
                           size_t k)
{
        unsigned long sum_x = 0;
        unsigned long sum_ix = 0;

        for (size_t i = 0; i < m; i++) {
                sum_x += x[i];
                sum_ix += i * x[i];
        }

        *b = ((*b) + k * m * (*a)
                   + m * k * (k+1) / 2 * sum_x
                   - k * sum_ix) % M;
        *a = ((*a) + k * sum_x) % M;
}