Coordinated Cache Refill with Redis and Node

After launching a new Node.js app we built as part of a recent redesign for a high-volume site, we found some performance problems occurring every five minutes:

Response time spikes when cache entries expire.
Response time spikes when cache entries expire.

The spikes on the left graph and the troughs on the right graphs are the problem. We were using Redis to cache common shared operations, and while it performed very well, our application would rush to recompute cached values when they expired every five minutes.

Our initial implementation of caching simply saved off a computation result in Redis with a TTL of five minutes. When the cache expired, every request would try to recompute the value and save the new result until one of them did so successfully. This would cause a rush of database queries in the intervening time, increasing wait times and degrading user experience.

This is a common problem with heavily utilized caches, and many good solutions exist. However, we were in a time crunch to address the problem ASAP, and we didn’t have time for a literature review and open-source search. The quickest approach was to roll out a solution ourselves.

To solve the problem, we needed a way to coordinate the refreshing of the cache across our web tier. Instead of having every request recompute the value until the new result was saved, we wanted to have a single Node process recompute the value and let all other requests continue to use the cached value.

Our approach involved letting each process “draw straws” for the responsibility of recomputing the value well before the item should expire. The process with the longest straw gets the responsibility, and all other processes continue to use the existing cache value.

At a high level, the process works as follows:

  1. Each cached call has a min age, max age, and grace period.
  2. Every cached call is saved with a TTL set to the max age.
  3. Each time the value is fetched from cache, the client checks its age. If the item’s age is less than min age, return it.
  4. If the item is older than min age, pick a random identifier and write it to Redis, along with a new min age. This effectively pushes back the min age by grace milliseconds, so new requests will continue to use the cached value and abort at Step #3.
  5. Wait for a small period (25ms) to let any other clients that run between #1 and #4 to do the same.
  6. Refetch the item. If the saved identifier is not equal to the one this process just generated, the longest straw was not returned. Therefore, just return the cache value.
  7. If the identifier is equal to ours, recompute the value and set it into the cache store.

For most of our call sites, we have a min age of four minutes, a max age of five minutes, and a grace of 20 seconds.

The first request to run after the min age has passed resets the min age immediately, before doing any real work. If this takes 5ms to communicate with Redis, then any other call in that 5ms window will do the same, but once the first write to Redis completes, all new requests will continue to use the cache value as though min age had not yet passed.

Any processes that run in that 5ms window will be writing to Redis with a random identifier they picked for themselves. Once the dust has settled, the last identifier to come in is the process that will recompute the value. The rest of the participating processes simply reuse the cache value and assume it will be updated shortly.

Since the grace period is less than the delta between min age and max age, there can be multiple attempts to refill the cache before it expires should anything go wrong, such as a Node process getting killed.

With this change, our response time spikes disappeared completely, and our database now sits mostly idle at all times.

For anyone interested, here’s the TypeScript implementation we put together:

import { getRedisConnection } from "@Graphql/endpoint";

const SEPARATOR = "\u0000";
const KEY_PREFIX = "cache2";


export async function cache<T>(
  key: string,
  func: () => Promise<T>,
  opts: CacheOpts = {
    minAgeMs: 230000,
    maxAgeMs: 300000,
    graceMs: 20000
  }
): Promise<T> {
  const cacheKey = `${KEY_PREFIX}:${key}`;
  let effectivePayload = await getCacheEntry(cacheKey);
  if (effectivePayload) {
    const now = new Date();
    // If the minimum age has expired
    if (now > effectivePayload.nextCheckTime) {
      const ticket = drawTicket();
      const currentAge = msBetween(effectivePayload.createdAt, now);
      const timeUntilEntryIsInvalid = Math.max(opts.maxAgeMs - currentAge, 1);

      // Advance minimum age by grace period so other processes don't retry.
      const nextCheckTime = addMsToDate(now, opts.graceMs);

      // Put current value back in cache with a new min age and our ticket.
      await setCacheEntry(
        cacheKey,
        {
          ticket: ticket,
          createdAt: effectivePayload.createdAt,
          nextCheckTime,
          jsonValue: effectivePayload.jsonValue
        },
        timeUntilEntryIsInvalid
      );

      // Wait for others to do the same.
      await sleep(25);

      // See what's in the cache
      effectivePayload = await getCacheEntry(cacheKey) || effectivePayload;

      // If our ticket matches what's in redis, then it's our job to recompute the value.
      if (effectivePayload.ticket === ticket) {
        effectivePayload = await fillCache();
      }
    }
  } else {
    effectivePayload = await fillCache();
  }

  return JSON.parse(effectivePayload.jsonValue);

  async function fillCache(): Promise<Payload> {
    let value = await func();
    if (value === undefined) {
      value = null as any;
    }
    const now = new Date();
    const payload: Payload = {
      createdAt: now,
      nextCheckTime: addMsToDate(now, opts.minAgeMs),
      ticket: -1,
      jsonValue: JSON.stringify(value)
    };
    await setCacheEntry(cacheKey, payload, opts.maxAgeMs);
    return payload;
  }
}

interface CacheOpts {
  minAgeMs: number;
  maxAgeMs: number;
  graceMs: number;
}

function drawTicket() {
  return Math.floor(Math.random() * 2000000000);
}

export interface Payload {
  ticket: number;
  createdAt: Date;
  nextCheckTime: Date;
  jsonValue: string;
}

export function payloadToString(opts: Payload) {
  return [
    opts.ticket.toString(),
    opts.createdAt.valueOf(),
    opts.nextCheckTime.valueOf(),
    opts.jsonValue
  ].join(SEPARATOR);
}

export function stringToPayload(cacheString: string): Payload | null {
  const entries = cacheString.split(SEPARATOR);
  if (entries.length !== 4) {
    return null;
  }

  const [winner, createdAt, nextCheckTime, jsonValue] = entries;
  return {
    ticket: parseInt(winner, 10),
    createdAt: new Date(parseInt(createdAt, 10)),
    nextCheckTime: new Date(parseInt(nextCheckTime, 10)),
    jsonValue: jsonValue
  };
}

interface CacheStore {
  get(key: string): Promise<string | null>;
  set(key: string, value: string, ttlMs: number): Promise<string | null>;
}

export const REDIS_CACHE: CacheStore = {
  async get(key: string): Promise<string | null> {
    const redis = getRedisConnection();
    const cacheString: string | null = await redis.get(key);
    return cacheString;
  },

  async set(key: string, value: string, ttlMs: number): Promise<any> {
    const redis = getRedisConnection();
    await redis.set(key, value, "px", ttlMs);
  }
};

interface LocalCache extends CacheStore {
  map: Map<string, {
    expiration: Date,
    value: string
  }>;

  reset(): void;
  expirationTime(key: string): Date | null;
}

export const LOCAL_CACHE: LocalCache = {
  map: new Map(),

  reset(this: LocalCache) {
    this.map = new Map();
  },

  expirationTime(this: LocalCache, key: string): Date | null {
    const entry = this.map.get(key);
    return entry ? entry.expiration : null;
  },

  async get(this: LocalCache, key: string): Promise<string | null> {
    await sleep(Math.random() * 15);
    const cacheValue = this.map.get(key);
    if (cacheValue && new Date() < cacheValue.expiration) {
      return cacheValue.value;
    } else {
      return null;
    }
  },

  async set(this: LocalCache, key: string, value: string, ttlMs: number): Promise<any> {
    await sleep(Math.random() * 15);

    this.map.set(key, {
      value,
      expiration: addMsToDate(new Date(), ttlMs)
    });
  }
};

export let CACHE_STORE: CacheStore = REDIS_CACHE;
export function setCacheStore(store: CacheStore) {
  CACHE_STORE = store;
}

export async function getCacheEntry(key: string): Promise<Payload | null> {
  const cacheString: string | null = await CACHE_STORE.get(key);
  return cacheString === null || cacheString === undefined
    ? null
    : stringToPayload(cacheString);
}

export async function setCacheEntry(
  key: string,
  payload: Payload,
  ttlMs: number
): Promise<any> {
  await CACHE_STORE.set(key, payloadToString(payload), ttlMs);
}

export function addMsToDate(date: Date, ms: number) {
  return new Date(date.valueOf() + ms);
}
export function msBetween(date1: Date, date2: Date): number {
  return Math.round((date2.valueOf() - date1.valueOf()));
}

export function sleep(ms: number) {
  return new Promise((resolve) => {
    setTimeout(resolve, ms);
  });
}