memoise: Caching in the cloud

Update: 2019-06-22

Based on my suggestions, out-of-memory caching was implemented in the “official” memoise package here. The memoise package now caches based on files and AWS.

Original Post

Memoisation is a technique for caching the results of functions based on inputs. For example, the following function calculates the fibonnaci sequence in R.

fib <- function(n) {
  if (n < 2) return(1)
  fib(n - 2) + fib(n - 1)
}

This is an innefficient way of calculating values of the fibonnacci sequence. However, it is a useful example for understanding memoisation. The following code uses Hadley Wickhams package memoise.

library(memoise)

# fib() generates the nth element of the fibonnaci seqeuence
fib <- function(n) {
  if (n < 2) return(1)
  fib(n - 2) + fib(n - 1)
}

# Memoize fib
mem_fib <- memoise(fib)

mem_fib(30) # Initial run caches the value

In the above example, the memoise() function generates a memoised function, which will automatically cache results. If the function is run again with the same parameters, it will return the cached result rather than recompute the result. Implementing memoisation can significantly speed up analysis when functions that take time to run are repeatedly called.

What if you are running similar analyses within a cluster environment? The ability to cache results in a centralized datastore could increase the speed of analysis across all machines. Alternatively, perhaps you work on different computers at work and at home. Forgetting to save/load intermediate files may require long-running functions to be run again. Further, managing and retaining intermediate files can be cumbersome and annoying. Again, caching the results of the memoised function in a central location (e.g., cloud-based storage) can speed up analytical pipelines across machines.

Recently I’ve put some work into developing new types of out-of-memory caches for the memoise package available here. This forked version can be used to cache items locally or remotely in a variety of environments. Supported environments include:

R environment (cache_local)
Google Datastore (cache_datastore)
Amazon S3 (cache_aws_s3)
File system (cache_filesystem; allows dropbox, google drive to be used for caching)

There are a few caveats to consider when using this version of memoise. If you use the external cache options, it will take additional time to retrieve cached items. This is preferable in cluster environments where syncing files across instances/nodes can be difficult. However, when working at home/work, using locally synced files is preferred.

Installation

devtools::install_github("danielecook/memoise")

Usage

Google Datastore

library(memoise)

# fib() generates the nth element of the fibonnaci seqeuence
fib <- function(n) {
  if (n < 2) return(1)
  fib(n - 2) + fib(n - 1)
}

# Define a cache
ds <- cache_datastore(project = "your-project-name", cache_name = "rcache2")

# Memoize fib
mem_fib <- memoise(fib, cache = ds)

mem_fib(30) # Initial run caches the value

Amazon S3

library(memoise)

# fib() generates the nth element of the fibonnaci seqeuence
fib <- function(n) {
  if (n < 2) return(1)
  fib(n - 2) + fib(n - 1)
}

# Set up credentials
Sys.setenv("AWS_ACCESS_KEY_ID" = "<access key>",
           "AWS_SECRET_ACCESS_KEY" = "<access secret>")

# Define a cache
# Your bucket name must be unique among all s3 users - so use something like 'rcache-<initials>'
aws_s3 <- cache_s3("<unique bucket name>")

# Memoize fib
mem_fib <- memoise(fib, cache = aws_s3)

mem_fib(30) # Initial run caches the value