Launching tasks with mirai • promises

We’ve updated this guide to using the mirai package from future as we believe the following benefits are compelling in the context of Shiny apps:

Faster startup times and much less per-task overhead, meaning you can boost performance by making even shorter tasks async.
More linear scaling, meaning you get the same relative benefits whether running 2 or 200 cores.
Event-driven promises using mirai vs. promises using future which time-poll every 100 ms. Lower latency and response times can help with the user experience.

The previous guide using future is available here. Using future continues to be supported within the Shiny ecosystem, and Henrik Bengtsson‘s excellent work on the futureverse deserves credit for pushing the boundaries of parallelism in R farther than many thought possible.

The mirai package provides a lightweight way to launch R tasks that don’t block the current R session.

The promises package provides the API for working with the results of async tasks, but it totally abdicates responsibility for actually launching/creating async tasks. The idea is that any number of different packages could be capable of launching async tasks, using whatever techniques they want, but all of them would either return promise objects or objects that can be converted to promise objects, as is the case for mirai.

This document will give an introduction to the parts of mirai that are most relevant to promises. For more information, please consult the documentation and vignettes that come with mirai.

How mirai works

The main API that mirai provides couldn’t be simpler. You call mirai() and pass it the code that you want executed asynchronously:

m <- mirai({
  # expensive operations go here...
  df <- download_lots_of_data()
  fit_model(df)
})

The object that’s returned is a mirai, which for all intents and purposes is a promise object¹, which will eventually resolve to the return value of the code block (i.e. the last expression) or an error if the code does not complete executing successfully. The important thing is that no matter how long the expensive operation takes, these lines will execute almost instantly, while the operation continues in the background.

But we know that R is single-threaded, so how does mirai accomplish this? The answer: by utilizing another R process. mirai delegates the execution of the expensive operation to a totally different R process, so that the original R process can move on.

Choosing a launch method

In mirai, the daemons() function is used to set and launch background R processes (daemons).

These background processes will be used/recycled for the life of the originating R process. If a mirai is launched while all the background R processes are busy executing, then the new mirai is queued until one of the background processes frees up.

To launch n processes locally, you just need to call daemons(n), supplying the value of n.

You need to determine n yourself, and typically this should be at most one less than the number of processor cores on your machine, to leave one for the main R process. The reason we don’t automatically detect this for you is that you may also be running other tasks on your machine, and you should take this into account when supplying a value for n.

daemons() has further arguments url and remote for setting and launching remote daemons over the network for distributed computing. To learn more, see the mirai::daemons() reference docs as well as the daemons sections of the mirai reference vignette.

If you don’t set daemons() in a session, then each mirai() call will launch a new local R process solely for the purpose of performing that evaluation. Whilst this may be desirable in certain circumstances, this is rarely going to be the case for Shiny. This is as we cannot limit the total number of processes spawned at any one time. If a Shiny app has many simultaneous users, then this could lead to an excessive number of processes being created, overwhelming the system.

Caveats and limitations

The abstractions that mirai presents are simple and consistent, although it may take some time to get used to them. Please read this entire section carefully before proceeding.

Globals: Providing input to mirai code chunks

Most mirai code chunks will need to reference data from the original process, e.g. data to be fitted, URLs to be requested, file paths to read from.

As evaluation happens in another process, these won’t be available to the code chunk by default. These objects will need to be passed to the ... argument of your mirai() call. These are then serialized and sent to the other process along with the code to be executed.

These objects include any functions which are defined in your session and not in a package.

For example:

download_data <- function(url) {
  file <- tempfile()
  download.file(url, file, "libcurl")
  file
}

url <- "http://example.com/data.csv"

m <- mirai(
  {
    file <- download_data(url)
    read.csv(file)
  },
  download_data = download_data,
  url = url
)

If there are many variables to pass through, mirai does offer a convenience feature to pass an environment instead of individual ... pairs. The above call would then look like this instead:

m <- mirai(
  {
    file <- download_data(url)
    read.csv(file)
  },
  environment()
)

This passes the calling environment, which includes both the download_data function as well as url.

Care should be taken when using this feature as it will also pass anything else that happens to be in the same environment. It is safer to use when mirai is called inside of another function, then environment() will only consist of variables passed as arguments to that function, or created locally within it.

Package loading

Besides variables, package functions need to be declared with the full namespace so that they can be found in the other process. For example, using dplyr::mutate() instead of just mutate(), even if the dplyr package is loaded in your main session, as the other process will not have any packages loaded by default.

Alternatively, make a call to load the package inside your mirai code chunk, for example by adding library(dplyr). Sometimes this may be the most convenient option, especially for for infix operators. For example the magrittr pipe %>%, requires a library(magrittr) to load the package beforehand.

Custom Data Types

Certain objects are implemented at a low level, not using one of R’s native vector types, and represented in R by an external pointer. An example of this is an Arrow table. It is not possible to serialize these to and from R’s native rds format. Instead they provide their own serialization and deserialization methods.

mirai offers a seamless solution for working with these data types, integrating those custom serialization and deserialization methods with R’s native serialization so that you don’t need to manually handle each instance of these objects when moving them across processes.

This does require a one-off configuration step when you set up daemons, and you may read more about this in the mirai serialization vignette.

Native resources

Mirai code blocks cannot use resources such as database connections and network sockets that were created in the parent process. Even if it seems to work with a simple test, you are asking for crashes or worse by sharing these kinds of resources across processes.

Instead, make sure you create, use, and destroy such resources entirely within the scope of the mirai code block.

Mutation

Reference class objects (including R6 objects, S7 objects, and data.table objects) and environments are among the few “native” R object types that are mutable, that is, can be modified in-place. Unless they contain native resources (see previous section), there’s nothing wrong with using mutable objects from within mirai code blocks, even objects created in the parent process. However, note that any changes you make to these objects will not be visible from the parent process; the mirai code is operating on a copy of the object, not the original.

Returning values

Mirai code blocks return a value—they’d be a lot less useful if they couldn’t! Like everywhere else in R, the return value is determined by the last expression in the code block, unless return() is explicitly called earlier.

The return value will always be copied back into the parent process. This matters for two reasons.

First, if the return value is very large, the copying process can take some time — and because the data must essentially be serialized to and deserialized from rds format, it can take a surprising amount of time. In the case of mirai blocks that execute fairly quickly but return huge amounts of data, you may be better off not using async techniques at all.

Second, objects that refer to native resources are unlikely to work in this direction either; just as you can’t use the parent’s database connections in the child process, you also cannot have the child process return a database connection for the parent to use.

Next: Using promises with Shiny