12  Current and future directions

This chapter is for notes about possible in-progress and future changes to R: there is no commitment to release such changes, let alone to a timescale.

12.1 Long vectors

Vectors in R 2.x.y were limited to a length of 2^31 - 1 elements (about 2 billion), as the length is stored in the SEXPREC as a C int, and that type is used extensively to record lengths and element numbers, including in packages.

Note that longer vectors are effectively impossible under 32-bit platforms because of their address limit, so this section applies only on 64-bit platforms. The internals are unchanged on a 32-bit build of R.

A single object with 2^31 or more elements will take up at least 8GB of memory if integer or logical and 16GB if numeric or character, so routine use of such objects is still some way off.

There is now some support for long vectors. This applies to raw, logical, integer, numeric and character vectors, and lists and expression vectors. (Elements of character vectors (CHARSXPs) remain limited to 2^31 - 1 bytes.) Some considerations:

  • This has been implemented by recording the length (and true length) as -1 and recording the actual length as a 64-bit field at the beginning of the header. Because a fair amount of code in R uses a signed type for the length, the ‘long length’ is recorded using the signed C99 type ptrdiff_t, which is typedef-ed to R_xlen_t.
  • These can in theory have 63-bit lengths, but note that current 64-bit OSes do not even theoretically offer 64-bit address spaces and there is currently a 52-bit limit (which exceeds the theoretical limit of current OSes and ensures that such lengths can be stored exactly in doubles).
  • The serialization format has been changed to accommodate longer lengths, but vectors of lengths up to 2^31-1 are stored in the same way as before. Longer vectors have their length field set to -1 and followed by two 32-bit fields giving the upper and lower 32-bits of the actual length. There is currently a sanity check which limits lengths to 2^48 on unserialization.
  • The type R_xlen_t is made available to packages in C header Rinternals.h: this should be fine in C code since C99 is required. People do try to use R internals in C++, but C++98 compilers are not required to support these types.
  • Indexing can be done via the use of doubles. The internal indexing code used to work with positive integer indices (and negative, logical and matrix indices were all converted to positive integers): it now works with either INTSXP or REALSXP indices.
  • The R function length returns a double value if the length exceeds 2^31-1. Code calling as.integer(length(x)) before passing to .C/.Fortran should checks for an NA result.

12.2 64-bit types

There is also some desire to be able to store larger integers in R, although the possibility of storing these as double is often overlooked (and e.g. file pointers as returned by seek are already stored as double).

Different routes have been proposed:

  • Add a new type to R and use that for lengths and indices—most likely this would be a 64-bit signed type, say longint. R’s usual implicit coercion rules would ensure that supplying an integer vector for indexing or length<- would work.
  • A more radical alternative is to change the existing integer type to be 64-bit on 64-bit platforms (which was the approach taken by S-PLUS for DEC/Compaq Alpha systems). Or even on all platforms.
  • Allow either integer or double values for lengths and indices, and return double only when necessary.

The third has the advantages of minimal disruption to existing code and not increasing memory requirements. In the first and third scenarios both R’s own code and user code would have to be adapted for lengths that were not of type integer, and in the third code branches for long vectors would be tested rarely.

Most users of the .C and .Fortran interfaces use as.integer for lengths and element numbers, but a few omit these in the knowledge that these were of type integer. It may be reasonable to assume that these are never intended to be used with long vectors.

The remaining interfaces will need to cope with the changed VECTOR_SEXPREC types. It seems likely that in most cases lengths are accessed by the length and LENGTH functions1 The current approach is to keep these returning 32-bit lengths and introduce ‘long’ versions xlength and XLENGTH which return R_xlen_t values.

1 but LENGTH is a macro under some internal uses.

See also https://homepage.cs.uiowa.edu/~luke/talks/useR10.pdf.

12.3 Large matrices

Matrices are stored as vectors and so were also limited to 2^31-1 elements. Now longer vectors are allowed on 64-bit platforms, matrices with more elements are supported provided that each of the dimensions is no more than 2^31-1. However, not all applications can be supported.

The main problem is linear algebra done by Fortran code compiled with 32-bit INTEGER. Although not guaranteed, it seems that all the compilers currently used with R on a 64-bit platform allow matrices each of whose dimensions is less than 2^31 but with 2^31 or more elements and index them correctly, and a substantial part of the support software (such as BLAS and LAPACK) also work.

There are exceptions: for example some complex LAPACK auxiliary routines do use a single INTEGER index and hence overflow silently and segfault or give incorrect results. One example seen was svd() on a complex matrix.

Since this is implementation-dependent, it is possible that optimized BLAS and LAPACK may have further restrictions: a segfault have been reported from svd() using ATLAS on x86_64 Linux.

For matrix algebra on large matrices one almost certainly wants a machine with a lot of RAM (100s of gigabytes), many cores and a multi-threaded BLAS.

Footnotes