# 12 Current and future directions

This chapter is for notes about possible in-progress and future changes to R: there is no commitment to release such changes, let alone to a timescale.

## 12.1 Long vectors

Vectors in R 2.x.y were limited to a length of 2^31 - 1 elements (about 2 billion), as the length is stored in the `SEXPREC`

as a C `int`

, and that type is used extensively to record lengths and element numbers, including in packages.

Note that longer vectors are effectively impossible under 32-bit platforms because of their address limit, so this section applies only on 64-bit platforms. The internals are unchanged on a 32-bit build of R.

A single object with 2^31 or more elements will take up at least 8GB of memory if integer or logical and 16GB if numeric or character, so routine use of such objects is still some way off.

There is now some support for long vectors. This applies to raw, logical, integer, numeric and character vectors, and lists and expression vectors. (Elements of character vectors (`CHARSXP`

s) remain limited to 2^31 - 1 bytes.) Some considerations:

- This has been implemented by recording the length (and true length) as
`-1`

and recording the actual length as a 64-bit field at the beginning of the header. Because a fair amount of code in R uses a signed type for the length, the ‘long length’ is recorded using the signed C99 type`ptrdiff_t`

, which is typedef-ed to`R_xlen_t`

. - These can in theory have 63-bit lengths, but note that current 64-bit OSes do not even theoretically offer 64-bit address spaces and there is currently a 52-bit limit (which exceeds the theoretical limit of current OSes and ensures that such lengths can be stored exactly in doubles).
- The serialization format has been changed to accommodate longer lengths, but vectors of lengths up to 2^31-1 are stored in the same way as before. Longer vectors have their length field set to
`-1`

and followed by two 32-bit fields giving the upper and lower 32-bits of the actual length. There is currently a sanity check which limits lengths to 2^48 on unserialization. - The type
`R_xlen_t`

is made available to packages in C header`Rinternals.h`

: this should be fine in C code since C99 is required. People do try to use R internals in C++, but C++98 compilers are not required to support these types. - Indexing can be done via the use of doubles. The internal indexing code used to work with positive integer indices (and negative, logical and matrix indices were all converted to positive integers): it now works with either
`INTSXP`

or`REALSXP`

indices. - The R function
`length`

returns a double value if the length exceeds 2^31-1. Code calling`as.integer(length(x))`

before passing to`.C`

/`.Fortran`

should checks for an`NA`

result.

## 12.2 64-bit types

There is also some desire to be able to store larger integers in R, although the possibility of storing these as `double`

is often overlooked (and e.g. file pointers as returned by `seek`

are already stored as `double`

).

Different routes have been proposed:

- Add a new type to R and use that for lengths and indices—most likely this would be a 64-bit signed type, say
`longint`

. R’s usual implicit coercion rules would ensure that supplying an`integer`

vector for indexing or`length<-`

would work. - A more radical alternative is to change the existing
`integer`

type to be 64-bit on 64-bit platforms (which was the approach taken by S-PLUS for DEC/Compaq Alpha systems). Or even on all platforms. - Allow either
`integer`

or`double`

values for lengths and indices, and return`double`

only when necessary.

The third has the advantages of minimal disruption to existing code and not increasing memory requirements. In the first and third scenarios both R’s own code and user code would have to be adapted for lengths that were not of type `integer`

, and in the third code branches for long vectors would be tested rarely.

Most users of the `.C`

and `.Fortran`

interfaces use `as.integer`

for lengths and element numbers, but a few omit these in the knowledge that these were of type `integer`

. It may be reasonable to assume that these are never intended to be used with long vectors.

The remaining interfaces will need to cope with the changed `VECTOR_SEXPREC`

types. It seems likely that in most cases lengths are accessed by the `length`

and `LENGTH`

functions^{1} The current approach is to keep these returning 32-bit lengths and introduce ‘long’ versions `xlength`

and `XLENGTH`

which return `R_xlen_t`

values.

^{1} but `LENGTH`

is a macro under some internal uses.

See also https://homepage.cs.uiowa.edu/~luke/talks/useR10.pdf.

## 12.3 Large matrices

Matrices are stored as vectors and so were also limited to 2^31-1 elements. Now longer vectors are allowed on 64-bit platforms, matrices with more elements are supported provided that each of the dimensions is no more than 2^31-1. However, not all applications can be supported.

The main problem is linear algebra done by Fortran code compiled with 32-bit `INTEGER`

. Although not guaranteed, it seems that all the compilers currently used with R on a 64-bit platform allow matrices each of whose dimensions is less than 2^31 but with 2^31 or more elements and index them correctly, and a substantial part of the support software (such as BLAS and LAPACK) also work.

There are exceptions: for example some complex LAPACK auxiliary routines do use a single `INTEGER`

index and hence overflow silently and segfault or give incorrect results. One example seen was `svd()`

on a complex matrix.

Since this is implementation-dependent, it is possible that optimized BLAS and LAPACK may have further restrictions: a segfault have been reported from `svd()`

using ATLAS on `x86_64`

Linux.

For matrix algebra on large matrices one almost certainly wants a machine with a lot of RAM (100s of gigabytes), many cores and a multi-threaded BLAS.

Footnotes