10  Parser

The parser is what converts the textual representation of R code into an internal form which may then be passed to the R evaluator which causes the specified instructions to be carried out. The internal form is itself an R object and can be saved and otherwise manipulated within the R system.

10.1 The parsing process

10.1.1 Modes of parsing

Parsing in R occurs in three different variants:

  • The read-eval-print loop
  • Parsing of text files
  • Parsing of character strings

The read-eval-print loop forms the basic command line interface to R. Textual input is read until a complete R expression is available. Expressions may be split over several input lines. The primary prompt (by default >) indicates that the parser is ready for a new expression, and a continuation prompt (by default +) indicates that the parser expects the remainder of an incomplete expression. The expression is converted to internal form during input and the parsed expression is passed to the evaluator and the result is printed (unless specifically made invisible). If the parser finds itself in a state which is incompatible with the language syntax, a “Syntax Error” is flagged and the parser resets itself and resumes input at the beginning of the next input line.

Text files can be parsed using the parse function. In particular, this is done during execution of the source function, which allows commands to be stored in an external file and executed as if they had been typed at the keyboard. Note, though, that the entire file is parsed and syntax checked before any evaluation takes place.

Character strings, or vectors thereof, can be parsed using the text= argument to parse. The strings are treated exactly as if they were the lines of an input file.

10.1.2 Internal representation

Parsed expressions are stored in an R object containing the parse tree. A fuller description of such objects can be found in Language objects and Expression objects. Briefly, every elementary R expression is stored in function call form, as a list with the first element containing the function name and the remainder containing the arguments, which may in turn be further R expressions. The list elements can be named, corresponding to tagged matching of formal and actual arguments. Note that all R syntax elements are treated in this way, e.g. the assignment x <- 1 is encoded as "<-"(x, 1).

10.1.3 Deparsing

Any R object can be converted to an R expression using deparse. This is frequently used in connection with output of results, e.g. for labeling plots. Notice that only objects of mode "expression" can be expected to be unchanged by reparsing the output of deparsing. For instance, the numeric vector 1:5 will deparse as "c(1, 2, 3, 4, 5)", which will reparse as a call to the function c. As far as possible, evaluating the deparsed and reparsed expression gives the same result as evaluating the original, but there are a couple of awkward exceptions, mostly involving expressions that weren’t generated from a textual representation in the first place.

10.2 Comments

Comments in R are ignored by the parser. Any text from a # character to the end of the line is taken to be a comment, unless the # character is inside a quoted string. For example,

> x <- 1  # This is a comment...
> y <- "  #... but this is not."

10.3 Tokens

Tokens are the elementary building blocks of a programming language. They are recognised during lexical analysis which (conceptually, at least) takes place prior to the syntactic analysis performed by the parser itself.

10.3.1 Constants

There are five types of constants: integer, logical, numeric, complex and string.

In addition, there is the special constant NULL. Also, the numeric Inf, and NaN, the logical NA, and NA_character_, NA_integer_, NA_real_, and NA_complex_ deserve mentioning; for the latter, see NA handling.

NULL is used to indicate the empty object. NA is used for absent (“Not Available”) data values. Inf denotes infinity and NaN is not-a-number in the IEEE floating point calculus (results of the operations respectively 1/0 and 0/0, for instance).

Logical constants are either TRUE, FALSE or NA.

Numeric constants follow a similar syntax to that of the C language. They consist of an integer part consisting of zero or more digits, followed optionally by . and a fractional part of zero or more digits optionally followed by an exponent part consisting of an E or an e, an optional sign and a string of one or more digits. Either the fractional or the decimal part can be empty, but not both at once.

Valid numeric constants: 1 10 0.1 .2 1e-7 1.2e+7

Numeric constants can also be hexadecimal, starting with 0x or 0x followed by zero or more digits, a-f or A-F. Hexadecimal floating point constants are supported using C99 syntax, e.g. 0x1.1p1.

There is now a separate class of integer constants. They are created by using the qualifier L at the end of the number. For example, 123L gives an integer value rather than a numeric value. The suffix L can be used to qualify any non-complex number with the intent of creating an integer. So it can be used with numbers given by hexadecimal or scientific notation. However, if the value is not a valid integer, a warning is emitted and the numeric value created. The following shows examples of valid integer constants, values which will generate a warning and give numeric constants and syntax errors.

Valid integer constants:  1L, 0x10L, 1000000L, 1e6L
Valid numeric constants:  1.1L, 1e-3L, 0x1.1p-2
Syntax error:  12iL 0x1.1

A warning is emitted for decimal values that contain an unnecessary decimal point, e.g. 1.L. It is an error to have a decimal point in a hexadecimal constant without the binary exponent.

Note also that a preceding sign (+ or -) is treated as a unary operator, not as part of the constant.

Up-to-date information on the currently accepted formats can be found by ?NumericConstants.

Complex constants have the form of a decimal numeric constant followed by i. Notice that only purely imaginary numbers are actual constants, other complex numbers are parsed a unary or binary operations on numeric and imaginary numbers.

Valid complex constants: 2i 4.1i 1e-2i

String constants are delimited by a pair of single (’) or double (") quotes and can contain all other printable characters. Quotes and other special characters within strings are specified using escape sequences:

\

single quote

\"

double quote

\n

newline (aka ‘line feed’, LF)

\r

carriage return (CR)

\t

tab character

\b

backspace

\a

bell

\f

form feed

\v

vertical tab

\\

backslash itself

\nnn

character with given octal code – sequences of one, two or three digits in the range 0 ... 7 are accepted.

\xnn

character with given hex code – sequences of one or two hex digits (with entries 0 ... 9 A ... F a ... f).

\unnnn \u{nnnn}

(where multibyte locales are supported, otherwise an error). Unicode character with given hex code – sequences of up to four hex digits. The character needs to be valid in the current locale.

\Unnnnnnnn \U{nnnnnnnn}

(where multibyte locales are supported, otherwise an error). Unicode character with given hex code – sequences of up to eight hex digits.

A single quote may also be embedded directly in a double-quote delimited string and vice versa.

A NUL (\0) is not allowed in a character string, so using \0 in a string constant terminates the constant (usually with a warning): further characters up to the closing quote are scanned but ignored.

10.3.2 Identifiers

Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit or an underscore, or with a period followed by a digit.

The definition of a letter depends on the current locale: the precise set of characters allowed is given by the C expression (isalnum(c) || c == '.' || c == '_') and will include accented letters in many Western European locales.

Notice that identifiers starting with a period are not by default listed by the ls function and that ... and ..1, ..2, etc. are special.

Notice also that objects can have names that are not identifiers. These are generally accessed via get and assign, although they can also be represented by text strings in some limited circumstances when there is no ambiguity (e.g. "x" <- 1). As get and assign are not restricted to names that are identifiers they do not recognise subscripting operators or replacement functions. The following pairs are not equivalent

x$a<-1 assign("x$a",1)
x[[1]] get("x[[1]]")
names(x)<-nm assign("names(x)",nm)

10.3.3 Reserved words

The following identifiers have a special meaning and cannot be used for object names

if else repeat while function for in next break
TRUE FALSE NULL Inf NaN
NA NA_integer_ NA_real_ NA_complex_ NA_character_
... ..1 ..2 etc.

10.3.4 Special operators

R allows user-defined infix operators. These have the form of a string of characters delimited by the % character. The string can contain any printable character except %. The escape sequences for strings do not apply here.

Note that the following operators are predefined:

%% %*% %/% %in% %o% %x% %||%

10.3.5 Separators

Although not strictly tokens, stretches of whitespace characters (spaces, tabs and form feeds, on Windows and UTF-8 locales other Unicode whitespace characters1) serve to delimit tokens in case of ambiguity, (compare x<-5 and x < -5).

1 such as U+A0, non-breaking space, and U+3000, ideographic space.

Newlines have a function which is a combination of token separator and expression terminator. If an expression can terminate at the end of the line the parser will assume it does so, otherwise the newline is treated as whitespace. Semicolons (;) may be used to separate elementary expressions on the same line.

Special rules apply to the else keyword: inside a compound expression, a newline before else is discarded, whereas at the outermost level, the newline terminates the if construction and a subsequent else causes a syntax error. This somewhat anomalous behaviour occurs because R should be usable in interactive mode and then it must decide whether the input expression is complete, incomplete, or invalid as soon as the user presses RET.

The comma (,) is used to separate function arguments and multiple indices.

10.3.6 Operator tokens

R uses the following operator tokens

+ - * / %% %/% ^ arithmetic
> >= < <= == != relational
! & | logical
~ model formulae
-> <- assignment
$ list indexing
: sequence

(Several of the operators have different meaning inside model formulas)

10.3.7 Grouping

Ordinary parentheses—( and )—are used for explicit grouping within expressions and to delimit the argument lists for function definitions and function calls.

Braces—{ and }—delimit blocks of expressions in function definitions, conditional expressions, and iterative constructs.

10.3.8 Indexing tokens

Indexing of arrays and vectors is performed using the single and double brackets, [] and [[]]. Also, indexing tagged lists may be done using the $ operator.

10.4 Expressions

An R program consists of a sequence of R expressions. An expression can be a simple expression consisting of only a constant or an identifier, or it can be a compound expression constructed from other parts (which may themselves be expressions).

The following sections detail the various syntactical constructs that are available.

10.4.1 Function calls

A function call takes the form of a function reference followed by a comma-separated list of arguments within a set of parentheses.

function_reference ( arg1, arg2, ...... , argn )

The function reference can be either

  • an identifier (the name of the function)
  • a text string (ditto, but handy if the function has a name which is not a valid identifier)
  • an expression (which should evaluate to a function object)

Each argument can be tagged (tag=expr), or just be a simple expression. It can also be empty or it can be one of the special tokens ..., ..2, etc.

A tag can be an identifier or a text string.

Examples:

f(x)
g(tag = value, , 5)
"odd name"("strange tag" = 5, y)
(function(x) x^2)(5)

10.4.2 Infix and prefix operators

The order of precedence (highest first) of the operators is

::
$ @
^
- +                (unary)
:
%xyz% |>
* /
+ -                (binary)
> >= < <= == !=
!
& &&
| ||
~                  (unary and binary)
-> ->>
<- <<-
=                  (as assignment)

Note that : precedes binary +/-, but not ^. Hence, 1:3-1 is 0 1 2, but 1:2^3 is 1:8.

The exponentiation operator ^ and the left assignment plus minus operators <- - = <<- group right to left, all other operators group left to right. That is, 2 ^ 2 ^ 3 is 2 ^ 8, not 4 ^ 3, whereas 1 - 1 - 1 is -1, not 1.

Notice that the operators %% and %/% for integer remainder and divide have higher precedence than multiply and divide.

Although it is not strictly an operator, it also needs mentioning that the = sign is used for tagging arguments in function calls and for assigning default values in function definitions.

The $ sign is in some sense an operator, but does not allow arbitrary right hand sides and is discussed under Index constructions. It has higher precedence than any of the other operators.

The parsed form of a unary or binary operation is completely equivalent to a function call with the operator as the function name and the operands as the function arguments.

Parentheses are recorded as equivalent to a unary operator, with name "(", even in cases where the parentheses could be inferred from operator precedence (e.g., a * (b + c)).

Notice that the assignment symbols are operators just like the arithmetic, relational, and logical ones. Any expression is allowed also on the target side of an assignment, as far as the parser is concerned (2 + 2 <- 5 is a valid expression as far as the parser is concerned. The evaluator will object, though). Similar comments apply to the model formula operator.

10.4.3 Index constructions

R has three indexing constructs, two of which are syntactically similar although with somewhat different semantics:

object [ arg1, ...... , argn ]
object [[ arg1, ...... , argn ]]

The object can formally be any valid expression, but it is understood to denote or evaluate to a subsettable object. The arguments generally evaluate to numerical or character indices, but other kinds of arguments are possible (notably drop = FALSE).

Internally, these index constructs are stored as function calls with function name "[" respectively "[[".

The third index construction is

object $ tag

Here, object is as above, whereas tag is an identifier or a text string. Internally, it is stored as a function call with name "$"

10.4.4 Compound expressions

A compound expression is of the form

{ expr1 ; expr2 ; ...... ; exprn }

The semicolons may be replaced by newlines. Internally, this is stored as a function call with "{" as the function name and the expressions as arguments.

10.4.5 Flow control elements

R contains the following control structures as special syntactic constructs

if ( cond ) expr
if ( cond ) expr1 else expr2
while ( cond ) expr
repeat expr
for ( var in list ) expr

The expressions in these constructs will typically be compound expressions.

Within the loop constructs (while, repeat, for), one may use break (to terminate the loop) and next (to skip to the next iteration).

Internally, the constructs are stored as function calls:

"if"(cond, expr)
"if"(cond, expr1, expr2)
"while"(cond, expr)
"repeat"(expr)
"for"(var, list, expr)
"break"()
"next"()

10.4.6 Function definitions

A function definition is of the form

function ( arglist ) body

The function body is an expression, often a compound expression. The arglist is a comma-separated list of items each of which can be an identifier, or of the form identifier = default, or the special token .... The default can be any valid expression.

Notice that function arguments unlike list tags, etc., cannot have “strange names” given as text strings.

Internally, a function definition is stored as a function call with function name function and two arguments, the arglist and the body. The arglist is stored as a tagged pairlist where the tags are the argument names and the values are the default expressions.

10.5 Directives

The parser currently only supports one directive, #line. This is similar to the C-preprocessor directive of the same name. The syntax is

#line nn [ "filename" ]

where nn is an integer line number, and the optional filename (in required double quotes) names the source file.

Unlike the C directive, #line must appear as the first five characters on a line. As in C, nn and "filename" entries may be separated from it by whitespace. And unlike C, any following text on the line will be treated as a comment and ignored.

This directive tells the parser that the following line should be assumed to be line nn of file filename. (If the filename is not given, it is assumed to be the same as for the previous directive.) This is not typically used by users, but may be used by preprocessors so that diagnostic messages refer to the original file.

Footnotes