This distribution implements the variational Gaussian process (VGP), as
described in Titsias (2009) and Hensman (2013). The VGP is an
inducing point-based approximation of an exact GP posterior.
Ultimately, this Distribution class represents a marginal distribution over function values at a
index_points. It is parameterized by
a kernel function,
a mean function,
the (scalar) observation noise variance of the normal likelihood,
a set of index points,
a set of inducing index points, and
the parameters of the (full-rank, Gaussian) variational posterior distribution over function values at the inducing points, conditional on some observations.
tfd_variational_gaussian_process( kernel, index_points, inducing_index_points, variational_inducing_observations_loc, variational_inducing_observations_scale, mean_fn = NULL, observation_noise_variance = 0, predictive_noise_variance = 0, jitter = 1e-06, validate_args = FALSE, allow_nan_stats = FALSE, name = "VariationalGaussianProcess" )
function that acts on index points to produce a (batch
of) vector(s) of mean values at those index points. Takes a
Logical, default FALSE. When TRUE distribution parameters are checked for validity despite possibly degrading runtime performance. When FALSE invalid inputs may silently render incorrect outputs. Default value: FALSE.
Logical, default TRUE. When TRUE, statistics (e.g., mean, mode, variance) use the value NaN to indicate the result is undefined. When FALSE, an exception is raised if one or more of the statistic's batch members are undefined.
name prefixed to Ops created by this class.
a distribution instance.
A VGP is "trained" by selecting any kernel parameters, the locations of the
inducing index points, and the variational parameters. Titsias (2009) and
Hensman (2013) describe a variational lower bound on the marginal log
likelihood of observed data, which this class offers through the
variational_loss method (this is the negative lower bound, for convenience
when plugging into a TF Optimizer's
Training may be done in minibatches.
Titsias (2009) describes a closed form for the optimal variational
parameters, in the case of sufficiently small observational data (ie,
small enough to fit in memory but big enough to warrant approximating the GP
posterior). A method to compute these optimal parameters in terms of the full
observational data set is provided as a staticmethod,
optimal_variational_posterior. It returns a
MultivariateNormalLinearOperator instance with optimal location and scale parameters.
Notation We will in general be concerned about three collections of index points, and it'll be good to give them names:
x, ..., x[N]: observation index points -- locations of our observed data.
z, ..., z[M]: inducing index points -- locations of the
"summarizing" inducing points
t, ..., t[P]: predictive index points -- locations where we are
making posterior predictions based on observations and the variational
To lighten notation, we'll use
X, Z, T to denote the above collections.
Similarly, we'll denote by
f(X) the collection of function values at each of
x[i], and by
Y, the collection of (noisy) observed data at each
We'll denote kernel matrices generated from pairs of index points as
K_tz, etc, e.g.,
K_tz = | k(t, z) k(t, z) ... k(t, z[M]) | | k(t, z) k(t, z) ... k(t, z[M]) | | ... ... ... | | k(t[P], z) k(t[P], z) ... k(t[P], z[M]) |
A Gaussian process is an indexed collection of random variables, any finite
collection of which are jointly Gaussian. Typically, the index set is some
finite-dimensional, real vector space, and indeed we make this assumption in
what follows. The GP may then be thought of as a distribution over functions
on the index set. Samples from the GP are functions on the whole index set;
these can't be represented in finite compute memory, so one typically works
with the marginals at a finite collection of index points. The properties of
the GP are entirely determined by its mean function
m and covariance
k. The generative process, assuming a mean-zero normal likelihood
f ~ GP(m, k) Y | f(X) ~ Normal(f(X), sigma), i = 1, ... , N
In finite terms (ie, marginalizing out all but a finite number of f(X), sigma), we can write
f(X) ~ MVN(loc=m(X), cov=K_xx) Y | f(X) ~ Normal(f(X), sigma), i = 1, ... , N
Posterior inference is possible in analytical closed form but becomes intractible as data sizes get large. See Rasmussen (2006) for details.
The VGP is an inducing point-based approximation of an exact GP posterior, where two approximating assumptions have been made:
function values at non-inducing points are mutually independent conditioned on function values at the inducing points,
the (expensive) posterior over function values at inducing points conditional on obseravtions is replaced with an arbitrary (learnable) full-rank Gaussian distribution,
q(f(Z)) = MVN(loc=m, scale=S),
S are parameters to be chosen by optimizing an evidence
lower bound (ELBO).
The posterior predictive distribution becomes
q(f(T)) = integral df(Z) p(f(T) | f(Z)) q(f(Z)) = MVN(loc = A @ m, scale = B^(1/2))
A = K_tz @ K_zz^-1 B = K_tt - A @ (K_zz - S S^T) A^T
The approximate posterior predictive distribution
q(f(T)) is what the
VariationalGaussianProcess class represents.
Model selection in this framework entails choosing the kernel parameters, inducing point locations, and variational parameters. We do this by optimizing a variational lower bound on the marginal log likelihood of observed data. The lower bound takes the following form (see Titsias (2009) and Hensman (2013) for details on the derivation):
L(Z, m, S, Y) = MVN(loc= (K_zx @ K_zz^-1) @ m, scale_diag=sigma).log_prob(Y) - (Tr(K_xx - K_zx @ K_zz^-1 @ K_xz) + Tr(S @ S^T @ K_zz^1 @ K_zx @ K_xz @ K_zz^-1)) / (2 * sigma^2) - KL(q(f(Z)) || p(f(Z))))
where in the final KL term,
p(f(Z)) is the GP prior on inducing point
function values. This variational lower bound can be computed on minibatches
of the full data set
(X, Y). A method to compute the negative variational
lower bound is implemented as
Optimal variational parameters
As described in Titsias (2009), a closed form optimum for the variational
location and scale parameters,
S, can be computed when the
observational data are not prohibitively voluminous. The
optimal_variational_posterior function to computes the optimal variational
posterior distribution over inducing point function values in terms of the GP
parameters (mean and kernel functions), inducing point locations, observation
index points, and observations. Note that the inducing index point locations
must still be optimized even when these parameters are known functions of the
inducing index points. The optimal parameters are computed as follows:
C = sigma^-2 (K_zz + K_zx @ K_xz)^-1 optimal Gaussian covariance: K_zz @ C @ K_zz optimal Gaussian location: sigma^-2 K_zz @ C @ K_zx @ Y