Gaussian Processes

Gaussian Processes #

AstroStat Academy, 2026

The content presented in this notebook is the original work of the authors, unless specified otherwise. Any publicly available material incorporated is properly credited to its respective sources. All references to published papers, datasets, and software tools are duly acknowledged. The original content of this notebook is licensed under the GNU General Public License v3.0 (GNU GPLv3).

This notebook presents Gaussian Processes, a fitting method that has countless applications in Statistics and ML.

Notably, it has 2 properties we will exploit:

it can be used to fit a black-box function
it naturally provides uncertainties on the model

Applications:

Gaussian Processes is very powerful on its own, being a generic fitting method for any regression task.

However ..

It shines when incorporated in Bayesian Optimization, a technique we will use for hyperparameter optimization.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Code block for Colab compatibility
import sys, os

REPO_URL = "https://github.com/AstroStat-Academy/astrostat-school-7.git"
REPO_NAME = REPO_URL.split("/")[-1].replace(".git", "")

LESSON_FOLDER = "Gaussian Pocesses"

try:
    import google.colab

    # Clone repo if missing
    if not os.path.exists(REPO_NAME):
        !git clone {REPO_URL}

    # Import path to /src
    sys.path.insert(0, f"{REPO_NAME}/src")

    # Mirror data/ locally if needed
    src_data = f"{REPO_NAME}/{LESSON}/data"
    if not os.path.exists("data") and os.path.exists(src_data):
        !cp -r "{src_data}" .
    
except ImportError:
    sys.path.insert(0, "../src")

# Fancy plot style (import after matplotlib; you can comment it out without consequences)
import plot_style

Preamble: $n$ function points are equal to 1 point in $\mathbb{R}^{n}$#

Let’s assume we have a unknown function $-$ the one we want to model $-$ $f(x)$, where $x$ is 1D.

IMPORTANT: We do not know the functional form of $f$ !

We only know that, for every point $x_i$, we can generate a corresponding $y_i \equiv f(x_i)$

# Unknown function of extra-terrestrial origin:
def f(x):
    return np.exp(-0.5 * x) * np.cos(2 * x)
'''NOTE: You have to pretend we do _not_ know the mathematical form.
         Just use this `f(x)` as a black-box function.'''

xx = np.linspace(0, 5, 50)
X = np.array([0.7, 1.1, 1.7, 1.9, 2.8, 3.5, 3.9, 4.3])

plt.figure()
plt.axhline(y=0, lw=0.5, c='grey')
plt.plot(xx, f(xx), c='C0', lw=0.5, label="We don't know this ..")
plt.scatter(X, f(X), c='C0', s=65, lw=0.5, label=".. but we can have these")
plt.xlabel('x')
plt.ylabel('f(x)')

for i in range(2):
    plt.plot([X[i], X[i]], [0, f(X[i])], ls='--', c='grey', lw=1)  # Vertical segment to curve
    plt.plot([0, X[i]], [f(X[i]), f(X[i])], ls='--', c='grey', lw=1)  # Horizontal segment to y-axis
    plt.text(0, f(X[i])+0.1, str("$f(x_%s)$" % (i+1)), c='C0', fontsize=12, verticalalignment='center')

plt.legend()
plt.show()

../_images/72790467eb500aa806e272f30fb4b8f188a9fd30b1e7cfef13cdee8512214c29.png

We can see the array composed of the $n$ values:

\[[f(x_1), f(x_2),\dots, f(x_n)] = [0.16, -0.38, ...]\]

as 1 point in $\mathbb{R}^{n}$ (but we can show only the first 2 dimensions)

# Define grid
fx1x1 = np.linspace(-0.0, 0.5, 50)
fx2x2 = np.linspace(-0.5, 0.0, 50)

fX1, fX2 = f(X[0]), f(X[1])

from mpl_toolkits.mplot3d import Axes3D

def plot_point_in_Rn(fX1, fX2, fx1x1, fx2x2, rv=None):
    
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, projection='3d')
    
    # Plot the point of interest
    ax.scatter(fX1, fX2, 0, color='C3', s=50, label="[$f(x1)$, $f(x2)$]")
    
    # Segmented lines to x- and y-axis
    ax.plot([fX1, fX1], [np.min(fx1x1), fX2],   [0, 0], ls='--', c='grey', lw=1)  # Horizontal line to x-axis
    ax.plot([np.min(fx1x1), fX1],   [fX2, fX2], [0, 0], ls='--', c='grey', lw=1)  # Horizontal line to y-axis
    
    ax.text(fX1, np.min(fx1x1), 0, r"$f(x_1)$", fontsize=12, color='C0')
    ax.text(np.min(fx1x1), fX2, 0, r"$f(x_2)$", fontsize=12, color='C0')
    
    ax.set_xlabel('$f_1$')
    ax.set_ylabel('$f_2$')
    ax.set_xlim([np.max(fx1x1), np.min(fx1x1)]) #  Flip for better visualization
    ax.set_ylim([np.max(fx2x2), np.min(fx2x2)]) #  Flip for better visualization
    ax.set_zlim([0, 1]) # Set view above plane

    if rv:
        # Extract the 2D marginal mean and covariance
        rv_2D = multivariate_normal(mean=mu[:2], cov=sigma[:2, :2])

        # Compute probability density function on a grid
        fX1X1, fX2X2 = np.meshgrid(fx1x1, fx2x2)        
        Z = rv_2D.pdf(np.dstack((fX1X1, fX2X2)))

        # Plot the Multivariate Normal Distribution
        ax.plot_wireframe(fX1X1, fX2X2, Z, color='C2', linewidth=0.5, label='Multinormal')
        
        ax.set_zlim([0, np.max(Z)])
        
    plt.legend()
    plt.show()

plot_point_in_Rn(fX1, fX2, fx1x1, fx2x2, rv=None)

../_images/4badec620e8e2d54e445e36a3c4c3482d151ba8890be4ae87937eacf6ba1884b.png

NOTE:

$f_1$: space of all the possible values that the unknown function may have at point $x_1$
$f_2$: space of all the possible values that the unknown function may have at point $x_2$
$f(x_1)$: actual value that the unknown function has at point $x_1$
$f(x_2)$: actual value that the unknown function has at point $x_2$

Let’s now assume that f = $[f(x_1), f(x_2),\dots, f(x_n)]$ is a point that follows a multivariate Gaussian (Normal) distribution in $\mathbb{R}^{n}$.

In other words, we assume that f is sampled from a multivariate Normal

Let’s throw an arbitrary multivariate Normal in the image:

from scipy.stats import multivariate_normal
rng = np.random.default_rng(seed=42)

n_dim = len(X)
# the number of dimensions equals the number of datapoints

# Define Multivariate Normal Distribution
mu = f(X) + rng.uniform(0.05, 0.1, size=n_dim)  # Random means vector (conveniently near the true values)
cov = 0.05*rng.random((n_dim, n_dim))  # Random covariance
sigma = np.dot(cov, cov.T)  # Ensure it's positive semi-definite
#
rv = multivariate_normal(mean=mu, cov=sigma)

plot_point_in_Rn(fX1, fX2, fx1x1, fx2x2, rv=rv)

../_images/eba57d7e375376ad6ca96da34ffb261bf78330ae1f54732e0b6ef704593aa8b6.png

We can pretend that the red dot (f) is sampled at random from this distribution.

NOTE:

A set of variables is multivariate (= jointly) Normal $\leftrightarrow$ every linear combination of them is normally distributed

$\rightarrow$ the projections (“marginalizations”) of the multivariate are also Normal (in 1D):

def plot_marginals(fx1x1, fx2x2, rv, mu, f=None, X=None):
    # Extract the 2D marginal mean and covariance
    rv_2D = multivariate_normal(mean=mu[:2], cov=sigma[:2, :2])

    # Compute probability density function on a grid
    fX1X1, fX2X2 = np.meshgrid(fx1x1, fx2x2)        
    Z = rv_2D.pdf(np.dstack((fX1X1, fX2X2)))

    # Generate marginal distributions
    marginal_f1 = multivariate_normal(mu[0], sigma[0][0])
    marginal_f2 = multivariate_normal(mu[1], sigma[1][1])
    marg_f1_vals = marginal_f1.pdf(fx1x1)
    marg_f2_vals = marginal_f2.pdf(fx2x2)
    
    # Create figure;
    if not f:
        n_rows = 2
        n_cols = 2
        figsize = (4, 4)
        gridspec_kw={'height_ratios': [1, 3], 'width_ratios': [3, 1], 'hspace': 0, 'wspace': 0}
    else:
        n_rows = 2
        n_cols = 4
        figsize = (10, 4)
        gridspec_kw={'height_ratios': [1, 3], 'width_ratios': [3, 1, 1, 6], 'hspace': 0, 'wspace': 0}

    fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize, gridspec_kw=gridspec_kw)
    
    axes[0, 1].axis('off')
    ax_marg_x = axes[0, 0]
    ax_joint  = axes[1, 0]
    ax_marg_y = axes[1, 1]

    if f:
        axes[0, 2].axis('off')
        axes[0, 3].axis('off')
        axes[1, 2].axis('off')
        ax_f = fig.add_subplot(axes[1, 3])

    # Contour plot in central panel
    contour = ax_joint.contour(fX1X1, fX2X2, Z, levels=10, cmap='Greens', zorder=-1)
    
    # Plot the point of interest
    ax_joint.scatter(fX1, fX2, color='C3', s=50, label="[$f(x1)$, $f(x2)$]", zorder=1)
    ax_joint.plot([fX1, fX1], [np.min(fx1x1), fX2], ls='--', c='grey', lw=1)
    ax_joint.plot([np.min(fx2x2), fX1], [fX2, fX2], ls='--', c='grey', lw=1)
    
    ax_joint.set_xlim([np.min(fx1x1), np.max(fx1x1)])
    ax_joint.set_ylim([np.min(fx2x2), np.max(fx2x2)])
    ax_joint.set_xlabel('$f_1$')
    ax_joint.set_ylabel('$f_2$')
    
    # Marginal distributions
    ax_marg_x.plot(fx1x1, marg_f1_vals, color='C2')
    ax_marg_x.set_xlim([np.min(fx1x1), np.max(fx1x1)])
    ax_marg_x.set_xticklabels([])
    ax_marg_x.set_yticklabels([])
    
    ax_marg_y.plot(marg_f2_vals, fx2x2, color='C2')
    ax_marg_y.set_ylim([np.min(fx2x2), np.max(fx2x2)])
    ax_marg_y.set_xticklabels([])
    ax_marg_y.set_yticklabels([])

    if f:
        xx = np.linspace(0, 5, 50)
        ax_f.plot(xx, f(xx), c='C0', lw=0.5, label="We don't know this ..")
        
        ax_f.axhline(y=0, lw=0.5, c='grey')
        ax_f.scatter(X, f(X), c='C0', s=65, lw=0.5, label="Data")
        ax_f.set_xlabel('x')
        ax_f.set_ylabel('f(x)')

        # Plot sampled points
        fX_samples = rv.rvs(size=10, random_state=rng)
        ax_joint.scatter(fX_samples[:, 0], fX_samples[:, 1], color='C1', s=30, zorder=1)

        for fX_sample in fX_samples:
            ax_f.scatter(X, fX_sample, c='C4', s=30, lw=0.5, label="Data", alpha=0.5)
                
    plt.show()
    
plot_marginals(fx1x1, fx2x2, rv, mu, f=None, X=None)

../_images/7889ee2c8f583801dceee50bf978325aa59228fa7628c48e899cc8955648b108.png

Gaussian Processes#

A Gaussian Process (GP) is a distribution over functions.

We assign probabilities to entire functions $\rightarrow$ any function sampled from a GP is a possible “realization” of a stochastic process.

Definition#

A function f(x) is distributed according to a GP if:

for any set of input points $[x_1, \dots, x_n]$, the corresponding set of function values [ f(x$_1$), …, f(x$_n$) ] follows a multivariate Gaussian (Normal) distribution.

$\rightarrow$ Yes, just like we visualized above!

Simliarly, we can sample other points from the same distribution, and plot them:

plot_marginals(fx1x1, fx2x2, rv, mu, f=f, X=X)

../_images/97bba3addf4ccd4e4a7359c418415af7bab7e5fd0e956614766ec068f5db1777.png

However, we see that the sampled points do not follow the data.

$\rightarrow$ Solving GP means to find the correct Gaussian that describes the data.

If we do, we can predict the function at any $x_*$!

Intuition#

But … how can we find the whole Gaussian by using only 1 datapoint (the red one)?

Key idea: Function values at different points are not $i.i.d.$ but instead exhibit covariance.

$\rightarrow$ When sampling from a GP, we effectively draw an entire set of function values [$f(x_1), \dots, f(x_n)$], not just independent values at each point $x$.

In mathematical terms#

If a function is sampled from a GP, we write:

\[ f \sim \mathcal{GP}(m(x), k(x, x')), \]

where:

$m(x)$ is the mean function $-$ gives the expected value of $f(x)$ at any $x$
$k(x, x')$ is the “covariance” (or “kernel”) $-$ tells how much the function values at $x$ and $x'$ are expected to vary together

Equivalently $-$ we say that f = $[f(x_1), f(x_2),\dots, f(x_n)]$ is sampled from a multivariate Normal $\mathcal{N}$:

\[\begin{split} \begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} m(x_1) \\ m(x_2) \\ \vdots \\ m(x_n) \end{bmatrix}, \begin{bmatrix} k(x_i, x_j) \end{bmatrix}_{i,j} \right). \end{split}\]

with mean $m(x)$ and covariance $k(x, x')$.

Or, in a more compact form:

\[ \mathbf{f} \sim \mathcal{N} \left(\mathbf{m}, K\right). \]

This will be our model.

Problem Statement#

Objective: Find the model’s best m and K that allow to approximate the unknown function $f(x)$

Our observables are a set of inputs: $$ X = \{ x_1, x_2, \dots, x_n \} ~~~,~~~ x_i \in \mathbb{R}^{n} $$

and corresponding outputs: $$ \mathbf{y} = [y_1, y_2, \dots, y_n]^T ~~~,~~~ y_i \in \mathbb{R} $$

Assumption 1: The observed $y_i$ are not error-less $-$ instead, they are noisy realizations of $f(x)$:

\[ y_i = f(x_i) + \epsilon_i ~~~,~~~ \epsilon_i \sim \mathcal{N}(0, \sigma^2_n) \tag{1}\]

“f” $\rightarrow$ true function values (unknown)
“y” $\rightarrow$ observed function values

Assumption 2: $\epsilon_i$ are independent Gaussian noise terms with variance $\sigma^2_n$

(Yes, a lot of different Gaussians are involved!)

Solving GP#

Inference of test points

Now, suppose we wish to predict $f_*$ = $f(x_*)$ at new test point $x_*$ $\rightarrow$ we have to extend f to a new array:

\[\begin{split}\begin{bmatrix} \mathbf{f} \\ f_* \end{bmatrix}\end{split}\]

Since $-$ by definition $-$ any finite collection of function values in a GP follows a multivariate Normal:

(1)#\[\begin{equation} \tag{2} \begin{bmatrix} \mathbf{f} \\ f_* \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} \mathbf{m} \\ m_* \end{bmatrix}, \begin{bmatrix} K & k_* \\ k_*^T & k_{**} \end{bmatrix} \right) \end{equation}\]

where:

$m_*$ is the mean vector evaluated at $x_*$,
$K$ is the $n \times n$ covariance matrix on $X$,
$k_* = [k(x_∗,x_1), \dots, k(x_∗,x_n)]^T$ is the covariance vector between the test point and the training points,
$k_{∗∗} = k(x_∗,x_∗)$ is the variance at the test point.

Of course, we don’t have the array f, but y (our observations) $\rightarrow$ we actually modify Eq. 2 to include y:

(2)#\[\begin{equation} \tag{3} \begin{bmatrix} \mathbf{y} \\ f_* \end{bmatrix} \sim \dots~? \end{equation}\]

Deriving the posterior

Our goal is to obtain the “posterior distribution”, i.e., $p(f_* | y)$ $\rightarrow$ If we have that, we can infer $f_*$ for any new $x_*$.

Let’s derive it $-$ Starting form Eq. 1:

\[ \mathbf{y} \sim \mathbf{f} + \boldsymbol{\epsilon} ~~~,~~~ \boldsymbol{\epsilon} \sim \mathcal{N} \left(0, \sigma^2_n I\right). \]

Since $\mathbf{f} \sim \mathcal{N} \left(\mathbf{m}, K\right)$ and the noise $\epsilon_i$ is Gaussian, the observed data $y$ is also Gaussian:

\[ \mathbf{y} \sim \mathcal{N} \left(\mathbf{m}, K + \sigma^2_n I\right). \tag{4} \]

with variance equalling the sum of variances of f and $\epsilon_i$.

Plugging this into Eq. 2, we get:

(3)#\[\begin{equation} \tag{5} \begin{bmatrix} \mathbf{y} \\ f_* \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} \mathbf{m} \\ m^* \end{bmatrix}, \begin{bmatrix} K + \sigma_n^2 I & k_* \\ k_*^T & k_{**} \end{bmatrix} \right). \end{equation}\]

For a joint Normal like the one in Eq. 5, there is a rule to derive the conditional probability …

[Detailed derivation] (click here to expand)

\[\begin{split} \begin{bmatrix} a \\ b \end{bmatrix} \sim \mathcal{N}\!\Biggl( \begin{bmatrix} \mu_a \\ \mu_b \end{bmatrix}, \begin{bmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{bmatrix} \Biggr) \end{split}\]

\[ b \mid a \;\sim\; \mathcal{N}\!\Bigl(\mu_{b|a},\, \Sigma_{b|a}\Bigr), \]

\[ \text{where}\quad \mu_{b|a} \;=\; \mu_b \;+\; \Sigma_{ba}\,\Sigma_{aa}^{-1}\,\bigl(a-\mu_a\bigr), \quad \Sigma_{b|a} \;=\; \Sigma_{bb} \;-\; \Sigma_{ba}\,\Sigma_{aa}^{-1}\,\Sigma_{ab}. \]

We identify, in our case:

\[\begin{split} \begin{aligned} a &= \mathbf{y}, \quad \mu_a = \mathbf{m}, \quad \Sigma_{aa} = K + \sigma_n^2 I,\\ b &= f_\ast, \quad \mu_b = m_\ast, \quad \Sigma_{ba} = k_\ast, \quad \Sigma_{ab} = k_\ast^T, \quad \Sigma_{bb} = k_{\ast\ast}. \end{aligned} \end{split}\]

… which gives:

\[ f_* \mid \mathbf{y} \;\sim\; \mathcal{N}\!\Bigl(m_* + k_*\,(K + \sigma_n^2\,I)^{-1}(\mathbf{y} - \mathbf{m}),\; k_{**} - k_* \,(K + \sigma_n^2\,I)^{-1} k_*^T\Bigr). \tag{6} \]

$\rightarrow$ To predict $f_*$ at point $x_*$, we will sample from this Multinormal! $\blacksquare$

Q: BTW, what does it even mean to “condition” a multivariate Normal? (in-class discussion)

[Spoiler] (click here to expand)

Sorry, you should have attended in class, for this!

Interpretation#

\[ \text{posterior mean} ~~~~~~ m_*^{\text{post}} = m_* + k_*\,\bigl(K + \sigma_n^2\,I\bigr)^{-1}\,\bigl(\mathbf{y} - \mathbf{m}\bigr) \]

is our best estimate for the function values at the test inputs $x_*$

\[ \text{posterior covariance} ~~~~~~ \Sigma_*^{\text{post}} = k_{**} - k_*\,\bigl(K + \sigma_n^2\,I\bigr)^{-1}\,k_*^T \]

quantifies our uncertainty about these estimates

The kernel

What was $k(x, x')$, again?

“tells how much the function values at $x$ and $x'$ are expected to vary together” = covariance between $f(x)$ and $f(x')$

In practice, we don’t know $\rightarrow$ we assume a function, e.g., the Radial Basis Function (RBF) kernel:

\[ k(x, x') = \exp \left( -\frac{||x - x'||^2}{2\ell^2} \right), \]

Completing the derivation#

Yes,

Eq. 6

But,

we forgot a detail

$\rightarrow$ We need to derive $\ell$ (or any other kernel hyperparameter) !
$\rightarrow$ We need to derive $\sigma_n$ (Eq. 4)!

The hyperparameter $\sigma_n$ represents the expected noise around the true values of f

How?

We must find the best $\hat{\sigma_n}$ and $\hat{\ell}$ maximize the likelihood that our model returns the observed data:

\[ (\hat{\sigma}_n, \hat{\ell}) \;=\; \underset{\sigma_n,\,\ell}{\mathrm{arg\,max}} \; p\bigl(data \mid model): \]

a.k.a. Maximum Likelihood Estimation (MLE)

To be specific, in our case:

\[ (\hat{\sigma}_n, \hat{\ell}) \;=\; \underset{\sigma_n,\,\ell}{\mathrm{arg\,max}} \; p\bigl(\mathbf{y} \mid \mathbf{X}, \sigma_n, \ell\bigr). \tag{7} \]

$\rightarrow$ We just need $p\bigl(\mathbf{y} \mid \mathbf{X}, \sigma_n, \ell\bigr)$

… But, we have it: Eq. 4 !

\[ \mathbf{y} \sim \mathcal{N} \left(\mathbf{m}, K + \sigma^2_n I\right). \]

In other words:

\[ p\bigl(\mathbf{y}\,\big|\mathbf{X},\sigma_n, \ell\bigr) \;= =\; \mathcal{N}\!\bigl(\mathbf{m},\,K_{\boldsymbol{\ell}} + \sigma_n^2 I\bigr). \]

NOTE: The dependence on X is included in K, which is a matrix with entries $k(x_i, x_j)$.

… which $-$ in matrix notation $-$ is:

\[ \mathcal{N}\!\bigl(\mathbf{m},\,K_{\boldsymbol{\ell}} + \sigma_n^2 I\bigr) \;=\; \frac{1}{\sqrt{(2\pi)^n\,\det\!\bigl(K_{\boldsymbol{\ell}} + \sigma_n^2 I\bigr)}} \;\exp\!\Bigl(-\tfrac{1}{2}\, (\mathbf{y}-\mathbf{m})^\mathsf{T}\, \bigl(K_{\boldsymbol{\ell}} + \sigma_n^2 I\bigr)^{-1}\, (\mathbf{y}-\mathbf{m}) \Bigr). \]

Replacing this in Eq. 7:

\[ (\hat{\sigma}_n, \hat{\ell}) \;=\; \underset{\sigma_n,\,\ell}{\mathrm{arg\,max}} \; \frac{1}{\sqrt{(2\pi)^n\,\det\!\bigl(K_{\boldsymbol{\ell}} + \sigma_n^2 I\bigr)}} \;\exp\!\Bigl(-\tfrac{1}{2}\, (\mathbf{y}-\mathbf{m})^\mathsf{T}\, \bigl(K_{\boldsymbol{\ell}} + \sigma_n^2 I\bigr)^{-1}\, (\mathbf{y}-\mathbf{m}) \Bigr). \]

Q: Which technique we can use to find the max? (in-class discussion)

[Spoiler] (click here to expand)

$\rightarrow$ Any standard optimization algorithm, e.g., Gradient Descent!

I’m gonna sleep, show me the code already#

After all this theory, the sklearn code will look underwhelming, for its simplicity ..

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
rng = np.random.default_rng(seed=1)

def f_noisy(X):
    return f(X.flatten()) + 0.1 * rng.normal(size=X.shape[0])

# Data
X_train = np.array([0.7, 1.1, 1.7, 1.9, 2.8, 3.5, 3.9, 4.3]).reshape(-1, 1)
'''same `X` as above, but reshaped to 2D for sklearn'''
y_train = f_noisy(X)
'''creating y from the "unknown" f (*wink *wink) plus a little noise'''

# Define a kernel:
#    - RBF is the radial basis function kernel
#    - WhiteKernel handles noise in the data
guess_l = 1.0
guess_sigma_n = 0.3
kernel = RBF(length_scale=guess_l) + WhiteKernel(noise_level=guess_sigma_n)
'''
Here we initialize the kernels with the initial guesses for their hyperparameters.
Internally, sklearn's `GaussianProcessRegressor` will optimize them.

Initial guesses are important! A fit may fail if the guesses are too far off.
'''

# Initialize the Gaussian Process Regressor
gpr = GaussianProcessRegressor(kernel=kernel, random_state=42)

# Fit the model
gpr.fit(X_train, y_train)

GaussianProcessRegressor(kernel=RBF(length_scale=1) + WhiteKernel(noise_level=0.3),
                         random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now that the model is fit, we can print the best hyperparamers:

# Print the chosen kernel parameters after fitting
print("Optimized Kernel:", gpr.kernel_)

Optimized Kernel: RBF(length_scale=1.09) + WhiteKernel(noise_level=0.00492)

And, of course, predict:

yhat = gpr.predict(X_train)

xx = np.linspace(0, 5, 50)

plt.figure()
plt.axhline(y=0, lw=0.5, c='grey')
plt.plot(xx, f(xx), c='C0', lw=0.5, label="Unknown $f(x)$")
plt.scatter(X_train.flatten(), y_train, c='C0', s=65, lw=0.5, label=r"$y$ (noisy)")
plt.scatter(X_train.flatten(), yhat,    c='C3', s=65, lw=0.5, label=r"$\hat{y}$")
'''NOTE: We need to flatten `X_train` for the plot, since it is 2D'''
plt.xlabel('x')
plt.ylabel('f(x)')

plt.legend()
plt.show()

../_images/0067858fb28902f05c9a4276eea0a81e957a1a189cec5c450e8bfea9e80ba888.png

You know what? Let’s predict for a new test point $x_*$ !

No, better, for 50 test points in the interval 0 < $x_*$ < 5:

    X_test = np.linspace(0, 5, 50).reshape(-1, 1)

# Predict on new data
X_test = np.linspace(0, 5, 50).reshape(-1, 1)
yhat_test = gpr.predict(X_test)

plt.figure()
plt.axhline(y=0, lw=0.5, c='grey')
plt.plot(xx, f(xx), c='C0', lw=0.5, label="Unknown $f(x)$")
plt.scatter(X_train.flatten(), y_train,         c='C0', s=65, lw=0.5, label="$y$ (noisy)")
plt.scatter(X_test.flatten(),  yhat_test, c='C3', s=30, lw=0.5, label=r"$\hat{y}_{test}$")
plt.plot(X_test.flatten(), yhat_test, c='C3', lw=1)
'''NOTE: We need to flatten `X` and `X_test` for the plot, since they are 2D'''
plt.xlabel('x')
plt.ylabel('f(x)')

plt.legend()
plt.show()

../_images/e7f20804e73fb57f0585a301502bd3cba211498e1aa44230d1c3cbaf2b71d9c1.png

Q: What we are seeing here?

It is the mean function $m(x)$ $\leftrightarrow$ the most probable function that describes the true $f(x)$

Remember the interpretation ?

\[ \text{posterior mean} ~~~~~~ m_*^{\text{post}} = m_* + k_*\,\bigl(K + \sigma_n^2\,I\bigr)^{-1}\,\bigl(\text{\textbf{y}} - \text{\textbf{m}}\bigr) \]

is our best estimate for the function values at the test inputs $x_*$

If there is a “mean”, there must be uncertainty around it:

\[ \text{posterior covariance} ~~~~~~ \Sigma_*^{\text{post}} = k_{**} - k_*\,\bigl(K + \sigma_n^2\,I\bigr)^{-1}\,k_*^T \]

quantifies our uncertainty about these estimates

We can extract it like this (full covariance matrix):

    yhat_test, yhat_test_cov = gpr.predict(X_test, return_cov=True)

or like this (only diagonal terms, i.e. standard deviations at each $x_*$):

    yhat_test, yhat_test_std = gpr.predict(X_test, return_std=True)

yhat_test, yhat_test_std = gpr.predict(X_test, return_std=True)

plt.figure()
plt.axhline(y=0, lw=0.5, c='grey')
plt.plot(xx, f(xx), c='C0', lw=0.5, label="Unknown $f(x)$")
plt.scatter(X_train.flatten(), y_train, c='C0', s=65, lw=0.5, label="$y$ (noisy)")
plt.plot(X_test.flatten(), yhat_test, c='C3', lw=1, label=r"$\hat{y}_{test}$ (mean function)")
'''NOTE: We need to flatten `X` and `X_test` for the plot, since they are 2D'''
plt.fill_between(
    X_test.ravel(),
    yhat_test - yhat_test_std,
    yhat_test + yhat_test_std,
    alpha=0.2,
    color='tomato',
    label=r'1-$\sigma$ confidence interval'
)
plt.xlabel('x')
plt.ylabel('f(x)')

plt.legend()
plt.show()

../_images/4aef165b84f686b1450f01ea189b0ccb8c0dcdcb2145a7de178e9d6c327ccf50.png

Exercise [45 min]

Objective: Fill-in missing pixels in an image.

Dataset: An astronomical image.

from astropy.io import fits
from astropy.visualization import ZScaleInterval
import requests
from io import BytesIO
from skimage.transform import rescale

# Download NGC 1300 galaxy image from HST archive (FITS format)
fits_url = "https://hla.stsci.edu/cgi-bin/fitscut.cgi?red=hst_08597_11_wfpc2_f606w_pc&size=ALL&format=fits&config=ops"
response = requests.get(fits_url)
fits_file = BytesIO(response.content)

# Load the FITS image
data_fits = fits.getdata(fits_file)

# Crop image around galaxy
data_crop = data_fits[350:650, 425:725]

# Downscale the image (for faster computations)
scale_factor = 5
data = rescale(data_crop, 1/scale_factor, anti_aliasing=True, preserve_range=True).astype(data_crop.dtype)

# > Figure
fig, axes = plt.subplots(nrows=1, ncols=1)
axes = [axes]

# Apply ZScale interval for color scaling
interval = ZScaleInterval()
vmin, vmax = interval.get_limits(data)

axes[0].set_title("Original image", fontsize=12)
axes[0].imshow(data, cmap='magma', origin='lower', vmin=vmin, vmax=vmax)

plt.tight_layout()
plt.show()

../_images/c0fba3bb9cd5b89343a114b525319e47f47527a293695c35964a3b5f5d976a93.png

IMPORTANT:

In this image, each pixel has an associated value $\rightarrow$ indicated by the color.

This is our original data, the image for galaxy NGC 1300.

But, we will “corrupt” it by creating holes in it.

# Generate `n_holes` random circular gaps and set their values to np.nan
rng = np.random.default_rng(seed=2)
masked_data = data.copy()
n_holes = 20

rows, cols = data.shape
for _ in range(n_holes):
    x, y = rng.integers(0, cols), rng.integers(0, rows)
    radius = rng.integers(2, 5) # we add holes of 2 to 5 pixels at random
    y_grid, x_grid = np.ogrid[:rows, :cols]
    mask = (x_grid - x) ** 2 + (y_grid - y) ** 2 < radius ** 2
    masked_data[mask] = np.nan

'''NOTE:
    If we create too many holes (`n_holes`) or too large holes (`radius`),
    we may not have enough data to properly fit the data.

    You may experiment with these quantities and see the effect on the
    final result.
'''

# > Figure
fig, axes = plt.subplots(nrows=1, ncols=1)
axes = [axes]

# Apply ZScale interval for color scaling
interval = ZScaleInterval()
vmin, vmax = interval.get_limits(data)

axes[0].set_title("Image with missing values", fontsize=12)
axes[0].imshow(masked_data, cmap='magma', origin='lower', vmin=vmin, vmax=vmax)

plt.tight_layout()
plt.show()

../_images/2d4aea029aee4732611bab726f4b5215deb01b6883516d2304785a37c8257cdf.png

Task: Use GP to create a model that can predict the missing values.

In particular, you must:

Train the GP model
Predict on missing data
Use your predictions to color the missing spots in the image!

Hint: For step 3, when you need to create an image with your predictions, you literally have to invert the creation of y_test (see next block)

Q: How can we prepare the data for GP? (in-class discussion)

[Spoiler] (click here to expand)

We can place the pixel coordinates in the matrix $\rightarrow$ X

.. and the corresponding values (color, in the image) in the array $\rightarrow$ y

from pandas import option_context

# Put the coordinates of valid and invalid pixels in X ...
X_train = np.column_stack(np.where(~np.isnan(masked_data)))  # Coordinates of pixels that are NOT NaN
X_test  = np.column_stack(np.where(np.isnan(masked_data)))   # Coordinates of pixels that are NaN

# Put the values of valid and invalid pixels in y ...
y_train = data[np.where(~np.isnan(masked_data))]
y_test  = data[np.where(np.isnan(masked_data))]

with option_context("display.max_rows", 4):
    print('Train data:')
    display(pd.DataFrame(np.column_stack((X_train, y_train)), columns=['X1', 'X2', 'y']))
    print('Test data:')
    display(pd.DataFrame(np.column_stack((X_test, y_test)), columns=['X1', 'X2', 'y']))

Train data:

	X1	X2	y
0	0.0	0.0	0.066435
1	0.0	1.0	0.068779
...	...	...	...
3063	59.0	58.0	0.049804
3064	59.0	59.0	0.050841

3065 rows × 3 columns

Test data:

	X1	X2	y
0	3.0	25.0	0.146869
1	3.0	26.0	0.146841
...	...	...	...
533	58.0	41.0	0.104825
534	58.0	42.0	0.100024

535 rows × 3 columns

You can now proceed with the GP code ...

# Your code here. Add blocks as you deem fit.

[Solution below]

%%time

# 1. Train the GP

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
rng = np.random.default_rng(seed=1)

# Define a kernel:
#    - RBF is the radial basis function kernel
#    - WhiteKernel handles noise in the data
guess_l = 1.0
guess_sigma_n = 0.3
kernel = RBF(length_scale=guess_l) + WhiteKernel(noise_level=guess_sigma_n)

# Initialize the Gaussian Process Regressor
gpr = GaussianProcessRegressor(kernel=kernel, random_state=42)

# Fit the model
gpr.fit(X_train, y_train)

# Print the chosen kernel parameters after fitting
print("Optimized Kernel:", gpr.kernel_)

Optimized Kernel: RBF(length_scale=1.96) + WhiteKernel(noise_level=0.0118)
CPU times: user 4min 56s, sys: 2.62 s, total: 4min 58s
Wall time: 21.3 s

# 2. Predict

yhat_test = gpr.predict(X_test, return_std=False)
print('yhat_test shape: %s' % yhat_test.shape)

yhat_test shape: 535

# 3. Color the plot

# Create a new image
filled_data = masked_data.copy()
# Fill the masked points with predictions
filled_data[np.where(np.isnan(masked_data))] = yhat_test

# > Figure
fig, axes = plt.subplots(figsize=(10, 4), nrows=1, ncols=3)

# Apply ZScale interval for color scaling
interval = ZScaleInterval()
vmin, vmax = interval.get_limits(data)

axes[0].set_title("Image with missing values", fontsize=12)
axes[0].imshow(masked_data, cmap='magma', origin='lower', vmin=vmin, vmax=vmax)
axes[1].set_title("Image with filled gaps", fontsize=12)
axes[1].imshow(filled_data, cmap='magma', origin='lower', vmin=vmin, vmax=vmax)
axes[2].set_title("Original image", fontsize=12)
axes[2].imshow(data, cmap='magma', origin='lower', vmin=vmin, vmax=vmax)

plt.tight_layout()
plt.show()

../_images/22336ec468c947aebb9a67942c4a828cf9c35b4d808d3ae4e737e1bc316021aa.png

[End of solution]

Noticed anything? GP is quite sloooow …

Complexity:

Training: $\mathcal{O}(n^3)$ $-$ due to inversion of the covariance matrix
Prediction: $\mathcal{O}(n^2)$ $-$ for computing the mean and variance

Food-for-thought#

Interpretation

Why GP works “reasonably well” here?

GP uses a kernel $k(x_i, x_j)$ which represents the correlation between $f(x_i)$ and $f(x_j)$

Q: What is $f(x_i)$ and $f(x_j)$, in the case of our image?

[Spoiler] (click here to expand)

The pixel values (colors) !

So the question becomes: are the pixel values at $x_i$ and $x_j$ correlated at all?

$\rightarrow$ Yes! Because the galaxy light profile changes smoothly!

Uncertainties

What is the uncertainty on the prediction?

We filled the image with the mean predicted value.

But GP also provides standard deviation.

$\rightarrow$ How could we display that ?

Gaussian Processes

Contents

Gaussian Processes #

Preamble: \(n\) function points are equal to 1 point in \(\mathbb{R}^{n}\)#

Gaussian Processes#

Definition#

Intuition#

In mathematical terms#

Problem Statement#

Solving GP#

Interpretation#

Completing the derivation#

I’m gonna sleep, show me the code already#

Food-for-thought#