Approximation: Interpolation and Regression#
Approximation is a broad concept in mathematics and numerical analysis that involves finding a function or model that closely represents a set of data points or a more complex function. The goal is to provide a simpler or more manageable representation while maintaining accuracy within acceptable bounds. In order to correctly use the library, it is important to understand the differences between the two subdomains, interpolation and regression.
Interpolation:
Definition: Interpolation is the process of constructing a new function that passes exactly through a given set of known data points. The function is typically used to estimate unknown values within the range of the known data points.
Application: Used when you have a set of exact data points and want to estimate values within the range of the data. It is often used in fields like computer graphics, engineering, and numerical analysis.
Methods: Common methods include polynomial interpolation (like Lagrange interpolation), spline interpolation, and piecewise linear interpolation.
Regression:
Definition: Regression is the process of fitting a function or model to a set of data points, typically with the goal of modeling the relationship between variables. Unlike interpolation, regression does not necessarily pass through the given data points; instead, it aims to minimize the difference (or error) between the data points and the model.
Application: Used when the data is noisy or when there is a need to model a trend, often in statistics, machine learning, and economics. Regression is used to predict or infer the relationship between variables.
Methods: Common methods include linear regression, polynomial regression, and non-linear regression techniques.
Interpolation#
📖 Interpolation API Reference
Lagrange Interpolation in 1d#
Given \(n+1\) data points \(((x_0, y_0), (x_1, y_1), \ldots, (x_n, y_n))\), the Lagrange interpolation polynomial \(P(x)\) is expressed as:
where \(L_i(x)\) are the Lagrange basis polynomials defined as
Each \(L_i(x)\) is constructed so that \(L_i(x_j)=1\) if \(i=j\) and \(L_i(x_j)=0\) if \(i \neq j\). This ensures that the polynomial \(P(x)\) passes through all given points exactly.
You can generate Lagrange basis polynomials of any degree using the :func:`~sigmaepsilon.math.approx.lagrange.gen_Lagrange_1d` function. Note that it only generates the basis functions $L_i(x)$, not the interpolation function $P(x)$. If you want the interpolation polynomial $P(x)$ directly, use the :func:`~sigmaepsilon.math.approx.lagrange.approx_Lagrange_1d` function.Generating the Lagrange Basis Polynomials#
[1]:
from sigmaepsilon.math.approx.lagrange import gen_Lagrange_1d
Let say you want to obtain the approximation functions for the case where the values are known at the points \(x_1 = -1\), \(x_2 = 0\) and \(x_3 = 1\).
[2]:
functions = gen_Lagrange_1d(x=[-1, 0, 1])
functions
[2]:
DeepDict({1: DeepDict({'symbol': '\\phi_{1}', 0: x*(x - 1)/2, 1: x - 1/2, 2: 1, 3: 0}), 2: DeepDict({'symbol': '\\phi_{2}', 0: 1 - x**2, 1: -2*x, 2: -2, 3: 0}), 3: DeepDict({'symbol': '\\phi_{3}', 0: x*(x + 1)/2, 1: x + 1/2, 2: 1, 3: 0})})
With these settings, the function returns the necessary approximation functions and their derivatives up to 3rd.
[3]:
type(functions), list(functions.keys())
[3]:
(sigmaepsilon.deepdict.deepdict.DeepDict, [1, 2, 3])
The returned object is a deep dictionary from sigmaepsilon.deepdict. The length of the dictionary is 3, because there are 3 data points. If we look into the first of these, this is what we see:
[4]:
functions[1]
[4]:
DeepDict({'symbol': '\\phi_{1}', 0: x*(x - 1)/2, 1: x - 1/2, 2: 1, 3: 0})
We can see, that every approximation function has 4 items. The first of these is the LaTeX string symbol of the symbol of the function, the others are functions to evaulate the approximation function (with key 0) and its derivatives.
[5]:
functions[1, "symbol"]
[5]:
'\\phi_{1}'
[6]:
functions[1, 0]
[6]:
[7]:
type(functions[1, 0])
[7]:
sympy.core.mul.Mul
If you prefer zero based indexing for the functions, you can have control over it using the argument i as follows:
[8]:
functions = gen_Lagrange_1d(x=[-1, 0, 1], i=[0, 1, 2])
functions
[8]:
DeepDict({0: DeepDict({'symbol': '\\phi_{0}', 0: x*(x - 1)/2, 1: x - 1/2, 2: 1, 3: 0}), 1: DeepDict({'symbol': '\\phi_{1}', 0: 1 - x**2, 1: -2*x, 2: -2, 3: 0}), 2: DeepDict({'symbol': '\\phi_{2}', 0: x*(x + 1)/2, 1: x + 1/2, 2: 1, 3: 0})})
You can see that in this case the LaTeX code of the first function is \(\phi_{0}\), instead of \(\phi_{1}\), hence the data corresponding to the first apprimation function is accessible with the key 0, instead of 1 as it was in the previous case.
[9]:
functions[0, "symbol"]
[9]:
'\\phi_{0}'
You can obtain the approximation functions in purely symbolic form, where the data points are variables in the functions:
[10]:
gen_Lagrange_1d(i=[1, 2], sym=True)[1, 0]
[10]:
Here, the argument \(i=[1, 2]\) only affects the indices used in the output.
[11]:
gen_Lagrange_1d(i=[0, 1], sym=True)[0, 0]
[11]:
[12]:
gen_Lagrange_1d(i=[0, 1], sym=False)[0, 0]
[12]:
Clearly, if we substitute the right values, we can see that the result is the same:
[13]:
gen_Lagrange_1d(i=[0, 1], sym=True)[0, 0].subs({'x_1': -1, 'x_2': 1})
[13]:
Let’s plot the approximation functions for interpolating over 4 points in the standard domain \([-1, 1]\).
[14]:
from sigmaepsilon.math.function import Function
import matplotlib.pyplot as plt
from matplotlib import gridspec
import numpy as np
inds = [1, 2, 3, 4]
data = gen_Lagrange_1d(i=inds)
fig = plt.figure(figsize=(4, 7)) # in inches
fig.patch.set_facecolor('white')
gs = gridspec.GridSpec(len(inds), 1)
xdata = np.linspace(-1, 1, 100)
for i, ind in enumerate(inds):
ax = fig.add_subplot(gs[i])
label = '$' + data[ind]['symbol'] + '$'
ax.set_title(label)
fnc = Function(data[ind][0])
fdata = fnc([xdata])
ax.plot(xdata, fdata)
ax.hlines(y=0, xmin=-1, xmax=1, colors='k', zorder=-10, lw=1.0)
fig.tight_layout()
Generating the Lagrange Interpolation Polynomial#
If you want the interpolation polynomial $P(x)$ directly, use the approx_Lagrange_1d() function.
If you want to obtain a numerical function, it is suggested to specify lambdify=True. With this option, the resulting symbolic expression is turned into a high-performance ‘NumPy function’ via calling sympy.lambdify.
[15]:
from sigmaepsilon.math.approx import approx_Lagrange_1d
points, values = [-1, 1], [0, 10]
approx = approx_Lagrange_1d(points, values, lambdify=True)
approx(-1), approx(1), approx(0)
[15]:
(0, 10, 5)
It is also possible to use symbols in either the locations
[16]:
import sympy as sy
L = sy.symbols('L', real=True, positive=True)
fnc = approx_Lagrange_1d([0, L], [-1, 1])
dfnc = fnc.diff('x')
dfnc
[16]:
or as the values
[17]:
fnc = approx_Lagrange_1d([-1, 1], [0, L])
dfnc = fnc.diff('x')
dfnc
[17]:
Lagrange Interpolation in Higher Dimensions#
Due to the need for the approximation function to pass through each data point precisely, generating Lagrange interpolation polynomials in higher dimensions is a highly challenging task. While it can be relatively straightforward in 2 or 3 dimensions, finding a general solution becomes significantly more complex, and there is currently no practical demand for it. If you’re looking to perform interpolation in 2 or 3 dimensions over domains or arbitrary shapes, we recommend using the
sigmaepsilon.mesh or PyVista libraries.
Regression#
If you don’t need the approximation function to pass through the data points exactly, you need regression.
The library provides two ways to find fit functions by means of linear or nonlinear regression. You can use the least-quares, weighted least-squares or the moving least-squares method. The two ways of arriving at the fit function (and possibly its derivatives) are:
By using the
least_squares(),weighted_least_squares()andmoving_least_squares()functions. These implementations are less performant, but provide great flexibility and customization of behavior.By using the
MLSApproximatorclass. This is a high-performance implementation, at the cost of the weight function being fixed. Also, at the moment this class can only be used to approximate the function values.
Least-Squares (LS) Approximation#
We create a globally defined fit function using 9 points and two sets of known values.
[18]:
import numpy as np
# the coordinates of the known values
points = [(1, 1), (1, -1), (-1, 1), (-1, -1), (0, 0), (1, 0), (-1, 0), (0, 1), (0, -1)]
points = np.array(points)
# two sets of known values
values = [[1, -0.5, 1, 1, -1, 0, 0, 0, 0], [1, -1, 0, 0, 1, 0, -1, -1, 1]]
# transpose the values because the function expects datasets to be columns
values = np.array(values).T
points.shape, values.shape
[18]:
((9, 2), (9, 2))
From the shapes of these arrays, we can tell that we have 9 points in 2d and we also have 2 points for each point.
We wish to fit a quadratic (2nd order) polynomial for a set of points in a 2 dimensional space (each point has 2 coordinates).
[19]:
from sigmaepsilon.math.approx import least_squares
approx = least_squares(points, values, deg=2, order=1)
We can use this approximator to evaluate the fit function and its derivatives. For instance, if we wanted to evaluate the approximation functions at \(\mathbf{x} = (0, 0)\):
[20]:
f, fdx, fdy, fdxx, fdyy, fdxy = approx([0, 0])
f, fdx, fdy, fdxx, fdyy, fdxy
[20]:
(array([-0.83333333, 0.33333333]),
array([-0.25 , 0.16666667]),
array([0.25, 0. ]),
None,
None,
None)
We can see that the approximation functions for the partial derivatives \(\frac{\partial^2 f}{\partial x^2}\), \(\frac{\partial^2 f}{\partial z^2}\) and \(\frac{\partial^2 f}{\partial x \partial y}\) are None, since we were only asking for the first derivatives by providing the argument order=1.
Let’s now plot the fit function with Matplotlib.
[21]:
from matplotlib import pyplot as plt
from matplotlib import cm
n = 20 # number of sampling points per coordinate
X = np.linspace(-1, 1, n)
Y = np.linspace(-1, 1, n)
X, Y = np.meshgrid(X, Y)
Z = np.zeros([values.shape[-1], n, n])
for i in range(n):
for j in range(n):
f, *_ = approx([X[i, j], Y[i, j]])
Z[:, i, j] = f
fig = plt.figure()
ax1 = fig.add_subplot(221, projection="3d")
ax2 = fig.add_subplot(222, projection="3d")
ax1.plot_surface(X, Y, Z[0, ::], cmap=cm.coolwarm, linewidth=0, antialiased=False)
ax2.plot_surface(X, Y, Z[1, ::], cmap=cm.coolwarm, linewidth=0, antialiased=False)
ax1.set_xlabel("$x_1$")
ax1.set_ylabel("$x_2$")
ax2.set_xlabel("$x_1$")
ax2.set_ylabel("$x_2$")
plt.show()
Weighted Least-Squares (WLS) Approximation#
For now, we are going to use a prepared dataset, that you can find among the downloads. It consists of a NumPy array, where the first two columns are for the inputs (the points) and the remaining columns represent the outputs (the values).
[22]:
import numpy as np
from sigmaepsilon.math.downloads import download_mls_testdata
data: np.ndarray = download_mls_testdata()
points = data[:, :2]
values = data[:, 2:]
points.shape, values.shape
[22]:
((102, 2), (102, 4))
We can see that we have 102 data points in 2d and the same number of values in 4d.
For the weighted least-squares method, we also have to specify a suitable weight function.
[23]:
from sigmaepsilon.math.approx import weighted_least_squares, CubicWeightFunction
w = CubicWeightFunction(core=[0.0, 0.0], supportdomain=[0.5, 0.5])
approx = weighted_least_squares(points, values, deg=2, order=0, w=w)
Let’s plot the approximation functions for each 4 output dimensions.
[24]:
from matplotlib import pyplot as plt
from matplotlib import cm
n = 20 # number of sampling points per coordinate
X = np.linspace(0, 1, n)
Y = np.linspace(0, 1, n)
X, Y = np.meshgrid(X, Y)
Z = np.zeros([4, n, n])
for i in range(n):
for j in range(n):
f, *_ = approx([X[i, j], Y[i, j]])
Z[:, i, j] = f
fig = plt.figure()
ax1 = fig.add_subplot(221, projection="3d")
ax2 = fig.add_subplot(222, projection="3d")
ax3 = fig.add_subplot(223, projection="3d")
ax4 = fig.add_subplot(224, projection="3d")
ax1.plot_surface(X, Y, Z[0, ::], cmap=cm.coolwarm, linewidth=0, antialiased=False)
ax2.plot_surface(X, Y, Z[1, ::], cmap=cm.coolwarm, linewidth=0, antialiased=False)
ax3.plot_surface(X, Y, Z[2, ::], cmap=cm.coolwarm, linewidth=0, antialiased=False)
ax4.plot_surface(X, Y, Z[3, ::], cmap=cm.coolwarm, linewidth=0, antialiased=False)
ax1.set_xlabel("IN 1")
ax1.set_ylabel("IN 2")
ax1.set_zlabel("OUT 1")
ax2.set_xlabel("IN 1")
ax2.set_ylabel("IN 2")
ax2.set_zlabel("OUT 2")
ax3.set_xlabel("IN 1")
ax3.set_ylabel("IN 2")
ax3.set_zlabel("OUT 3")
ax4.set_xlabel("IN 1")
ax4.set_ylabel("IN 2")
ax4.set_zlabel("OUT 4")
plt.show()
Moving Least-Squares (MLS) Approximation#
Using the moving_least_squares() function is the same as with the weighted_least_squares() function. The only difference is in the behaviour, as the MLS adjusts the core of the weight function, hence the term ‘moving’.
[25]:
import numpy as np
from sigmaepsilon.math.downloads import download_mls_testdata
data: np.ndarray = download_mls_testdata()
points = data[:, :2]
values = data[:, 2:]
points.shape, values.shape
[25]:
((102, 2), (102, 4))
[26]:
from matplotlib import pyplot as plt
from matplotlib import cm
n = 20 # number of sampling points per coordinate
X = np.linspace(0, 1, n)
Y = np.linspace(0, 1, n)
X, Y = np.meshgrid(X, Y)
Z = np.zeros([4, n, n])
for i in range(n):
for j in range(n):
f, *_ = approx([X[i, j], Y[i, j]])
Z[:, i, j] = f
fig = plt.figure()
ax1 = fig.add_subplot(221, projection="3d")
ax2 = fig.add_subplot(222, projection="3d")
ax3 = fig.add_subplot(223, projection="3d")
ax4 = fig.add_subplot(224, projection="3d")
ax1.plot_surface(X, Y, Z[0, ::], cmap=cm.coolwarm, linewidth=0, antialiased=False)
ax2.plot_surface(X, Y, Z[1, ::], cmap=cm.coolwarm, linewidth=0, antialiased=False)
ax3.plot_surface(X, Y, Z[2, ::], cmap=cm.coolwarm, linewidth=0, antialiased=False)
ax4.plot_surface(X, Y, Z[3, ::], cmap=cm.coolwarm, linewidth=0, antialiased=False)
ax1.set_xlabel("IN 1")
ax1.set_ylabel("IN 2")
ax1.set_zlabel("OUT 1")
ax2.set_xlabel("IN 1")
ax2.set_ylabel("IN 2")
ax2.set_zlabel("$OUT 2")
ax3.set_xlabel("IN 1")
ax3.set_ylabel("IN 2")
ax3.set_zlabel("$OUT 3")
ax4.set_xlabel("IN 1")
ax4.set_ylabel("IN 2")
ax4.set_zlabel("$OUT 4")
plt.show()
Of course, the difference between the two is hardly noticeable now, but if you create a difference plot, or play around with them, you fill find that the MLS is the more accurate one.
Moving Least Squares Approximation using the MLSApproximator class#
The MLSApproximator class is different from the previous implementation in that it is high-performance. This comes at the cost that the choice of the weight function is fixed. Another difference is that it uses a kNN tree to decide the points around the neighbourhood of the point of evaluation on which the regression is made. Also, the class only approximates the function, not the derivatives.
[27]:
import numpy as np
from sigmaepsilon.math.downloads import download_mls_testdata
data: np.ndarray = download_mls_testdata()
points = data[:, :2]
values = data[:, 2:]
points.shape, values.shape
[27]:
((102, 2), (102, 4))
[28]:
from sigmaepsilon.math.approx import MLSApproximator
approximator = MLSApproximator(points, values)
approximator.approximate(points)[:5, :]
[28]:
array([[6.88029282e+05, 6.88029282e+05, 2.92550220e+05, 1.97073011e+05],
[3.66243143e+04, 3.66243143e+04, 1.34600139e+03, 8.17082870e+01],
[6.88029282e+05, 6.88029282e+05, 2.92550220e+05, 1.97073011e+05],
[6.58356356e+05, 6.77569470e+05, 2.80088371e+05, 1.91808300e+05],
[6.09160300e+05, 6.64341189e+05, 2.59715765e+05, 1.83912612e+05]])
You can approximate n-dimensional data as well:
[29]:
values = np.random.rand(len(points), 5, 2, 6, 10)
approximator = MLSApproximator(points, values)
approximator.approximate(points).shape
[29]:
(102, 5, 2, 6, 10)
The performance of the approximator is perhaps better understood visually using 1d data. For the purpose of this example, we are going to use a sine function, with some random noise added on top of it.
[30]:
from sigmaepsilon.math import atleast2d
plt.ioff() # Turn off interactive plotting
number_of_data_points = 200
number_of_sampling_points = 16
coords = np.linspace(0, 1, number_of_data_points)
random_noise = np.random.normal(0, 0.2, coords.shape)
data = np.sin(4 * np.pi * coords) + random_noise
fig, axs = plt.subplots(1, 1, layout="constrained")
axs.plot(coords, data, "o", color="black")
axs.set_title("MLS approximation using kNN")
data.shape, coords.shape
[30]:
((200,), (200,))
Now we will approximate the underlying function with different settings and plot the approximation functions.
[31]:
coords_approx = np.linspace(0, 1, number_of_sampling_points)
def approximate(k: int):
approximator = MLSApproximator(coords, data, k=k)
data_approx = approximator.approximate(coords_approx)
return data_approx
def plot_on_axis(ax, coords_approx, data_approx, *args, **kwargs):
ax.plot(coords_approx, data_approx, *args, markersize=4, **kwargs)
ax.legend()
for k in [2, 4, 8, 16, 32, 64, 200]:
data_approx = approximate(k)
plot_on_axis(axs, coords_approx, data_approx, label=f"MLS kNN (k={k})")
plt.show()
It is visible that for this particular data, there is no big difference in using 4, 8, 16 or more neighbours, but the accuracy gets worse around the boundaries. Also, when all points are accounted for, the approximation reduces to a least-squares solution.