Mathematical Foundations for Machine Learning
Basis vectors in where
It can be shown that any basis of consists of exactly linear independent vectors and that conversely any set of linear independent vectors is a basis of .
| Term | Definition | ||
|---|---|---|---|
| Image / range |
|
||
| Kernel / nullspace |
|
All linear transformations share the property, that they map the null vector to null:
For many applications this is undesirable. Therefore linear transformations are often slightly extended to transformations, which are called affine transformations. An affine transformation is defined by an matrix and an -dimensional vector , which, in machine learning and statistics, is usually called bias-vector or intercept.
| Term | Definition |
|---|---|
| Rank |
Maximum number of linearly independent rows or columns
|
| Nullity |
Number of vectors in the null space
|
| Dimension |
|
| Span |
Set of all finite linear combinations of the elements of of a vector space .
|
| Column space |
Span of the column vectors of a matrix
|
| Row space |
Span of the row vectors of a matrix
|
| Term | Rules |
|---|---|
| Associativity |
|
| Distributivity |
|
| Commutativity |
|
| Transposing |
|
| Identity matrix |
|
| Inverting |
|
| Determinate |
|
| Scalars |
|
Let be the set of all matrices. How could the basis of look?
Unit matrices:
Shorthand:
Transposing:
Vectors = tensors of rank 1, matrices = tensors of rank 2
The standart basis for consists of basis vectors where , such that all components of are zero except at position where the value of the component is .
Let and be two sets. A mapping that associates to each element exactly one element of is called function.
| Term | Definition |
|---|---|
| Domain of | |
| Range of | |
| Real valued function | |
| Vector valued function | |
| A function of variables |
A pixel image with color channels (RGB) can be represented as a matrix (tensor of rank 3). The red value of a pixel at row , column can be indexed with . A function to determine whether the image was taken at day () or night () would have the following signature:
Residual sum of square (RSS) is a statistical method that helps identify the level of discrepancy in a dataset not predicted by a regression model. Thus, it measures the variance in the value of the observed data when compared to its predicted value as per the regression model.
If is a function from to its graph is defined as:
Due to the fact that the visual imagination of humans is limited to at most 3 dimensions, the graph of a function is a useful concept for graphical illustration if and only if .
| Term | Definition |
|---|---|
| Curves | Vector valued functions of one variable, i.e. they map one-dimensional input to multi-dimensional output |
| -dimensional curve | A continuous function that maps an interval into a subset |
| Surfaces | Real valued functions of multiple variables, i.e. they map multi-dimensional input to one dimensional output |
| -dimensional hyper-surface |
A continuous real valued function that maps a sufficiently large subset into a subset If a hypersurface is also simply called surface. If a hyper-surface is simply a continuous real valued function of one variable. |
The graph of both, an -dimensional curve and an -dimensional hyper-surface, is a subset of . A graphical illustration of such a graph therefore requires . However, unlike surfaces , a curve is usually not illustrated through
Instead curves are typically visualized by skipping the independent variable from the graph, i.e. by visualizing the image of the curve
which means that we are projecting the graph of the curve onto the gray plane that is spanned from the dependent variables. As this kind of plot saves one dimension we can also visualize curves for which the range is a subset of .
Unlike for image processing, curves are of little importance in machine learning. On the contrary, hyper-surfaces belong to the most important functions in machine learning. This is because, to a large extend, machine learning can be thought of being a branch of statistics. In statistics the most important class of functions are probability distributions, which are functions mapping subsets of into .
One approach to decide, whether two images are similar or not, is to first extract features (such as corner-points, edges, etc.) from each of the images, and store these features in a list of numbers, which is called feature vector. It is reasonable to assume, that similar images are mapped to similar feature vectors.
Two vectors and are similar, if both vectors head into the same direction and are of similar length, so if the difference between both vectors has a small length. In vector geometry, the length of a vector is called a norm, and the distance between two vectors is called a metric. A norm is thus a function that associates with a vector a positive number , which is interpreted as length of .
-dimensional Euclidean norm or -norm, written also as .
Once a vector space is equipped with a norm , we can also associate a metric with that vector space, which uses the notion of a length to measure the distance between pairs of vectors:
metric. The value is identical to the length of the vector that heads from to .
In machine learning many functions are vector valued functions of many variables, like linear transformations. Neural networks are typically defined to be a stack of unknown linear transformations, each of which being followed by a rather simple, predefined non-linear operation. Training a neural network is therefore almost equivalent to finding suitable linear transformations that make the network perform the desired operation.
Besides curves and surfaces there are functions that take multi-dimensional input and deliver multi-dimensional output. In order to get an idea how these functions look like, we often display low-dimensional “sections” of the function that can either be displayed as curves or as surfaces. There is however another way, how these functions can be displayed, namely using a vector field.
The decomposition of curves into coordinate functions allows us to generalize most of the concepts from previous analysis courses to curves:
We consider the curve
The component functions are
We can thus argue:
For any a curve of the form corresponds to a line going through the point in the direction of .
Let be an arbitrary real valued function and be an arbitrary vector in the domain of . We consider the domain of to be sufficiently large for doing calculus at , if for every vector there is a small interval , such that for all the vector . This condition guarantees that any curve that hits the domain of , remains there for at least a short period of time, and is one of the properties that needs to be satisfied in order to denote as a hyper-surface. The other property that is required for to be a hyper-surface, is that is continuous at any point of its domain.
We can use the notion of dependent variables as an alternative representation of functions, this notion allows us to define functions without explicitly stating its arguments.
Let us assume that
The graph
illustrates that is a function from to , is a function from to and is a subset of . Alternatively, for every set , such that we could also have written
indicating that maps into a subset of , which in turn is a subset of the domain of and thus can be mapped to using .
A commutative diagram is an extension of the previous diagram, in which we add an additional arrow pointing from to .
By labeling this additional arrow with , a commutative diagram can be used to define a function from to , that is behaving such, that it maps every to exactly the same element as the composition of with .
In the theory of real valued functions in one variable the linearisation of at is done via
and can be interpreted as:
- First order Taylor polynomial of , which implies that the linearisation of at is that first order polynomial which best approximates in the vicinity of .
- Identifying that particular line that fits best to the graph of in a neighborhood of . This line is known as tangent of at .
-
Definition of the derivative of at .
tells us that both sides of this equation become equal for , and therefore equivalent to the definition of the derivative of at
In the theory of vector valued functions of many variables, the linearisation of a vector valued function in variables , itself has to be a mapping from to . Further, the right hand side has to be an affine transformation, i.e. a function of the form
in which is a matrix and is a vector. Since we expect that to map to , we can infer that and .
The derivative of a vector valued function can thus be expected to be a matrix . This matrix is called Jacobian matrix and it can be shown that this matrix is comprised of all partial derivatives of all component functions of .
| Function | Derivative |
|---|---|
|
|
|
|
|
|
|
|
|
| Term | Definition |
|---|---|
| Training sample | Knowledge of in- and outputs prior to learning |
| Supervised learning | A process for finding suitable values of parameters that make the system respond as desired for as many training samples as possible |
| Cost functions |
Process of finding the values in supervised learning. Function of parameters that maps them to a high value if the misclassification is high and low if otherwise:
|
| Train | Feed the system with training samples and find the best parameters in the process in searching for the (global) minimum of the cost function |
To a large extend, the world we live in can be envisioned as a two dimensional surface that is embedded in three dimensional space. Using this analogy you can take away two important methods for visualizing surfaces: surface-plots (3d) and contour-plots (2d).
|
Surface-plot: |
Contour-plot: |
Let with be a hyper-surface.
| Term | Definition |
|---|---|
| Upper bound | Any number , such that for all is called an upper bound of |
| Lower bound | Any number , such that for all is called a lower bound of |
| Global maximum | An upper bound is called global maximum value or absolute maximum value of , if . The value is called the global maximum point or absolute maximum point |
| Global minimum | A lower bound is called global minimum value or absolute minimum value of , if . The value is called the global minimum point or absolute minimum point |
| Local maximum | Any point such that, whenever is called local maximum point and the value is called local maximum value |
| Local minimum | Any point such that, whenever is called local minimum point and the value is called local minimum value |
Stationary points can also identify positions, at which a function is becoming flat into one direction without loosing its monotony.
Finally, there is a third scenario, which can occur: saddle points. Saddle points require functions that depend on at least two variables, because the domain of such functions is sufficiently large to move into at least two different directions without leaving the graph. In such a case it can happen, that a stationary point appears to be as a maximum value in one direction, and as a minimum value in another direction.
Notation for second derivative:
Let , a curve passing our stationary point at and
then
- If you can show that , then defines a local minimum
- If you can show that , then defines a local maximum
- In both of those cases, i.e. , the point defines a saddle point
- In all other cases it defines nothing and further investigation is needed