Mathematical Foundations for Machine Learning

1.LINEAR ALGEBRA

1.1.STANDARD BASIS

Basis vectors in where

It can be shown that any basis of consists of exactly linear independent vectors and that conversely any set of linear independent vectors is a basis of .

1.2.VECTORS

1.3.LINEAR TRANSFORMATIONS

Term

Definition

Image / range

Kernel / nullspace

1.3.1.Affine transformations

All linear transformations share the property, that they map the null vector to null:

For many applications this is undesirable. Therefore linear transformations are often slightly extended to transformations, which are called affine transformations. An affine transformation is defined by an matrix and an -dimensional vector , which, in machine learning and statistics, is usually called bias-vector or intercept.

1.4.MATRICES

Term	Definition
Rank	Maximum number of linearly independent rows or columns
Nullity	Number of vectors in the null space
Dimension
Span	Set of all finite linear combinations of the elements of of a vector space .
Column space	Span of the column vectors of a matrix
Row space	Span of the row vectors of a matrix

1.4.1.Properties

Term	Rules
Associativity
Distributivity
Commutativity
Transposing
Identity matrix
Inverting
Determinate
Scalars

1.4.2.Unit matrices

Let be the set of all matrices. How could the basis of look?
Unit matrices:

1.4.3.Kronecker Delta

1.4.4.Matrix product

Shorthand:

Transposing:

1.5.TENSORS

Vectors = tensors of rank 1, matrices = tensors of rank 2

The standart basis for consists of basis vectors where , such that all components of are zero except at position where the value of the component is .

2.FUNCTIONS OF SEVERAL VARIABLES

Let and be two sets. A mapping that associates to each element exactly one element of is called function.

Term	Definition
Domain of
Range of
Real valued function
Vector valued function
A function of variables

2.1.IMAGE CLASSIFICATION

A pixel image with color channels (RGB) can be represented as a matrix (tensor of rank 3). The red value of a pixel at row , column can be indexed with . A function to determine whether the image was taken at day () or night () would have the following signature:

2.1.1.Binary classification

2.2.SIGMOID FUNCTION

2.3.RESIDUAL SUM OF SQUARE

Residual sum of square (RSS) is a statistical method that helps identify the level of discrepancy in a dataset not predicted by a regression model. Thus, it measures the variance in the value of the observed data when compared to its predicted value as per the regression model.

2.4.VISUALIZING FUNCTIONS

If is a function from to its graph is defined as:

Due to the fact that the visual imagination of humans is limited to at most 3 dimensions, the graph of a function is a useful concept for graphical illustration if and only if .

2.4.1.Curves and surfaces

Term	Definition
Curves	Vector valued functions of one variable, i.e. they map one-dimensional input to multi-dimensional output
-dimensional curve	A continuous function that maps an interval into a subset
Surfaces	Real valued functions of multiple variables, i.e. they map multi-dimensional input to one dimensional output
-dimensional hyper-surface	A continuous real valued function that maps a sufficiently large subset into a subset If a hypersurface is also simply called surface. If a hyper-surface is simply a continuous real valued function of one variable.

The graph of both, an -dimensional curve and an -dimensional hyper-surface, is a subset of . A graphical illustration of such a graph therefore requires . However, unlike surfaces , a curve is usually not illustrated through

Instead curves are typically visualized by skipping the independent variable from the graph, i.e. by visualizing the image of the curve

which means that we are projecting the graph of the curve onto the gray plane that is spanned from the dependent variables. As this kind of plot saves one dimension we can also visualize curves for which the range is a subset of .

Unlike for image processing, curves are of little importance in machine learning. On the contrary, hyper-surfaces belong to the most important functions in machine learning. This is because, to a large extend, machine learning can be thought of being a branch of statistics. In statistics the most important class of functions are probability distributions, which are functions mapping subsets of into .

2.4.1.1.Vector similarity

One approach to decide, whether two images are similar or not, is to first extract features (such as corner-points, edges, etc.) from each of the images, and store these features in a list of numbers, which is called feature vector. It is reasonable to assume, that similar images are mapped to similar feature vectors.

Two vectors and are similar, if both vectors head into the same direction and are of similar length, so if the difference between both vectors has a small length. In vector geometry, the length of a vector is called a norm, and the distance between two vectors is called a metric. A norm is thus a function that associates with a vector a positive number , which is interpreted as length of .

-dimensional Euclidean norm or -norm, written also as .

Once a vector space is equipped with a norm , we can also associate a metric with that vector space, which uses the notion of a length to measure the distance between pairs of vectors:

metric. The value is identical to the length of the vector that heads from to .

2.4.2.Vector valued functions of several variables

In machine learning many functions are vector valued functions of many variables, like linear transformations. Neural networks are typically defined to be a stack of unknown linear transformations, each of which being followed by a rather simple, predefined non-linear operation. Training a neural network is therefore almost equivalent to finding suitable linear transformations that make the network perform the desired operation.

Besides curves and surfaces there are functions that take multi-dimensional input and deliver multi-dimensional output. In order to get an idea how these functions look like, we often display low-dimensional “sections” of the function that can either be displayed as curves or as surfaces. There is however another way, how these functions can be displayed, namely using a vector field.

2.5.CALCULUS OF CURVES

The decomposition of curves into coordinate functions allows us to generalize most of the concepts from previous analysis courses to curves:

We consider the curve

The component functions are

We can thus argue:

For any a curve of the form corresponds to a line going through the point in the direction of .

2.6.CALCULUS OF REAL VALUED FUNCTIONS IN MANY VARIABLES

2.6.1.Partial derivative and gradient

Let be an arbitrary real valued function and be an arbitrary vector in the domain of . We consider the domain of to be sufficiently large for doing calculus at , if for every vector there is a small interval , such that for all the vector . This condition guarantees that any curve that hits the domain of , remains there for at least a short period of time, and is one of the properties that needs to be satisfied in order to denote as a hyper-surface. The other property that is required for to be a hyper-surface, is that is continuous at any point of its domain.

2.6.2.Compositions of functions

We can use the notion of dependent variables as an alternative representation of functions, this notion allows us to define functions without explicitly stating its arguments.

2.6.3.Partial and total derivative

2.6.4.Commutative diagrams

Let us assume that

The graph

illustrates that is a function from to , is a function from to and is a subset of . Alternatively, for every set , such that we could also have written

indicating that maps into a subset of , which in turn is a subset of the domain of and thus can be mapped to using .

A commutative diagram is an extension of the previous diagram, in which we add an additional arrow pointing from to .

By labeling this additional arrow with , a commutative diagram can be used to define a function from to , that is behaving such, that it maps every to exactly the same element as the composition of with .

2.7.CALCULUS OF VECTOR VALUED FUNCTIONS OF MANY VARIABLES

2.7.1.Jacobian matrix and linearisation

In the theory of real valued functions in one variable the linearisation of at is done via

and can be interpreted as:

First order Taylor polynomial of , which implies that the linearisation of at is that first order polynomial which best approximates in the vicinity of .
Identifying that particular line that fits best to the graph of in a neighborhood of . This line is known as tangent of at .
Definition of the derivative of at .

tells us that both sides of this equation become equal for , and therefore equivalent to the definition of the derivative of at

In the theory of vector valued functions of many variables, the linearisation of a vector valued function in variables , itself has to be a mapping from to . Further, the right hand side has to be an affine transformation, i.e. a function of the form

in which is a matrix and is a vector. Since we expect that to map to , we can infer that and .

The derivative of a vector valued function can thus be expected to be a matrix . This matrix is called Jacobian matrix and it can be shown that this matrix is comprised of all partial derivatives of all component functions of .

2.8.SUMMARY DERIVATIVES

Function	Derivative

2.8.1.Chain-rule for vector valued functions in many variables

3.GEOMETRY OF HYPER-SURFACES

3.1.INTRODUCTION

Term	Definition
Training sample	Knowledge of in- and outputs prior to learning
Supervised learning	A process for finding suitable values of parameters that make the system respond as desired for as many training samples as possible
Cost functions	Process of finding the values in supervised learning. Function of parameters that maps them to a high value if the misclassification is high and low if otherwise:
Train	Feed the system with training samples and find the best parameters in the process in searching for the (global) minimum of the cost function

3.1.1.Visualizing surfaces

To a large extend, the world we live in can be envisioned as a two dimensional surface that is embedded in three dimensional space. Using this analogy you can take away two important methods for visualizing surfaces: surface-plots (3d) and contour-plots (2d).

Surface-plot:

Contour-plot:

3.1.2.Tangent space and linearisation of a surface

3.2.OPTIMIZATION PROBLEMS

3.2.1.Extreme values and stationary points

Let with be a hyper-surface.

Term	Definition
Upper bound	Any number , such that for all is called an upper bound of
Lower bound	Any number , such that for all is called a lower bound of
Global maximum	An upper bound is called global maximum value or absolute maximum value of , if . The value is called the global maximum point or absolute maximum point
Global minimum	A lower bound is called global minimum value or absolute minimum value of , if . The value is called the global minimum point or absolute minimum point
Local maximum	Any point such that, whenever is called local maximum point and the value is called local maximum value
Local minimum	Any point such that, whenever is called local minimum point and the value is called local minimum value

Stationary points can also identify positions, at which a function is becoming flat into one direction without loosing its monotony.

Finally, there is a third scenario, which can occur: saddle points. Saddle points require functions that depend on at least two variables, because the domain of such functions is sufficiently large to move into at least two different directions without leaving the graph. In such a case it can happen, that a stationary point appears to be as a maximum value in one direction, and as a minimum value in another direction.

3.2.2.Necessary and sufficient conditions for extremal values

Notation for second derivative:

Let , a curve passing our stationary point at and

then

If you can show that , then defines a local minimum
If you can show that , then defines a local maximum
In both of those cases, i.e. , the point defines a saddle point
In all other cases it defines nothing and further investigation is needed