Advanced Machine Learning
Lectures
1 - Introduction
Measurement space, features, typical learning problems, key concepts, what you should
know
Supervised vs unsupervised learning, generative vs discriminative modeling
2 - Representations
Expected risk (R): conditional and total expected risk
Empirical risk (R^): training error, empirical risk minimizer, test error
Distinguish between test error and expected risk
Taxonomy of data, object space, measurement
Monadic, dyadic (e.g. pairwise), polyadic
Scales
Nominal (categorical): qualitative, but without quantitative measurements
Ordinal: measurement values are meaningful only with respect to other
measurements, i.e., the rank order of measurements carries the information, not the
numerical differences
Quantitative scale
Interval: the relation of numerical differences carries the information. Invariance
w.r.t. translation and scaling
Ratio: zero value of the scale carries information but not the measurement unit
Absolute: absolute values are meaningful
Mathematical spaces: topological, metric, Euclidean vector, metrizable
Probability spaces: elementary event, sample space, family of sets, algebra of events,
probability of events, probability model (triplet)
Stackexchange: Where a distinction is made between probability function and density,
the pmf applies only to discrete random variables, while the pdf applies to continuous
random variables
ml2016tutorial1: Note: Expected value =/= Most likely value
Describing dependencies in data by covariance is equivalent to approximation of data
distribution by a Gaussian model.
3 - Density Estimation in Regression: Parametric Models
Modeling assumptions for regression, different approaches, Bayesianism and frequentism
Maximum Likelihood Estimation, ML estimation for normal distributions
Procedure: Find the extremum of the log-likelihood function
Wikipedia: Under the additional assumption that the errors are normally distributed,
ordinary least squares (OLS) is the maximum likelihood estimator.
, Wikipedia: Gauss-Markov Theorem states that in a linear regression model in which
the errors have expectation zero, are uncorrelated and have equal variances, the best
linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least
squares (OLS) estimator, provided it exists. The errors do not need to be normal, nor do
they need to be independent and identically distributed (only uncorrelated with mean
zero and homoscedastic with finite variance).
ml2016tutorial3: Note that if we don't know the real value of μ, we can use its
obtained prediction μ^ to calculate σ^, however in this case σ^ would be biased, i.e. σ^
=/= σtrue.
The James—Stein estimator is a biased estimator of the mean of Gaussian random
vectors. It can be shown that the James—Stein estimator dominates the "ordinary"
least squares approach, i.e., it has lower mean squared error. It is the best-known
example of Stein's phenomenon.
Maximum likelihood estimation of variance is biased, but it is nevertheless consistent.
Rao-Cramer inequality, Fisher information, score etc.
Wikipedia: In its simplest form, the bound states that the variance of any unbiased
estimator is at least as high as the inverse of the Fisher information.
Wikipedia: An unbiased estimator which achieves this lower bound is said to be (fully)
efficient. Such a solution achieves the lowest possible mean squared error among all
unbiased methods, and is therefore the minimum variance unbiased (MVU) estimator.
Wikipedia: The Cramér–Rao bound can also be used to bound the variance of biased
estimators of given bias. In some cases, a biased approach can result in both a variance
and a mean squared error that are below the unbiased Cramér–Rao lower bound
Importance of the Maximum Likelihood Method, realizable model
Summary of MLEs
Consistency, equivariance, asymptotic efficiency, asymptotic normality
Bayesian Learning, on normal distribution, recursive Bayesian estimation
Exercise 2: Having determined the functional form of the prior and likelihood, we want
to compute the posterior. Doing it analytically can be hard in general, but it is easy if
the prior and likelihood form a conjugate pair. Then the posterior will have the same
functional form as the prior, only the parameters differ.
Wikipedia: In Bayesian probability theory, if the posterior distributions are in the same
probability distribution family as the prior probability distribution, the prior and
posterior are then called conjugate distributions, and the prior is called a conjugate
prior for the likelihood function
ml2016tutorial3: Conjugate priors:
the gamma distribution is conjugate to the exponential distribution
the normal distribution is conjugate to the normal one
ML-Bayes estimation differences
The maximum likelihood method only estimates the parameters μ^, σ^, but not the
distribution!
ml2016tutorial3: simple linear regression corresponds to MLE, regularized linear
regression corresponds to MAP.
Schematic behaviour of bias and variance
4 - Regression
Linear regression models, least squares, residual sum of squares (RSS)
Lectures
1 - Introduction
Measurement space, features, typical learning problems, key concepts, what you should
know
Supervised vs unsupervised learning, generative vs discriminative modeling
2 - Representations
Expected risk (R): conditional and total expected risk
Empirical risk (R^): training error, empirical risk minimizer, test error
Distinguish between test error and expected risk
Taxonomy of data, object space, measurement
Monadic, dyadic (e.g. pairwise), polyadic
Scales
Nominal (categorical): qualitative, but without quantitative measurements
Ordinal: measurement values are meaningful only with respect to other
measurements, i.e., the rank order of measurements carries the information, not the
numerical differences
Quantitative scale
Interval: the relation of numerical differences carries the information. Invariance
w.r.t. translation and scaling
Ratio: zero value of the scale carries information but not the measurement unit
Absolute: absolute values are meaningful
Mathematical spaces: topological, metric, Euclidean vector, metrizable
Probability spaces: elementary event, sample space, family of sets, algebra of events,
probability of events, probability model (triplet)
Stackexchange: Where a distinction is made between probability function and density,
the pmf applies only to discrete random variables, while the pdf applies to continuous
random variables
ml2016tutorial1: Note: Expected value =/= Most likely value
Describing dependencies in data by covariance is equivalent to approximation of data
distribution by a Gaussian model.
3 - Density Estimation in Regression: Parametric Models
Modeling assumptions for regression, different approaches, Bayesianism and frequentism
Maximum Likelihood Estimation, ML estimation for normal distributions
Procedure: Find the extremum of the log-likelihood function
Wikipedia: Under the additional assumption that the errors are normally distributed,
ordinary least squares (OLS) is the maximum likelihood estimator.
, Wikipedia: Gauss-Markov Theorem states that in a linear regression model in which
the errors have expectation zero, are uncorrelated and have equal variances, the best
linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least
squares (OLS) estimator, provided it exists. The errors do not need to be normal, nor do
they need to be independent and identically distributed (only uncorrelated with mean
zero and homoscedastic with finite variance).
ml2016tutorial3: Note that if we don't know the real value of μ, we can use its
obtained prediction μ^ to calculate σ^, however in this case σ^ would be biased, i.e. σ^
=/= σtrue.
The James—Stein estimator is a biased estimator of the mean of Gaussian random
vectors. It can be shown that the James—Stein estimator dominates the "ordinary"
least squares approach, i.e., it has lower mean squared error. It is the best-known
example of Stein's phenomenon.
Maximum likelihood estimation of variance is biased, but it is nevertheless consistent.
Rao-Cramer inequality, Fisher information, score etc.
Wikipedia: In its simplest form, the bound states that the variance of any unbiased
estimator is at least as high as the inverse of the Fisher information.
Wikipedia: An unbiased estimator which achieves this lower bound is said to be (fully)
efficient. Such a solution achieves the lowest possible mean squared error among all
unbiased methods, and is therefore the minimum variance unbiased (MVU) estimator.
Wikipedia: The Cramér–Rao bound can also be used to bound the variance of biased
estimators of given bias. In some cases, a biased approach can result in both a variance
and a mean squared error that are below the unbiased Cramér–Rao lower bound
Importance of the Maximum Likelihood Method, realizable model
Summary of MLEs
Consistency, equivariance, asymptotic efficiency, asymptotic normality
Bayesian Learning, on normal distribution, recursive Bayesian estimation
Exercise 2: Having determined the functional form of the prior and likelihood, we want
to compute the posterior. Doing it analytically can be hard in general, but it is easy if
the prior and likelihood form a conjugate pair. Then the posterior will have the same
functional form as the prior, only the parameters differ.
Wikipedia: In Bayesian probability theory, if the posterior distributions are in the same
probability distribution family as the prior probability distribution, the prior and
posterior are then called conjugate distributions, and the prior is called a conjugate
prior for the likelihood function
ml2016tutorial3: Conjugate priors:
the gamma distribution is conjugate to the exponential distribution
the normal distribution is conjugate to the normal one
ML-Bayes estimation differences
The maximum likelihood method only estimates the parameters μ^, σ^, but not the
distribution!
ml2016tutorial3: simple linear regression corresponds to MLE, regularized linear
regression corresponds to MAP.
Schematic behaviour of bias and variance
4 - Regression
Linear regression models, least squares, residual sum of squares (RSS)