Introduction to Optimization
Week 1
1 Introduction
1.1 Optimization problems and related terminology
Consider a set X and a subset C ⊂ X. A minimization problem consists in finding a point xmin ∈ C such
that f (xmin ) ≤ x for all x ∈ C. f : X → R is called the objective function, C is called the feasible set,
and its elements are feasible points. The problem is constrained if C ̸= X and unconstrained otherwise.
xmin is called the (global) minimizer of f on C, or a solution to the problem
min(f ) = min f (s) = min{f (x) : x ∈ C}
C x∈C
The set of such solutions is denoted by arg minC (f ). If C = X, we often leave out the C.
If a function has no minimizers, we still may be interested in finding the value inf C (f ) = inf{f (x) : x ∈
C}. In any case, the set of minimizers can be described as
\
arg min(f ) = [f ≤ γ], where [f ≤ γ] = {x ∈ X : f (x) ≤ γ}
C
γ>inf C (f )
[f ≤ γ] is called the γ-sublevel set of f .
A maximization problem is equivalent by
max f (x) = − min{−f (x)}
x∈C x∈C
1.2 Two simple examples in R that actually tell us a lot
1.2.1 Cost-efficient cylindrical containers
A cylindrical container has a top, of say cost a, and a side of say cost b. The total cost is then given by
C(r, h) = 2πar2 + 2brπh
while the volume of the can is given by
V = πr2 h
From these two we can find that the optimal height h̄ is dependent on the optimal radius r̄ by
2r̄a
h̄ =
b
Generalizing the techniques we used, we end up with two propositions:
Proposition 1.1 Let I be an open interval, and let f : I → R be twice differentiable, with f ′′ (x) > 0
for all x ∈ I. If there is x̄ ∈ I such that f ′ (x̄) = 0, then f (x̄) < f (x) for every x ∈ I ̸ {x̄}.
Proposition 1.2 Let I be an open interval, and let f : I → R be differentiable. If there is x̄ ∈ I such
that f (x̄) ≤ f (x) for every x ∈ I, then f ′ (x̄) = 0
1
,1.2.2 A broken fastest path
Fermat’s Principle of Least Time states that the path taken by a ray of light between two given points
is the path that can be traveled in the shortest time. We can do the math ourselves.
Proposition 1.5 (law of refraction) For a given pair of media, the ratio of the sines of the angles of
incidence and refraction equals the ratio of the speed of light in the corresponding media.
1.3 Examples from data analysis
Consider a data set
D = {(aj , yj ) ∈ X × Y : j = 1, 2, . . . , J}
where the aj ’s are certain features (within set X) and the yj ’s are observations or labels (in set Y ),
corresponding to a sample or individual j ∈ {1, . . . , J}. The aim is to discover or learn a function
ϕ : X → Y such that ϕ(aj ) ≃ yj . Typically, ϕ will belong to a family of functions depending on a
parameter θ ∈ Rn .
1.3.1 The loss function approach
Define a loss function
J
1X
LD (θ) = ℓ(aj , yj , θ)
J j=1
One of the most typical problems in data analysis is the linear least squares problem:
J
1 X T 1
minn (θ aj − yj )2 = minn ||Aθ − y||2
θ∈R 2J j=1 θ∈R 2J
In this case, ℓ(a, y, θ) is therefore (θT a − y)2 .
If additionally to approximating the data with a linear structure, we want to impose a structure on θ,
we can add a regularization term to the problem. Several examples:
• Ridge regression: adds λ||θ||2 (less sensitive to pertubatuions)
• LASSO: adds λ||θ||1 (takes into account sparseness of θ, but not differentiable)
1.3.2 Support Vector machines
A Support Vector Machine (SVM) is an optimization approach to solving binary classification problems
(let’s say y takes values −1 and 1). We seek a hyperplane that seperates the sets {aj : yj = −1} and
{aj : yj = 1}. This hyperplane need not exist, in which case we can still find the best possible fit. If
such a hyperplane exists, we want to maximize the distance to either side.
2 Basic optimization theory
2.1 Metric and topological structure of Rn
x1
x ∈ Rn ⇐⇒ x = ... , x1 , . . . , xn ∈ R
xn
√
The Euclidean norm of x ∈ Rn is ||x|| = x1 + · · · + xn . Recall a norm satisfies
• ||0|| = 0 and ||x|| > 0 for x ̸= 0
• ||ax|| = |a|||x|| for all x ∈ Rn and a ∈ R
• ||x + y|| ≤ ||x|| + ||y|| for all x, y ∈ Rn .
2
, The 1-norm and ∞-norm are defined as
||x||1 = |x1 | + · · · + |xn |, ||x||∞ = max{|x1 |, . . . , |xn |}
The distance between x, y ∈ Rn is dist(x, y) = ||x − y||. A metric satisfies the following properties:
• dist(x, x) = 0 and dist(x, y) > 0 for x ̸= y
• dist(x, y) = dist(y, x)
• dist(x, y) ≤ dist(x, z) + dist(z, y)
The open and closed ball centered at x ∈ Rn with radius r > 0 are given by
B(x, r) = {y ∈ Rn : ||x − y|| < r}, B̄(x, r) = {y ∈ Rn : ||x − y|| ≤ r}
A subset S ⊂ Rn is open if for each x ∈ S, there exists r > 0 such that B(x, r) ⊂ S.
Open balls are open sets. The intersection of a finite number of open sets is open, while any union (finite
or infinite) of open sets is open.
´
The
´ interior of a set S ⊂ Rn , noted by (S), is the largest ´open set contained in S. The elements of
(S) are called interior points of S. S is open if and only if (S) = S.
A subset S ∈ Rn is closed if and only if its complement is open. Closed balls are closed sets.
A sequence (xk ) in Rn converges to x̄ ∈ Rn (as k → ∞) if, for every ϵ > 0, there is k0 ∈ N such that
xk ∈ B(x̄, ϵ) In this case, x̄ is the limit of (xk ).
Proposition 2.8 A set S ⊂ Rn is closed if and only if every convergent sequence of points in S has its
limit in S.
The closure of a set S ⊂ Rn , denoted by S̄, is the smallest closed set that contains S. A set S is closed
if and only if S̄ = S.
A set S ⊂ Rn is bounded if it is contained in a ball.
Proposition 2.9 (Bolzano-Weierstrass) Every bounded sequence in Rn has a convergent subsequence.
A set C ∈ Rn is compact if it is both closed and bounded.
Corollary 2.10 Every sequence in a compact set has a subsequence that converges to a point in the set.
2.2 Continuous functions and existence of minimizers
Let f : D ⊂ Rn → R, let x̄ ∈ D, and let ℓ ∈ R. If, for every ϵ > 0 there is δ > 0 such that |f (x) − ℓ| < ϵ
whenever ||x − x̄|| < δ, we say that f (x) converges to ℓ as x tends to x̄, and that ℓ is the limit of f (x)
as x tends to x̄. In this case, we write ℓ = limx→x̄ f (x).
A function f : D ⊂ Rn → R is continuous at a point x̄ ∈ D if limx→x̄ f (x) = f (x̄). If f is continuous at
every point of D, it is continuous in D, or just continuous. f is continuous at x̄ if and only if for every
sequence (xk ) converging to x̄, we have limk→∞ f (xk ) = f (x̄).
Theorem 2.12 (Weierstrass) If C ⊂ Rn is compact and f : C → R is continuous, then there exists
xmin , xmax ∈ C such that f (xmin ) ≤ f (x) ≤ f (xmax ) for all x ∈ C.
A function f : Rn → R is coercive if for every R > 0, there is ρ > 0 such that f (x) > R for every
x ̸∈ B(0, ρ). In this case, we shall often write lim||x||→∞ f (x) = ∞.
Given γ ∈ R, the γ-level set of f : D ⊂ Rn → R is
[f = γ] = {x ∈ D : f (x) = γ}
The level and sublevel sets of a continuous function are closed.
Proposition 2.15 The sublevel sets of a coercive function must be bounded.
3
Week 1
1 Introduction
1.1 Optimization problems and related terminology
Consider a set X and a subset C ⊂ X. A minimization problem consists in finding a point xmin ∈ C such
that f (xmin ) ≤ x for all x ∈ C. f : X → R is called the objective function, C is called the feasible set,
and its elements are feasible points. The problem is constrained if C ̸= X and unconstrained otherwise.
xmin is called the (global) minimizer of f on C, or a solution to the problem
min(f ) = min f (s) = min{f (x) : x ∈ C}
C x∈C
The set of such solutions is denoted by arg minC (f ). If C = X, we often leave out the C.
If a function has no minimizers, we still may be interested in finding the value inf C (f ) = inf{f (x) : x ∈
C}. In any case, the set of minimizers can be described as
\
arg min(f ) = [f ≤ γ], where [f ≤ γ] = {x ∈ X : f (x) ≤ γ}
C
γ>inf C (f )
[f ≤ γ] is called the γ-sublevel set of f .
A maximization problem is equivalent by
max f (x) = − min{−f (x)}
x∈C x∈C
1.2 Two simple examples in R that actually tell us a lot
1.2.1 Cost-efficient cylindrical containers
A cylindrical container has a top, of say cost a, and a side of say cost b. The total cost is then given by
C(r, h) = 2πar2 + 2brπh
while the volume of the can is given by
V = πr2 h
From these two we can find that the optimal height h̄ is dependent on the optimal radius r̄ by
2r̄a
h̄ =
b
Generalizing the techniques we used, we end up with two propositions:
Proposition 1.1 Let I be an open interval, and let f : I → R be twice differentiable, with f ′′ (x) > 0
for all x ∈ I. If there is x̄ ∈ I such that f ′ (x̄) = 0, then f (x̄) < f (x) for every x ∈ I ̸ {x̄}.
Proposition 1.2 Let I be an open interval, and let f : I → R be differentiable. If there is x̄ ∈ I such
that f (x̄) ≤ f (x) for every x ∈ I, then f ′ (x̄) = 0
1
,1.2.2 A broken fastest path
Fermat’s Principle of Least Time states that the path taken by a ray of light between two given points
is the path that can be traveled in the shortest time. We can do the math ourselves.
Proposition 1.5 (law of refraction) For a given pair of media, the ratio of the sines of the angles of
incidence and refraction equals the ratio of the speed of light in the corresponding media.
1.3 Examples from data analysis
Consider a data set
D = {(aj , yj ) ∈ X × Y : j = 1, 2, . . . , J}
where the aj ’s are certain features (within set X) and the yj ’s are observations or labels (in set Y ),
corresponding to a sample or individual j ∈ {1, . . . , J}. The aim is to discover or learn a function
ϕ : X → Y such that ϕ(aj ) ≃ yj . Typically, ϕ will belong to a family of functions depending on a
parameter θ ∈ Rn .
1.3.1 The loss function approach
Define a loss function
J
1X
LD (θ) = ℓ(aj , yj , θ)
J j=1
One of the most typical problems in data analysis is the linear least squares problem:
J
1 X T 1
minn (θ aj − yj )2 = minn ||Aθ − y||2
θ∈R 2J j=1 θ∈R 2J
In this case, ℓ(a, y, θ) is therefore (θT a − y)2 .
If additionally to approximating the data with a linear structure, we want to impose a structure on θ,
we can add a regularization term to the problem. Several examples:
• Ridge regression: adds λ||θ||2 (less sensitive to pertubatuions)
• LASSO: adds λ||θ||1 (takes into account sparseness of θ, but not differentiable)
1.3.2 Support Vector machines
A Support Vector Machine (SVM) is an optimization approach to solving binary classification problems
(let’s say y takes values −1 and 1). We seek a hyperplane that seperates the sets {aj : yj = −1} and
{aj : yj = 1}. This hyperplane need not exist, in which case we can still find the best possible fit. If
such a hyperplane exists, we want to maximize the distance to either side.
2 Basic optimization theory
2.1 Metric and topological structure of Rn
x1
x ∈ Rn ⇐⇒ x = ... , x1 , . . . , xn ∈ R
xn
√
The Euclidean norm of x ∈ Rn is ||x|| = x1 + · · · + xn . Recall a norm satisfies
• ||0|| = 0 and ||x|| > 0 for x ̸= 0
• ||ax|| = |a|||x|| for all x ∈ Rn and a ∈ R
• ||x + y|| ≤ ||x|| + ||y|| for all x, y ∈ Rn .
2
, The 1-norm and ∞-norm are defined as
||x||1 = |x1 | + · · · + |xn |, ||x||∞ = max{|x1 |, . . . , |xn |}
The distance between x, y ∈ Rn is dist(x, y) = ||x − y||. A metric satisfies the following properties:
• dist(x, x) = 0 and dist(x, y) > 0 for x ̸= y
• dist(x, y) = dist(y, x)
• dist(x, y) ≤ dist(x, z) + dist(z, y)
The open and closed ball centered at x ∈ Rn with radius r > 0 are given by
B(x, r) = {y ∈ Rn : ||x − y|| < r}, B̄(x, r) = {y ∈ Rn : ||x − y|| ≤ r}
A subset S ⊂ Rn is open if for each x ∈ S, there exists r > 0 such that B(x, r) ⊂ S.
Open balls are open sets. The intersection of a finite number of open sets is open, while any union (finite
or infinite) of open sets is open.
´
The
´ interior of a set S ⊂ Rn , noted by (S), is the largest ´open set contained in S. The elements of
(S) are called interior points of S. S is open if and only if (S) = S.
A subset S ∈ Rn is closed if and only if its complement is open. Closed balls are closed sets.
A sequence (xk ) in Rn converges to x̄ ∈ Rn (as k → ∞) if, for every ϵ > 0, there is k0 ∈ N such that
xk ∈ B(x̄, ϵ) In this case, x̄ is the limit of (xk ).
Proposition 2.8 A set S ⊂ Rn is closed if and only if every convergent sequence of points in S has its
limit in S.
The closure of a set S ⊂ Rn , denoted by S̄, is the smallest closed set that contains S. A set S is closed
if and only if S̄ = S.
A set S ⊂ Rn is bounded if it is contained in a ball.
Proposition 2.9 (Bolzano-Weierstrass) Every bounded sequence in Rn has a convergent subsequence.
A set C ∈ Rn is compact if it is both closed and bounded.
Corollary 2.10 Every sequence in a compact set has a subsequence that converges to a point in the set.
2.2 Continuous functions and existence of minimizers
Let f : D ⊂ Rn → R, let x̄ ∈ D, and let ℓ ∈ R. If, for every ϵ > 0 there is δ > 0 such that |f (x) − ℓ| < ϵ
whenever ||x − x̄|| < δ, we say that f (x) converges to ℓ as x tends to x̄, and that ℓ is the limit of f (x)
as x tends to x̄. In this case, we write ℓ = limx→x̄ f (x).
A function f : D ⊂ Rn → R is continuous at a point x̄ ∈ D if limx→x̄ f (x) = f (x̄). If f is continuous at
every point of D, it is continuous in D, or just continuous. f is continuous at x̄ if and only if for every
sequence (xk ) converging to x̄, we have limk→∞ f (xk ) = f (x̄).
Theorem 2.12 (Weierstrass) If C ⊂ Rn is compact and f : C → R is continuous, then there exists
xmin , xmax ∈ C such that f (xmin ) ≤ f (x) ≤ f (xmax ) for all x ∈ C.
A function f : Rn → R is coercive if for every R > 0, there is ρ > 0 such that f (x) > R for every
x ̸∈ B(0, ρ). In this case, we shall often write lim||x||→∞ f (x) = ∞.
Given γ ∈ R, the γ-level set of f : D ⊂ Rn → R is
[f = γ] = {x ∈ D : f (x) = γ}
The level and sublevel sets of a continuous function are closed.
Proposition 2.15 The sublevel sets of a coercive function must be bounded.
3