implications for the expressive power of these models. Discuss the limitations of the
theorem and its relationship to the number of neurons and network architecture.
The Universal Approximation Theorem states that a neural network with at least one hidden
layer of a sufficient number of neurons, and a non-linear activation function can approximate
any continuous function to an arbitrary level of accuracy. In other words, a neural network can
fit any function to an arbitrary level of accuracy. That’s why neural networks are
called universal approximators. The Universal Approximation Theorem assumes that: There
is enough data to reasonably train the network. Generally, neural networks perform well with
large amounts of data. It also assumes overfitting has been avoided so that the approximation
will generalize well on new training instances. (Pramoditha, 2023).
The Universal Approximation Theorem states that a feedforward neural network with a single
hidden layer containing a finite number of neurons can approximate continuous functions on
compact subsets of Rn , under mild assumptions on the activation function. This implies that
neural networks have the potential to approximate any continuous function to a desired
accuracy given a sufficiently large number of neurons in the hidden layer.
So, given a continuous function f: A ⊂ Rd → Rm , where A is a compact subset of Euclidean
space Rd and m is the output dimension, and ϵ > 0, there exists a feedforward neural network
with a single hidden layer, Sigmoid activation functions, and sufficiently large hidden layer
size N, such that for all x ∈ A: |f(x) - y(x)| < ϵ|
where y(x) is the output of the neural network for input x. Input layer size d, hidden layer
sizes n, output layer size m.
Implications: Wider and deeper networks have greater expressive power in theory. Sigmoid
activation adds complexity, allowing the network to learn non-linear relationships. Neural
networks are powerful function approximators and can theoretically represent a wide range of
complex relationships within data. They have the capacity to learn and represent highly
nonlinear and intricate patterns. While the theorem guarantees approximation, how fast the
error decreases with increasing N can be analysed using bounds on approximation rates for
instance using the Rademacher complexity.