Neural Networks and Deep Learning. 2nd Edition.
By Charu C. Aggarwal.
Contents
1 An Introduction to Neural Networks 1
2 Machine Learning with Shallow Neural Networks 5
3 Training Deep Neural Networks 9
4 Teaching Deep Learners to Generalize 15
5 Radial Basis Function Networks 19
6 Restricted Boltzmann Machines 23
7 Recurrent Neural Networks 25
8 Convolutional Neural Networks 29
9 Deep Reinforcement Learning 33
10 Advanced Topics in Deep Learning 35
vii
,viii
,Chapter 1
An Introduction to Neural Networks
1. Consider the case of the XOR function in which the two points {(0, 0), (1, 1)} belong to one class, and the
other two points {(1, 0), (0, 1)} belong to the other class. Show how you can use the ReLU activation function
to separate the two classes in a manner
similar to the example in the chapter.
We assume that the hidden layer contains two units. The first layer should implement the transformations
x1−x2 and x2−x1 to create the pre-activation values. On applying the ReLU activation to the two pre-activated
values, one obtains the representation
{(0, 0), (0, 0)} for the first pair of data points, and the representation {(1, 0), (0, 1)} for the second pair of the
data points. Clearly, the two classes become separable.
2. Show the following properties of the sigmoid and tanh activation functions (denoted by Φ(·) in each case):
(a) Sigmoid activation: Φ(−v) = 1 − Φ(v)
(b) Tanh activation: Φ(−v) = −Φ(v)
(c) Hard tanh activation: Φ(−v) = −Φ(v)
For sigmoid activation Φ(−v) = 1/(1 + exp(v)) = exp(−v)/(1 + exp(−v)). The last value can easily be
shown to be 1 − 1/(1 + exp(−v)) = 1 − Φ(v).
For tanh activation, the proof is similar; the numerator and denominator should be multiplied with exp(2v) to
obtain the result.
The proof for hard tanh is even simpler because it is a thresholded linear function.
3. Show that the tanh function is a re-scaled sigmoid function with both horizontal and vertical stretching, as
well as vertical translation:
tanh(v) = 2sigmoid(2v) − 1
This identity is easy to show by plugging in the values on both sides.
1
, 4. Consider a data set in which the two points {(−1, −1), (1, 1)} belong to one class, and the other two points {(1,
−1), (−1, 1)} belong to the other class. Start with perceptron parameter values at (0, 0), and work out a few
stochastic gradient descent updates
with α = 1. While performing the stochastic gradient descent updates, cycle through the training points in any
order.
(a) Does the algorithm converge in the sense that the change in objective function becomes extremely small
over time?
(b) Explain why the situation in (a) occurs.
The algorithm will not converge because the two classes are not linearly separable.
5. For the data set in Exercise 4, where the two features are denoted by (x1, x2), define a new 1-dimensional
representation z denoted by the following:
z = x1 · x2
Is the data set linearly separable in terms of the 1-dimensional representation corre- sponding to z? Explain the
importance of nonlinear transformations in classification problems.
The points in one class map to 1, whereas the points in the other class map to −1. Therefore, the
transformation makes the points linearly separable. Note that this is the XOR function, and therefore
nonlinear transformations are required to map inseparable points to separable values.
6. Implement the perceptron in a programming language of your choice.
This is an implementation exercise.
7. Show that the derivative of the sigmoid activation function is at most 0.25, irrespective of the value of its
argument. At what value of its argument does the sigmoid activation function take on its maximum value?
The derivative is of the form o(1 − o), where o ∈ (0, 1). By differentiating, it is easy to show that this function
takes on its maximum value at o = 0.5 (i.e, argument value of 0), and the maximum value of 1. Furthermore
o(1 − o) always lies in (0, 1) because each of the terms in the product is less than 1.
8. Show that the derivative of the tanh activation function is at most 1, irrespective of the value of its argument.
At what value of its argument does the tanh activation take on its maximum value?
The output is of the form 1 − o2, which is always at most 1 at o = 0. The maximum gradient is 1. Note that the
tanh activation function tends to have less problem as compared to the sigmoid with respect to the vanishing
gradient problem.
9. Consider a network with two inputs x1 and x2. It has two hidden layers, each of which contain two units. Assume
that the weights in each layer are set so that top unit in each layer applies sigmoid activation to the sum of its
inputs and the bottom unit in each layer applies tanh activation to the sum of its inputs. Finally, the single
output node applies ReLU activation to the sum of its two inputs. Write the output of this neural network in
closed form as a function of x1 and x2. This exercise should give you an idea of the complexity of functions
computed by neural networks.
2