“Digging Deep into Neural Networks!”

18 min readJun 29, 2021

Deep Learning, holds major importance in the field of Artificial Intelligence. It teaches computers to learn from examples in order to perform a large variety of tasks.

It plays a significant role in our daily lives and has applications such as Self-driving cars, Virtual assistants, Videos Surveillance, Face recognition, Spam and Malware Filtering, Search Engine Result Refining, E-Commerce product recommendations, to name a few!

In this blog, we will be learning almost all the important areas in deep learning and neural networks and will also go through the mathematical intuition behind algorithms and various optimization techniques.

So let’s get started!

Introduction to Deep Learning and Neural Networks

In Simple Terms, Deep Learning is a subset of Machine Learning which is inspired by the structure and functioning of the human brain.

The algorithms in deep learning imitate the inner working of our brain to create patterns and perform decision-making.

The main idea behind Deep Learning is that we can artificially develop a neural network just like our biological neurons. We can create simplified mathematical models of some parts of neurons such as — dendrites, cell bodies, and axons and connect those neurons to form a network which could help in various decision-making techniques , and hence the term- Artificial Neural Network!

Artificial Neural Network(ANN) has about 10–1000 neurons in them as compared to the human brain which has about 86 Billion neurons and hence, the human brain has a very complex topology and asynchronous connections.

Now, before digging deep into neural networks and their various widespread architectures, let us start our learning with a very simple linear classifier and understand the structure of a neural network.’’

Basic Structure of Neural Network

Let us take an example of a binary linearly classification problem, where we have to classify an image as “Cat” or “Not Cat”.

We start by feeding the features of our image data into the input layer, then this input layer is connected to a hidden layer of the neural network.

Neural Networks has a tree-like layered structure with fully interconnected neurons. These neurons are the building block of deep neural networks.

A hidden layer is located between the input and output of the neural network, in which the function applies weights to the inputs and directs them through an activation function as the output.

The hidden layers perform non-linear transformations of the inputs entered into the network.

Now, every hidden layer function is specialized to produce a defined output.

For example, hidden layer functions that are used to identify cats' eyes and ears may be used in conjunction with subsequent layers to identify faces in images. While the functions to identify eyes alone are not enough to independently recognize objects, they can function jointly within a neural network.

Finally, the output layer takes the input from previous hidden layer, performs the calculations via its neurons, and returns the predicted output of the network.

Every Neural Network can have any number of neurons and hidden layers, but only a single output layer. This model can be understood more clearly with the help of the following visual representation of the 2-layer network.

Visual Representation of NN for Image Recognition

Basic Structure of Neuron

Now, let us understand the main function of our artificial neuron and neural network and the intuition behind it with the help of mathematical equations.

An artificial neuron also known as a “perceptron” can be thought of as a 2- step mathematical function. It takes one or more inputs that are multiplied by values called “weights” and added together.

Then, this value is passed through an activation function which finally becomes neurons output.

We also include a “bias” term in each layer which helps in the better fitting of data.

Z: output of a neuron

W: weight applied to the input of the neuron

b: bias term

xi: inputs to the neuron

Activation function : Sigmoid fn

Underlying Mathematics behind Neural Network

Now, we shall go through Neural Network representation using mathematical equations and matrix representations.

Activation Functions

An activation function is a non-linear function applied by a neuron to introduce non-linear transformations in the network.

By introducing non-linearity in our deep learning network, we can better capture the patterns in our data and better “fit” the model.

For Example,

As seen earlier, an artificial neuron is a “weighted-sum” of its input, i.e.

Z = ∑ (weight*input) + bias

So basically, the value of Z can range from -inf to +inf, but it doesn’t know from these values when to “fire” or get activated.

Hence, we use an activation function for this purpose. An activation function is applied to this weighted input which can activate (or deactivate ) the neurons and can also squash the range of Z to a limited range.

Types of Activation Function

Linear Activation Function

It is a type of straight-line function, and in this function value of “a” is constant.

Such type of functions can get very large values and are unable to capture complex patterns.

def linear(x):
    return a * x

Sigmoid Activation Function

It is a type of non-linear function so it can capture complex patterns very well. The range of this function is also bounded from (0,1) so it doesn’t get very large.
But such type of functions can suffer from the “vanishing gradient” problem.
`

def sigmoid(x):
    return 1/(1+np.exp(-x))

Tanh Function

It is a very popular and widely used activation function and it’s plot looks very similar to sigmoid function, except the fact that it is a scaled sigmoid function!
Range of the Tanh function is bounded from (-1,1) so it
keeps the range bounded.

The point to consider here is the derivatives of this function are steeper and hence the gradient is stronger than the sigmoid function. But tanh suffers from the “vanishing gradient” problem too.

def tanh(x):
   return np.tanh(x)

ReLu Function

This function gives an output x if x is positive and 0 otherwise and is non-linear in nature.

But this function is not range-bounded, i.e [0, inf), which means it can blow up the activation.

Now, In the case of sigmoid or Tanh, activations were dense which means almost all activations were processed to describe the output of a network which results in expensive computation.

But In the case of Relu, since we have a zero output for negative values of x, almost 50% of neurons are deactivated in our neural network making the activations sparse and efficient.

def relu(x):
    x1 = [ ]
    for i in x:
       if i in x:
          x1.append (0)
       else :
          x1.append(i)
    return x1

But there’s a drawback in Relu Function!

Because of the horizontal line in ReLu( for negative X ), the gradient can go towards 0, because of which the weights will not get adjusted during descent.

That means, neurons going into that state will stop responding to variations in error/ input ( simply because the gradient is 0, nothing changes ). This is called the dying ReLu problem. This problem can cause several neurons to just die and not respond making a substantial part of our neural network passive.

Leaky ReLu function

This function helps in overcoming the drawback of ReLu function by increasing the range of the ReLU function.

When the value of a is 0.01 => ReLu activation function, and when it is other than 0.01 , then it is known as Randomised ReLU. Therefore the range of the Leaky ReLU is (-inf to inf).

def leaky_relu(x):
    if x > 0:
       return x
    else:
       return 0.01*x

SoftMax Activation Function

This type of Activation Function is only used in the output layer rather than throughout the network. Each value ranges between 0 and 1 and the sum of all values is 1 so can be used to model probability distributions.

def soft_max(x):
    e_x  = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

Training the Neural Network

Let us now revisit some previous topics and understand the mathematical process of training and optimizing a neural network.

The complete process includes 2 stages, namely Forward propagation for weight initialization and prediction of output, and Backward Propagation for updating weights so as to minimize the loss function.

This network consists of input and output layer along with 2 hidden layers as shown in the given figure.

In the Forward Propagation,

For Layer-2 or first hidden layer , we can write equations as:

z[2] = W[1] x + b[1]

a[2] = sigmoid (z[2])

Similarly, For Layer-3 or second hidden layer:

z[3] = W[2] a[2] + b[2]

a[3] = sigmoid (z[3])

where, W² and W³ are the weights in Layer 2 and 3 while b² and b³ are the biases.

Equation for W[1]:

Equation for x and b:

Equation for z[2]:

Equation for Output Layer:-

In the last pass of forward-propagation, we evaluate the predicted value ŷ against actual output, y.

This evaluation is done mainly by Loss Function, which can be as simple as Mean Squared Error(MSE) or some complex computation such as cross-entropy.

Hence, L = Loss-Function(ŷ,y)

Mean Squared Error(MSE) = 1/n * ∑ (ŷ — y)²

Now, in Back Propagation method, we minimise our loss-function by repeatedly adjusting our weight and bias parameters.

This adjustment of parameters is done with the help of gradients of loss function with respect to these parameters by the method of chain rule.

So finally, our updated parameters will become:

Vanishing Gradient Problem

When we train a Deep Neural Network using a back-propagation algorithm,

we basically calculate the gradient of the output w.r.t to weight matrices and then subtract it from respective weight matrices to make its values closer to the actual output.

But when the gradient becomes negligible, subtracting it from the original matrix doesn’t make any sense and hence the model stops learning. This problem is called as Vanishing Gradient Problem.

This can be mathematically shown as below:-

Firstly, let us observe the plot of the sigmoid function and its derivation.

From this graph, it is clear that the derivative of the sigmoid reaches a maximum value =0.25

Hence, Value of range varies as: 0 <= sigmoid(z) <= 0.25

Now, Let us consider a neural network with 4 hidden layers with a single neuron in each matrix, and here in the back-propagation method, we calculate the gradient of Loss Function with respect to the weight parameter, W¹.

Since gradients have a range of [0, 0.25], and in the computation of derivatives, multiplication of such small values for a large number of times makes the gradient very small and makes the model almost stop learning.

Exploding Gradient Problem

This problem mostly occurs when we initialize neurons with high values of weights, which further results in large values of the gradient.

Hence, during training, these large gradients accumulate which causes the model to become unstable and unable to learn from training data.

Since gradient value becomes very high, the updated value becomes highly negative.

Hence, at every epoch rather than converging to minimum value of curve, the gradient “jumps”!

Dropout Layers Regularisation in Deep Learning

Deep learning neural networks tend to quickly overfit a training dataset with a small size. This also results in an increase in generalization error.

This overfitting phenomenon can be resolved by randomly dropping out some neurons during training.

By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections.

This method is called dropout and provides a computationally cheap and effective regularisation method to reduce overfitting and improve generalization error in deep neural networks

Let us get some mathematical intuition behind the Dropout regularisation approach to understand things better.

If we find the Expectation E[ ] of gradients of dropout network, we get:

This implies the expectation of the gradient with Dropout is equal to the gradient of Regularised regular network Eɴ if w’ = p*w.

Hence we can conclude that,

In Training Data, according to the value of “p”, we randomly drop out some neurons in every layer i.e. deactivating some neurons.

and,

In Test Data, everything will be interconnected,i.e. no deactivated neurons but every weight will be multiplied by value “p”.

We can also use Gaussian-Based Dropout, which will replace the Bernoulli gate with a Gaussian gate. It has been found to work as well as the regular Dropout and sometimes better.

In this method, the expected value of the activation remains unchanged and no weight scaling is required.

Hence, this gives it a computational advantage too!

Weight initialization techniques in Neural Network

The main objective is to prevent layer activation outputs from exploding or vanishing gradients during the forward propagation.

If we initialize the weights correctly, then our objective i.e, optimization of loss function will be achieved in the least time otherwise converging to a minimum using gradient descent will be impossible.

As a rule of thumb, Initialized Weights should be small, different, and have good variance among them.

Before we move on to some methods, let us go through a simple concept of fan_in and fan_out.

fan_in in neuron = number of inputs to a given neuron
fan_out in neuron = number of output from a given neuron

Basic weight initialization techniques

Uniform Distribution

2. Xavier Distribution

This distribution works well with the Sigmoid function
Decreases the probability of the gradient vanishing/exploding problem
This method is not useful when the activation function is non-differentiable
Dying neuron problems can occur during the training.

2.1 Xavier Normal

2.2 Xavier Uniform

3. He initialization

This technique is mostly used in ReLU activation functions
This technique solves dying neuron problems
But this method is not useful for layers with differentiable activation functions such as ReLU or LeakyReLU

3.1 He Uniform

3.2 He Normal

Gradient Descent Variants

As seen previously, Gradient descent is a way to minimize our cost function parameterized by the model’s parameters by updating the parameters in the opposite direction of the gradient of the cost function. The learning rate determines the size of the steps we take to reach a (local) minimum.

We can state 3 variants of gradient descent, and we can choose any of them depending upon the amount of data we have.

Batch Gradient Descent

It computes the gradient of the cost function w.r.t. to the weight parameters for the entire training dataset. Since the whole dataset is updated by the neural network at once, batch gradient descent can be very slow as large datasets can’t be fit into memory altogether.

Hence, these models using batch gradient descent can’t be updated in real-time.

for i in range(nb_epochs):
    params_grad = evaluate_gradient(loss_function , data , params)
    params = params - learning_rate * params_grad

2.Stochastic Gradient Descent

SGD performs a parameter update for each training example, one at a time. Because of a single update at a time, models using SGD can be updated in real-time.

SGD in contrast to Batch Gradient Descent, reaches global minimum with some noise/fluctuations as SGD keeps overshooting.

for i in range(nb_epochs):
    np.random.shuffle(data)
    for example in data:
        params_grad = evaluate_gradient(loss_function ,data ,params)
        params = params - learning_rate * params_grad

3. Mini-Batch Stochastic Gradient Descent

It is the most popular method used in many deep learning algorithms. It performs an update for every mini-batch of “k” training examples.

Hence, it provides more stable convergence due to reduced variance of parameter updates.

for i in range(nb_epochs):
    np.random.shuffle(data)
    for bacth in get_batches(data , batch_size = 50):
        params_grad = evaluate_gradient(loss_function ,data ,params)
        params = params - learning_rate * params_grad

Gradient Descent Optimisation Algorithms

SGD with Momentum

SGD takes more time to reach global minimum because of noisy data in updates. Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations. It does this by adding a fraction γ of the update vector of the past time step to the current update vector:

Adagrad Optimiser

Until now, we have seen that the learning rate was the same for all variants of gradient descent , and for all layers and neurons.

The main idea behind Adagrad Optimiser is to use dynamic values of the learning rate for different layers and neurons, based on different iterations.

The main drawback here is an accumulation of the squared gradients in the denominator. Since every added term is +ve, the accumulated sum keeps growing during training which in turn causes the learning rate to become infinitesimally small, due to which the algorithm is no longer able to acquire additional knowledge.

This drawback is resolved in RMS Optimiser

RMS Optimiser

In Adagrad Optimiser, the learning rate should become small but due to accumulated sum during training, it was becoming very very small.

By putting this restriction, the modified learning rate will decrease slowly and will not shrink during the process.

ADAM Optimiser

Adaptive Moment Estimation (Adam) is the best and the most popular optimizer being used these days. Adam can be looked at as a combination of RMSprop and Stochastic Gradient Descent with momentum. It uses the squared gradients to scale the learning rate like RMSprop and it takes advantage of momentum by using the moving average of the gradient instead of the gradient itself like SGD with momentum.

We can also compare the training cost for the following optimizers used on MNIST Multilayer Neural Network with dropout regularisation.

Implementation of NN to recognize hand-written digits images using Keras

In our final section of this blog, we will go through the hands-on experience of implementing a simple Neural Network to recognize hand-written digits images and classify them in the range 0–9.

For this purpose, we will use the MNIST dataset available in the Keras dataset library.

Understanding the data and initialization process:

1. Dataset contains 60,000 28x28 grayscale images of the 10 digits, along with a test set of 10,000 images.

2. Pixel value of each image ranges from [0,255],0: Black, and 255: White

3. This image matrix of size 28x28 is flattened into 1- dimensional array of size 784x1 and fed into the input layer of our neural network.

4. Output layer has 10 neurons for range [0,9] and has a sigmoid function which outputs a score between 0 to 1, and determines to which class our image belongs.

5. First, we will solve this problem by using only the input and output layer and in the next step, we will try to optimize it by adding hidden layers and compare the performance of the models

Dataset used: MNIST dataset from Keras Dataset library

# Importing Modules
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np(X_train, y_train),(X_test, y_test)=keras.datasets.mnist.load_data()print("Number of records in training data:" , len(X_train))
print("Number of records in test data:" , len(X_test))
print("Size of image (in pixels):" , X_train[0].shape)

Now, let us plot a sample random image from our dataset using matplotlib library and compare it to its corresponding output in training data.

plt.matshow(X_train[1000])

print("Output to the given same data point i.e. 1000th data point:" , y_train[1000])

Output to the given same data point i.e. 1000th data point: 0

Let us now flatten our image to the 1-D matrix.

So, originally we had a matrix of shape (60000, 28 , 28) and finally, we will have (60000, 784).

Here, we are dividing/scaling the whole data255 so that all values are in the
range [0–1]

# Scaling and flattening of dataX_train = X_train / 255 
X_test = X_test / 255
X_train_flatten = X_train.reshape(len(X_train), 28*28)
X_test_flatten = X_test.reshape(len(X_test), 28*28)

Designing a simple dense — Neural Network with no hidden layes
Output Layer: 10 neurons
Input Layer: 784 neuron
Activation function : Sigmoid Function
Loss Function: sparse categorical cross-entropy function which basically computes entropy loss between labels and predictions
Optimizer: Adam
Metrics: Accuracy

model = keras.Sequential([
keras.layers.Dense(10, input_shape=(784,), activation='sigmoid')])model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])model.fit(X_train_flatten, y_train, epochs=5)

We can observe here as the number of epochs increases, accuracy also increases.
Here, accuracy after 5 epochs came about to be approximately 92% which means our model will predict results correctly 92% of times!
Now, we will evaluate our model on our Test Data and observe the results.

model.evaluate(X_test_flatten, y_test)

We can also test our model on sample data (let 88th image) and compare the results.

plt.matshow(X_test[88])

y_predicted = model.predict(X_test_flatten)
y_predicted[88] # prediction for 88th image

Here, we get 10 scores for 10 classes from 0 to 9 in y_predicted[88], and from these, we have to choose the value with a maximum score.

np.argmax(y_predicted[88])

Hence, we got the desired output (=6) from our model! Now , we can also build our confusion matrix with the help of the TensorFlow module and visually observe it.

y_predicted_labels = [np.argmax(i) for i in y_predicted]confusionMatix = tf.math.confusion_matrix(labels=y_test,predictions=y_predicted_labels)import seaborn as sn
plt.figure(figsize = (10,7))
sn.heatmap(confusionMatix, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')

From the above diagram, we can interpret many things, such as:

959 times our model predicted “0” , when the output was actually “0”

9 times our model predicted “1” , when the output was actually “2”

So, all the diagonal elements of our confusion matrix have desired correct outputs and all the non-diagonal elements are errors!

Now, let us try to improve our model by adding a hidden layer and evaluate our model again.

# Hidden layer with ReLU activation function and randomly chosen 100 neurons

model = keras.Sequential([
keras.layers.Dense(100, input_shape=(784,), activation='relu'),
keras.layers.Dense(10, activation='sigmoid'
)])model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])model.fit(X_train_flatten, y_train, epochs=5)

model.evaluate(X_test_flatten,y_test)

y_predicted = model.predict(X_test_flatten)
y_predicted_labels = [np.argmax(i) for i in y_predicted]
confusionMatix = tf.math.confusion_matrix(labels=y_test,predictions=y_predicted_labels)plt.figure(figsize = (10,7))
sn.heatmap(confusionMatix, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')

Here, using the hidden layer, the accuracy of our model increased to 97% and the values of non-diagonal elements i.e. error points have also decreased!
Keras also comes with a special function .Flatten() called so that we don’t have to call .reshape on the input dataset

model = keras.Sequential([keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(100, activation='relu'),
keras.layers.Dense(10, activation='sigmoid')])
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])model.fit(X_train, y_train, epochs=10)
model.evaluate(X_test,y_test)

That’s all for the blog.
Hope you enjoyed reading it! :)