Introduction to Machine Learning and How It Works

Machine Learning

Artificial Intelligence

Artificial intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions) and self-correction.

Machine Learning

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

In Traditional Programming, data and program are run on the computer to produce the output. In Machine Learning, data and output are run on the computer to create a program. The program can be used in traditional programming.

Machine learning algorithms are often categorized as supervised or unsupervised.

Supervised Learning

Supervised learning is a learning in which we teach or train the machine using data which is well labelled that means some data is already tagged with correct answer. After that, machine is provided with new set of examples(data) so that supervised learning algorithm analyses the training data (set of training examples) and produces a correct outcome from labelled data.

Classification algorithms and regression algorithms are types of supervised learning. Classification algorithms are used when the outputs are restricted to a limited set of values. For a classification algorithm that filters emails, the input would be an incoming email, and the output would be the name of the folder in which to file the email. For an algorithm that identifies spam emails, the output would be the prediction of either “spam” or “not spam”, represented by the Boolean values true and false. Regression algorithms are named for their continuous outputs, meaning they may have any value within a range. Examples of a continuous value are the temperature, length, or price of an object.

Unsupervised Learning

Unsupervised learning is the training of machine using information that is neither classified nor labelled and allowing the algorithm to act on that information without guidance. Here the task of machine is to group unsorted information according to similarities, patterns and differences without any prior training of data. The most common unsupervised learning method is cluster analysis or clustering, which is used for exploratory data analysis to find hidden patterns or grouping in data.

Some simple Machine Learning algorithms

Linear Regression

Here, we establish a relationship between independent and dependent variables by fitting the best line. It is used to estimate real values (cost of houses, number of calls, total sales, etc.) based on a continuous variable(s).

Below model is used to predict the Ice cream sales based on the temperature in a city.

We need a weight(w) and a bias(b) to fit a straight-line (y = wx + b) and this can be diagrammatically represented as given below:

Above diagram is the simplest Neural Network. A neural network is a system of hardware and/or software patterned after the operation of neurons in the human brain.

Logistic Regression

Logistic Regression is a classification algorithm used to estimate discrete binary values (like 0/1, yes/no, true/false) based on given set of independent variables. Typically, this involves fitting a curve to separate 2 distinct classes of data points.

The neural network for logistic regression has multiple weights / bias as inputs and 2 output nodes as shown below:

Deep Learning

Deep learning is a specific method of machine learning, and it’s based primarily on the use of neural networks.

In traditional supervised machine learning, systems require an expert to use his or her domain knowledge to specify the information (called features) in the input data that will best lead to a well-trained system. In Deep Learning, rather than specifying the features in our data that we think will lead to the best classification accuracy, we let the machine find this information on its own. Often, it can look at the problem in a way that even an expert wouldn’t have been able to imagine.

Neural Network Terminology

Activation function

The activation function of a node defines the output of that node, or “neuron”, given an input or set of inputs. This output is then used as input for the next node and so on until a desired solution to the original problem is found. Some of the commonly used activation functions are given below

Input / Output / Hidden Layers

Simply as the name suggests the input layer is the one which receives the input and is essentially the first layer of the network. The output layer is the one which generates the output or is the final layer of the network. The processing layers are the hidden layers within the network. These hidden layers are the ones which perform specific tasks on the incoming data and pass on the output generated by them to the next layer. The input and output layers are the ones visible to us, while are the intermediate layers are hidden.

Forward propagation

Forward Propagation refers to the movement of the input through the hidden layers to the output layers. In forward propagation, the information travels in a single direction FORWARD. The input layer supplies the input to the hidden layers and then the output is generated. There is no backward movement.

Cost / Loss function

When we build a network, the network tries to predict the output as close as possible to the actual value. We measure this accuracy of the network using the loss function. The loss function tries to penalize the network when it makes errors. Our objective while running the network is to increase our prediction accuracy and to reduce the error, hence minimizing the loss function. The most optimized output is the one with the least value of the loss function. If we define the loss function to be the mean squared error, it can be written as –

C= 1/m ∑ (y – a)2 where m is the number of training inputs, a is the predicted value and y is the actual value of that example.

The learning process revolves around minimizing the cost.

Gradient Descent

Gradient descent is an optimization algorithm for minimizing the cost. To think of it intuitively, while climbing down a hill you should take small steps and walk down instead of just jumping down at once. Therefore, what we do is, if we start from a point x, we move down a little i.e. delta h, and update our position to x-delta h and we keep doing the same till we reach the bottom. Consider bottom to be the minimum cost point.

Mathematically, to find the local minimum of a function one takes steps proportional to the negative of the gradient of the function.

Learning Rate

rate at which we descend towards the minima of the cost function is the learning rate. We should choose the learning rate very carefully since it should neither be very large that the optimal solution is missed and nor should be very low that it takes forever for the network to converge.

Backpropagation

When we define a neural network, we assign random weights and bias values to our nodes. Once we have received the output for a single iteration, we can calculate the error of the network. This error is then fed back to the network along with the gradient of the cost function to update the weights of the network. These weights are then updated so that the errors in the subsequent iterations is reduced. This updating of weights using the gradient of the cost function is known as back-propagation.

Steps in training a Neural Network

Initialize weights and biases.
ii. Forward propagation: Using the input X, weights W and biases b, for every layer we compute Z and A, the Linear and Non-linear activations. At the final layer, we compute f(A^(L-1)) which could be a sigmoid, softmax or linear function of A^(L-1) and this gives the prediction y_hat.
Compute the loss function: This is a function of the actual label y and predicted label y_hat. It captures how far off our predictions are from the actual target. Our objective is to minimize this loss function.
Backward Propagation: In this step, we calculate the gradients of the loss function f(y, y_hat) with respect to A, W, and b called dA, dW and db. Using these gradients, we update the values of the parameters from the last layer to the first.
Repeat steps 2–4 for n iterations/epochs till we feel we have minimized the loss function, without overfitting the train data

Machine Learning using Python

Simple Machine Learning models like Linear Regression can be trained using the python library scikit-learn. Neural Networks are built and trained using the libraries Keras, TensorFlow or PyTorch.

In below simple example, we are building a linear regression model to predict the ice cream sales based on temperature. 80% of the available data is used for testing and we are using the remaining 20% data for testing our model.

  
import matplotlib.pyplot as plt   
import numpy as np   
from sklearn.linear_model import LinearRegression  
from sklearn.metrics import r2_score  
import pandas as pd  
                  
 # load the dataset   
 Stock_Market = {'Temprature_in_Fahrenheit' :[58, 62, 52, 60, 66, 74, 68, 80, 76, 74, 64,],  
 'Ice_Cream_sales': [215,325,185,332,406,522,412,614,544,44500000,408]          
                        }  
                  
 df = pd.DataFrame(Stock_Market,columns=['Temprature_in_Fahrenheit','Ice_Cream_sales'])  
          
 X = df[['Temprature_in_Fahrenheit']]  
 Y = df['Ice_Cream_sales']  
 # splitting X and y into training and testing sets   
 from sklearn.model_selection import train_test_split   
 X_train, X_test, y_train, y_test = train_test_split(X, Y, 
 test_size=0.2, random_state=1)
          
 # create linear regression object   
 reg = LinearRegression()  
 # train the model using the training sets   
 reg.fit(X_train, y_train)  
  #Prediction  
  y_predict = reg.predict(X_test)  
        
  ## plotting residual errors in training data   
  plt.scatter(reg.predict(X_train), reg.predict(X_train) - 
  y_train, color = "green", s = 10, label = 'Train data')   
  ## plotting residual errors in test data   
  plt.scatter(reg.predict(X_test), reg.predict(X_test) - y_test, 
  color = "blue", s = 10, label = 'Test data')   
  ## plotting line for zero residual error   
  plt.hlines(y = 0, xmin = 0, xmax = 2000, linewidth = 2)   
  ## plotting legend   
  plt.legend(loc = 'upper right')   
  ## plot title   
  plt.title("Residual errors")     
  ## function to show plot   
  plt.show()