What is Synthetic New Relic?:

New Relic Synthetics is a set of automated scriptable tools to monitor the websites, critical business transactions and API endpoints. A detailed individual results from each monitor run can also be viewed. With access to New relic Insights, in-depth queries of data can be run from Synthetics monitors. Creation of custom dashboards are also possible.

Features of Synthetic New Relic:
  • Easy to set up real time instrumentation and analytics
  • REST API functions
  • Real browsers
  • Comparative charting with Browser
  • New Relic Insights support
  • Advanced scripted monitoring
  • Global test coverage
Different types of Synthetic Monitor:

There are four types of monitor.

a) Ping monitor:

Ping monitors are the simplest type of monitor. These monitors are used to check if an application is online. The Synthetics ping monitor uses a simple Java HTTP client to make requests to your site.

b) API tests:

API tests are used to monitor API endpoints. This can ensure that the app server works in addition to the corresponding website. New Relic uses the “http request module” internally to make HTTP calls to API endpoint and validate the results.

c) Browser:

Simple browser monitors essentially are simple, pre-built scripted browser monitors. These monitors make a request to the site using an instance of Google Chrome.

d) Script_Browser:

Scripted browser monitors are used for more sophisticated, customized monitoring. A custom script can be created to navigate to the website, take specific actions and ensure that the specific resources are present.

Creation of Synthetic Monitor:

API Test Monitor:

Step 1:

  • Login to new relic monitor

Step 2 - Create synthetic monitor

  • Click “synthetic” in new relic dashboard after click on the “Add new” in the right up corner.

Step 3: Enter the Required Details

  • Select on “API Test” in monitor type.
  • Enter the monitor name under details
  • Select one location for the monitor under monitoring locations.
  • Set the Schedule – Set frequency for monitoring. For example On selecting frequency as 10 mins, The monitor would run this monitor and check for every 10 mins.
  • Set Notification – Notification to email ids can be set with help of new alert policy or can be appended to existing alert policy. In case of existing alert policy, Click on “Add to an existing alert policy” and the existing policy can be selected. In case of new policy, email address and policy name has to be given. There are three type of policy,
    1. By Policy – Only one open incident at a time for this alert policy.
    2. By Condition – Only one open incident at a time per alert condition
    3. By condition and entity – open an incident every time a condition is violated.
  • Only on completing the above steps, Script can be written by clicking on “Write Your script”
  • Click on “create monitor” after the monitor creation steps done.
PING Monitor:

Step 1:

  • Login to new relic monitor

Step 2 - Create synthetic monitor

  • Click “synthetic” in new relic dashboard after click on the “Add new” in the right up corner.

Step 3: Enter the Required Details

  • Select on “API Test” in monitor type
  • Enter the monitor name under details
  • Enter the URL and enter the response corresponding URL
  • Select one location for the monitor under monitoring locations.
  • Set the Schedule – Set frequency for monitoring. For example On selecting frequency as 10 mins, The monitor would run this monitor and check for every 10 mins.
  • Set Notification – Notification to email ids can be set with help of new alert policy or can be appended to existing alert policy. In case of existing alert policy, Click on “Add to an existing alert policy” and the existing policy can be selected. In case of new policy, email address and policy name has to be given. There are three type of policy,
    1. By Policy – Only one open incident at a time for this alert policy.
    2. By Condition – Only one open incident at a time per alert condition.
    3. By condition and entity – open an incident every time a condition is violated.
  • Only on completing the above steps, Ping monitor gets created when clicking on “ Create Monitor”

Synthetic Monitor Functionality:

API Test:
Pass Scenario:

Below script is used to store the data using Post method, then pass the value to the call back function .Call back function is nothing but it is a function is passed into another function as an argument.

Here, call back function has three arguments like error, response and body.

In the below script, comparing the value “gear” and “10” with JSON body value. Both the values are same. Hence no assertion error is triggered.

In case of value mismatch, an assertion error is thrown.

Failure scenario:

In the below script, the values do not match with the JSON body value. Hence an assertion error is thrown.

In case of assertion error, an alert will be sent to the mail id given in the notification channel. The Assertion error will not be resolved until the Value is made “10”.

Mail Alert: (Ping & API Test)

The error log can be seen as below:

After the error is fixed, an update would be sent to the notification channel

Delete a Monitor: (Ping & API Test)
  • From the Monitors list, select the monitor which needs to delete.
  • In the selected monitor, under settings click on General to view the monitor settings page.
  • Select the trash icon, it will show alert popup and click on “ok” in alert popup then monitor will delete.

What is Cross Browser Testing?

Cross Browser Testing is a type of Functional Test to check whether web application works as expected on different browsers.


Cross-browser testing is basically running the same set of test cases multiple times on different browsers.

Below two are the most intent of cross-browser testing,

Below two are the most intent of cross-browser testing,
  1. Below two are the most intent of cross-browser testing,
  2. Appearance of the page in different browsers- is it the same, is it different, if one is better than the other, etc

Note: In recent years, testing mobile browsers are included on the Cross-Browser testing scope.

When this testing can be started?

Any testing reaps the best benefits when it is done early on. Therefore, the industry recommendation is to start with it as soon as the page designs are available. Because finding and fixing bugs on early stages are very cost effective. Finding bugs after release or completion of application will not be a cost effective one.

Cross Browser testing through Manual:

Sure, it can be done manually. First, business needs to identify all browsers that the application needs to support. Tester need to run all the testcase against every identified browser and observe whether the appearance and functionality are same.

Through manual testing, it is not possible to cover many browsers and its major versions. So, performing cross browser testing manually will be costly and time-consuming too.

In an Agile world it’s not a good advice to do whole cross browser testing through manual.

Cross Browser testing through Automation:

As stated above, Cross-browser testing is basically running the same set of test cases multiple times on different browsers. This type of repeated task is best suited for automation. Thus, it’s more cost and time effective to perform this testing by using tools.

Selenium for Cross Browser Testing:

Selenium is well known for automated testing of the web-based applications. Just by changing the browser to be used for running the test cases, selenium makes it very easy to run the same test cases multiple times using different browsers.

Note: Rest of this blog we are going to see how Selenium can be used for Cross-Browser Testing.

Advantages of choosing Selenium:
  • Open source
  • Supports programming languages like Java, Perl, Python, C#, Ruby, Groovy, Java Script, etc
  • Platform Independent: Supports (OS) like Windows, Mac, Linux, UNIX, etc.
  • Supports multiple browsers namely, Internet Explorer, Chrome, Firefox, Opera, Safari, etc
  • Ease of implementation
  • Reusability

By using TestNG along with Selenium Grid we can achieve parallel test execution on different browser in different machines. Let’s see TestNG and Selenium Grid on the following topics,


TestNG is an automation testing framework in which NG stands for "Next Generation". TestNG is inspired from JUnit which uses the annotations (@). Default Selenium tests do not generate a proper format for the test results. Using TestNG we can generate test results.

Why TestNG?
  • Multiple test cases can be grouped easily by converting them into testng.xml file. In which you can make priorities which test case should be executed first.
  • The same test case can be executed multiple times without loops just by using keyword called 'invocation count.'
  • Using TestNG, you can execute multiple test cases on multiple browsers
  • It can be easily integrated with tools like Maven, Jenkins, etc.
Selenium Grid

Selenium Grid is a part of the Selenium Suite which specialise in running multiple tests across different browsers, operating system and machines. You can connect to it with Selenium Remote by specifying the browser, browser version, and operating system you want

Components of Selenium Grid

In Selenium Grid, the HUB is a computer which is the central point where we can load our tests into. Hub also acts as a server because of which it acts as a central point to control the network of Test machines. The Selenium Grid has only one hub and it is the master of the network.


In Selenium Grid, a NODE is referred to a Test Machine which opts to connect with the Hub. This test machine will be used by Hub to run tests on. A Grid network can have multiple nodes. A node is supposed to have different platforms i.e. different operating system and browsers. The node does not need the same platform for running as that of hub.

Advantages of Selenium Grid
  • Selenium Grid allows running multiple tests across different web browsers, operating systems, and machines. This ensures compatibility of the application under test across multiple combinations of web browsers, operating system, and hardware architecture
  • It speeds up the test suite completion time as it can run multiple tests in parallel. For example, if we have 10 nodes and we need to execute a test suite of 50 tests then it is going to take 10 times lesser time than a single machine that runs this test suit without Selenium Grid.
Disadvantage of Selenium Grid
  • Extra cost to project as it requires additional machines as Nodes
Grid Code Snippets:

Continuous Integration (CI) is a development practice where developers integrate code into a shared repository frequently, preferably several times a day. Each integration can then be verified by an automated build and automated tests. While automated testing is not strictly part of CI it is typically implied.

One of the key benefits of integrating regularly is that you can detect errors quickly and locate them more easily. As each change introduced is typically small, pinpointing the specific change that introduced a defect can be done quickly.

In recent years CI has become a best practice for software development and is guided by a set of key principles. Among them are revision control, build automation and automated testing.

Benefits and Advantages of Continuous Integration

Continuous Integration has many benefits. A good CI setup speeds up your workflow and encourages the team to push every change without being afraid of breaking anything. There are more benefits to it than just working with a better software release process. Continuous Integration brings great business benefits as well.

  • Reduces the time and effort for integrations of different code changes
  • Enables a quick feedback mechanism on every change
  • Allows earlier detection and prevention of defects
  • Helps collaboration between team members so recent code is always shared
  • Reduces manual testing effort
  • Building features more incrementally saves time on the debugging side so you can focus on adding features
  • First step into fully automating the whole release process
  • Prevents divergence in different branches as they are integrated regularly
Continuous Integration Tools


Jenkins is a cross-platform open source CI tool written in Java. It offers configuration through both the GUI interface and the console commands. Jenkins is a very flexible tool to use because it offers an extension of features through plugins. Its plugin list is very broad, and one can easily add their own plugins to that list. Furthermore, Jenkins can distribute software builds and test loads on several machines.

Travis CI

Travis CI is an open source CI service free for all open source projects hosted on GitHub. Since Travis CI is hosted, it is platform independent. It is configured using Travis.Yml files which contain actionable data. Travis CI supports a variety of software languages, and the build configuration for each of those languages is complete. Travis CI uses virtual machines to create applications.


TeamCity is a Java-based sophisticated CI tool offered by JetBrains. It supports Java,Net and Ruby platforms. TeamCity has a range of free plugins available developed both by JetBrains and third parties. It also offers integration with several IDEs including, Eclipse, IntelliJ IDEA and Visual Studio. Moreover, TeamCity allows simultaneous running of multiple builds and tests in different platforms and environments.

GitLab CI

GitLab CI is hosted on the free hosting service GitLab.com, and it offers Git repository management function with features such as, access control, bug tracking, and code reviewing. GitLab CI is completely unified with GitLab and it can easily be used to link projects using the GitLab API. GitLab CI process builds are coded in the Go language and can execute on several operating systems such as, Windows, Linux, Docker, OSX, and FreeBSD.


CircleCI is a CI tool hosted only on GitHub. It supports several languages, including Java, Python, Ruby/Rails, Node.js, PHP, Skala and Haskell. It offers services based on containers. CircleCI offers one container free, and any number of projects can be built on it. It offers up to five levels of parallelization (1x, 4x, 8x, 12x and 16x). Therefore, maximum parallelization of 16x can be achieved in one build. CircleCI also supports Docker platform.


Bamboo is a CI tool developed by Atlassian. Bamboo is available in two versions, cloud and server. For the cloud version, Atlassian offers hosting service with the help of Amazon EC2 account. For the server version, self-hosting needs to be done. Bamboo supports well known Atlassian products, JIRA and BitBucket.

Machine Learning

Artificial Intelligence

Artificial intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions) and self-correction.

Machine Learning

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

In Traditional Programming, data and program are run on the computer to produce the output. In Machine Learning, data and output are run on the computer to create a program. The program can be used in traditional programming.

Machine learning algorithms are often categorized as supervised or unsupervised.

Supervised Learning

Supervised learning is a learning in which we teach or train the machine using data which is well labelled that means some data is already tagged with correct answer. After that, machine is provided with new set of examples(data) so that supervised learning algorithm analyses the training data (set of training examples) and produces a correct outcome from labelled data.

Classification algorithms and regression algorithms are types of supervised learning. Classification algorithms are used when the outputs are restricted to a limited set of values. For a classification algorithm that filters emails, the input would be an incoming email, and the output would be the name of the folder in which to file the email. For an algorithm that identifies spam emails, the output would be the prediction of either "spam" or "not spam", represented by the Boolean values true and false. Regression algorithms are named for their continuous outputs, meaning they may have any value within a range. Examples of a continuous value are the temperature, length, or price of an object.

Unsupervised Learning

Unsupervised learning is the training of machine using information that is neither classified nor labelled and allowing the algorithm to act on that information without guidance. Here the task of machine is to group unsorted information according to similarities, patterns and differences without any prior training of data. The most common unsupervised learning method is cluster analysis or clustering, which is used for exploratory data analysis to find hidden patterns or grouping in data.

Some simple Machine Learning algorithms

Linear Regression

Here, we establish a relationship between independent and dependent variables by fitting the best line. It is used to estimate real values (cost of houses, number of calls, total sales, etc.) based on a continuous variable(s).

Below model is used to predict the Ice cream sales based on the temperature in a city.

We need a weight(w) and a bias(b) to fit a straight-line (y = wx + b) and this can be diagrammatically represented as given below:

Above diagram is the simplest Neural Network. A neural network is a system of hardware and/or software patterned after the operation of neurons in the human brain.

Logistic Regression

Logistic Regression is a classification algorithm used to estimate discrete binary values (like 0/1, yes/no, true/false) based on given set of independent variables. Typically, this involves fitting a curve to separate 2 distinct classes of data points.

The neural network for logistic regression has multiple weights / bias as inputs and 2 output nodes as shown below:

Deep Learning

Deep learning is a specific method of machine learning, and it’s based primarily on the use of neural networks.

In traditional supervised machine learning, systems require an expert to use his or her domain knowledge to specify the information (called features) in the input data that will best lead to a well-trained system. In Deep Learning, rather than specifying the features in our data that we think will lead to the best classification accuracy, we let the machine find this information on its own. Often, it can look at the problem in a way that even an expert wouldn’t have been able to imagine.

Neural Network Terminology

Activation function

The activation function of a node defines the output of that node, or “neuron”, given an input or set of inputs. This output is then used as input for the next node and so on until a desired solution to the original problem is found. Some of the commonly used activation functions are given below

Input / Output / Hidden Layers

Simply as the name suggests the input layer is the one which receives the input and is essentially the first layer of the network. The output layer is the one which generates the output or is the final layer of the network. The processing layers are the hidden layers within the network. These hidden layers are the ones which perform specific tasks on the incoming data and pass on the output generated by them to the next layer. The input and output layers are the ones visible to us, while are the intermediate layers are hidden.

Forward propagation

Forward Propagation refers to the movement of the input through the hidden layers to the output layers. In forward propagation, the information travels in a single direction FORWARD. The input layer supplies the input to the hidden layers and then the output is generated. There is no backward movement.

Cost / Loss function

When we build a network, the network tries to predict the output as close as possible to the actual value. We measure this accuracy of the network using the loss function. The loss function tries to penalize the network when it makes errors. Our objective while running the network is to increase our prediction accuracy and to reduce the error, hence minimizing the loss function. The most optimized output is the one with the least value of the loss function. If we define the loss function to be the mean squared error, it can be written as –

C= 1/m ∑ (y – a)2 where m is the number of training inputs, a is the predicted value and y is the actual value of that example.

The learning process revolves around minimizing the cost.

Gradient Descent

Gradient descent is an optimization algorithm for minimizing the cost. To think of it intuitively, while climbing down a hill you should take small steps and walk down instead of just jumping down at once. Therefore, what we do is, if we start from a point x, we move down a little i.e. delta h, and update our position to x-delta h and we keep doing the same till we reach the bottom. Consider bottom to be the minimum cost point.

Mathematically, to find the local minimum of a function one takes steps proportional to the negative of the gradient of the function.

Learning Rate

rate at which we descend towards the minima of the cost function is the learning rate. We should choose the learning rate very carefully since it should neither be very large that the optimal solution is missed and nor should be very low that it takes forever for the network to converge.


When we define a neural network, we assign random weights and bias values to our nodes. Once we have received the output for a single iteration, we can calculate the error of the network. This error is then fed back to the network along with the gradient of the cost function to update the weights of the network. These weights are then updated so that the errors in the subsequent iterations is reduced. This updating of weights using the gradient of the cost function is known as back-propagation.

Steps in training a Neural Network
  • Initialize weights and biases.
  • ii. Forward propagation: Using the input X, weights W and biases b, for every layer we compute Z and A, the Linear and Non-linear activations. At the final layer, we compute f(A^(L-1)) which could be a sigmoid, softmax or linear function of A^(L-1) and this gives the prediction y_hat.
  • Compute the loss function: This is a function of the actual label y and predicted label y_hat. It captures how far off our predictions are from the actual target. Our objective is to minimize this loss function.
  • Backward Propagation: In this step, we calculate the gradients of the loss function f(y, y_hat) with respect to A, W, and b called dA, dW and db. Using these gradients, we update the values of the parameters from the last layer to the first.
  • Repeat steps 2–4 for n iterations/epochs till we feel we have minimized the loss function, without overfitting the train data
Machine Learning using Python

Simple Machine Learning models like Linear Regression can be trained using the python library scikit-learn. Neural Networks are built and trained using the libraries Keras, TensorFlow or PyTorch.

In below simple example, we are building a linear regression model to predict the ice cream sales based on temperature. 80% of the available data is used for testing and we are using the remaining 20% data for testing our model.

import matplotlib.pyplot as plt   
import numpy as np   
from sklearn.linear_model import LinearRegression  
from sklearn.metrics import r2_score  
import pandas as pd  
 # load the dataset   
 Stock_Market = {'Temprature_in_Fahrenheit' :[58, 62, 52, 60, 66, 74, 68, 80, 76, 74, 64,],  
 'Ice_Cream_sales': [215,325,185,332,406,522,412,614,544,44500000,408]          
 df = pd.DataFrame(Stock_Market,columns=['Temprature_in_Fahrenheit','Ice_Cream_sales'])  
 X = df[['Temprature_in_Fahrenheit']]  
 Y = df['Ice_Cream_sales']  
 # splitting X and y into training and testing sets   
 from sklearn.model_selection import train_test_split   
 X_train, X_test, y_train, y_test = train_test_split(X, Y, 
 test_size=0.2, random_state=1)
 # create linear regression object   
 reg = LinearRegression()  
 # train the model using the training sets   
 reg.fit(X_train, y_train)  
  y_predict = reg.predict(X_test)  
  ## plotting residual errors in training data   
  plt.scatter(reg.predict(X_train), reg.predict(X_train) - 
  y_train, color = "green", s = 10, label = 'Train data')   
  ## plotting residual errors in test data   
  plt.scatter(reg.predict(X_test), reg.predict(X_test) - y_test, 
  color = "blue", s = 10, label = 'Test data')   
  ## plotting line for zero residual error   
  plt.hlines(y = 0, xmin = 0, xmax = 2000, linewidth = 2)   
  ## plotting legend   
  plt.legend(loc = 'upper right')   
  ## plot title   
  plt.title("Residual errors")     
  ## function to show plot   

Spring Initializer – using Java and Maven

Spring Initializr is a web tool which is provided by Spring on official site https://start.spring.io/ We can create Spring Boot project by providing project details.

In the below example, we added the springboot-starter-web dependency to write REST Endpoints.


After providing the Group, Artifact, Dependencies, Build Project, Platform and Version, click Generate Project button. The zip file will get downloaded and the files will be extracted. After the project is downloaded, unzip the file.

The maven file pom.xml will have the Web dependency we had selected above.


Note that only the Spring boot starter parent has a version number. Spring boot starter web doesn’t have a version as it is automatically configured based on version of the parent.

You can find the main class file under src/java/main directories with the default package.


To write a simple Hello World Rest Endpoint in the Spring Boot Application main class file itself, follow the steps shown below:

  • Firstly, add the @RestController annotation at the top of the class.
  • Now, write a Request URI method with @RequestMapping annotation.
  • Then, the Request URI method should return the Hello World string.

Create an executable JAR by executing the below Maven command in the folder having pom.xml
C:\Users\SaravananP\Downloads\demo\mvn clean install


The .jar file will be created in the target folder as indicated above

Run the Jar file using java –jar and verify the results



Application Properties

In the above examples, we have seen that Spring boot automatically configured Tomcat to run in port 8080. We can override this by specifying the port in the file src\main\resources\application.properties


If we rebuild the jar and execute it, we will get an error in http://localhost:8080 and be able to see the Hello World message in http://localhost:9090



Spring Boot

Spring Boot is an open source Java-based framework used to create Micro Services. It is used to build stand-alone and production ready spring applications.

What is Micro Service?

Micro Service is an architecture that allows the developers to develop and deploy services independently. Each service running has its own process, and this achieves the lightweight model to support business applications.

Features and benefits of Spring Boot

  • Spring boot provides a flexible way to configure Java Beans, XML configurations, and Database Transactions.
  • It provides a powerful batch processing and manages REST endpoints.
  • In Spring Boot, everything is auto configured; no manual configurations are needed.
  • It offers annotation-based spring application.
  • Eases dependency management.
  • It includes Embedded Servlet Container.
  • It is highly dependent on the starter templates feature.

How Spring Boot works

Spring Boot automatically configures our application based on the dependencies we have added to the project by using @EnableAutoConfiguration annotation. For example, if MySQL database is on our classpath, but we have not configured any database connection, then Spring Boot auto-configures an in-memory database.

Spring Boot Starters

Handling dependency management is a difficult task for big projects. Spring Boot resolves this problem by providing a set of dependencies for developer’s convenience.

For example, if we want to create a web application with REST Endpoints, it is sufficient if we include spring-boot-starter-web dependency in our project.

Note that all Spring Boot starters follow the same naming pattern spring-boot-starter-*, where * indicates that it is a type of the application.


Spring Boot Starter Test dependency is used for writing Test cases. Its code is shown below:


Spring Boot Application

The entry point of the Spring Boot Application is the class containing @SpringBootApplication annotation. This class should have the main method to run the Spring Boot application. @SpringBootApplication annotation includes @EnableAutoConfiguration, @ComponentScan, and @SpringBootConfiguration annotations.

Spring Boot automatically scans all the components included in the project by using @ComponentScan annotation.

Observe the following code for a better understanding:

        import org.springframework.boot.SpringApplication;
        import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
        public class DemoApplication {
        public static void main(String[] args) {
        SpringApplication.run(DemoApplication.class, args);

Spring Boot – Quick Start – using Groovy

The Spring Boot CLI is a command line tool and it allows us to run the Groovy scripts. Create a simple groovy file which contains the Rest Endpoint script.


        class Example {
        public String hello() {
        "Hello Spring Boot"

The above file can be run using the command “spring run Hello.groovy”


Once we run the groovy file, required dependencies will download automatically and it will start the application in Tomcat 8080 port as shown in the screenshot above. You can also see that sping ‘Mapped "{[/]}" onto public java.lang.String Example.hello()’.

We can go to the web browser and hit the URL http://localhost:8080/, and see the output from hello() function as shown below:


Click here to continue >>

When you are deploying a new change into production, the associated deployment should be in a predictable manner. In simple terms, this means no disruption and zero downtime! In case you do encounter a problem or a bottleneck, the deployment strategy should include a quick roll back.

The safe strategy can be achieved by working with two identical infrastructures - the “green” environment hosting the current production and the “blue” environment with the new changes.

The business and IT teams will have an opportunity to conduct sanity, smoke test or any other test in the “blue” environment before making a “Go” decision. Upon “Go”, the team can switch “blue” to “green” and “green” to “blue”.

In Azure, different processes are available for implementing the Blue-Green strategy with two environments.

We have listed below some of these techniques. Naturally, this list is not fixed and will grow continuously as new tool sets and services emerge.

  • Deployment slots - For Web Apps, deployment slots provide an easy way to implement Blue-Green deployments.
  • Azure Traffic Manager – This can be leveraged to realize Blue-Green deployments for smoother deployments with weighted round-robin routing method. The detailed configuration and implementation methods are available in Azure Documentation.
  • Using an Application Gateway with two backend pools and a routing rule - Have two backend resource pools with one as a stage pool and another one as a prod pool. Add stage VMSS to stage pool, prod VMSS to prod pool and have one routing rule in the app gateway. Depending on the need to use stage or prod VMSS, this rule will be changed to point to the appropriate backend address pool.

CloudIQ architects and engineers have implemented Blue-Green deployment for multiple clients, and in each case, we have customized our strategies to suit their use-cases. If you are looking for a completely safe way to deploy new software versions and applications, then reach out to us at sales@cloudiq.io

Apache Spark is an open-source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads.

Spark provides distributed task transmission, scheduling, and I/O functionality. It provides programmers with a potentially faster and more flexible alternative to MapReduce, the software framework to which early versions of Hadoop were tied.

How Apache Spark works

Apache Spark can process data from a variety of data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases and relational data stores.

The Spark Core engine uses the resilient distributed data set, or RDD, as its primary data type. The RDD is designed in such a way to hide much of the computational complexity from users. It aggregates data and partitions it across a server cluster, where it can then be computed and either moved to a different data store or run through an analytic model. The user doesn't have to define where specific files are sent or what computational resources are used to store or retrieve files.

Given below is a sample Spark program written in Python to count the number of records with each rating in the input file given in next page:

 from pyspark import SparkConf, SparkContext
        import collections
        conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
        sc = SparkContext(conf = conf)
        lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")
        ratings = lines.map(lambda x: x.split()[2])
        result = ratings.countByValue()
        sortedResults = collections.OrderedDict(sorted(result.items()))
        for key, value in sortedResults.items():
            print("%s %i" % (key, value))

In the above code, sc is the SparkContext associated with input file u.data. ratings is a RDD created by mapping the 3rd column in input file (array occurrence [2] – Ratings). Here map() is a transformation function which produces a new RDD.

We can have multiple transformations in a single spark program each producing a new RDD from an existing RDD or an input file. countByValue() is an Action that is performed.
In Spark, the transformations are not executed until an Action is triggered. This is called Lazy Evaluation.

Apache Spark works

Figure 1

Spark languages

Spark was written in Scala, which is considered the primary language for interacting with the Spark Core engine. Out of the box, Spark also comes with API connectors for using Java, R, and Python.

Spark libraries

  • The Spark Core engine functions partly as an application programming interface (API) layer and underpins a set of related tools for managing and analyzing data.
  • Spark SQL -- One of the most commonly used libraries, Spark SQL enables users to query data stored in disparate applications using the common SQL language.
  • Spark Streaming -- This library allows users to build applications that analyze and present data in real time.
  • MLlib -- A library of machine learning code that enables users to apply advanced statistical operations to data in their Spark cluster and to build applications around these analyses.
  • GraphX -- A built-in library of algorithms for graph-parallel computation.

RDDs, DataFrames, and Datasets

An RDD is an immutable distributed collection of elements of data, partitioned across nodes in a cluster that can be operated in parallel with a low-level API that offers transformations and actions.

Like an RDD, a DataFrame is an immutable distributed collection of data. However, unlike an RDD, data is organized into named columns, like a table in a relational database.

Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface.

Executing SQL-style functions on a Dataframe

Given below is a map-reduce program to get the list of popular movies (which has been rated by many customers using the same input data as Figure 1 above).

 from pyspark import SparkConf, SparkContext
        conf = SparkConf().setMaster("local").setAppName("PopularMovies")
        sc = SparkContext(conf = conf)
        lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")
        movies = lines.map(lambda x: (int(x.split()[1]), 1))
        movieCounts = movies.reduceByKey(lambda x, y: x + y)
        flipped = movieCounts.map( lambda xy: (xy[1],xy[0]) )
        sortedMovies = flipped.sortByKey()
        results = sortedMovies.collect()
        for result in results:

The same program, when written using DataFrames, will look like this

 from pyspark.sql import SparkSession
        from pyspark.sql import Row
        from pyspark.sql import functions
        def loadMovieNames():
            movieNames = {}
            with open("ml-100k/u.ITEM") as f:
                for line in f:
                    fields = line.split('|')
                    movieNames[int(fields[0])] = fields[1]
            return movieNames
        # Create a SparkSession (the config bit is only for Windows!)
        spark = SparkSession.builder.config("spark.sql.warehouse.dir", 
        # Load up our movie ID -> name dictionary
        nameDict = loadMovieNames()
        # Get the raw data
        lines = spark.sparkContext.textFile("file:///SparkCourse/ml-100k/u.data")
        # Convert it to a RDD of Row objects
        movies = lines.map(lambda x: Row(movieID =int(x.split()[1])))
        # Convert that to a DataFrame
        movieDataset = spark.createDataFrame(movies)
        # Some SQL-Style magic to sort all movies by popularity in one line!
        topMovieIDs = movieDataset.groupBy("movieID").count().orderBy
        ("count", ascending=False).cache()
        # Show the results at this point:
        #|     50|  584|
        #|    258|  509|
        #|    100|  508|
        # Grab the top 10
        top10 = topMovieIDs.take(10)
        # Print the results
        for result in top10:
            # Each row has movieID, count as above.
            print("%s: %d" % (nameDict[result[0]], result[1]))
        # Stop the session

As you can see DataFrames gives us the flexibility to use SQL style functions to get the required results. Because DataFrames APIs are built on top of the Spark SQL engine, it uses Catalyst to generate an optimized logical and physical query plan.

Job Scheduling

Spark has several facilities for scheduling resources between computations.

  • Each Spark application (instance of SparkContext) runs an independent set of executor processes. The cluster managers that Spark runs on provide facilities for scheduling across applications.
  • Within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads. This is common if the application is serving requests over the network. Spark includes a fair scheduler to schedule resources within each SparkContext.

Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.

Spark Streaming

The Python program shown below counts the number of words in text data received from a data server listening on a TCP socket.

Sample input entered for this program at a terminal through NetCat and the output of the program is given below.

# Running Netcat

$ nc -lk 9999

hello world

# TERMINAL 2: RUNNING network_wordcount.py
$ ./bin/spark-submit examples/src/main/python/streaming/network_wordcount.py 
localhost 9999
Time: 2014-10-14 15:25:21



Launched for the first time in May 2014, Apache Spark has become the go-to program for companies that work with large-scale Big Data applications. The speed and agility of Spark have made it incredibly useful across a wide range of industries.

From FMCG giants to BFSI companies to digital advertising firms – Apache Spark has proved to be indispensable when it comes to aggregating data, gleaning insights and forecasting industry trends.

CloudIQ is a leading Cloud Consulting and Solutions firm that helps businesses solve today’s problems and plan the enterprise of tomorrow by integrating intelligent cloud solutions. We help you leverage the technologies that make your people more productive, your infrastructure more intelligent, and your business more profitable. 


626 120th Ave NE, B102, Bellevue,

WA, 98005.



Chennai One IT SEZ,

Module No:5-C, Phase ll, 2nd Floor, North Block, Pallavaram-Thoraipakkam 200 ft road, Thoraipakkam, Chennai – 600097

© 2020 CloudIQ Technologies. All rights reserved.

Get in touch

Please contact us using the form below


626 120th Ave NE, B102, Bellevue, WA, 98005.

+1 (206) 203-4151



Chennai One IT SEZ,

Module No:5-C, Phase ll, 2nd Floor, North Block, Pallavaram-Thoraipakkam 200 ft road, Thoraipakkam, Chennai – 600097