Provisioning Cloud Infrastructure using AWS CloudFormation Templates

Amazon Web Services (AWS) is a secure cloud services platform, offering compute power, database storage, content delivery, and other functionalities to help businesses scale and grow.

It gives organizations a secure and robust platform to develop their custom cloud-based solutions and has several unique features that make it one of the most reliable and flexible cloud platform such as

  • Mobile-friendly access through AWS Mobile Hub and AWS Mobile SDK
  • Fully managed purpose-built Databases
  • Serverless cloud functions
  • Range of storage options that are affordable and scalable.
  • Unbeatable security and compliance

Following are some core services offered by AWS:

AWS Core services
  1. An EC2 instance is a virtual server in Amazon's Elastic Compute Cloud (EC2) for running applications on the AWS infrastructure.
  2. Amazon Elastic Block Store (EBS) is a cloud-based block storage system provided by AWS that is best used for storing persistent data.
  3. Amazon Virtual Private Cloud (Amazon VPC) enables us to launch AWS resources into a virtual network that we have defined. This virtual network closely resembles a traditional network that we would operate in our own data center, with the benefits of using the scalable infrastructure of AWS.
  4. Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that uses to run its global e-commerce network.
  5. AWS security groups (SGs) are associated with EC2 instances and provide security at the protocol and port access level. Each security group — working much the same way as a firewall — contains a set of rules that filter traffic coming into and out of an EC2 instance.

Let us look more deeply at one of AWS's core services – AWS CloudFormation – that is key for managing workloads on AWS.

1.   CloudFormation

AWS CloudFormation is a service that helps us model and set up our Amazon Web Services resources so that we can spend less time managing those resources and more time focusing on our applications that run in AWS.  We create a template that describes all the AWS resources that we want (like Amazon EC2 instances or S3 buckets), and AWS CloudFormation takes care of provisioning and configuring those resources for us. We don't need to individually create and configure AWS resources and figure out what's dependent on what; AWS CloudFormation handles all of that.

A stack is a collection of AWS resources that you can manage as a single unit. In other words, we can create, update, or delete a collection of resources by creating, updating, or deleting stacks. All the resources in a stack are defined by the stack's AWS CloudFormation template.

2.   CloudFormation template

CloudFormation templates can be written in either JSON or YAML.  The structure of the template in YAML is given below:

AWSTemplateFormatVersion: "version date"

  template metadata
  set of parameters
  set of mappings
  set of conditions
  set of resources
  set of outputs

In the above yaml file,

  1. AWSTemplateFormatVersion - The AWS CloudFormation template version that the template conforms to.
  2. Description - A text string that describes the template.
  3. Metadata - Objects that provide additional information about the template.
  4. Parameters - Values to pass to our template at runtime (when we create or update a stack). We can refer to parameters from the Resources and Outputs sections of the template.
  5. Mappings - A mapping of keys and associated values that we can use to specify conditional parameter values, like a lookup table. We can match a key to a corresponding value by using the Fn::FindInMap intrinsic function in the Resources and Outputs sections.
  6. Conditions - Conditions that control whether certain resources are created or whether certain resource properties are assigned a value during stack creation or update. For example, we can conditionally create a resource that depends on whether the stack is for a production or test environment.
  7. Resources - Specifies the stack resources and their properties, such as an Amazon Elastic Compute Cloud instance or an Amazon Simple Storage Service bucket.  We can refer to resources in the Resources and Outputs sections of the template.
  8. Outputs - Describes the values that are returned whenever we view our stack's properties. For example, we can declare an output for an S3 bucket name and then call the AWS cloudformation describe-stacks AWS CLI command to view the name.

Resources is the only required section in the CloudFormation template.  All other sections are optional.

3.   CloudFormation template to create S3 bucket


    Type: AWS::S3::Bucket

In AWS Console, go to CloudFormation and click on Create Stack

Upload the template file which we created.  This will get stored in an S3 location, as shown below.

Click next and give a stack name

Click Next and then "Create stack".  After a few minutes, you can see that the stack creation is completed.

Clicking on the Resource tab, you can see that the S3 bucket has been created with name “s3-stack-hellobucket-buhpx7oucrgn”.  AWS has provided this same since we didn’t specify the BucketName property in YAML.

Note that deleting the stack will delete the S3 bucket which it had created.

4.   Intrinsic functions

AWS CloudFormation provides several built-in functions that help you manage your stacks.

In the below example, we create two resources - a Security Group and an EC2 Instance, which uses this Security Group.  We can refer to the Security Group resource using the !Ref function.


    Type: 'AWS::EC2::Instance'
        - !Ref InstanceSecurityGroup
      KeyName: mykey
      ImageId: ''
    Type: 'AWS::EC2::SecurityGroup'
      GroupDescription: Enable SSH access via port 22
        - IpProtocol: tcp
          FromPort: '22'
          ToPort: '22'

Some other commonly used intrinsic functions are

  1. Fn::GetAtt - returns the value of an attribute from a resource in the template.
  2. Fn::Join - appends a set of values into a single value, separated by the specified delimiter. If a delimiter is an empty string, the set of values are concatenated with no delimiter.
  3. Fn::Sub - substitutes variables in an input string with values that you specify. In our templates, we can use this function to construct commands or outputs that include values that aren't available until we create or update a stack.
5.   Parameters

Parameters enable us to input custom values to your template each time you create or update a stack.


    Type: String
    Default: t2.micro
      - t2.micro
      - m1.small
      - m1.large
    Description: Enter t2.micro, m1.small, or m1.large. Default is t2.micro.
    Type: AWS::EC2::Instance
        Ref: InstanceTypeParameter
      ImageId: ami-0ff8a91507f77f867
6.   Pseudo Parameters

Pseudo parameters are parameters that are predefined by AWS CloudFormation. We do not declare them in our template. Use them the same way as we would a parameter as the argument for the Ref function.

Commonly used pseudo parameters:

  1. AWS: Region - Returns a string representing the AWS Region in which the encompassing resource is being created, such as us-west-2
  2. AWS::StackName - Returns the name of the stack as specified during cloudformation create-stack, such as teststack
7.   Mappings

The optional Mappings section matches a key to a corresponding set of named values. For example, if you want to set values based on a region, we can create a mapping that uses the region name as a key and contains the values we want to specify for each specific region. We use the Fn::FindInMap intrinsic function to retrieve values in a map.

We cannot include parameters, pseudo parameters, or intrinsic functions in the Mappings section.


AWSTemplateFormatVersion: "2010-09-09"
      HVM64: ami-0ff8a91507f77f867
      HVMG2: ami-0a584ac55a7631c0c
      HVM64: ami-0bdb828fd58c52235
      HVMG2: ami-066ee5fd4a9ef77f1
      HVM64: ami-047bb4163c506cd98
      HVMG2: ami-0a7c483d527806435
      HVM64: ami-06cd52961ce9f0d85
HVMG2: ami-053cdd503598e4a9d
      HVM64: ami-08569b978cc4dfa10
      HVMG2: ami-0be9df32ae9f92309
    Type: "AWS::EC2::Instance"
      ImageId: !FindInMap [RegionMap, !Ref "AWS::Region", HVM64]
      InstanceType: m1.small
8.   Outputs

The optional Outputs section declares output values that we can import into other stacks (to create cross-stack references), return in response (to describe stack calls), or view on the AWS CloudFormation console. For example, we can output the S3 bucket name for a stack to make the bucket easier to find.

In the below example, the output named StackVPC returns the ID of a VPC, and then exports the value for cross-stack referencing with the name VPCID appended to the stack's name.

    Description: The ID of the VPC
    Value: !Ref MyVPC
      Name: !Sub "${AWS::StackName}-VPCID"

How to Create and Run Spark Clusters with Qubole using AWS

There is little doubt that data will guide the next generation of business strategy and will bring new efficiencies across industries. But for that to happen, organizations must be able to extract insights from their data.

Qubole is an ideal platform to activate end-to-end data processing in organizations. It combines all types of data – structured, unstructured, and legacy offline data – into a single data pipeline and turns it into rich insights by adding AI, ML, and deep analytics tools to the mix.

It scales seamlessly to accommodate more users and new data without adding administrative overheads and lowers cloud costs significantly. Simply put, Qubole is a platform that puts big data on the cloud to power business decisions based on real-time analytics.

At CloudIQ Technologies, our data experts have deployed Qubole’s cloud-native data systems for many of our clients, and the results have been outstanding. Here is an article from one of our data engineers that provides an overview of how to setup Qubole to use AWS environment and create and run spark clusters.

AWS Access Configuration:

In order for Qubole to create and run a cluster, we have to grant Qubole access to our AWS environment. We can grant access based on a key or a role. We will use role-based authentication.

Step 1: Login to Qubole

Step 2: Click on the menu at the top left corner and select "Account Settings" under the Control Panel.

Step 3: Scroll down to Access settings

Step 4: Switch Access mode to “IAM Role”

Step 5: Copy the Trusted Principal AWS Account ID and External ID

Step 6: Use the copied values to create a QuboleAccessRole in the AWS account (using the cloudformation template)

Step 7: Copy the Role ARN of the QuboleAccessRole and enter it in the Role ARN field

Step 8: Enter the S3 bucket location where the Qubole metadata will be stored in the "Default Location" field.

Step 9: Click Save

Spark Cluster
Create a cluster

The below steps will help create a new Spark cluster in Qubole.

Step 1: Click on the top-left dropdown menu and select "Cluster"

Step 2: Click on “+New” button

Step 3: Select “Spark” and click “Next”

Step 4: Provide a name for the cluster in the “Cluster Labels” field

Step 5: Select the version of Spark to run, Master Node Type, Worker Node Type, Minimum and Maximum nodes

Step 6: Select Region as us-west-2

Step 7: Select Availability Zone as us-west-2a

Step 8: Click “Next”

Step 9: In the Composition screen, you can select the type of nodes that will be spun up.

Step 10: In the Advanced Configuration screen, proceed to EC2 settings

Step 11: Enter “QuboleDualIAMRole” in the “Instance Profile” field

Step 12: Select “AppVPC” in VPC field

Step 13: Select “AppPrivateSNA” under Subnet field

Step 14: Enter the ip address of the Bastion node in the “Bastion Node” field

Step 15: Scroll to the bottom and enter “AppQuboleClusterSG” (security group for the cluster) in the “Persistent Security Group” field

Step 16: Click on “Create”

Run a cluster

To start a cluster, click on the dropdown menu on the top left corner and select cluster. Now click on "Start" button next to the cluster that needs to be started. A cluster is also automatically started when a job is submitted for the cluster.

Submit a job

One of the simplest ways to run a spark job is to submit it through the workbench. You can navigate to the workbench from the drop-down menu at the top left corner. In the workbench, click on "+Create New". Then select "Spark" next to the title of the job. Once you select Spark, an optional drop-down appears where you can choose "Python". In the last drop-down menu, select the spark cluster where you want to execute the job. If this cluster is not active, it will be activated automatically. Enter your spark job in the window below. When complete, click on "Run" to run the job.

Airflow Cluster

Airflow scheduler can be used to run various jobs in a sequence. Let’s take a look at configuring an Airflow cluster in Qubole.

Setting up DataStore

The first step in creating an airflow cluster is to set up a datastore. Make sure that the MySQL db is up and running and contains a database for airflow. Now, select “Explore” from the dropdown menu at the top left corner. On the left hand menu, drop down the selection menu showing “Qubole Hive” and select “+Add Data Store”

In the new screen, provide a name for the data store. Select “MySQL” as the database type. Enter the database name for the airflow database (The database should already be created in MySQL). Enter the host address as “". Enter the username and password. Make sure to select “Skip Validation”. Since the MySQL db is in a private VPC, Qubole does not have access to it and will not be able to validate.

Configuring Airflow Cluster

Step 1: Click on the top left drop-down menu and select "Cluster"

Step 2: Click on “+New” button

Step 3: Select “Airflow” in the type of cluster and click “Next”

Step 4: Give a cluster name. Select the airflow version, node type.

Step 5: Select the datastore which points to the MySQL

Step 6: Select the us-west-2 as the Region

Step 7: Select us-west-2a as the Availability zone

Step 8: Click next to go to Advanced Configuration

Step 9: Select AppVPC as the VPC

Step 10: Select AppPrivateSNA as the Subnet

Step 11: Enter the Bastion Node information

Step 12: Scroll to the bottom and enter AppQuboleClusterSG as the Persistent Security Groups

Step 13: Click on create

Once the cluster is created, you can run it by clicking on "Start" next to the cluster's name.

How to Set Up a Kubernetes Cluster on RHEL 7.6 on AWS

Containers are being embraced at a breakneck speed – developers love them, and they are great for business because they deliver speed and scale in a cost-efficient manner. So much so, that container technology seems to be overtaking VMs – especially with container orchestration tools like Kubernetes, making them simpler to manage and extracting higher efficiency and speed from them.

Kubernetes cluster architecture

Kubernetes provides an open-source platform for simplifying multi-cloud environments. The disparities between different cloud providers are a roadblock for developers and Kubernetes helps by streamlining and standardizing container-based applications.

Kubernetes clusters are the architectural foundation that drives this simplicity and makes it possible for users to get the functionality they need at scale and with ease. Here are some of the functionalities of Kubernetes -

  • Kubernetes distributes workload efficiently across all open resources and reduces traffic spikes or outages.
  • It simplifies application deployment regardless of the size of the cluster
  • It automates horizontal scaling
  • It monitors against app failure with constant node and container health checks and performs self-healing and replication to resolve any failure issues.

All this makes the work of developers faster and frees up their time and attention from trivial repetitive tasks allowing them to build applications better and faster! For the organization, the benefits are three-fold - higher productivity, better products and, finally, cost efficiencies.

Let’s move to the specifics now and find out how to set up a Kubernetes Cluster on the RHEL 7.6 operating system on AWS.

  • You should have a VPC available.
  • A subnet within that VPC, into which you will place your cluster.
  • You should have Security Groups for the Control Plane Load Balancer and the Nodes created.
  • You should have created the Control Plane Load Balancer.
  • A bastion host, or jump box, with a public IP within your VPC from which you can secure shell into your VMs.
  • A pem file for your AWS region, which you will use to secure shell into your VMs.
Creating the IAM Roles

You will need to create 2 IAM roles: one for the Master(s), and one for the worker nodes.

Master Role

To create an IAM role, go to the IAM (Identity and Access Management) page in the AWS console. On the left-hand menu, click ‘Roles’. Then click ‘Create Role’.

Select the service that will use this role. By default, it is EC2, which is what we want. Then click "Next: Permissions".

Click ‘Create Policy’. The Create Policy page opens in a new tab.

Click on the ‘JSON tab’. Then paste this json into it:

    "Version": "2012-10-17",
    "Statement": [
            "Action": [
            "Resource": "*",
            "Effect": "Allow"

This json defines the permissions that your master nodes will need.

Click ‘Review Policy’. Then give your policy a name and a description.

Click ‘Create Policy’ and your policy is created!

Back on the Create Role page, refresh your policy list, and filter for the policy you just created. Select it and click ‘Next: Tags’.

You should add 2 tags: Name, with a name for your role, and KubernetesCluster, with the name of the cluster that you are going to create. Click ‘Next: Review’.

Give your role a name and a description. Click ‘Create Role’ and your role is created!

Node Role

For the node role, you will follow similar steps, except that you will use the following json:

    "Version": "2012-10-17",
    "Statement": [
            "Action": [
            "Resource": "*",
            "Effect": "Allow"
Provisioning the VMs
Provisioning the Master

We will use RHEL 7.6 for our cluster because RHEL 8.0 uses iptables v1.8, and kube-proxy does not work well with iptables v1.8. However, kube-proxy works with iptables v1.4, which is installed on RHEL 7.6. We will use the x86_64 architecture.

Log into the AWS console. Go to the EC2 home page and click ‘Launch Instance’. We will search under Community AMIs for our image.

Click ‘Select’. Then choose your instance type. T2.medium should suffice for a Kubernetes master. Click ‘Next: Configure Instance Details’.

We will use only 1 instance. For an HA cluster, you will want more. Select your network and your subnet. For the purposes of this tutorial, we will enable auto-assigning a public IP.  In production, you would probably not want your master to have a public IP.  But you would need to make sure that your subnet is configured correctly with the appropriate NAT and route tables. Select the IAM role you created. Then click ‘Next: Add Storage’.

The default, 10 GB of storage, should be adequate for a Kubernetes master. Click ‘Next: Add Tags’.

We will add 3 tags: Name, with the name of your master; KubernetesCluster, with the name of your cluster; and<name of your cluster>, with the value owned. Click ‘Next: Configure Security Group’.

Select "Select an existing security group" and select the security group you created for your Kubernetes nodes. Click ‘Review and Launch’.

Click ‘Launch’. Select "Choose an Existing Key Pair". Select the key pair from the drop-down. Check the "I acknowledge" box. You should have the private key file saved on the machine from which you plan to secure shell into your master; otherwise you will not be able to ssh into the master! Click ‘Launch Instances’ and your master is created.

Provisioning the Auto Scaling Group

Your worker nodes should be behind an Auto Scaling group. Under Auto Scaling in the left-hand menu of the AWS console, click ‘Auto Scaling Groups’. Click ‘Create Auto Scaling Group’. On the next page, click ‘Get Started’.

Under “Choose AMI”, select RHEL 7.6 x86_64 under Community AMIs, as you did for the master.

When choosing your instance type, be mindful of what applications you want to run on your Kubernetes cluster and their resource needs. Be sure to provision a size with sufficient CPUs and memory.

Under “Configure Details”, give your autoscaling group a name and select the IAM role you configured for your Kubernetes nodes.

When selecting your storage size, be mindful of the storage requirements of your applications that you want to run on Kubernetes. A database application, for example, would need plenty of storage.

Select the security group that you configured for Kubernetes nodes.

Click ‘Create Launch Configuration’. Then select your key pair as you did for the master. Click ‘Create Launch Configuration’ and you are taken to the ‘Configure Auto Scaling Group Details’ page. Give your group a name. Select a group size. For our purpose, 2 nodes will suffice. Select the same subnet on which you placed your master. Click ‘Next: Configure Scaling policies’.

For this tutorial, we will select "Keep this group at its initial size". For a production cluster with variability in usage, you may want to use scaling policies to adjust the capacity of the group. Click ‘Next: Configure Notifications’.

We will not add any notifications in this tutorial. Click ‘Next: Configure Tags’.

We will add 3 tags: Name, with the name of your nodes; KubernetesCluster, with the name of your cluster; and<your cluster name>, with the value owned. Click ‘Review’.

Click Create Auto Scaling Group and your auto-scaling group is created!

Installing Kubernetes

Specific steps need to be followed to install Kubernetes. Run the following steps as sudo on your master(s) and worker nodes.

 # add docker repo

yum-config-manager --add-repo

# install container-selinux

 yum install -y

# install docker

yum install docker-ce

# enable docker

systemctl enable --now docker

# create Kubernetes repo. The 2 urls after gpgkey have to be on 1 line.

cat <<EOF > /etc/yum.repos.d/kubernetes.repo

# configure selinux

setenforce 0
sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

# install kubelet, kubeadm, kubectl, and Kubernetes-cni. We found that version 1.13.2 works well with RHEL 7.6.

yum install -y kubelet-1.13.2 kubeadm-1.13.2 kubectl-1.13.2 kubernetes-cni-0.6.0-0.x86_64 --disableexcludes=kubernetes –nogpgcheck

# enable kubelet

systemctl enable --now kubelet

# Run the following command as a regular user.

sudo usermod -a -G docker $USER
Creating the Kubernetes Cluster

First, add your master(s) to the control plane load balancer as follows. Log into the AWS console, EC2 service, and on the left-hand menu, under Load Balancing, click ‘Load Balancers’. Select your load balancer and click the Instances tab in the bottom window. Click ‘Edit Instances’.

Select your master(s) and click ‘Save’.

We will create the Kubernetes cluster via a config file. You will need a token, the master’s private DNS name taken from the AWS console, the Load Balancer’s IP, and the Load Balancer’s DNS name. You can generate a Kubernetes token by running the following command on a machine on which you have installed kubeadm:

kubeadm token generate

To get the load balancer’s IP, you must execute a dig command. You install dig by running the following command as sudo:

yum install bind-utils

Then you execute the following command:

dig +short <load balancer dns>

Then you create the following yaml file:

 kind: InitConfiguration
 - groups:
   - "system:bootstrappers:kubeadm:default-node-token"
   token: "<token>"
   ttl: "0s"
   - signing
   - authentication
   name: "<master private dns>"
     cloud-provider: "aws"
 kind: ClusterConfiguration
 kubernetesVersion: "v1.13.2"
   timeoutForControlPlane: 10m0s
   - "<Load balancer IPV4>"
     cloud-provider: "aws"
 clusterName: kubernetes
 controlPlaneEndpoint: "<load balancer DNS>:6443"
     cloud-provider: "aws"
     allocate-node-cidrs: "false"
   type: CoreDNS 

You then bootstrap the cluster with the following command as sudo:

kubeadm init --config kubeadm.yaml --ignore-preflight-errors=all

I had a timeout error on the first attempt, but the command ran successfully the second time. Make a note of the output because you will need it to configure the nodes.

You then configure kubectl as follows:

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

After this there are some components that need to be installed on Kubernetes on AWS:

# Grant the "admin" user complete access to the cluster

kubectl create clusterrolebinding admin-cluster-binding --clusterrole=cluster-admin --user=admin

# Add-on for networking providers, so pods can communicate. 
# Currently either calico.yaml or weave.yaml

kubectl apply -f

# Install the Kubernetes dashboard

kubectl apply -f

# Install the default StorageClass

kubectl apply -f

# Set up the network policy blocking the AWS metadata endpoint from the default namespace.

kubectl apply -f

Then you have to configure kubelet arguments:

sudo vi /var/lib/kubelet/kubeadm-flags.env

And add the following parameters:

--cloud-provider=aws --hostname-override=<the node name>

After editing the kubeadm-flags.env file:

sudo systemctl restart kubelet

Finally, you have to label your master with the provider ID. That way, any load balancers you create for this node will automatically add the node as an AWS instance:

kubectl patch node <node name> -p '{"spec":{"providerID":"aws:///<availability zone>/<instance ID>"}}'

You can join worker nodes to the cluster by running the following command as sudo, which should have been printed out after running kubeadm init on the master:

kubeadm join <load balancer dns>:6443 --token <token> --discovery-token-ca-cert-hash <discovery token ca cert hash> --ignore-preflight-errors=all

Be sure to configure kubelet arguments on each node and patch them using kubectl as you did for the master.

Your Kubernetes cluster on AWS is now ready!

Creating AWS Security Groups for Kubernetes

In this article we will discuss how to create security groups in AWS for Kubernetes. The goal is to set up a Kubernetes cluster on AWS EC2, having provisioned your virtual machines. You are going to need two security groups: one for the control plane load balancer, and another for the VMs.

Creating a Security Group through the AWS Console

Prerequisite: You should have a VPC (virtual private cloud) set up.

Log into the AWS EC2 (or VPC) console. On the left-hand menu, under Network and Security, click Security Groups.

Click on Create Security Group.

Enter a Name and a Description for your Security Group. Then select your VPC from the drop-down menu. Click Add Rule.

You will need 2 TCP ingress rules, one over port 6443, another over port 443. We are choosing to allow the Source from anywhere. In production you may want to restrict the CIDR, IP, or security group that can reach this load balancer.

We are choosing to leave the outbound rules as default, in which all outbound traffic is permitted.

Click Create and your security group is created!

Select your security group in the console. You may want to give your security group a Name (in addition to the Group Name that you specified when creating it).

But you are not done yet: you must add tags to your security group. These tags will alert AWS that this security group is to be used for Kubernetes. Click on the Tags tab at the bottom of the window. Then click Add/Edit Tags.

You will need 2 tags:
  • Name: KubernetesCluster. Value: <the name of your Kubernetes cluster>
  • Name:<the name of your Kubernetes cluster>. Value: owned

Click Save and your tags are saved!

Creating a Security Group for the Virtual Machines

Follow the steps above to create a security group for your virtual machines. Here are the ports that you will need to open for your control plane VMs:

The master node:
  1. 22 for SSH from your bastion host
  2. 6443 for the Kubernetes API Server
  3. 2379-2380 for the ETCD server
  4. 10250 for the Kubelet health check
  5. 10252 for the Kube controller manager
  6. 10255 for the read only kubelet API
The worker nodes:
  1. 22 for SSH
  2. 10250 for the kubelet health check
  3. 30000-32767 for external applications. However, it is more likely that you will expose external applications to outside the cluster via load balancers, and restrict access to these ports to within your vpc.
  4. 10255 for the read only kubelet API

We have chosen to combine the master and the worker rules into one security group for convenience. You may want to separate them into 2 security groups for extra security.

Follow the step-by-step instructions detailed above and you will have successfully created AWS Security Groups for Kubernetes.

Faster OLAP Setup Using AWS Glue, AWS Athena & AWS QuickSight


Analytics is the discovery, interpretation, and communication of meaningful patterns in data; and the process of applying those patterns towards effective decision making. In other words, analytics can be understood as the connective tissue between data and effective decision making, within an organization. Organizations may apply analytics to business data to describe, predict, and improve business performance.
Big data analytics is the complex process of examining large and varied data sets -- or big data -- to uncover information including hidden patterns, unknown correlations, market trends and customer preferences that can help organizations make informed business decisions.

Glue, Athena and QuickSight are 3 services under the Analytics Group of services offered by AWS. Glue is used for ETL, Athena for interactive queries and Quicksight for Business Intelligence (BI).


AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. We can create and run an ETL job with a few clicks in the AWS Management Console. We simply point AWS Glue to our data stored on AWS, and AWS Glue discovers our data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, our data is immediately searchable, queryable, and available for ETL.

In this blog we will look at 2 components of Glue – Crawlers and Jobs

Glue Crawlers

Glue crawlers can scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog. From there it can be used to guide ETL operations.

Suppose we have a file named people.json in S3 with the below contents:


Below are the steps to crawl this data and create a table in AWS Glue to store this data:

  1. On the AWS Glue Console, click “Crawlers” and then “Add Crawler”
  2. Give a name for your crawler and click next
  3. Select S3 as data source and under “Include path” give the location of json file on S3.
  4. Since we are going to crawl data from only 1 dataset, select No in next screen and click Next
  5. In next screen select an IAM role which has access to the S3 data store
  6. Select Frequency as “Run on demand” in next screen.
  7. Select a Database to store the crawler’s output. I chose a database named “saravanan” in the screen below. If no database exists, Add a database using the link given
  8. Review all details in next step and click Finish
  9. On next screen, click on “Run it now” to run the crawler
  10. The crawler runs for around a minute and finally you will be able to see status as Stopping / Ready with Tables added count as 1.
  11. Now you can go to Tables link and see that a table named “people_json” has been created under “Saravanan” database.
  12. Using the “View details” Action, and then scrolling down, you can see the schema for the table which Glue has automatically inferred and generated.
Glue jobs

The AWS Glue Jobs system provides managed infrastructure to orchestrate our ETL workflow. We can create jobs in AWS Glue that automate the scripts we use to extract, transform, and transfer data to different locations. Jobs can be scheduled and chained, or they can be triggered by events such as the arrival of new data.

To add a new job using the console

  1. Open the AWS Glue console, and choose the Jobs tab.
  2. Choose Add job and follow the instructions in the Add job wizard. Below screens copy data from the table we created earlier to a parquet file named people-parquet in same S3 bucket.

    After the above job runs and completes, you will be able to verify in S3 that the output Parquet has been created.

Glue Jobs use a data structure named DynamicFrame. A DynamicFrame is similar to a Spark DataFrame, except that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a choice (or union) type.

Instead of just using the python job which Glue generates, we can code our own jobs using DynamicFrames and have

        import sys
        from awsglue.transforms import *
        from awsglue.utils import getResolvedOptions
        from pyspark.context import SparkContext
        from awsglue.context import GlueContext
        glueContext = GlueContext(SparkContext.getOrCreate())
        users = glueContext.create_dynamic_frame.from_catalog(
        users_courses = glueContext.create_dynamic_frame.from_catalog(
        users = users.select_fields(['AccountName','Id','UserName','FullName','Active'])
         users_courses = users_courses.select_fields(['UserId', 'Id','Name','Code','Active',
        joined_table = Join.apply(users, users_courses, 'Id', 'UserId').drop_fields(['Id'])


it run through Glue. The same Glue job on next page selects specific fields from 2 Glue tables, renames some of the fields, joins the tables and writes the joined table to S3 in parquet format.


Amazon Athena is an interactive query service that makes it easy to analyse data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and we pay only for the queries that we run.

Athena is easy to use. We must simply point to our data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare our data for analysis. This makes it easy for anyone with SQL skills to quickly analyse large-scale datasets.

Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing us to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning.

Since Athena uses same Data Catalog as Glue, we will be able to query and view properties of the people_json table which we created earlier using Glue.

Also, we can create new table using data from S3 bucket data as shown below:

Unlike Glue, we have to explicitly give the data format (CSV, JSON, etc) and specify the column names and types while creating the table in Athena.

We can also manually create and query the tables using SQL as shown below:


Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy for us to deliver insights to everyone in our organization.

QuickSight lets us create and publish interactive dashboards that can be accessed from browsers or mobile devices. We can embed dashboards into our applications, providing our customers with powerful self-service analytics.

QuickSight easily scales to tens of thousands of users without any software to install, servers to deploy, or infrastructure to manage.

Below are the steps to create a sample Analysis in QuickSight:

  1. Any Analysis in QuickSight requires data from a Data Set. First click on the “Manage data” link at top right to list the Data Sets we currently have.
  2. To create a new Data Set, click the “New data set” link
  3. We can create Data Set from any of the Data sources listed here – uploading a file, S3, Athena table, etc.
  4. For our example, I am selecting Athena as data source and giving it a name “Athena source”. Then we must map this to a database / table in Athena.
  5. After we select the Athena table, QuickSight provides us an option to import the data to SPICE. SPICE is Amazon QuickSight's in-memory optimized calculation engine, designed specifically for fast, adhoc data visualization. SPICE stores our data in a system architected for high availability, where it is saved until we choose to delete it.
  6. Using the Edit/Preview Data option above allows us to select the columns to be included in Data set and rename them if required.
  7. Once we click the “Save & visualize” link above, QuickSight starts creating an Analysis for us. For our exercise we will select the Table visual type from the list.
  8. Add Account Name and User_id by dragging them from “Fields list” to “Group by” and course_active to “Value”
  9. Now we will add 2 parameters for Account Name and Learner id by clicking on Parameters at bottom left. While creating the parameter use the option “Link to a data set field” for Values and link the parameter to the appropriate column in the Athena table
  10. Once the parameters are added, create controls for the parameters. If we are adding 2 parameters with controls, we have option of showing only relevant values for second parameter based on the values selected for first parameter. For this select the “Show relevant values only” checkbox.
  11. Next add 2 Custom filters for Account Name and Learner id. These filters should me mapped to the parameters we had created earlier. For this choose the Filter type as “Custom filter” and select the “Use parameters” checkbox.
  12. Now using the Visualize option, we can verify if our Controls are working correctly
  13. To share the Dashboard with others, use the share option on top towards the right and use publish dashboard. We can search for users of the AWS account by email and publish to selective users.

Amazon Alexa Custom Skills – How to Build One Step-by-Step

Introduction to Alexa

Amazon’s Alexa is the voice activated, interactive  AI Bot, or intelligent personal assistant in the cloud that lets people speak with their Amazon Echo, Echo dot and other Amazon smart home devices. Alexa is designed to respond to number of commands and converse with people.

Alexa Skills are apps that give Alexa even more abilities. These skills can let her speak to more devices or websites.When the Alexa device is connected through wi-fi or Bluetooth to the internet, it wakes up by merely saying “ALEXA”.  Alexa Skills radically expands the bots repertoire, allowing users to perform more actions with voice-activated control through Alexa.

Overview of Alexa Skill

The most important part of ALEXA skill is its interaction design. Alexa skills don’t have visual feedback  like web or desk top applications and will guide the user through the skill using voice. All Alexa skill replies needs to tell the user clearly what the next options are.

Functional Architechture

An Alexa skill is a small application that interacts with Alexa via an AWS Lambda function

Functional Architechture

Designing the Alexa Skill

The most important aspect of the skill is its Vocal interface. The skill should be interacting naturally with the user. The components of Alexa Skill are :

  • Alexa requires a word, often called as Wake phrase which would alert the device that they can expect a command immediately after.The Default wake phrase is ALEXA. It can also be Amazon, Echo,Computer.
  • Launch Phrase is the word that tells Alexa to trigger a skill. Examples of Launch phrase are “ OPEN”, “ASK” , “START”, “LAUNCH”.
  • Invocation name is the name of the skill.
  • Intents are the goals that the user is trying to achieve by invoking the skill.
  • Utterance tells alexa what the skill should do. Apart from Static utterances such as Start and Launch, dynamic commands can be added. These Dynamic commands are called slots.
  • Each intent can contain one or more slots. A Slot is the variable that is parsed and exposed to the application code.


Alexa has a built in natural language processing engine. To Map the verbal phrase to an intent, Alexa handles the complexity of natural language processing through the help of a manually curated file Sampleutterances.txt.

The first word in each line of SampleUtteranaces.txt is the intent name. The application code reads the value of the intent name and responds appropriately. Following the intent name, is the phrase that the user says to achieve that intent. The User might tell a phrase apart from those defined as Slots in the intent. The application is free to react differently based on the presence or value of the slot. To give Alexa the best chance of understanding users, it is recommended to include as many sample utterances as possible. Depending on the skill there could be n number of ever changing sample utterances.

The below example sums up the entire vocal interface

entire vocal interface

Build and Publish a new skill

Building and publishing a new skill in Alexa comprises of  the below steps:

1. Create and Configure Skill :

Create an Alexa skill using This will open the skill information where we can specify the name of the skill and the invocation name.

2. Create Interaction Model:

Interaction model is a set of rules that defines the way the user interacts with your skill. As a part of an interaction model, Intents, Utterances  are defined. The intent schema should be in JSON format and it should define an array of intents, each with a name, and an optional list of dynamic parts — slots. Alexa will automatically train itself with the provided interaction model.

3. Coding the Backend system:

Once the interaction model has been designed Code and deployment of the Lambda function has to be done.

For each intent, an input/output contract has to be implemented. The input is an IntentRequest which is a representation of the user’s request and includes all the slot values.

The response from alexa can be of  multiple ways.

  • Ask the user a question and wait for response.
  • Give the details to the user and shut down.
  • Say nothing and shut down.

Alexa can either respond verbally or the response could be displayed on the phone.

4. Deploying the Backend system:

The skills can be deployed as an AWS Lambda function with code written in Java or Node.Js, Python or C#. The simplest approach would be the code in Node.Js.

5. Testing the Skill:

Testing of the Skill can either be done through the test simulator available in the Developer console account or through the device connected to the development account.

6. Publishing the Skill:

To Publish the skill, The skill has to be  submitted by filling out the “Publishing Information” and the “Privacy & Compliance” sections

Azure vs AWS for Analytics & Big Data

This is the fifth blog in our series helping you understand all about cloud, when you are in a dilemma to choose Azure or AWS or both, if needed.

Before we jumpstart on the actual comparison chart of Azure and AWS, we would like to bring you some basics on data analytics and the current trends on the subject.

If you would rather like to have quick look at the comparison table, Click here

This blog is intended to help you strategize your data analytics initiatives so that you can make the most informed decision possible by analyzing all the data you need in real time. Furthermore, we also will help you draw comparisons between Azure and AWS, the two leaders in cloud, and their capabilities in Big Data and Analytics as published in a handout released by Microsoft.

Beyond doubts, this is an era of data. Every touch point of your business generates volumes of data and these data cannot be simply whisked away, cast aside as valuable business insights can be unearthed with a little effort. Here’s where your Data Analytics infrastructure helps.

A 2017 Planning Guide for Data and Analytics published by Gartner written by the Analyst John Hagerty states that

The Key Findings as per the report are as follows:

  • Data and analytics must drive modern business operations, not just reflect them. Technical professionals must holistically manage an end-to-end data and analytics architecture to acquire, organize, analyze and deliver insights to support that goal.
  • Analytics are now infused in places where they never existed before.
  • Executives will seek strategies to better manage and monetize data for internal and external business ecosystems.
  • Data gravity is rapidly shifting to the cloud, with IoT, data providers and cloud-native applications leading the way. It is no longer a question of "if" for using cloud for data and analytics; it's "how."

The last point emphasizes on how cloud is playing a prominent role when it comes to Data Analytics and if you have thoughts on who and how, Gartner in its latest magic quadrant has said that AWS and Azure are the top leaders. Now, if you are in doubt whether to go the Azure way or AWS or should it be the both, here’s the comparison table showing their respective Big Data and Analytics Capabilities


Elastic data warehouseA fully managed data warehouse that analyzes data using business intelligence tools. RedshiftSQL Data Warehouse
Big data processingSupports technologies that break up large data processing tasks into multiple jobs, and then combine the results to enable massive parallelism.Elastic MapReduce (EMR)HDInsight
Data orchestrationProcesses and moves data between different compute and storage services, as well as on-premises data sources at specifed intervals. Data PipelineData Factory
Cloud-based ETL/data integration service that orchestrates and automates the movement and transformation of data from various sources.AWS Glue Data CatalogData Factory + Data Catalog
AnalyticsStorage and analysis platforms that create insights from massive quantities of data, or data that originates from many sources.Kinesis AnalyticsStream Analytics

Data Lake Analytics

Data Lake Store
Streaming dataAllow mass ingestion of small data inputs, typically from devices and sensors, to process and route data.Kinesis Streams

Kinesis Firehose
Event Hubs

Event Hubs Capture
Visualizationperform ad-hoc analysis, and develop business insights from data.QuickSight (Preview)Power BI
Allows visualization and data analysis tools to be embedded in applications.Power BI Embedded
SearchA scalable search server based on Apache Lucene.Elasticsearch ServiceMarketplace—Elasticsearch
Delivers full-text search and related search analytics and capabilities.CloudSearchSearch
Machine learningProduces an end-to-end workfow to create, process, refne, and publish predictive models from complex data sets.Machine LearningMachine Learning
Data discoveryProvides the ability to better register, enrich, discover, understand, and consume data sources.Data Catalog
A serverless interactive query service that uses standard SQL for analyzing databases.Amazon AthenaData Lake Analytics

Click here to read the entire guide published by Microsoft Azure Team:

Azure vs AWS for Database

This is our fourth blog in the series of blogs intended to help you embark on a cloud strategy, most importantly when you are in dilemma to choose AWS or Azure, the two prominent cloud players today.

If you had missed our earlier blogs, click here

1st Blog - Compute

2nd Blog- Storage

3rd Blog- CDN & Networking

Before we jumpstart on the actual comparison chart of Azure and AWS, we would like to bring you some basics on the database aspect of cloud strategy.

If you would rather like to have quick look at the database comparison table, click here

Through this blog, let’s understand the database aspect of your cloud strategy. As per the guide, Database services refers to options for storing data, whether it’s a managed relational SQL database that’s globally distributed or a multi-model NoSQL database designed for any scale.

When you decide cloud, one of the critical decisions you face is which database to use - SQL or NoSQL. Though SQL has an impressive track record, NoSQL is not far behind as it is gradually making notable gains and has many proponents. Once you have picked your database, the other big decision to make is which cloud vendor to choose amongst the many vendors.

Here’s where you consider Gartner’s prediction; the research company published a document that states

“Public cloud services, such as Amazon Web Services (AWS), Microsoft Azure and IBM Cloud, are innovation juggernauts that offer highly operating-cost-competitive alternatives to traditional, on-premises hosting environments.

Cloud databases are now essential for emerging digital business use cases, next-generation applications and initiatives such as IoT. Gartner recommends that enterprises make cloud databases the preferred deployment model for all new business processes, workloads, and applications. As such, architects and tech professionals should start building a cloud-first data strategy now, if they haven't done so already”

Reinstating the trend, recently Gartner has published a new magic quadrant for infrastructure-as-a-service (IaaS) that – surprising nobody – has Amazon Web Services and Microsoft alone in the leader's quadrant and a few others thought outside of the box.


Now, the question really is, Azure or AWS for your cloud data? Or should it be both? Here’s a quick comparison table to guide you.

Relational databaseSQL Database is a high-performance, reliable, and secure database you can use to build data-driven applications and websites, without needing to manage infrastructure.RDSSQL Database including Postgres and MySQL
NoSQL—document storageA globally-distributed, multi-model database that natively supports multiple data models: key-value, documents, graphs, and columnar.DynamoDBCosmos DB
NoSQL—key/value storageA non-relational data store for semi-structured data. DynamoDB and SimpleDBTable Storage
CachingAn in-memory–based, distributed-caching service that provides a high-performance store typically used to offoad non-transactional work from a database.ElastiCacheRedis Cache
Database migrationFocuses on migration of database schema and data from one database format to a specifc database technology in the cloud.Database Migration Service (Preview)SQL Database Migration Wizard

Click here to read the entire guide published by Microsoft Azure Team:

Azure vs AWS for Networking & Content Delivery

In line with our latest blog series highlighting how common cloud services are made available via Azure and Amazon Web Services (AWS), as published by Microsoft, this third blog in the series helps you understand Cloud Networking and Content Delivery capabilities of both Azure and AWS.

Before we jumpstart on the actual comparison chart of Azure and AWS, we would like to bring you some basics on cloud content delivery networking and the current trends on the subject.

If you would rather like to have quick look at the comparison table, click here

When we talk about cloud Content Delivery Network (CDN) and the related networking capabilities it includes all the hardware and software that allows you to easily provision private networks, connect your cloud application to your on-premises datacenters, and more.

According to Gartner, Content delivery networks (CDNs) are a type of distributed computing infrastructure, where devices (servers or appliances) reside in multiple points of presence on multi-hop packet-routing networks, such as the Internet, or on private WANs. A CDN can be used to distribute rich media downloads or streams, deliver software packages and updates, and provide services such as global load balancing, Secure Sockets Layer acceleration and dynamic application acceleration via WAN optimization techniques.

In simpler terms, this highly distributed server platforms are optimized to deliver content in a way that improves customer experience. Hence, it is important to decrease latency by keeping the data closer to the users, protect it from security threats while ensuring rapid streamlined content delivery including general web delivery, content purge, content caching and tracking history as long as 90 days.

As per, most organizations use CDN services, such as web caching, request routing, and server-load balancing, to reduce load times and improve website performance. Further to qualify as a CDN provider, a service provider must:

  • Allow access to a geographically dispersed network of PoPs in multiple data centers
  • Help websites access this network to deliver content to website visitors
  • Offer services designed to improve website performance
  • Provide scalable Internet bandwidth allowances according to customer needs
  • Maintain data center(s) of servers to reduce the possibility of overloading individual instances

With this background, let’s look at the AWS vs Azure comparison chart in terms of Networking and Content Delivery Capabilities:

Cloud virtual networkingProvides an isolated, private environment in the cloud.Virtual Private CloudVirtual Network
Cross-premises connectivityConnects Azure virtual networks to other Azure virtual networks or customer on-premises networks. It also supports VPN tunneling.AWS VPN GatewayVPN Gateway
Domain name system managementManage DNS records using the same credentials, billing, and support contract as other Azure services.

Service that hosts domain names, routes users to Internet applications, manages traf c to apps, and improves app availability with automatic failover.
Route 53

Route 53

Traffic Manager
Content delivery networkGlobal content delivery network that transfers audio, video, applications, images, and other les.CloudFrontContent Delivery Network
Dedicated networkEstablishes a dedicated, private network connection from a location to the cloud provider.Direct ConnectExpressRoute
Load balancingAutomatically distributes incoming application traf c to add scale, handle failover, and route to a collection of resources.Elastic Load BalancingLoad Balancer

Application Gateway

To read more about the Microsoft guide which briefs all about cloud by drawing comparisons between Azure or AWS, click here (link to PDF download)

You may also like to read our previous blogs in these series, if so, please click here:

AWS vs AZURE for Cloud Storage

Azure or AWS or Azure & AWS? What’s your cloud strategy for Storage?

This is our second blog, in our latest blog series helping you understand all about cloud, especially when you are in doubt whether to go Azure or AWS or both.

To read our first blog talking about Cloud strategy in general and Compute in particular, click here…

Moving on, in this blog let’s find what Azure or AWS offer when it comes to Storage Capabilities for your Cloud Infrastructure.

Globally CIOs are increasingly looking to cease running their own data centers and move to cloud which is evident when we read the projection made by a leading researcher, MarketsandMarkets. They had reported that the global cloud storage business sector to grow from $18.87 billion in 2015 to $65.41 billion by 2020, at a compound annual growth rate (CAGR) of 28.2 percent during the forecast period.

Reinstating the fact, 451 Research’s Voice of the Enterprise survey last year stated that Public cloud storage spending will double by next year (2017). "IT managers are recognizing the need for storage transformation to meet the realities of the new digital economy, especially in terms of improved efficiency and agility in the face of relentless data growth," said Simon Robinson, research vice president at 451 and research director of the new Voice of the Enterprise: Storage service. "It's clear from our Q4 study that emerging options, especially public cloud storage and all-flash array technologies, will be increasingly important components in this transformation" he added further.

As we see, many companies are in for Cloud Storage, undoubtedly. But the big question - Whom to choose from a gamut of leading public cloud players including big players like AZURE, AWS; Should it be AZURE alone for your cloud storage or AWS or a combination of both still prevails.

This needs a thorough understanding. To help you decide for good, we have decided to re-produce a guide, published by Microsoft that briefs Azure‘s capabilities in comparison to AWS when it comes to Cloud Strategy. And we will see the Storage part in this blog, but before, that a little backgrounder on Cloud Storage.

When we talk about cloud storage device mechanisms, we include all logical units of data storage covering from files, blocks, and datasets to objects and their relative storage interfaces. These instances of virtual storage devices are designed specifically for cloud-based provisioning and can be scaled as per need. It is to be noted that different cloud service consumers utilize different technologies to interface with virtualized cloud storage devices.

Object storageObject storage service for use cases including cloud apps,
content distribution, backup, archiving, disaster recovery,
and big data analytics.
Simple Storage Services (S3) Storage—Block Blob (for content logs, files) (Standard—Hot)
Virtual Server disk
SSD storage optimized for I/O intensive
read/write operations.
Elastic Block Store (EBS)Disk Storage—Page Blobs (for VHDs or other random-write type data)

Disk Storage—Premium Storage
Shared file storageA simple interface to create and configure file
systems quickly as well as share common files.
Elastic File SystemFile Storage (file share between VMs)
Archiving—cool storageA lower cost tier for storing data that is
infrequently accessed and long-lived.
S3 IA GlacierStorage—Hot, Cool & Archive Tier
BackupBackup and archival solutions that allow files and folders
to be backed-up and recovered from the cloud, and
provide off-site protection against data loss.
Backup and RecoveryBackup
Hybrid storageIntegrates on-premises IT environments with cloud
storage. Automates data management and storage, plus
supports disaster recovery.
Storage GatewayStorSimple
Bulk data transferA data transport solution that uses secure disks and
appliances to transfer substantial amounts of data.

Petabyte- to Exabyte-scale data transport solution.
AWS Import/Export Disk

AWS Import/Export Snowball

AWS Snowball Edge

AWS Snowmobile

Data Box
Disaster recoveryAutomates protection and replication of virtual
machines with health monitoring, recovery plans,
and recovery plan testing.
Site Recovery

For a more detailed understanding download the document here

Azure vs. AWS or Azure & AWS?

Surprisingly, as per an article published by Gartner, “Cloud Computing is still perplexing to many CIOs even after a decade of cloud’. While cloud computing is a foundation for digital business, Gartner estimates that less than one-third of enterprises have a documented cloud strategy. This indeed comes as a surprise given the fact that cloud has evolved from a disruption to the indispensable tech of today and tomorrow, all along strategically adopted by many progressive companies.

In the same article Donna Scott, Vice President and distinguished analyst at Gartner states that “Cloud computing will become the dominant design style for new applications and for refactoring a large number of existing applications over the next 10-plus years”. She also added that “A cloud strategy clearly defines the business outcomes you seek, and how you are going to get there. Having a cloud strategy will enable you to apply its tenets quickly with fewer delays, thus speeding the arrival of your ultimate business outcomes.”

However, it is easier said than done. Many top businesses still have questions like how to make the most from cloud computing? What kind of architectures and techniques need to be strategized to support the many flavors of evolving cloud computing? Private or Public? Hybrid or Public? Azure or AWS, or it should be a hybrid combo?

Through a series of blogs we intent to bring answers to these questions. As a first one, we would like to highlight and represent a comparative cloud service map focusing on both Azure and AWS both leaders in public cloud platforms, as published by Microsoft.

The well-researched article draws detailed comparisons between Azure and AWS and how common cloud services across parameters such as Marketplace, Compute, Storage, Networking, Database, Analytics, Big Data, Intelligence, IOT, Mobile and Enterprise Integration are made available via Azure and Amazon Web Services (AWS)

It should be noted that as prominent public cloud platforms providers, Azure and AWS each offer businesses a wide and comprehensive capabilities across the globe. Many organizations have chosen either one of them or both depending upon their needs in order to gain more agility, and flexibility while minimizing the risk and maximizing the larger benefits of a multi-cloud environment.

For starters, let’s start with COMPUTE and the points one should consider and compare before deciding the Azure or AWS approach or a combination of both.

Virtual serversAllows users to deploy, manage, and maintain
OS and server software; instance types provide
configurations of CPU/RAM.

Offers a lightweight, simplified product offering users can
choose from from when building out a virtual machine.
Elastic Compute Cloud (EC2)

Amazon Lightsail
Virtual Machines

Virtual Machine Images
Container managementSupports Docker containers and allows users to run
applications on managed instance clusters.

Allows customers to store Docker formatted images. Used
to create all types of container deployments on Azure.
EC2 Container Service (ECS)

EC2 Container Registry
Container Service

Container Registry
Orchestrates and manages the execution, lifetime, and
resilience of complex, interrelated code components
that can be either stateless or stateful.
Service Fabric
Backend process logicIntegrates systems and runs backend processes
in response to events or schedules without
provisioning or managing servers.

Event Grid
Job orchestrationWhen processing across hundreds or thousands
of compute nodes, this tool orchestrates the
tasks and interactions between compute
resources that are necessary.
AWS BatchBatch
ScalabilityAutomatically changes the number of instances
providing a compute workload. Users set defined
metrics and thresholds that determine if the platform
adds or removes instances.
AWS Auto ScalingVirtual Machine Scale Sets

App Service Scale Capability (PAAS)

Pre-defined templatesCommunity-led templates for creating and
deploying virtual machine-based solutions.
AWS Quick StartQuickstart templates

For a more detailed understanding download the document here

Microservices – Aligning business and technology for closer collaboration, agility & flexibility

A microservices-based architecture introduces agility, flexibility and supports a sustainable DEVOPS culture ensuring closer collaboration within businesses and the news is that it’s actually happening for those who embraced it.

True, monolith apps architectures have enabled businesses to benefit from IT all along as it is single coded, simple to develop, test and run. As they are also based on a logical modular hexagonal or layered architectures (Presentation Layer responsible for handling HTTP requests and responding with either HTML or JSON/XML, Business logic layer, Database access and Apps integration) they cover and tie all processes, functions and gaps to an extent.

Despite these ground level facts, monolith software, which is instrumental for businesses embrace IT in their initial stages and which even exists today, is seeing problems. The growing complex business operation conditions are purely to be blamed.

So, how do businesses today address new pressures caused by digitization, continuous technology disruptions, increased customer awareness & interceptions and sudden regulatory interventions? The answer lies in agility, flexibility and scalability of the underlying IT infrastructure- the pillars of rapid adaptability to changes.

Monolith Apps, even though it is based on a well-designed 3 tier architecture, in the long run, loses fluidity and turns rigid. Irrespective of its modularity, modules are still dependent on each other and any minimal change in one module needs generation and deployment of all artifacts in each server pool, touched across the distributed environment.

Besides whenever there is a critical problem, the blame game starts amongst the UI developers, business logic experts, backend developers, database programmers, etc as they are predominantly experts in their domains, but have little knowledge about other processes. As the complexity of business operations sets in, the agility, flexibility and scalability part of your software is highly tested in a monolithic environment.

Here’s where Microservices plays a huge role as the underlying architecture helps you break your software applications into independent loosely coupled services that can be deployed and managed solely at that level and needn’t have to depend on other services.

For example, if your project needs you to design and manage inventory, sales, shipping, and billing and UI shopping cart modules, you can break each service down as an independently deployable module. Each has its own database, where monitoring and maintenance of application servers are done independently as the architecture allows you to decentralize the database, reducing complexity. Besides it enables continuous delivery/deployment of large, complex applications which means technology also evolves along with the business.

The other important aspect is that microservices promotes a culture wherein whoever develops the service is also responsible to manage it. This avoids the handover concept and the following misunderstandings and conflicts whenever there is a crisis.

In line with the DevOps concept, Microservices enables easy collaboration between the development and operations team as they embrace and work on a common toolset that establishes common terminology, as well as processes for requirements, dependencies, and problems. There is no denying the fact that DevOps and microservices work better when applied together.

Perhaps that’s the reason companies like Netflix, Amazon, etc are embracing the concept of microservices in their products. And for other new businesses embracing it, a new environment where agility, flexibility and closer collaboration between business and technology becomes a reality providing the much-needed edge in these challenging times.

Restrict s3 bucket Access to VPC

Allow access to s3 bucket only from vpc

Currently I am evaluating options to lockdown permissions to my S3 Buckets as part of Security Enhancements. These are the steps I followed to lock down S3 Bucket access only to my VPC

Create VPC End Points

VPC End Points Screen Shot

Attach the S3 Bucket Policy to Restrict Access
	"Version": "2012-10-17",
	"Id": "Policy123456789",
	"Statement": [
			"Sid": "Stmt123456789",
			"Effect": "Deny",
			"Principal": "*",
			"Action": "s3:*",
			"Resource": "arn:aws:s3:::example-confidential/*",
			"Condition": {
				"StringNotEquals": {
					"aws:sourceVpc": "vpc-2f2b202b"
Access the Buckets Outside VPC

Once you have attached the policy, if you access the S3 Files through console not being on VPC, you will receive the error.

AccessDenied Access Denied 14FB0BEFD8A0C8E5 JrFOr/6Fe20lyMxjCy6lPhJIJ8sj3kG7zSiel2kcvv6OUssHQ2W/e7bYTjD3hXjX2m1/aHB+G1I=
Access the Buckets from VPC

If you log into a EC2 Instance which is hosted on VPC, you will be able to access the s3 Bucket.

SSH Into your EC2 Machine and verify your VPC through Instance Meta Data Store.

[ec2-user]# curl http: //
01:ed:88:51:f6:29/ [ec2-user]# curl http: //

If you execute s3 commands to access the bucket, you will be able to access the S3 Bucket without access denied error.

aws s3 ls example-confidential aws s3 cp s3:: //example-confidential/SampleConfidentialFile.txt  SampleConfidentialFile.txt

MYSQL / AURORA Database Troubleshooting

Here is a look at some of the common queries that will be useful when troubleshooting AURORA database.

Number of Connections by Host
    FROM information_schema.processlist 
    group by SUBSTRING(HOST, 0, 10) , DB,USER 
    ORDER BY Count desc ;
    -- '', 'Portal', 'webguest-dev', '46'
Aurora Max Connections
        select * from mysql.slow_log 
        where sql_text not like '%LOAD DATA%'
        order by query_time desc
        limit 1000 ;
        select count(*) from mysql.general_log 
        where user_host not like 'rdsadmin%'
        and user_host not like '[rdsadmin]%'
        and event_time > '2017-06-15 18:51:14';
        select current_timestamp();
        desc mysql.general_log  ;
        select @@MAX_CONNECTIONS
        -- '4000'
        select *  from mysql.general_log 
        where command_type like '%Connect%';
        select *  from mysql.general_log_backup 
        where command_type like '%Connect%' ;
        SHOW GLOBAL STATUS LIKE '%Connection_errors%';
        SHOW STATUS WHERE `variable_name` = 'Threads_connected';

Memory-Optimized Tables

Monitor Memory Optimized Table Space Usage

Memory-optimized tables are fully durable by default, and, like transactions on (traditional) disk-based tables, transactions on memory-optimized tables are fully atomic, consistent, isolated, and durable (ACID). Memory-optimized tables and natively compiled stored procedures support only a subset of Transact-SQL features. The following blog post shows how to monitor the table space usage.

        WITH    system_allocated_memory ( system_allocated_memory_in_mb )
              AS ( SELECT   ISNULL(( SELECT CONVERT(DECIMAL(18, 2), 
             ( SUM(TMS.memory_allocated_for_table_kb)
             + SUM(TMS.memory_allocated_for_indexes_kb) )
             / 1024.00)
             FROM   [sys].[dm_db_xtp_table_memory_stats] TMS
             WHERE  TMS.object_id <= 0
             ), 0.00)
             table_index_memory ( table_used_memory_in_mb, table_unused_memory_in_mb, 
             index_used_memory_in_mb, index_unused_memory_in_mb )
             ( SUM(TMS.memory_used_by_table_kb)
             / 1024.00 ))
             ), 0.00) AS table_used_memory_in_mb ,
             ISNULL(( SELECT CONVERT(DECIMAL(18, 2), ( SUM(TMS.memory_allocated_for_table_kb)
             - SUM(TMS.memory_used_by_table_kb) )
             / 1024.00)
             ), 0.00) AS table_unused_memory_in_mb ,
             ISNULL(( SELECT CONVERT(DECIMAL(18, 2), ( SUM(TMS.memory_used_by_indexes_kb)
             / 1024.00 ))
             ), 0.00) AS index_used_memory_in_mb ,
             ISNULL(( SELECT CONVERT(DECIMAL(18, 2), ( SUM(TMS.memory_allocated_for_indexes_kb)
             - SUM(TMS.memory_used_by_indexes_kb) )
             / 1024.00)
             ), 0.00) AS index_unused_memory_in_mb
             FROM     [sys].[dm_db_xtp_table_memory_stats] TMS
            WHERE    TMS.object_id > 0
        SELECT  s.system_allocated_memory_in_mb ,
                t.table_used_memory_in_mb ,
                t.table_unused_memory_in_mb ,
                t.index_used_memory_in_mb ,
                t.index_unused_memory_in_mb ,
            ), 0) AS has_memory_optimized_filegroup
        FROM    system_allocated_memory s ,
           table_index_memory t
        SELECT  t.object_id ,
            ISNULL(( SELECT CONVERT(DECIMAL(18, 2), ( TMS.memory_used_by_table_kb )
            / 1024.00)
            ), 0.00) AS table_used_memory_in_mb ,
                ISNULL(( SELECT CONVERT(DECIMAL(18, 2), ( TMS.memory_allocated_for_table_kb
                    - TMS.memory_used_by_table_kb )
            / 1024.00)
            ), 0.00) AS table_unused_memory_in_mb ,
                ISNULL(( SELECT CONVERT(DECIMAL(18, 2), ( TMS.memory_used_by_indexes_kb )
            / 1024.00)
            ), 0.00) AS index_used_memory_in_mb ,
                ISNULL(( SELECT CONVERT(DECIMAL(18, 2), ( TMS.memory_allocated_for_indexes_kb
                    - TMS.memory_used_by_indexes_kb )
            / 1024.00)
            ), 0.00) AS index_unused_memory_in_mb
        FROM    sys.tables t
            JOIN sys.dm_db_xtp_table_memory_stats TMS ON ( t.object_id = TMS.object_id )
All Memory Used by Memory Optimized Table across Database Engine
        -- this DMV accounts for all memory used by the hek_2 engine 
        SELECT type ,
        name ,
        memory_node_id ,
        pages_kb / 1024 AS pages_MB
        FROM sys.dm_os_memory_clerks
        WHERE type LIKE '%xtp%' 
Enable Natively Compiled Stored Procedure Stats Collection
        EXEC [sys].[sp_xtp_control_proc_exec_stats] @new_collection_value = 1  
        DECLARE @c BIT  
        EXEC sp_xtp_control_proc_exec_stats @old_collection_value = @c OUTPUT  
        SELECT  @c AS 'collection status' 
DBCC FREEPROCCACHE does not remove natively compiled stored procedures from Plan Cache
        DECLARE @sql NVARCHAR(MAX) = N''
        SELECT  @sql += N'EXECUTE sp_recompile N'''
                + QUOTENAME(SCHEMA_NAME(o.schema_id)) + N'.' + QUOTENAME( + '''
        FROM    sys.sql_modules sm
                JOIN sys.objects o ON sm.object_id = o.object_id
        WHERE   uses_native_compilation = 1
        EXECUTE sp_executesql @sql
        -- Reset wait and latch statistics.
        DBCC SQLPERF('sys.dm_os_latch_stats' , CLEAR)
        DBCC SQLPERF('sys.dm_os_wait_stats' , CLEAR) 
Errors Encountered During Migration :

Msg 41317, Level 16, State 5, Line 6
A user transaction that accesses memory optimized tables or natively compiled modules cannot access more than one user database or databases model and msdb, and it cannot write to master.

Bulk Load Data Files into Aurora RDS from S3 Bucket using AWS Data Pipeline

Bulk Load Data Files in S3 Bucket into Aurora RDS

We typically get data feeds from our clients ( usually about ~ 5 – 20 GB) worth of data. We download these data files to our lab environment and use shell scripts to load the data into AURORA RDS . We wanted to avoid unnecessary data transfers and decided to setup data pipe line to automate the process and use S3 Buckets for file uploads from the clients.

In theory it’s very simple process of setting up data pipeline to load data from S3 Bucket into Aurora Instance .Even though it’s trivial , setting up this process is very convoluted multi step process . It’s not as simple as it sounds . Welcome to Managed services world.

  • Create ROLE and Attach S3 Bucket Policy :
  • Create Cluster Parameter Group
  • Modify Custom Parameter Groups to use ROLE

By default aurora cannot access S3 Buckets and we all know it’s just common sense default setup to reduce the surface area for better security.

For EC2 Machines you can attach a role and the EC2 machines can access other AWS services on behalf of role assigned to the Instance.Same method is applicable for AURORA RDS. You Can associate a role to AURORA RDS which has required permissions to S3 Bucket .

There are ton of documentation on how to create a role and attach policies . It’s pretty widely adopted best practice in AWS world. Based on AWS Documentation, AWS Rotates access keys attached to these roles automatically. From security aspect , its lot better than using hard coded Access Keys.

In Traditional Datacenter world , you would typically run few configuration commands to change configuration options .( Think of sp_configure in SQL Server ).

In AWS RDS World , its tricky . By default configurations gets attached to your AURORA Cluster . If you need to override any default configuration , you have to create your own DB Cluster Parameter Group and modify your RDS instance to use the custom DB Cluster Parameter Group you created.Now you can edit your configuration values .

The way you attach a ROLE to AURORA RDS is through Cluster parameter group .

These three configuration options are related to interaction with S3 Buckets.

  • aws_default_s3_role
  • aurora_load_from_s3_role
  • aurora_select_into_s3_role

Get the ARN for your Role and modify above configuration values from default empty string to ROLE ARN value.

Then you need to modify your Aurora instance and select to use the role . It should show up in the drop down menu in the modify role tab.

        GRANT LOAD FROM S3 ON *.* TO user@domain-or-ip-address
        GRANT LOAD FROM S3 ON *.* TO 'aurora-load-svc'@'%' 

Without Reboot you will be spending lot of time troubleshooting. You need to reboot to the AURORA Instance for new cluster parameter values to take effect.

After this you will be be able to execute the LOAD FILE FROM S3 to AURORA .

Screen Shots :
Create ROLE and Attach Policy :

Attach S3 Bucket Policy :

Create Parameter Group :

Modify Custom Parameter Groups

Modify AURORA RDS Instance to use ROLE

Troubleshooting :
Errors :

Error Code: 1871. S3 API returned error: Missing Credentials: Cannot instantiate S3 Client 0.078 sec

Usually means , AURORA Instance can’t reach S3 Bucket. Make sure you have applied the role and rebooted the Instance.

Sample BULK LOAD Command :

You could use following sample scripts to test your Setup.

        LOAD DATA FROM S3 's3://yourbucket/allusers_pipe.txt'
        INTO TABLE ETLStage.users
        (@var1, @var2, @var3, @var4, @var5, @var6, @var7, @var8, @var9, @var10, @var11, @var12, @var13, @var14, @var15, @var16, @var17, @var18)
        userid = @var1,
        username = @var2,
        firstname = @var3,
        lastname = @var4,

Sample File in S3 Public Bucket : s3://awssampledbuswest2/tickit/allusers_pipe.txt

        SELECT * FROM ETLStage.users INTO OUTFILE S3's3-us-west-2://s3samplebucketname/outputestdata'
        OVERWRITE ON; 
        create table users_01(
        userid integer not null primary key,
        username char(8),
        firstname varchar(30),
        lastname varchar(30),
        city varchar(30),
        state char(2),
        email varchar(100),
        phone char(14),
        likesports varchar(100),
        liketheatre varchar(100),
        likeconcerts varchar(100),
        likejazz varchar(100),
        likeclassical varchar(100),
        likeopera varchar(100),
        likerock varchar(100),
        likevegas varchar(100),
        likebroadway varchar(100),
        likemusicals varchar(100)) 

Getting Started with AWS Data Pipeline and Loading S3 File Into MySQL

Getting started with AWS Data Pipeline

AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks.

AWS Data Pipe Line Sample Workflow

AWS Data Pipe Line Sample Workflow
Default IAM Roles

AWS Data Pipeline requires IAM roles to determine what actions your pipelines can perform and who can access your pipeline’s resources.

The AWS Data Pipeline console creates the following roles for you:



                "Version": "2012-10-17",
                "Statement": [
                 "Effect": "Allow",
                 "Action": [
                 "Resource": [
                "Version": "2012-10-17",
                "Statement": [
                 "Effect": "Allow",
                 "Action": [
                 "Resource": [
Error Message:

Error MessageUnable to create resource for @EC2ResourceObj_2017-05-05T04:25:32 due to: No default VPC for this user (Service: AmazonEC2; Status Code: 400; Error Code: VPCIdNotSpecified; Request ID: bd2f3abb-d1c9-4c60-977f-6a83426a947d)


When you look at your VPC, you would notice Default VPC is not configured. While launching EC2 Instance on Data Pipeline, by default it can’t figure out which VPC to use and that needs to be explicitly specified in Configurations.

SubNetID for EC2 Resource

Default VaPC
Build Sample Data Pipeline to Load S3 File into MySQL Table :

Use Cases for AWS Data Pipeline
Setup sample Pipeline in our develop environment
Import Text file from AWS S3 Bucket to AURORA Instance
Send out notifications through SNS to
Export / Import Data Pipe Line Definition.


Have MySQL Instance
Access to Invoke Data Pipeline with appropriate permissions
Target Database and Target Table
SNS Notification setup with right configuration

Steps to Follow:

Create Data Pipeline with Name
Create MySQL Schema and Table
Configure Your EC2 Resource ( Make sure EC2 instance has access to MySQL Instance ).
If MySQL instance allows only certain IPS’s and VPC, then you need to configure your EC2 Resource in the same VPC or Subnet.
Configure Data Source and appropriate Data Format ( Notice this is Pipe Delimited File ant CSV File ).
Configure your SQL Insert Statement
Configure SNS Notification for PASS / FAIL Activity.
Run your Pipeline and Troubleshoot if errors occur.

Data Pipe Line JSON Definiton:


Create Table SQL :
                create table users_01(
                userid integer not null primary key,
                username char(8),
                firstname varchar(30),
                lastname varchar(30),
                city varchar(30),
                state char(2),
                email varchar(100),
                phone char(14),
                likesports varchar(100),
                liketheatre varchar(100),
                likeconcerts varchar(100),
                likejazz varchar(100),
                likeclassical varchar(100),
                likeopera varchar(100),
                likerock varchar(100),
                likevegas varchar(100),
                likebroadway varchar(100),
                likemusicals varchar(100)) 
                INSERT INTO `ETLStage`.`users_01`
Errors Encountered:

errorMessageQuote character must be defined in record format

You can use “TSV” type as your custom format type and provide:

  • “Column separator” as pipe(|),
  • “Record separator” as new line(\n),
  • “Escape Char” as backslash(\) or any other character you wa

errorId : ActivityFailed:SQLException
errorMessage : No value specified for parameter
errorMessage : Parameter index out of range (1 > number of parameters, which is 0).
errorMessage : Incorrect integer value: ‘FALSE’ for column ‘likesports’ at row 1

Ensure the Table Column Data Type set to correct . By Default MySQL Doesn’t covert TRUE / FALSE into Boolean Data Type.

errorMessage : Parameter index out of range (1 > number of parameters, which is 0).
errorMessage for Load script: ERROR 1227 (42000) at line 1: Access denied; you need (at least one of) the LOAD FROM S3 privilege(s) for this operation

AWS Regions and Availability Zones

AWS region codeAWS region nameNumber of AZsAZ names
us-west-2Oregon3us-west-2a, us-west-2b, us-west-2c
ap-southeast-2 Sydney2ap-southeast-2a, ap-southeast-2b, ap-southeast-2c
us-east-1 Virginia4us-east-1a, us-east-1b, us-east-1c, us-east-1e
us-west-1 N. California 2us-west-1a, us-west-1b
eu-west-1 Ireland3eu-west-1a, eu-west-1b, eu-west-1c
eu-central-1 Frankfurt2eu-central-1a, eu-central-1b
ap-southeast-1 Singapore 2ap-southeast-1a, ap-southeast-1b
ap-northeast-1 Tokyo2ap-northeast-1a, ap-northeast-1c
sa-east-1 Sao Paulo 3sa-east-1a, sa-east-1b, sa-east-1c

                C:\Users\Raju>AWS ec2 describe-regions
                C:\Users\Raju> aws ec2 describe-availability-zones 
AWS Regions for EC2

Availability Zones

Install SQL Server Docker Container on Linux

This blog post explains how to setup and configure SQL Server docker container on a linux machine. Microsoft recently started supporting running SQL Server on Linux and the entire process takes only few steps to run.

Install SQL Server Docker Image

//Pull the SQL Server Image from the docker registry
$docker pull microsoft/mssql-server-linux
$docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=Password' -p 1433:1433 -d microsoft/mssql-server-linux
$docker exec -it 40dd "bash"
$/opt/mssql-tools/bin/sqlcmd -S localhost -U SA

[root@ip-10-0-0-110 ec2-user]# docker pull microsoft/mssql-server-linux
[root@ip-10-0-0-110 ec2-user]# docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=Password' -p 1433:1433 -d microsoft/mssql-server-linux
[root@ip-10-0-0-120 ec2-user]# docker exec -it 40dd "bash"
root@40dde973f4a0:/# /opt/mssql-tools/bin/sqlcmd -S localhost -U SA

Command History
[root@ip-10-0-0-110 ec2-user]# docker pull microsoft/mssql-server-linux
Using default tag: latest
latest: Pulling from microsoft/mssql-server-linux
4c0c60131530: Pull complete
Digest: sha256:604d27fe5d3d9b4434fb1657e9bf4f2c2bf55ea9bd29dc0cb3660d84bc6f56a8
Status: Downloaded newer image for microsoft/mssql-server-linux:latest
[root@ip-10-0-0-110 ec2-user]# docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=Password' -p 1433:1433 -d microsoft/mssql-server-linux
[root@ip-10-0-0-120 ec2-user]# docker -it 40dd "bash"
flag provided but not defined: -it
See 'docker --help'.
[root@ip-10-0-0-120 ec2-user]# docker exec -it 40dd "bash"
root@40dde973f4a0:/# /opt/mssql-tools/bin/sqlcmd -S localhost -U SA
1> select @@servername;
2> go


(1 rows affected)
1> select db_name();
2> go


(1 rows affected)
Version Info:

select @@version

Microsoft SQL Server 2017 (CTP2.1) - 14.0.600.250 (X64) 
May 10 2017 12:21:23 
Copyright (C) 2017 Microsoft Corporation. All rights reserved.
Developer Edition (64-bit) on Linux (Ubuntu 16.04.2 LTS)
Backup Database on Docker Container and Copy to Host:

Connect to SQL Server Management Studio or SQLCMD and issue the backup command

Backup Database on Docker Container

TO DISK ='/var/opt/mssql/data/SalaryDatabase_V2_06132017.bak'
Copy the file from Container to Host and Sync with S3 Bucket:
 $ docker cp <containerId>:/file/path/within/container /host/path/target

$ docker cp aabb19ca439f:/var/opt/mssql/data/SalaryDatabase_06132017.bak /Docker/

$ aws s3 sync ./ s3://docker-backups

upload: ./SalaryDatabase_06132017.bak  to s3://docker-backups/SalaryDatabase_06132017.bak

Completed 18.4 GiB/25.8 GiB (46.4 MiB/s) with 1 file(s) remaining
Troubleshooting :

root@e83b4048db28:/var/opt/mssql/log# /opt/mssql-tools/bin/sqlcmd -S localhost -U SA
Sqlcmd: Error: Microsoft ODBC Driver 13 for SQL Server : Login failed for user ‘SA’..
root@e83b4048db28:/var/opt/mssql/log# exit
[root@ip-10-0-0-120 ec2-user]# docker rm $(docker ps -a -q)
Error response from daemon: You cannot remove a running container e83b4048db28505951f20fff4aff9f5132695fd1e1c7251c8daeb79d15ac403d. Stop the container before attempting removal or use -f
[root@ip-10-0-0-120 ec2-user]# docker rm -f $(docker ps -a -q)
Unable to telnet without allowing access to port 1433
MacBook:.ssh Raju$ telnet 1433
telnet: connect to address Operation timed out
telnet: Unable to connect to remote host
MacBook:.ssh Raju$ telnet 1433
Connected to
Escape character is ‘^]’.
^CError while connecting through SQL Server Management Studio without allowing access to port 1433
TITLE: Connect to Server
Cannot connect to

A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 – Could not open a connection to SQL Server) (Microsoft SQL Server, Error: 53)

For help, click:


The network path was not found






Error while connecting through SQLCMD without allowing access to port 1433

C:\Users\>sqlcmd -S -U SA
Password: HResult 0x35, Level 16, State 1
Named Pipes Provider: Could not open a connection to SQL Server [53].
Sqlcmd: Error: Microsoft SQL Server Native Client 10.0 : A network-related or in
stance-specific error has occurred while establishing a connection to SQL Server
. Server is not found or not accessible. Check if instance name is correct and i
f SQL Server is configured to allow remote connections. For more information see
SQL Server Books Online..
Sqlcmd: Error: Microsoft SQL Server Native Client 10.0 : Login timeout expired.

Open port 1433 in Security Groups

Allow inbound traffic from the IP’s or Security Groups you need SQL Server Access.

Restart Docker Container and See Docker Logs


1001 docker ps -a
1002 docker restart bb1b1
1003 docker logs bb
SQL Server Docker Container Errors:

017-06-09 00:08:39.35 spid9s Starting up database ‘tempdb’.
2017-06-09 00:08:39.45 spid26s Recovery of database ‘UserDBName’ (7) is 0% complete (approximately 1717 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2017-06-09 00:08:40.15 spid9s The tempdb database has 1 data file(s).
2017-06-09 00:08:40.16 spid36s The Service Broker endpoint is in disabled or stopped state.
2017-06-09 00:08:40.17 spid36s The Database Mirroring endpoint is in disabled or stopped state.
2017-06-09 00:08:40.35 spid36s Service Broker manager has started.
2017-06-09 00:08:41.05 spid33s [INFO] HkRecoverFromLogOpenRange(): Database ID: [5]. Log recovery scan from 00000495:00005D20:006B to 000004C7:00010F48:0002.
2017-06-09 00:08:59.47 spid26s Recovery of database ‘UserDBName’ (7) is 16% complete (approximately 110 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2017-06-09 00:09:11.16 Logon Error: 18456, Severity: 14, State: 38.
2017-06-09 00:09:11.16 Logon Login failed for user ‘UserName’. Reason: Failed to open the explicitly specified database ‘UserDBName’. [CLIENT:]
2017-06-09 00:09:19.51 spid26s Recovery of database ‘UserDBName’ (7) is 30% complete (approximately 94 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2017-06-09 00:09:39.59 spid26s Recovery of database ‘UserDBName’ (7) is 44% complete (approximately 76 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2017-06-09 00:09:51.38 spid26s Recovery of database ‘UserDBName’ (7) is 54% complete (approximately 62 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2017-06-09 00:09:51.40 spid26s Recovery of database ‘UserDBName’ (7) is 54% complete (approximately 62 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
DBSTARTUP (UserDBName, 7): FCBOpenTime took 164 ms
DBSTARTUP (UserDBName, 7): FCBHeaderReadTime took 135 ms
DBSTARTUP (UserDBName, 7): FileMgrPreRecoveryTime took 277 ms
DBSTARTUP (UserDBName, 7): MasterFilesScanTime took 144 ms
DBSTARTUP (UserDBName, 7): AnalysisRecTime took 1470 ms
DBSTARTUP (UserDBName, 7): RedoRecTime took 71938 ms
DBSTARTUP (UserDBName, 7): UndoRecTime took 4903 ms
DBSTARTUP (UserDBName, 7): PhysicalRecoveryTime took 73408 ms
DBSTARTUP (UserDBName, 7): PhysicalCompletionTime took 4913 ms
DBSTARTUP (UserDBName, 7): RecoveryCompletionTime took 102 ms
DBSTARTUP (UserDBName, 7): StartupInDatabaseTime took 136 ms
DBSTARTUP (UserDBName, 7): RemapSysfiles1Time took 125 ms
2017-06-09 00:10:11.44 spid6s Recovery of database ‘UserDBName’ (7) is 63% complete (approximately 55 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2017-06-09 00:10:31.48 spid6s Recovery of database ‘UserDBName’ (7) is 63% complete (approximately 65 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2017-06-09 00:10:34.61 spid33s [INFO] HkRedoCloseLastOpenRangeSegment(): Database ID: [5]. Log recovery open segment scan from 00000495:00005D20:006B to 000004C7:00010E78:002F.
2017-06-09 00:10:34.67 spid25s [INFO] redoOpenRangeSegment(): Database ID: [5]. Log recovery open segment scan completed at 000004C7:00010E78:002F.
2017-06-09 00:10:34.67 spid25s [INFO] HkPrintUndoRowStats(): Database ID: [5]. Undo Rows Stats. [UndoRowsSeen] = 0, [UndoRowsMatched] = 0, [InsertRowsMatched] = 0, [InsertRowsSeen] = 0, [UndoRowsAborted] = 0
DBSTARTUP (UserDBName, 5): FCBOpenTime took 202 ms
DBSTARTUP (UserDBName, 5): FCBHeaderReadTime took 133 ms
DBSTARTUP (UserDBName, 5): FileMgrPreRecoveryTime took 308 ms
DBSTARTUP (UserDBName, 5): MasterFilesScanTime took 158 ms
DBSTARTUP (UserDBName, 5): StreamFileMgrPreRecoveryTime took 141 ms
DBSTARTUP (UserDBName, 5): LogMgrPreRecoveryTime took 478 ms
DBSTARTUP (UserDBName, 5): PhysicalCompletionTime took 116181 ms
DBSTARTUP (UserDBName, 5): HekatonRecoveryTime took 116167 ms
2017-06-09 00:10:34.84 spid24s Recovery completed for database UserDBName (database ID 5) in 117 second(s) (analysis 12 ms, redo 0 ms, undo 50 ms.) This is an informational message only. No user action is required.
2017-06-09 00:10:51.48 spid6s Recovery of database ‘UserDBName’ (7) is 71% complete (approximately 53 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2017-06-09 00:11:11.54 spid6s Recovery of database ‘UserDBName’ (7) is 99% complete (approximately 1 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.
2017-06-09 00:11:11.54 spid6s 23 transactions rolled back in database ‘UserDBName’ (7:0). This is an informational message only. No user action is required.
2017-06-09 00:11:11.55 spid6s Recovery is writing a checkpoint in database ‘UserDBName’ (7). This is an informational message only. No user action is required.
2017-06-09 00:11:11.55 spid6s Recovery completed for database UserDBName (database ID 7) in 154 second(s) (analysis 1405 ms, redo 71933 ms, undo 80147 ms.) This is an informational message only. No user action is required.
2017-06-09 00:11:11.56 spid6s Parallel redo is shutdown for database ‘UserDBName’ with worker pool size [2].
2017-06-09 00:11:11.57 spid6s Recovery is complete. This is an informational message only. No user action is required.

Errors while reading log file
EXEC sp_readerrorlog

Msg 22004, Level 16, State 1, Line 0
The log file is not using Unicode format.
The log file is not using Unicode format. (.Net SqlClient Data Provider)
For help, click:
Server Name:
Error Number: 22004
Severity: 16
State: 1

Running out of Space

Msg 3202, Level 16, State 1, Line 5 Write on “/var/opt/mssql/data/HWageInfo_06132017.bak” failed: Insufficient bytes transferred. Common causes are backup configuration, insufficient disk space, or other problems with the storage subsystem such as corruption or hardware failure. Check errorlogs/application-logs for detailed messages and correct error conditions. Msg 3013, Level 16, State 1, Line 5 BACKUP DATABASE is terminating abnormally. Ensure you have enough disk space. Ensure you have enough disk space.


Types of Cloud Computing

Cloud computing is providing developers and IT departments with the ability to focus on what matters most and avoid undifferentiated work like procurement, maintenance, and capacity planning. As cloud computing has grown in popularity, several different models and deployment strategies have emerged to help meet specific needs of different users. Each type of cloud service, and deployment method, provides you with different levels of control, flexibility, and management. Understanding the differences between Infrastructure as a Service, Platform as a Service, and Software as a Service, as well as what deployment strategies you can use, can help you decide what set of services is right for your needs.


Cloud Computing Models

There are three main models for cloud computing. Each model represents a different part of the cloud computing stack.


Infrastructure as a Service (IaaS):

Infrastructure as a Service, sometimes abbreviated as IaaS, contains the basic building blocks for cloud IT and typically provide access to networking features, computers (virtual or on dedicated hardware), and data storage space. Infrastructure as a Service provides you with the highest level of flexibility and management control over your IT resources and is most similar to existing IT resources that many IT departments and developers are familiar with today.


Platform as a Service (PaaS):

Platforms as a service remove the need for organizations to manage the underlying infrastructure (usually hardware and operating systems) and allow you to focus on the deployment and management of your applications. This helps you be more efficient as you don’t need to worry about resource procurement, capacity planning, software maintenance, patching, or any of the other undifferentiated heavy lifting involved in running your application.


Software as a Service (SaaS):

Software as a Service provides you with a completed product that is run and managed by the service provider. In most cases, people referring to Software as a Service are referring to end-user applications. With a SaaS offering you do not have to think about how the service is maintained or how the underlying infrastructure is managed; you only need to think about how you will use that particular piece software. A common example of a SaaS application is web-based email where you can send and receive email without having to manage feature additions to the email product or maintaining the servers and operating systems that the email program is running on.

AWS RDS vs Azure vs Managed Services for SQL Server

We are evaluating pros and cons of different hosting solutions for SQL Server which best suits our business needs.

Our business needs

Our demand is very predictable seasonal demand. We are very small and can’t afford dedicated team for managing database infrastructure.( No DBA Team) Sky high expectation from Customers on availability and reliability for about 2 months in a year. Few minutes of downtown during peak period can cause havoc to our business . Fixed budget with very little wiggle room.   Our plan is to evaluate AWS SQL Server RDS, Azure RDS , Managed solutions from hosting provider. Evaluate each option in these categories.

  1. Performance and Reliability
  2. Ability to scale up during peak loads
  3. Cost ( Based on Network , Storage, Memory and CPU )
  4. Operations Efficiency
  5. Compliance
Infrastructure Requirements :

SQL Server Enterprise Edition since we use enterprise features AlwaysOn Availability group for High Availability Geo Replication or Multi Availability zone implementation for Cloud based databases Ability to route Read/Write workloads 128 Gig RAM – Minimum 1 – 2 TB Storage with 500 Gigs of SSD for TempDB Database and High Volume Tables Memory Optimized OLTP Support which needs SQL Server 2016 Edition Ability to handle ~ 30 K IOPS during peak load.

Amazon AWS SQL Server RDS

RDS Pricing Link  AWS SQL Server RDS Pricing

Enterprise Edition  Single-AZ Deployment
  Price Per Hour
Memory Optimized Instances – Current Generation
db.r3.2xlarge $5.810
db.r3.4xlarge $11.404
db.r3.8xlarge $19.271


Multi-AZ Deployment
  Price Per Hour
Memory Optimized Instances – Current Generation
db.r3.2xlarge $11.620
db.r3.4xlarge $22.808
db.r3.8xlarge $38.542

AWS SQL Server RDS Configurations On-Demand for SQL Server (License Included) Multi-AZ Deployment Region:  US East (N. Virginia) Memory Optimized Instances – Current Generation Price Per Hour RAM : 244 GB 10 Gigabit 32 vCPU 20,000 Provisioned IOPS  

db.r3.8xlarge 244 GB 2 x 320 SSD Intel Xeon E5-2670 v2 (Ivy Bridge) 32 vCPUs 10 Gigabit
Azure Pricing Calculator

Azure performance is measured in DTU. We have been collecting our performance metrics during load test. The following link provides lightweight utility to convert perfmon counters to Azure DTU’s.

Perfmon Counters to Azure DTU Conversion Utility Link:

Perfmon to Azure DTU calculator

Understanding DTUs Based on Microsoft definition :  

The Database Transaction Unit (DTU) is the unit of measure in SQL Database that represents the relative power of databases based on a real-world measure: the database transaction. We took a set of operations that are typical for an online transaction processing (OLTP) request, and then measured how many transactions could be completed per second under fully loaded.

Azure RDS Pricing Calculator Link
 : Azure RDS Pricing Calculator
Azure SQL Server Pricing Calculator
Azure Options for SQL Server
10010 GB2005~$149/mo
20020 GB4005~$298/mo
40039 GB4005~$595/mo
80078 GB4005~$1,198/mo
1200117 GB4005~$1,800/mo
100100 GB200100~$223/mo
200200 GB400100~$446/mo
400400 GB400100~$900/mo
800800 GB400100~$1,800/mo
12001.2 TB400100~$2,701/mo
125250 GB50125~$697/mo
250500 GB50250~$1,399/mo
500750 GB50500~$2,790/mo
1000750 GB501000~$5,580/mo
1500750 GB501000~$8,370/mo

Launching AWS SQL Server RDS

Business Needs:

Our goal is to identify whether Amazon SQL Server RDS Service provides elastic , highly available , Scalable and operationally efficient solution for our use case. We are evaluating options to migrate our read/write heavy production SQL Server database to amazon SQL Server RDS. We have pretty high throughput needs for few hours a day for few months in a year ,which is mission critical for our business success. Any downtown during peak usuage would be catastrophic for our business. We are evaluating pros and cons of moving to amazon RDS with provisioned IOPS.

Caveats of AWS RDS SQL Server:

These information we gathered while working with SQL Server RDS.

Feature Name Yes/No Description
SQL Server 2016 Support No Microsoft says SQL Server 2016 comes with very rich feature set and ton on OLTP Enhancements
Native Backup Restore Yes AWS RDS Released this feature a week ago , which makes moving databases across environments lot more easier.
Elastic IOPS No Storage and IOPS needs to be incremented linearly for higher performance. You can’t get higher IOPS without increasing storage.
Elastic Storage No Scaling Storage is not an option after launching an instance
RAID Support No We usually have RAID 10 for production workload and RDS doesn't have options to configure RAID
Point in Time Restore on Same Instance No You can't do Point in Time Restore on the existing database. You have to spin up new Instance
AlwaysOn Availability Groups No This provides ability to failover group of databases to your secondary instance
Mirroring Yes Mirroring is Deprecated feature and its replaced with AlwaysOn Availability Groups
Linked Servers from RDS No But Linked Servers to RDS is Allowed.
Service Broker No Comes handy for services

  No Admin privileges. You can’t execute normal sql server system stored procedures and you need to work with options groupand parameters group to modify configuration. You can’t execute sp_configure to change configurations.

AWS SLA Summary

AWS SLA Summary
AWS Service SLA Service Credit Pct SLA Resource
RDS 99.95% Less than 99.95% but equal to or greater than 99.0% 10 %
RDS 99.95% Less than 99.0% 25 %
S3 99.9% Equal to or greater than 99.0% but less than 99.9% 10 %
S3 99.9% Less than 99.0% 25 %
EC2 99.95% Less than 99.95% but equal to or greater than 99.0% 10 %
EC2 99.95% Less than 99.0% 30 %
Route 53 100 % 5 - 30 minutes in a Billing Cycle 1 day Service Credit
Route 53 100 % 31 minutes - 4 hours in a Billing Cycle 7 days Service Credit
Route 53 100 % More than 4 hours in a Billing Cycle 30 days Service Credit
SLA Percentages
Availability % Downtime/Month Downtime/Week Downtime/Day
90% ("one nine") 72 hours 16.8 hours 2.4 hours
95% 36 hours 8.4 hours 1.2 hours
97% 21.6 hours 5.04 hours 43.2 minutes
98% 14.4 hours 3.36 hours 28.8 minutes
99% ("two nines") 7.20 hours 1.68 hours 14.4 minutes
99.5% 3.60 hours 50.4 minutes 7.2 minutes
99.8% 86.23 minutes 20.16 minutes 2.88 minutes
99.9% ("three nines") 43.8 minutes 10.1 minutes 1.44 minutes
99.95% 21.56 minutes 5.04 minutes 43.2 seconds
99.99% ("four nines") 4.38 minutes 1.01 minutes 8.66 seconds

  Useful Links Cloud Provider Service Availability

AWS RDS First Touch Penalty

According to AWS Documentation The first time a DB instance is started and accesses an area of disk for the first time, the process can take longer than all subsequent accesses to the same disk area. This is known as the “first touch penalty.” Once an area of disk has incurred the first touch penalty, that area of disk does not incur the penalty again for the life of the instance, even if the DB instance is rebooted, restarted, or the DB instance class changes. Note that a DB instance created from a snapshot, a point-in-time restore, or a read replica is a new instance and does incur this first touch penalty.

Reference :
I captured number of cached pages per database on our SQL Server RDS Instance and rebooted the instance and captured cached pages again. Based on the documentation, the cached pages should be available and shouldn’t be affected by Reboot process. Its such a neat feature which makes life lot easier for DBA’s to respond to unexpected situations. But I noticed significant differences between the number of  cached pages before and after reboot.

Cached Pages Before and After SQL Server RDS Reboot
DBName Bef_Buf_Pages Size_MB Aft_Buf_Pages Size_MB
DatabaseOne 558482 4363 596 4
DatabaseTwo 1017 7 487 3
DatabaseThree 609 4 201 1
master 190 1 107 0
model 18 0 37 0
msdb 882 6 284 2
rdsadmin 1253 9 87 0
DatabaseFour 133 1 59 0
Resource Database 1877 14 319 2
tempdb 779280 6088 123 0

For DatabaseOne , Cached Pages dropped from 558482 to 596. I am not sure whether others have encountered the same issue. Not sure what to think of the First Touch Penalty Promise to keep the cache intact. Maybe its not true for SQL Server RDS. :-)

Remote Host Identification SSH Error

Remote Host Identification SSH Error

Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
Please contact your system administrator.
Add correct host key in /Users/username/.ssh/known_hosts to get rid of this message.
Offending RSA key in /Users/ username /.ssh/known_hosts:1
RSA host key for has changed and you have requested strict checking.
Host key verification failed.
Edit known_hosts and remove the offending host entry and SSH into Host Again

username -MBP:.ssh username $ vi known_hosts
username -MBP:.ssh username $ ssh -i Key.pem ec2-user@
The authenticity of host ‘ (’ can’t be established.
ECDSA key fingerprint is SHA256:2VcR+DiKNRwyQwZ2LqtZ6EHxQYv5MWMAsrrI.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added ‘’ (ECDSA) to the list of known hosts.
Last login: Fri Jul 29 23:08:35 2016 from
__| __|_ )
_| ( / Amazon Linux AMI
4 package(s) needed for security, out of 4 available
Run “sudo yum update” to apply all updates.

CloudIQ is a leading Cloud Consulting and Solutions firm that helps businesses solve today’s problems and plan the enterprise of tomorrow by integrating intelligent cloud solutions. We help you leverage the technologies that make your people more productive, your infrastructure more intelligent, and your business more profitable. 


626 120th Ave NE, B102, Bellevue,

WA, 98005.


Chennai One IT SEZ,

Module No:5-C, Phase ll, 2nd Floor, North Block, Pallavaram-Thoraipakkam 200 ft road, Thoraipakkam, Chennai – 600097

© 2020 CloudIQ Technologies. All rights reserved.

Get in touch

Please contact us using the form below


626 120th Ave NE, B102, Bellevue, WA, 98005.

+1 (206) 203-4151


Chennai One IT SEZ,

Module No:5-C, Phase ll, 2nd Floor, North Block, Pallavaram-Thoraipakkam 200 ft road, Thoraipakkam, Chennai – 600097