Cosmos DB is Microsoft Azure’s hugely successful tool to help their clients manage data on a global scale. This multi-model database service allows Azure platform users to elastically and independently scale throughput and storage across any number of Azure regions worldwide.

As Cosmos DB supports multiple data models, you can take advantage of fast, single-digit-millisecond data access using any of your favorite APIs, including SQL, MongoDB, Cassandra, Tables, or Gremlin. Being a NoSQL database, anyone with experience in MongoDB can easily work with Cosmos DB. Meanwhile, by supporting SQL APIs, it makes it easy to interact with SQL knowledge.

Why Cosmos DB?

For organizations looking to build a flexible and scalable database that is globally distributed, Cosmos DB is especially useful as it

  • provides a ready-to-use, extremely dynamic database service
  • guarantees low latency of less than 10 milliseconds when reading data and less than 15 milliseconds when writing data.
  • offers customers a faster, completely seamless experience.
  • offers 99.99% availability

Here are some tried and tested tips from our senior Azure expert on how to get the most out of Cosmos DB.

Data Modeling

Cosmos DB is great because it lets you model semi/unstructured aggregates and dynamic entities. This means it’s very easy to model ever-changing entities, entities that don’t all share the same attributes and hierarchical aggregates. To model for Cosmos, you need to think in terms of hierarchy and aggregates instead of entities and relations. NoSQL lets you, say, store a thing that has other things, which have things of their own. Give me the whole hierarchy of things back. So, you don’t have a person, rental addresses, and a relation between them. Instead, you have rental records, which aggregate for each person what rental addresses they’ve had.

The following NO SQL rules will perfectly match cosmos DB too.

  • Should be used as a complement to an existing or additional database.
  • Deal with PACELC theorem, an extension of CAP theorem.
  • A data modeler should think in terms of queries instead of in terms of storage
Connection Types

Cosmos DB can be connected to the application by 2 modes.

  • Gateway mode
  • Direct mode

The gateway mode is the default mode in Microsoft.Azure.DocumentDB SDK. It uses HTTPS port with a single endpoint. While the direct mode is the default for .NET V3 SDK. This uses TCP and HTTPS for connectivity.

The gateway mode is better while your application runs within a corporate network with strict network rules because it has a single endpoint that can be configured to the firewall for security. Meanwhile, the gateway mode performance will be low when compared to the direct mode.

There is also an option to connect through a RESTful programming model provided by the SDK. All the CRUD can be done through REST calls. This method is recommended if you need a client App to do database access directly instead of providing API. Thus, the overhead of providing an API wrapper to consume the Cosmos DB can be eradicated, and a performance payoff can be prevented.

The recommended mode is always the direct mode in most of the scenarios, which provides better performance.

I am taking the popular volcano data for comparing the response time between the SDK and RESTful model.

Query Executed in both the versions.

SELECT * FROM c where c.Status="Holocene"
Response details of the SDK

The query was responded with the data in 3710 ms.

Response details of the RESTful model

The query was responded with data in 5810 ms.

If we developed an API with this mode, then the response time taken by our API, too needs to be considered. So, using RESTful model in API will be a trade-off with the performance. Use this mode to query from the client directly.

Partitioning the DB

The logical partition is the primary key to provide performance to the cosmos transactions. For example, if you have a database with a list of student data of around 1500 from a school. Now, a simple search for a student named “Peter” will lead to a search through all the 1500 data entries, which consumes high throughput to get the result. Now split the data logically by the “Grade” they belong to. Now, querying like a student named “Peter” from “Grade 5” will lead the system to search only 30 or 40 students from the total 1500 saving the throughput consumption and elevating the performance as compared to the earlier approach.

The common phenomenon will be a. Any key such as city, state, country kind of properties can be used as Partition Key.

a. Any key such as city, state, country kind of properties can be used as Partition Key.
 b. No partition is required up to 10 GB.
 c. The query must be provided with the partition key to be searched.
 d. The property selected as partition key must be in all the documents in the container.

I am taking the popular Volcano data for testing the performance with and without Partition Key.

1. I initially created a collection without a partition key. The performance for the given query is

SELECT * FROM c where c.Status="Holocene"
Resultset
MetricValue
Partition key range id0
Retrieved document count200
Retrieved document size (in bytes)100769
Output document count200
Output document size (in bytes)101069
Index hit document count200
Index lookup time (ms)0.21
Document load time (ms)1.29
Query engine execution time (ms)0.33
System function execution time (ms)0
User defined function execution time (ms)0
Document write time (ms)0.52

2. Then I recreated the same collection with “/country” as a partition. Now the same query with Japan as partition value results with the given values.

Resultset
MetricValue
Partition key range id0
Retrieved document count16
Retrieved document size (in bytes)7887
Output document count16
Output document size (in bytes)7952
Index hit document count16
Index lookup time (ms)0.23
Document load time (ms)0.17
Query engine execution time (ms)0.06
System function execution time (ms)0
User defined function execution time (ms)0
Document write time (ms)0.01
Tune the Index

Indexing is always a top priority item in the checklist to tune performance. Indexing is an internal job that keeps track of the metadata about the data, which helps in finding the result set data for a query. By default, all the properties of a Cosmos container will be indexed. But it is not necessary as it is a useless overhead to the DB, and also keeping track of a lot of data consumes enough RUs, which is not cost-effective. The better approach is to exclude all the paths from indexing and add only age paths that are used for querying in the application. 

Indexing ModeWith Default IndexingWith Custom Indexing
RU's Consumed3146.1519.83
Output Doc Count100100
Doc load time (In ms)646.512.32
Query engine execution time (In ms)434.264.96
System function execution time (In ms)57.032.41
Paging

The execution of the query, by default, will return 100 documents. We can increase the number of documents by providing “maxItemCount” value. The maximum size will be 1000 documents. But it is not meaningful to take 1000 documents at a time from the DB except in some scenarios. To improve the performance and to show a crisp result set to the user, always reduce the “maxItemCount”. Unlike SQL databases, pagination is the default behavior in Cosmos. So, even though you are going to provide maximum count as 1000 and the result for the query is going to be more than that, then the Cosmos is going to return a token named “Continuation Token”. This token consists of a unique value that points to the query we did and the page number. If the total result for the query is provided by the Cosmos, then we can do the logic of pagination at the front end. Thus, by reducing the number of documents per response, we can save the throughput consumption, network data transaction, and increase performance.

Throughput Management

RU ’s or Request Units is the common term we always come across when using Cosmos DB. When you read a document from a container or write a document to it, then you are trading an RU with the Cosmos for your operation. It is like currency in our common world. Without money, you can’t buy anything, and without RU, you can’t query anything. You can buy only the items that cost equivalent to or less than the money you have in hand. Similarly, you can query the data that equals the RU’s you have.

If you have large data and if the query needs to traverse deep into the collection, then you need enough RU’s. Adding a property in the index will consume some RU’s. So, if all the properties are indexed, then your RU’s pay off will be high, and you will lack RU’s for querying. So, always add in the index only the properties that are needed while querying.

Index properly, save the RU’s, and utilize it during querying. For example, if you have 100K documents in the container. The Cosmos DB is consuming 1000 RU’s to do a query operation up to 50K documents with the default indexing, then our query will not reach the rest of the 50K documents, and we will never receive them in our result set. If appropriately indexed, the same query will consume only 400 RU’s to penetrate all the 100K documents.

Startup latency

The very first query will always be a bit late because of the time it takes to awaken the connection. To overcome this latency, it is best practice to call “OpenAsync()” in SDK 2 in the beginning while creating the connection.

await client.OpenAsync();
Singleton Connection

The best approach is to connect the DB and keep the connection alive for all the instances of the application. Also, polling the DB within a period will keep the connection alive. This reduces the DB connectivity latency.

Regions

Make sure the Cosmos and the applications are grouped within the same Azure Region. This reduces the latency a lot. The lowest possible latency is achieved by ensuring the calling application is located within the same Azure region as the provisioned Azure Cosmos DB endpoint.

Programming Best Practices
  • Always use the latest SDK version.
  • Use Streaming API (in SDK 3) that can receive and return data without serializing. Helpful when your API is just a relay and not doing any logical operations on the data.
  • Tune the queries.
  • Implement retry logic with reasonable waiting time to prevent throttle during a busy time.

By carefully analyzing all the above factors, we can improve the Cosmos DB query performance substantially.


LATEST BLOG

Cosmos DB is Microsoft Azure’s hugely successful tool to help their clients manage data on a global scale. This multi-model database service allows Azure platform users to elastically and independently scale throughput and storage across any number of Azure regions worldwide.

As Cosmos DB supports multiple data models, you can take advantage of fast, single-digit-millisecond data access using any of your favorite APIs, including SQL, MongoDB, Cassandra, Tables, or Gremlin. Being a NoSQL database, anyone with experience in MongoDB can easily work with Cosmos DB. Meanwhile, by supporting SQL APIs, it makes it easy to interact with SQL knowledge.

Why Cosmos DB?

For organizations looking to build a flexible and scalable database that is globally distributed, Cosmos DB is especially useful as it

  • provides a ready-to-use, extremely dynamic database service
  • guarantees low latency of less than 10 milliseconds when reading data and less than 15 milliseconds when writing data.
  • offers customers a faster, completely seamless experience.
  • offers 99.99% availability

Here are some tried and tested tips from our senior Azure expert on how to get the most out of Cosmos DB.

Data Modeling

Cosmos DB is great because it lets you model semi/unstructured aggregates and dynamic entities. This means it’s very easy to model ever-changing entities, entities that don’t all share the same attributes and hierarchical aggregates. To model for Cosmos, you need to think in terms of hierarchy and aggregates instead of entities and relations. NoSQL lets you, say, store a thing that has other things, which have things of their own. Give me the whole hierarchy of things back. So, you don’t have a person, rental addresses, and a relation between them. Instead, you have rental records, which aggregate for each person what rental addresses they’ve had.

The following NO SQL rules will perfectly match cosmos DB too.

  • Should be used as a complement to an existing or additional database.
  • Deal with PACELC theorem, an extension of CAP theorem.
  • A data modeler should think in terms of queries instead of in terms of storage
Connection Types

Cosmos DB can be connected to the application by 2 modes.

  • Gateway mode
  • Direct mode

The gateway mode is the default mode in Microsoft.Azure.DocumentDB SDK. It uses HTTPS port with a single endpoint. While the direct mode is the default for .NET V3 SDK. This uses TCP and HTTPS for connectivity.

The gateway mode is better while your application runs within a corporate network with strict network rules because it has a single endpoint that can be configured to the firewall for security. Meanwhile, the gateway mode performance will be low when compared to the direct mode.

There is also an option to connect through a RESTful programming model provided by the SDK. All the CRUD can be done through REST calls. This method is recommended if you need a client App to do database access directly instead of providing API. Thus, the overhead of providing an API wrapper to consume the Cosmos DB can be eradicated, and a performance payoff can be prevented.

The recommended mode is always the direct mode in most of the scenarios, which provides better performance.

I am taking the popular volcano data for comparing the response time between the SDK and RESTful model.

Query Executed in both the versions.

SELECT * FROM c where c.Status="Holocene"
Response details of the SDK

The query was responded with the data in 3710 ms.

Response details of the RESTful model

The query was responded with data in 5810 ms.

If we developed an API with this mode, then the response time taken by our API, too needs to be considered. So, using RESTful model in API will be a trade-off with the performance. Use this mode to query from the client directly.

Partitioning the DB

The logical partition is the primary key to provide performance to the cosmos transactions. For example, if you have a database with a list of student data of around 1500 from a school. Now, a simple search for a student named “Peter” will lead to a search through all the 1500 data entries, which consumes high throughput to get the result. Now split the data logically by the “Grade” they belong to. Now, querying like a student named “Peter” from “Grade 5” will lead the system to search only 30 or 40 students from the total 1500 saving the throughput consumption and elevating the performance as compared to the earlier approach.

The common phenomenon will be a. Any key such as city, state, country kind of properties can be used as Partition Key.

a. Any key such as city, state, country kind of properties can be used as Partition Key.
 b. No partition is required up to 10 GB.
 c. The query must be provided with the partition key to be searched.
 d. The property selected as partition key must be in all the documents in the container.

I am taking the popular Volcano data for testing the performance with and without Partition Key.

1. I initially created a collection without a partition key. The performance for the given query is

SELECT * FROM c where c.Status="Holocene"
Resultset
MetricValue
Partition key range id0
Retrieved document count200
Retrieved document size (in bytes)100769
Output document count200
Output document size (in bytes)101069
Index hit document count200
Index lookup time (ms)0.21
Document load time (ms)1.29
Query engine execution time (ms)0.33
System function execution time (ms)0
User defined function execution time (ms)0
Document write time (ms)0.52

2. Then I recreated the same collection with “/country” as a partition. Now the same query with Japan as partition value results with the given values.

Resultset
MetricValue
Partition key range id0
Retrieved document count16
Retrieved document size (in bytes)7887
Output document count16
Output document size (in bytes)7952
Index hit document count16
Index lookup time (ms)0.23
Document load time (ms)0.17
Query engine execution time (ms)0.06
System function execution time (ms)0
User defined function execution time (ms)0
Document write time (ms)0.01
Tune the Index

Indexing is always a top priority item in the checklist to tune performance. Indexing is an internal job that keeps track of the metadata about the data, which helps in finding the result set data for a query. By default, all the properties of a Cosmos container will be indexed. But it is not necessary as it is a useless overhead to the DB, and also keeping track of a lot of data consumes enough RUs, which is not cost-effective. The better approach is to exclude all the paths from indexing and add only age paths that are used for querying in the application. 

Indexing ModeWith Default IndexingWith Custom Indexing
RU's Consumed3146.1519.83
Output Doc Count100100
Doc load time (In ms)646.512.32
Query engine execution time (In ms)434.264.96
System function execution time (In ms)57.032.41
Paging

The execution of the query, by default, will return 100 documents. We can increase the number of documents by providing “maxItemCount” value. The maximum size will be 1000 documents. But it is not meaningful to take 1000 documents at a time from the DB except in some scenarios. To improve the performance and to show a crisp result set to the user, always reduce the “maxItemCount”. Unlike SQL databases, pagination is the default behavior in Cosmos. So, even though you are going to provide maximum count as 1000 and the result for the query is going to be more than that, then the Cosmos is going to return a token named “Continuation Token”. This token consists of a unique value that points to the query we did and the page number. If the total result for the query is provided by the Cosmos, then we can do the logic of pagination at the front end. Thus, by reducing the number of documents per response, we can save the throughput consumption, network data transaction, and increase performance.

Throughput Management

RU ’s or Request Units is the common term we always come across when using Cosmos DB. When you read a document from a container or write a document to it, then you are trading an RU with the Cosmos for your operation. It is like currency in our common world. Without money, you can’t buy anything, and without RU, you can’t query anything. You can buy only the items that cost equivalent to or less than the money you have in hand. Similarly, you can query the data that equals the RU’s you have.

If you have large data and if the query needs to traverse deep into the collection, then you need enough RU’s. Adding a property in the index will consume some RU’s. So, if all the properties are indexed, then your RU’s pay off will be high, and you will lack RU’s for querying. So, always add in the index only the properties that are needed while querying.

Index properly, save the RU’s, and utilize it during querying. For example, if you have 100K documents in the container. The Cosmos DB is consuming 1000 RU’s to do a query operation up to 50K documents with the default indexing, then our query will not reach the rest of the 50K documents, and we will never receive them in our result set. If appropriately indexed, the same query will consume only 400 RU’s to penetrate all the 100K documents.

Startup latency

The very first query will always be a bit late because of the time it takes to awaken the connection. To overcome this latency, it is best practice to call “OpenAsync()” in SDK 2 in the beginning while creating the connection.

await client.OpenAsync();
Singleton Connection

The best approach is to connect the DB and keep the connection alive for all the instances of the application. Also, polling the DB within a period will keep the connection alive. This reduces the DB connectivity latency.

Regions

Make sure the Cosmos and the applications are grouped within the same Azure Region. This reduces the latency a lot. The lowest possible latency is achieved by ensuring the calling application is located within the same Azure region as the provisioned Azure Cosmos DB endpoint.

Programming Best Practices
  • Always use the latest SDK version.
  • Use Streaming API (in SDK 3) that can receive and return data without serializing. Helpful when your API is just a relay and not doing any logical operations on the data.
  • Tune the queries.
  • Implement retry logic with reasonable waiting time to prevent throttle during a busy time.

By carefully analyzing all the above factors, we can improve the Cosmos DB query performance substantially.

Amazon Web Services (AWS) is a secure cloud services platform, offering compute power, database storage, content delivery, and other functionalities to help businesses scale and grow.

It gives organizations a secure and robust platform to develop their custom cloud-based solutions and has several unique features that make it one of the most reliable and flexible cloud platform such as

  • Mobile-friendly access through AWS Mobile Hub and AWS Mobile SDK
  • Fully managed purpose-built Databases
  • Serverless cloud functions
  • Range of storage options that are affordable and scalable.
  • Unbeatable security and compliance

Following are some core services offered by AWS:

AWS Core services
  1. An EC2 instance is a virtual server in Amazon’s Elastic Compute Cloud (EC2) for running applications on the AWS infrastructure.
  2. Amazon Elastic Block Store (EBS) is a cloud-based block storage system provided by AWS that is best used for storing persistent data.
  3. Amazon Virtual Private Cloud (Amazon VPC) enables us to launch AWS resources into a virtual network that we have defined. This virtual network closely resembles a traditional network that we would operate in our own data center, with the benefits of using the scalable infrastructure of AWS.
  4. Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its global e-commerce network.
  5. AWS security groups (SGs) are associated with EC2 instances and provide security at the protocol and port access level. Each security group — working much the same way as a firewall — contains a set of rules that filter traffic coming into and out of an EC2 instance.

Let us look more deeply at one of AWS’s core services – AWS CloudFormation – that is key for managing workloads on AWS.

1.   CloudFormation

AWS CloudFormation is a service that helps us model and set up our Amazon Web Services resources so that we can spend less time managing those resources and more time focusing on our applications that run in AWS.  We create a template that describes all the AWS resources that we want (like Amazon EC2 instances or S3 buckets), and AWS CloudFormation takes care of provisioning and configuring those resources for us. We don’t need to individually create and configure AWS resources and figure out what’s dependent on what; AWS CloudFormation handles all of that.

A stack is a collection of AWS resources that you can manage as a single unit. In other words, we can create, update, or delete a collection of resources by creating, updating, or deleting stacks. All the resources in a stack are defined by the stack’s AWS CloudFormation template.

2.   CloudFormation template

CloudFormation templates can be written in either JSON or YAML.  The structure of the template in YAML is given below:

---
AWSTemplateFormatVersion: "version date"

Description:
  String
Metadata:
  template metadata
Parameters:
  set of parameters
Mappings:
  set of mappings
Conditions:
  set of conditions
Resources:
  set of resources
Outputs:
  set of outputs

In the above yaml file,

  1. AWSTemplateFormatVersion – The AWS CloudFormation template version that the template conforms to.
  2. Description – A text string that describes the template.
  3. Metadata – Objects that provide additional information about the template.
  4. Parameters – Values to pass to our template at runtime (when we create or update a stack). We can refer to parameters from the Resources and Outputs sections of the template.
  5. Mappings – A mapping of keys and associated values that we can use to specify conditional parameter values, like a lookup table. We can match a key to a corresponding value by using the Fn::FindInMap intrinsic function in the Resources and Outputs sections.
  6. Conditions – Conditions that control whether certain resources are created or whether certain resource properties are assigned a value during stack creation or update. For example, we can conditionally create a resource that depends on whether the stack is for a production or test environment.
  7. Resources – Specifies the stack resources and their properties, such as an Amazon Elastic Compute Cloud instance or an Amazon Simple Storage Service bucket.  We can refer to resources in the Resources and Outputs sections of the template.
  8. Outputs – Describes the values that are returned whenever we view our stack’s properties. For example, we can declare an output for an S3 bucket name and then call the AWS cloudformation describe-stacks AWS CLI command to view the name.

Resources is the only required section in the CloudFormation template.  All other sections are optional.

3.   CloudFormation template to create S3 bucket

S3template.yml

Resources:
  HelloBucket:
    Type: AWS::S3::Bucket

In AWS Console, go to CloudFormation and click on Create Stack

Upload the template file which we created.  This will get stored in an S3 location, as shown below.

Click next and give a stack name

Click Next and then “Create stack”.  After a few minutes, you can see that the stack creation is completed.

Clicking on the Resource tab, you can see that the S3 bucket has been created with name “s3-stack-hellobucket-buhpx7oucrgn”.  AWS has provided this same since we didn’t specify the BucketName property in YAML.

Note that deleting the stack will delete the S3 bucket which it had created.

4.   Intrinsic functions

AWS CloudFormation provides several built-in functions that help you manage your stacks.

In the below example, we create two resources – a Security Group and an EC2 Instance, which uses this Security Group.  We can refer to the Security Group resource using the !Ref function.

Ec2template.yml

Resources:
  Ec2Instance:
    Type: 'AWS::EC2::Instance'
    Properties:
      SecurityGroups:
        - !Ref InstanceSecurityGroup
      KeyName: mykey
      ImageId: ''
  InstanceSecurityGroup:
    Type: 'AWS::EC2::SecurityGroup'
    Properties:
      GroupDescription: Enable SSH access via port 22
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: '22'
          ToPort: '22'
          CidrIp: 0.0.0.0/0

Some other commonly used intrinsic functions are

  1. Fn::GetAtt – returns the value of an attribute from a resource in the template.
  2. Fn::Join – appends a set of values into a single value, separated by the specified delimiter. If a delimiter is an empty string, the set of values are concatenated with no delimiter.
  3. Fn::Sub – substitutes variables in an input string with values that you specify. In our templates, we can use this function to construct commands or outputs that include values that aren’t available until we create or update a stack.
5.   Parameters

Parameters enable us to input custom values to your template each time you create or update a stack.

TemplateWithParameters.yaml

Parameters: 
  InstanceTypeParameter: 
    Type: String
    Default: t2.micro
    AllowedValues: 
      - t2.micro
      - m1.small
      - m1.large
    Description: Enter t2.micro, m1.small, or m1.large. Default is t2.micro.
Resources:
  Ec2Instance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType:
        Ref: InstanceTypeParameter
      ImageId: ami-0ff8a91507f77f867
6.   Pseudo Parameters

Pseudo parameters are parameters that are predefined by AWS CloudFormation. We do not declare them in our template. Use them the same way as we would a parameter as the argument for the Ref function.

Commonly used pseudo parameters:

  1. AWS: Region – Returns a string representing the AWS Region in which the encompassing resource is being created, such as us-west-2
  2. AWS::StackName – Returns the name of the stack as specified during cloudformation create-stack, such as teststack
7.   Mappings

The optional Mappings section matches a key to a corresponding set of named values. For example, if you want to set values based on a region, we can create a mapping that uses the region name as a key and contains the values we want to specify for each specific region. We use the Fn::FindInMap intrinsic function to retrieve values in a map.

We cannot include parameters, pseudo parameters, or intrinsic functions in the Mappings section.

TemplateWithMappings.yaml

AWSTemplateFormatVersion: "2010-09-09"
Mappings: 
  RegionMap: 
    us-east-1:
      HVM64: ami-0ff8a91507f77f867
      HVMG2: ami-0a584ac55a7631c0c
    us-west-1:
      HVM64: ami-0bdb828fd58c52235
      HVMG2: ami-066ee5fd4a9ef77f1
    eu-west-1:
      HVM64: ami-047bb4163c506cd98
      HVMG2: ami-0a7c483d527806435
    ap-northeast-1:
      HVM64: ami-06cd52961ce9f0d85
HVMG2: ami-053cdd503598e4a9d
    ap-southeast-1:
      HVM64: ami-08569b978cc4dfa10
      HVMG2: ami-0be9df32ae9f92309
Resources: 
  myEC2Instance: 
    Type: "AWS::EC2::Instance"
    Properties: 
      ImageId: !FindInMap [RegionMap, !Ref "AWS::Region", HVM64]
      InstanceType: m1.small
8.   Outputs

The optional Outputs section declares output values that we can import into other stacks (to create cross-stack references), return in response (to describe stack calls), or view on the AWS CloudFormation console. For example, we can output the S3 bucket name for a stack to make the bucket easier to find.

In the below example, the output named StackVPC returns the ID of a VPC, and then exports the value for cross-stack referencing with the name VPCID appended to the stack’s name.

Outputs:
  StackVPC:
    Description: The ID of the VPC
    Value: !Ref MyVPC
    Export:
      Name: !Sub "${AWS::StackName}-VPCID"

As organizations start to create and maintain clusters in AKS (Azure Kubernetes Service), they also need to use cloud-based identity and access management service to access other Azure cloud resources and services. The Azure Active Directory (AAD) pod identity is a service that gives users this control by assigning identities to individual pods.  

Without these controls, accounts may get access to resources and services they don’t require. And it can also become hard for IT teams to track which set of credentials were used to make changes.

Azure AD Pod identity is just one small part of the container and Kubernetes management process and as you delve deeper, you will realize the true power that Kubernetes and Containers bring to your DevOps ecosystem.

Here is a more detailed look at how to use AAD pod identity for connecting pods in AKS cluster with Azure Key Vault.

Pod Identity

Integrate your key management system with Kubernetes using pod identity. Secrets, certificates, and keys in a key management system become a volume accessible to pods. The volume is mounted into the pod, and its data is available directly in the container file system for your application.

On an existing AKS cluster –

Deploy Key Vault FlexVolume to your AKS cluster with this command:

  • kubectl create -f https://raw.githubusercontent.com/Azure/kubernetes-keyvault-flexvol/master/deployment/kv-flexvol-installer.yaml
1. Create the Deployment

Run this command to create the aad-pod-identity deployment on an RBAC-enabled cluster:

  • kubectl apply -f https://raw.githubusercontent.com/Azure/aad-pod-identity/master/deploy/infra/deployment-rbac.yaml

Or run this command to deploy to a non-RBAC cluster:

  • kubectl apply -f https://raw.githubusercontent.com/Azure/aad-pod-identity/master/deploy/infra/deployment.yaml
2. Create an Azure Identity

Create azure managed identity

Command:- az identity create -g ResourceGroupNameOfAKsService -n aks-pod-identity(ManagedIdentity)

Output:-  

{
"clientId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ",
"clientSecretUrl": "https://control-westus.identity.azure.net/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourcegroups/aks_dev_rg_wu/providers/Microsoft.ManagedIdentity/userAssignedIdentities/aks-pod-identity/credentials?tid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&oid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx&aid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ",
"id": "/subscriptions/xxxxxxxx-xxxx-XXXX-XXXX-XXXXXXXXXXXX/resourcegroups/aks_dev_rg_wu/providers/Microsoft.ManagedIdentity/userAssignedIdentities/aks-pod-identity",
"location": "westus",
"name": "aks-pod-identity",
"principalId": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
"resourceGroup": "au10515_aks_dev_rg_wu",
"tags": {},
"tenantId": XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXX ",
"type": "Microsoft.ManagedIdentity/userAssignedIdentities"
}

Assign Cluster SPN Role

Command for Getting AKSServicePrincipalID:- az aks show -g <resourcegroup> -n <name> –query servicePrincipalProfile.clientId -o tsv

Command:-az role assignment create –role “Managed Identity Operator” –assignee <AKSServicePrincipalId> –scope < ID of Managed identity>

Assign Azure Identity Roles

Command:- az role assignment create –role Reader –assignee <Principal ID of Managed identity> –scope <KeyVault Resource ID>

Set policy to access keys in your Key Vault

Command:- az keyvault set-policy -n dev-kv –key-permissions get –spn  <Client ID of Managed identity>

Set policy to access secrets in your Key Vault

Command:- az keyvault set-policy -n dev-kv –secret-permissions get –spn <Client ID of Managed identity>

Set policy to access certs in your Key Vault

Command:- az keyvault set-policy -n dev-kv –certificate-permissions get –spn <Client ID of Managed identity>

3. Install the Azure Identity

Save this Kubernetes manifest to a file named aadpodidentity.yaml:

apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzureIdentity
metadata:
name: <a-idname>
spec:
type: 0
ResourceID: /subscriptions/<subid>/resourcegroups/<resourcegroup>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<name>
ClientID: <clientId>

Replace the placeholders with your user identity values. Set type: 0 for user-assigned MSI or type: 1 for Service Principal.

Finally, save your changes to the file, then create the AzureIdentity resource in your cluster:

kubectl apply -f aadpodidentity.yaml

4. Install the Azure Identity Binding

Save this Kubernetes manifest to a file named aadpodidentitybinding.yaml:

apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzureIdentityBinding
metadata:
  name: demo1-azure-identity-binding
spec:
  AzureIdentity: <a-idname>
  Selector: <label value to match>

Replace the placeholders with your values. Ensure that the AzureIdentity name matches the one in aadpodidentity.yaml.

Finally, save your changes to the file, then create the AzureIdentityBinding resource in your cluster:

kubectl apply -f aadpodidentitybinding.yaml

Sample Nginx Deployment for accessing key vault secret using Pod Identity

Save this sample nginx pod manifest file named nginx-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: nginx-flex-kv-podid
    aadpodidbinding: 
  name: nginx-flex-kv-podid
spec:
  containers:
  - name: nginx-flex-kv-podid
    image: nginx
    volumeMounts:
    - name: test
      mountPath: /kvmnt
      readOnly: true
  volumes:
  - name: test
    flexVolume:
      driver: "azure/kv"
      options:
        usepodidentity: "true"         # [OPTIONAL] if not provided, will default to "false"
        keyvaultname: ""               # the name of the KeyVault
        keyvaultobjectnames: ""        # list of KeyVault object names (semi-colon separated)
        keyvaultobjecttypes: secret    # list of KeyVault object types: secret, key or cert (semi-colon separated)
        keyvaultobjectversions: ""     # [OPTIONAL] list of KeyVault object versions (semi-colon separated), will get latest if empty
        resourcegroup: ""              # the resource group of the KeyVault
        subscriptionid: ""             # the subscription ID of the KeyVault
        tenantid: ""            # the tenant ID of the KeyVault
Azure AD Pod Identity points to remember when implementing in cluster
  • Azure AD Pod Identity is currently bound to the default namespace. Deploying an Azure Identity and it’s binding to other namespaces, will not work!
  • Pods from all namespaces can be executed in the context of an Azure Identity deployed to the default namespace (related to point 1)
  • Every Pod Developer can add the aadpodidbinding label to his/her pod and use your Azure Identity
  • Azure Identity Binding is not using default Kubernetes label selection mechanism

There is little doubt that data will guide the next generation of business strategy and will bring new efficiencies across industries. But for that to happen, organizations must be able to extract insights from their data.

Qubole is an ideal platform to activate end-to-end data processing in organizations. It combines all types of data – structured, unstructured, and legacy offline data – into a single data pipeline and turns it into rich insights by adding AI, ML, and deep analytics tools to the mix.

It scales seamlessly to accommodate more users and new data without adding administrative overheads and lowers cloud costs significantly. Simply put, Qubole is a platform that puts big data on the cloud to power business decisions based on real-time analytics.

At CloudIQ Technologies, our data experts have deployed Qubole’s cloud-native data systems for many of our clients, and the results have been outstanding. Here is an article from one of our data engineers that provides an overview of how to setup Qubole to use AWS environment and create and run spark clusters.

AWS Access Configuration:

In order for Qubole to create and run a cluster, we have to grant Qubole access to our AWS environment. We can grant access based on a key or a role. We will use role-based authentication.

Step 1: Login to Qubole

Step 2: Click on the menu at the top left corner and select “Account Settings” under the Control Panel.

Step 3: Scroll down to Access settings

Step 4: Switch Access mode to “IAM Role”

Step 5: Copy the Trusted Principal AWS Account ID and External ID

Step 6: Use the copied values to create a QuboleAccessRole in the AWS account (using the cloudformation template)

Step 7: Copy the Role ARN of the QuboleAccessRole and enter it in the Role ARN field

Step 8: Enter the S3 bucket location where the Qubole metadata will be stored in the “Default Location” field.

Step 9: Click Save

Spark Cluster
Create a cluster

The below steps will help create a new Spark cluster in Qubole.

Step 1: Click on the top-left dropdown menu and select “Cluster”

Step 2: Click on “+New” button

Step 3: Select “Spark” and click “Next”

Step 4: Provide a name for the cluster in the “Cluster Labels” field

Step 5: Select the version of Spark to run, Master Node Type, Worker Node Type, Minimum and Maximum nodes

Step 6: Select Region as us-west-2

Step 7: Select Availability Zone as us-west-2a

Step 8: Click “Next”

Step 9: In the Composition screen, you can select the type of nodes that will be spun up.

Step 10: In the Advanced Configuration screen, proceed to EC2 settings

Step 11: Enter “QuboleDualIAMRole” in the “Instance Profile” field

Step 12: Select “AppVPC” in VPC field

Step 13: Select “AppPrivateSNA” under Subnet field

Step 14: Enter the ip address of the Bastion node in the “Bastion Node” field

Step 15: Scroll to the bottom and enter “AppQuboleClusterSG” (security group for the cluster) in the “Persistent Security Group” field

Step 16: Click on “Create”

Run a cluster

To start a cluster, click on the dropdown menu on the top left corner and select cluster. Now click on “Start” button next to the cluster that needs to be started. A cluster is also automatically started when a job is submitted for the cluster.

Submit a job

One of the simplest ways to run a spark job is to submit it through the workbench. You can navigate to the workbench from the drop-down menu at the top left corner. In the workbench, click on “+Create New”. Then select “Spark” next to the title of the job. Once you select Spark, an optional drop-down appears where you can choose “Python”. In the last drop-down menu, select the spark cluster where you want to execute the job. If this cluster is not active, it will be activated automatically. Enter your spark job in the window below. When complete, click on “Run” to run the job.

Airflow Cluster

Airflow scheduler can be used to run various jobs in a sequence. Let’s take a look at configuring an Airflow cluster in Qubole.

Setting up DataStore

The first step in creating an airflow cluster is to set up a datastore. Make sure that the MySQL db is up and running and contains a database for airflow. Now, select “Explore” from the dropdown menu at the top left corner. On the left hand menu, drop down the selection menu showing “Qubole Hive” and select “+Add Data Store”

In the new screen, provide a name for the data store. Select “MySQL” as the database type. Enter the database name for the airflow database (The database should already be created in MySQL). Enter the host address as “hmklabsbienvironment.cq8z1kp7ikd8.us-west-2.rds.amazonaws.com”. Enter the username and password. Make sure to select “Skip Validation”. Since the MySQL db is in a private VPC, Qubole does not have access to it and will not be able to validate.

Configuring Airflow Cluster

Step 1: Click on the top left drop-down menu and select “Cluster”

Step 2: Click on “+New” button

Step 3: Select “Airflow” in the type of cluster and click “Next”

Step 4: Give a cluster name. Select the airflow version, node type.

Step 5: Select the datastore which points to the MySQL

Step 6: Select the us-west-2 as the Region

Step 7: Select us-west-2a as the Availability zone

Step 8: Click next to go to Advanced Configuration

Step 9: Select AppVPC as the VPC

Step 10: Select AppPrivateSNA as the Subnet

Step 11: Enter the Bastion Node information

Step 12: Scroll to the bottom and enter AppQuboleClusterSG as the Persistent Security Groups

Step 13: Click on create

Once the cluster is created, you can run it by clicking on “Start” next to the cluster’s name.

Containers are being embraced at a breakneck speed – developers love them, and they are great for business because they deliver speed and scale in a cost-efficient manner. So much so, that container technology seems to be overtaking VMs – especially with container orchestration tools like Kubernetes, making them simpler to manage and extracting higher efficiency and speed from them.

Kubernetes cluster architecture

Kubernetes provides an open-source platform for simplifying multi-cloud environments. The disparities between different cloud providers are a roadblock for developers and Kubernetes helps by streamlining and standardizing container-based applications.

Kubernetes clusters are the architectural foundation that drives this simplicity and makes it possible for users to get the functionality they need at scale and with ease. Here are some of the functionalities of Kubernetes –

  • Kubernetes distributes workload efficiently across all open resources and reduces traffic spikes or outages.
  • It simplifies application deployment regardless of the size of the cluster
  • It automates horizontal scaling
  • It monitors against app failure with constant node and container health checks and performs self-healing and replication to resolve any failure issues.

All this makes the work of developers faster and frees up their time and attention from trivial repetitive tasks allowing them to build applications better and faster! For the organization, the benefits are three-fold – higher productivity, better products and, finally, cost efficiencies.

Let’s move to the specifics now and find out how to set up a Kubernetes Cluster on the RHEL 7.6 operating system on AWS.

Prerequisites:
  • You should have a VPC available.
  • A subnet within that VPC, into which you will place your cluster.
  • You should have Security Groups for the Control Plane Load Balancer and the Nodes created.
  • You should have created the Control Plane Load Balancer.
  • A bastion host, or jump box, with a public IP within your VPC from which you can secure shell into your VMs.
  • A pem file for your AWS region, which you will use to secure shell into your VMs.
Creating the IAM Roles

You will need to create 2 IAM roles: one for the Master(s), and one for the worker nodes.

Master Role

To create an IAM role, go to the IAM (Identity and Access Management) page in the AWS console. On the left-hand menu, click ‘Roles’. Then click ‘Create Role’.

Select the service that will use this role. By default, it is EC2, which is what we want. Then click “Next: Permissions”.

Click ‘Create Policy’. The Create Policy page opens in a new tab.

Click on the ‘JSON tab’. Then paste this json into it:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ec2:*",
                "elasticloadbalancing:*",
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:GetRepositoryPolicy",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:BatchGetImage",
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:UpdateAutoScalingGroup"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

This json defines the permissions that your master nodes will need.

Click ‘Review Policy’. Then give your policy a name and a description.

Click ‘Create Policy’ and your policy is created!

Back on the Create Role page, refresh your policy list, and filter for the policy you just created. Select it and click ‘Next: Tags’.

You should add 2 tags: Name, with a name for your role, and KubernetesCluster, with the name of the cluster that you are going to create. Click ‘Next: Review’.

Give your role a name and a description. Click ‘Create Role’ and your role is created!

Node Role

For the node role, you will follow similar steps, except that you will use the following json:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ec2:Describe*",
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:GetRepositoryPolicy",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:BatchGetImage"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}
Provisioning the VMs
Provisioning the Master

We will use RHEL 7.6 for our cluster because RHEL 8.0 uses iptables v1.8, and kube-proxy does not work well with iptables v1.8. However, kube-proxy works with iptables v1.4, which is installed on RHEL 7.6. We will use the x86_64 architecture.

Log into the AWS console. Go to the EC2 home page and click ‘Launch Instance’. We will search under Community AMIs for our image.

Click ‘Select’. Then choose your instance type. T2.medium should suffice for a Kubernetes master. Click ‘Next: Configure Instance Details’.

We will use only 1 instance. For an HA cluster, you will want more. Select your network and your subnet. For the purposes of this tutorial, we will enable auto-assigning a public IP.  In production, you would probably not want your master to have a public IP.  But you would need to make sure that your subnet is configured correctly with the appropriate NAT and route tables. Select the IAM role you created. Then click ‘Next: Add Storage’.

The default, 10 GB of storage, should be adequate for a Kubernetes master. Click ‘Next: Add Tags’.

We will add 3 tags: Name, with the name of your master; KubernetesCluster, with the name of your cluster; and kubernetes.io/cluster/<name of your cluster>, with the value owned. Click ‘Next: Configure Security Group’.

Select “Select an existing security group” and select the security group you created for your Kubernetes nodes. Click ‘Review and Launch’.

Click ‘Launch’. Select “Choose an Existing Key Pair”. Select the key pair from the drop-down. Check the “I acknowledge” box. You should have the private key file saved on the machine from which you plan to secure shell into your master; otherwise you will not be able to ssh into the master! Click ‘Launch Instances’ and your master is created.

Provisioning the Auto Scaling Group

Your worker nodes should be behind an Auto Scaling group. Under Auto Scaling in the left-hand menu of the AWS console, click ‘Auto Scaling Groups’. Click ‘Create Auto Scaling Group’. On the next page, click ‘Get Started’.

Under “Choose AMI”, select RHEL 7.6 x86_64 under Community AMIs, as you did for the master.

When choosing your instance type, be mindful of what applications you want to run on your Kubernetes cluster and their resource needs. Be sure to provision a size with sufficient CPUs and memory.

Under “Configure Details”, give your autoscaling group a name and select the IAM role you configured for your Kubernetes nodes.

When selecting your storage size, be mindful of the storage requirements of your applications that you want to run on Kubernetes. A database application, for example, would need plenty of storage.

Select the security group that you configured for Kubernetes nodes.

Click ‘Create Launch Configuration’. Then select your key pair as you did for the master. Click ‘Create Launch Configuration’ and you are taken to the ‘Configure Auto Scaling Group Details’ page. Give your group a name. Select a group size. For our purpose, 2 nodes will suffice. Select the same subnet on which you placed your master. Click ‘Next: Configure Scaling policies’.

For this tutorial, we will select “Keep this group at its initial size”. For a production cluster with variability in usage, you may want to use scaling policies to adjust the capacity of the group. Click ‘Next: Configure Notifications’.

We will not add any notifications in this tutorial. Click ‘Next: Configure Tags’.

We will add 3 tags: Name, with the name of your nodes; KubernetesCluster, with the name of your cluster; and kubernetes.io/cluster/<your cluster name>, with the value owned. Click ‘Review’.

Click Create Auto Scaling Group and your auto-scaling group is created!

Installing Kubernetes

Specific steps need to be followed to install Kubernetes. Run the following steps as sudo on your master(s) and worker nodes.

 # add docker repo

yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

# install container-selinux

 yum install -y http://mirror.centos.org/centos/7/extras/x86_64/Packages/container-selinux-2.107-1.el7_6.noarch.rpm

# install docker

yum install docker-ce

# enable docker

systemctl enable --now docker

# create Kubernetes repo. The 2 urls after gpgkey have to be on 1 line.

cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF

# configure selinux

setenforce 0
sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

# install kubelet, kubeadm, kubectl, and Kubernetes-cni. We found that version 1.13.2 works well with RHEL 7.6.

yum install -y kubelet-1.13.2 kubeadm-1.13.2 kubectl-1.13.2 kubernetes-cni-0.6.0-0.x86_64 --disableexcludes=kubernetes –nogpgcheck

# enable kubelet

systemctl enable --now kubelet

# Run the following command as a regular user.

sudo usermod -a -G docker $USER
Creating the Kubernetes Cluster

First, add your master(s) to the control plane load balancer as follows. Log into the AWS console, EC2 service, and on the left-hand menu, under Load Balancing, click ‘Load Balancers’. Select your load balancer and click the Instances tab in the bottom window. Click ‘Edit Instances’.

Select your master(s) and click ‘Save’.

We will create the Kubernetes cluster via a config file. You will need a token, the master’s private DNS name taken from the AWS console, the Load Balancer’s IP, and the Load Balancer’s DNS name. You can generate a Kubernetes token by running the following command on a machine on which you have installed kubeadm:

kubeadm token generate

To get the load balancer’s IP, you must execute a dig command. You install dig by running the following command as sudo:

yum install bind-utils

Then you execute the following command:

dig +short <load balancer dns>

Then you create the following yaml file:

 ---
 apiVersion: kubeadm.k8s.io/v1beta1
 kind: InitConfiguration
 bootstrapTokens:
 - groups:
   - "system:bootstrappers:kubeadm:default-node-token"
   token: "<token>"
   ttl: "0s"
   usages:
   - signing
   - authentication
 nodeRegistration:
   name: "<master private dns>"
   kubeletExtraArgs:
     cloud-provider: "aws"
 ---
 apiVersion: kubeadm.k8s.io/v1beta1
 kind: ClusterConfiguration
 kubernetesVersion: "v1.13.2"
 apiServer:
   timeoutForControlPlane: 10m0s
   certSANs:
   - "<Load balancer IPV4>"
   extraArgs:
     cloud-provider: "aws"
 clusterName: kubernetes
 controlPlaneEndpoint: "<load balancer DNS>:6443"
 controllerManager:
   extraArgs:
     cloud-provider: "aws"
     allocate-node-cidrs: "false"
 dcns:
   type: CoreDNS 

You then bootstrap the cluster with the following command as sudo:

kubeadm init --config kubeadm.yaml --ignore-preflight-errors=all

I had a timeout error on the first attempt, but the command ran successfully the second time. Make a note of the output because you will need it to configure the nodes.

You then configure kubectl as follows:

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

After this there are some components that need to be installed on Kubernetes on AWS:

# Grant the “admin” user complete access to the cluster

kubectl create clusterrolebinding admin-cluster-binding --clusterrole=cluster-admin --user=admin

# Add-on for networking providers, so pods can communicate. 
# Currently either calico.yaml or weave.yaml

kubectl apply -f https://aws-quickstart.s3.amazonaws.com/quickstart-vmware/scripts/weave.yaml

# Install the Kubernetes dashboard

kubectl apply -f https://aws-quickstart.s3.amazonaws.com/quickstart-vmware/scripts/dashboard.yaml

# Install the default StorageClass

kubectl apply -f https://aws-quickstart.s3.amazonaws.com/quickstart-vmware/scripts/default.storageclass.yaml

# Set up the network policy blocking the AWS metadata endpoint from the default namespace.

kubectl apply -f https://aws-quickstart.s3.amazonaws.com/quickstart-vmware/scripts/network-policy.yaml

Then you have to configure kubelet arguments:

sudo vi /var/lib/kubelet/kubeadm-flags.env

And add the following parameters:

--cloud-provider=aws --hostname-override=<the node name>

After editing the kubeadm-flags.env file:

sudo systemctl restart kubelet

Finally, you have to label your master with the provider ID. That way, any load balancers you create for this node will automatically add the node as an AWS instance:

kubectl patch node <node name> -p '{"spec":{"providerID":"aws:///<availability zone>/<instance ID>"}}'

You can join worker nodes to the cluster by running the following command as sudo, which should have been printed out after running kubeadm init on the master:

kubeadm join <load balancer dns>:6443 --token <token> --discovery-token-ca-cert-hash <discovery token ca cert hash> --ignore-preflight-errors=all

Be sure to configure kubelet arguments on each node and patch them using kubectl as you did for the master.

Your Kubernetes cluster on AWS is now ready!

As one of the most popular cloud platforms, Microsoft Azure is the backbone of thousands of businesses – 80% of the Fortune 500 companies are on Microsoft cloud, and Azure holds 31% of the global cloud market! 

Microsoft’s customer-centricity shines through the entire Azure stack, and a critical part of it is the Azure Alerts that allows you to monitor the metrics and log data for the whole stack across your infrastructure, application, and Azure platform.

Azure Alerts offers organizations and IT managers, access to faster alerts and a unified monitoring platform. Once set up, the software requires minimal technical effort and gives the IT team a centralized monitoring experience through a single dashboard that manages ALL the alerts.

The platform is designed to provide low latency log alerts and metric alerts which gives IT managers the opportunity to identify and fix production and performance issues almost in real-time. Naturally, in complex IT environments, this level of control and overview of the IT infrastructure leads to higher productivity and reduced costs.

Here are more details of how Azure Alerts work

Alerts proactively notify us when important conditions are found in your monitoring data. They allow us to identify and address issues before the users notice them.

This diagram represents the flow of alerts

Alerts can be created from

  • Metric values of resources
  • Log search queries results
  • Activity log events
  • Health of the underlying Azure platform

This is what a typical alert dashboard for a single/multiple subscriptions looks like

You can see 5 entities on the dashboard
  • Severity
    • Defines how severe the alert is and how quickly action needs to be taken.
  • Total alerts
    • Total number of alerts received aggregated by the severity of the alert.
  • New
    • The issue has just been detected and hasn’t yet been reviewed.
  • Acknowledged
    • An administrator has reviewed the alert and started working on it.
  • Closed
    • The issue has been resolved. After an alert has been closed, you can reopen it by changing it to another state.

We will now take you through the steps to create Metric Alerts, Log Search Query Alerts, Activity Log Alerts, and Service Health Alerts.

STEPS TO CREATE A METRIC ALERT

Go to Azure monitor. Click ‘alerts’ found on the left side.

To create a new alert, click on the ‘+ New alert rule’.

After clicking ‘+ New alert rule’ this window will appear.

To select a resource, click ‘select’. It will display this window where you can select the resource by filtering the subscription, and resource type and the location of the resource. Then select ‘Done’ in the bottom.

Once the resource is selected, now configure the condition. Click ‘select’ to configure the signal. The signal type will show both metrics and activity log for the selected resource.

Select the signal for which you need to create the alert, after selecting the signal, a new consecutive window is displayed, where you need to describe the alert logic.

Set the threshold sensitivity above which you need to trigger the alert. Setting the threshold sensitivity is applicable for static threshold only.

For dynamic threshold, the value is determined by continuously learning the data of the metric series and trying to model it using a set of algorithms and methods. It detects patterns in the data such as seasonality (Hourly / Daily / Weekly) and can handle noisy metrics (such as machine CPU or memory) as well as metrics with low dispersion (such as availability and error rate).

Now select an ‘action group’ if you already have one or create a new action group.

  • Provide a name for the action group.
  • Select the subscription and resource group where the action group needs to be deployed.
  • If you have selected the action type as Email/SMS/Push/Voice, that will display another window to configure the necessary details like email ID, contact number for SMS and voice notifications, etc., provide the information and select OK.
  • You can see the different action types available in the image below.

Input the alert details, alert rule name, description of the alert, and severity of the rule. Select ‘enable rule upon creation’.

Finally click ‘create alert rule’. It might take some time to create the alert and for it to start working.

HOW TO CREATE LOG SEARCH QUERY ALERTS

Repeat steps 1 to 3 as outlined in the Metric alert creation. In step 3 select the resource type as “log analytics workspace”.

Now select the condition, you can choose the “Log (saved query)” or select “Custom log search”.

Select the signal name as per your requirements; a new signal window will be displayed containing the attributes corresponding to the selected signal.

Here we have selected a saved query, which provides the result shown as above.

  1. Rule created based on “Number of results” and the threshold provided
  2. Metric measurement and the threshold provided. The Trigger alert based on,
    • Total breaches or
    • Continuous breaches of the threshold provided in the metric measurement.

Provide the evaluation based on time and the frequency in minutes where the alert rule needs to be monitored.

Follow the steps 8 and 9 as outlined in the Metric alert creation.

STEPS TO CREATE ACTIVITY LOG ALERTS

Repeat steps 1 to 3 as outlined in the Metric alert creation. In step 3 select the resource type as “log analytics workspace”.

On selecting the condition, click ‘Monitor Service’ and select the activity log-Administrative.

Here we have selected, all administrative operations as the signal.

Now configure the alert logic. The event level has many types. Select as per your requirement and click Done in the bottom.

Follow the steps 8 and 9 as outlined in the Metric alert creation.

STEPS TO CREATE A SERVICE HEALTH ALERT

You can receive an alert when Azure sends service health notifications to your Azure subscription. You can configure the alert based on:

  • The class of service health notification (Service issues, Planned maintenance, Health advisories).
  • The subscription affected.
  • The service(s) affected.
  • The region(s) affected.

Login into Azure portal, search for service health if it is on the left side. Click service health.

You can see the service health service is now visible and select the “Health alerts” in the alerts section.

Select Create service health alert and fill in the fields.

Select the subscription and services for which you need to be alerted.

Select the ‘region’ where your resources are located and select the ‘Event type’. Azure provides the following event types,

Select “all the event types” so that you can receive alerts irrespective of the event type.

Follow the steps 8 and 9 as outlined in the Metric alert creation. Then click on ‘Create Alert rule’. The service health alert can be seen in the Health Alerts section.

CloudIQ is a leading Cloud Consulting and Solutions firm that helps businesses solve today’s problems and plan the enterprise of tomorrow by integrating intelligent cloud solutions. We help you leverage the technologies that make your people more productive, your infrastructure more intelligent, and your business more profitable. 

US

626 120th Ave NE, B102, Bellevue,

WA, 98005.

 sales@cloudiqtech.com

INDIA

No. 3 & 4, Venkateswara Avenue,Bazaar Main Rd, Ramnagar South, Madipakkam, Chennai – 600091


© 2019 CloudIQ Technologies. All rights reserved.