Cosmos DB is Microsoft Azure's hugely successful tool to help their clients manage data on a global scale. This multi-model database service allows Azure platform users to elastically and independently scale throughput and storage across any number of Azure regions worldwide.

As Cosmos DB supports multiple data models, you can take advantage of fast, single-digit-millisecond data access using any of your favorite APIs, including SQL, MongoDB, Cassandra, Tables, or Gremlin. Being a NoSQL database, anyone with experience in MongoDB can easily work with Cosmos DB. Meanwhile, by supporting SQL APIs, it makes it easy to interact with SQL knowledge.

Why Cosmos DB?

For organizations looking to build a flexible and scalable database that is globally distributed, Cosmos DB is especially useful as it

  • provides a ready-to-use, extremely dynamic database service
  • guarantees low latency of less than 10 milliseconds when reading data and less than 15 milliseconds when writing data.
  • offers customers a faster, completely seamless experience.
  • offers 99.99% availability

Here are some tried and tested tips from our senior Azure expert on how to get the most out of Cosmos DB.

Data Modeling

Cosmos DB is great because it lets you model semi/unstructured aggregates and dynamic entities. This means it's very easy to model ever-changing entities, entities that don't all share the same attributes and hierarchical aggregates. To model for Cosmos, you need to think in terms of hierarchy and aggregates instead of entities and relations. NoSQL lets you, say, store a thing that has other things, which have things of their own. Give me the whole hierarchy of things back. So, you don't have a person, rental addresses, and a relation between them. Instead, you have rental records, which aggregate for each person what rental addresses they've had.

The following NO SQL rules will perfectly match cosmos DB too.

  • Should be used as a complement to an existing or additional database.
  • Deal with PACELC theorem, an extension of CAP theorem.
  • A data modeler should think in terms of queries instead of in terms of storage
Connection Types

Cosmos DB can be connected to the application by 2 modes.

  • Gateway mode
  • Direct mode

The gateway mode is the default mode in Microsoft.Azure.DocumentDB SDK. It uses HTTPS port with a single endpoint. While the direct mode is the default for .NET V3 SDK. This uses TCP and HTTPS for connectivity.

The gateway mode is better while your application runs within a corporate network with strict network rules because it has a single endpoint that can be configured to the firewall for security. Meanwhile, the gateway mode performance will be low when compared to the direct mode.

There is also an option to connect through a RESTful programming model provided by the SDK. All the CRUD can be done through REST calls. This method is recommended if you need a client App to do database access directly instead of providing API. Thus, the overhead of providing an API wrapper to consume the Cosmos DB can be eradicated, and a performance payoff can be prevented.

The recommended mode is always the direct mode in most of the scenarios, which provides better performance.

I am taking the popular volcano data for comparing the response time between the SDK and RESTful model.

Query Executed in both the versions.

SELECT * FROM c where c.Status="Holocene"
Response details of the SDK

The query was responded with the data in 3710 ms.

Response details of the RESTful model

The query was responded with data in 5810 ms.

If we developed an API with this mode, then the response time taken by our API, too needs to be considered. So, using RESTful model in API will be a trade-off with the performance. Use this mode to query from the client directly.

Partitioning the DB

The logical partition is the primary key to provide performance to the cosmos transactions. For example, if you have a database with a list of student data of around 1500 from a school. Now, a simple search for a student named “Peter” will lead to a search through all the 1500 data entries, which consumes high throughput to get the result. Now split the data logically by the “Grade” they belong to. Now, querying like a student named “Peter” from “Grade 5” will lead the system to search only 30 or 40 students from the total 1500 saving the throughput consumption and elevating the performance as compared to the earlier approach.

The common phenomenon will be a. Any key such as city, state, country kind of properties can be used as Partition Key.

a. Any key such as city, state, country kind of properties can be used as Partition Key.
 b. No partition is required up to 10 GB.
 c. The query must be provided with the partition key to be searched.
 d. The property selected as partition key must be in all the documents in the container.

I am taking the popular Volcano data for testing the performance with and without Partition Key.

1. I initially created a collection without a partition key. The performance for the given query is

SELECT * FROM c where c.Status="Holocene"
Resultset
MetricValue
Partition key range id0
Retrieved document count200
Retrieved document size (in bytes)100769
Output document count200
Output document size (in bytes)101069
Index hit document count200
Index lookup time (ms)0.21
Document load time (ms)1.29
Query engine execution time (ms)0.33
System function execution time (ms)0
User defined function execution time (ms)0
Document write time (ms)0.52

2. Then I recreated the same collection with "/country" as a partition. Now the same query with Japan as partition value results with the given values.

Resultset
MetricValue
Partition key range id0
Retrieved document count16
Retrieved document size (in bytes)7887
Output document count16
Output document size (in bytes)7952
Index hit document count16
Index lookup time (ms)0.23
Document load time (ms)0.17
Query engine execution time (ms)0.06
System function execution time (ms)0
User defined function execution time (ms)0
Document write time (ms)0.01
Tune the Index

Indexing is always a top priority item in the checklist to tune performance. Indexing is an internal job that keeps track of the metadata about the data, which helps in finding the result set data for a query. By default, all the properties of a Cosmos container will be indexed. But it is not necessary as it is a useless overhead to the DB, and also keeping track of a lot of data consumes enough RUs, which is not cost-effective. The better approach is to exclude all the paths from indexing and add only age paths that are used for querying in the application. 

Indexing ModeWith Default IndexingWith Custom Indexing
RU's Consumed3146.1519.83
Output Doc Count100100
Doc load time (In ms)646.512.32
Query engine execution time (In ms)434.264.96
System function execution time (In ms)57.032.41
Paging

The execution of the query, by default, will return 100 documents. We can increase the number of documents by providing “maxItemCount” value. The maximum size will be 1000 documents. But it is not meaningful to take 1000 documents at a time from the DB except in some scenarios. To improve the performance and to show a crisp result set to the user, always reduce the “maxItemCount”. Unlike SQL databases, pagination is the default behavior in Cosmos. So, even though you are going to provide maximum count as 1000 and the result for the query is going to be more than that, then the Cosmos is going to return a token named “Continuation Token”. This token consists of a unique value that points to the query we did and the page number. If the total result for the query is provided by the Cosmos, then we can do the logic of pagination at the front end. Thus, by reducing the number of documents per response, we can save the throughput consumption, network data transaction, and increase performance.

Throughput Management

RU ’s or Request Units is the common term we always come across when using Cosmos DB. When you read a document from a container or write a document to it, then you are trading an RU with the Cosmos for your operation. It is like currency in our common world. Without money, you can't buy anything, and without RU, you can't query anything. You can buy only the items that cost equivalent to or less than the money you have in hand. Similarly, you can query the data that equals the RU’s you have.

If you have large data and if the query needs to traverse deep into the collection, then you need enough RU's. Adding a property in the index will consume some RU's. So, if all the properties are indexed, then your RU's pay off will be high, and you will lack RU's for querying. So, always add in the index only the properties that are needed while querying.

Index properly, save the RU's, and utilize it during querying. For example, if you have 100K documents in the container. The Cosmos DB is consuming 1000 RU's to do a query operation up to 50K documents with the default indexing, then our query will not reach the rest of the 50K documents, and we will never receive them in our result set. If appropriately indexed, the same query will consume only 400 RU's to penetrate all the 100K documents.

Startup latency

The very first query will always be a bit late because of the time it takes to awaken the connection. To overcome this latency, it is best practice to call “OpenAsync()” in SDK 2 in the beginning while creating the connection.

await client.OpenAsync();
Singleton Connection

The best approach is to connect the DB and keep the connection alive for all the instances of the application. Also, polling the DB within a period will keep the connection alive. This reduces the DB connectivity latency.

Regions

Make sure the Cosmos and the applications are grouped within the same Azure Region. This reduces the latency a lot. The lowest possible latency is achieved by ensuring the calling application is located within the same Azure region as the provisioned Azure Cosmos DB endpoint.

Programming Best Practices
  • Always use the latest SDK version.
  • Use Streaming API (in SDK 3) that can receive and return data without serializing. Helpful when your API is just a relay and not doing any logical operations on the data.
  • Tune the queries.
  • Implement retry logic with reasonable waiting time to prevent throttle during a busy time.

By carefully analyzing all the above factors, we can improve the Cosmos DB query performance substantially.

As organizations start to create and maintain clusters in AKS (Azure Kubernetes Service), they also need to use cloud-based identity and access management service to access other Azure cloud resources and services. The Azure Active Directory (AAD) pod identity is a service that gives users this control by assigning identities to individual pods.  

Without these controls, accounts may get access to resources and services they don't require. And it can also become hard for IT teams to track which set of credentials were used to make changes.

Azure AD Pod identity is just one small part of the container and Kubernetes management process and as you delve deeper, you will realize the true power that Kubernetes and Containers bring to your DevOps ecosystem.

Here is a more detailed look at how to use AAD pod identity for connecting pods in AKS cluster with Azure Key Vault.

Pod Identity

Integrate your key management system with Kubernetes using pod identity. Secrets, certificates, and keys in a key management system become a volume accessible to pods. The volume is mounted into the pod, and its data is available directly in the container file system for your application.

On an existing AKS cluster -

Deploy Key Vault FlexVolume to your AKS cluster with this command:

  • kubectl create -f https://raw.githubusercontent.com/Azure/kubernetes-keyvault-flexvol/master/deployment/kv-flexvol-installer.yaml
1. Create the Deployment

Run this command to create the aad-pod-identity deployment on an RBAC-enabled cluster:

  • kubectl apply -f https://raw.githubusercontent.com/Azure/aad-pod-identity/master/deploy/infra/deployment-rbac.yaml

Or run this command to deploy to a non-RBAC cluster:

  • kubectl apply -f https://raw.githubusercontent.com/Azure/aad-pod-identity/master/deploy/infra/deployment.yaml
2. Create an Azure Identity

Create azure managed identity

Command:- az identity create -g ResourceGroupNameOfAKsService -n aks-pod-identity(ManagedIdentity)

Output:-  

{
"clientId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ",
"clientSecretUrl": "https://control-westus.identity.azure.net/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourcegroups/aks_dev_rg_wu/providers/Microsoft.ManagedIdentity/userAssignedIdentities/aks-pod-identity/credentials?tid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&oid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx&aid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ",
"id": "/subscriptions/xxxxxxxx-xxxx-XXXX-XXXX-XXXXXXXXXXXX/resourcegroups/aks_dev_rg_wu/providers/Microsoft.ManagedIdentity/userAssignedIdentities/aks-pod-identity",
"location": "westus",
"name": "aks-pod-identity",
"principalId": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
"resourceGroup": "au10515_aks_dev_rg_wu",
"tags": {},
"tenantId": XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXX ",
"type": "Microsoft.ManagedIdentity/userAssignedIdentities"
}

Assign Cluster SPN Role

Command for Getting AKSServicePrincipalID:- az aks show -g <resourcegroup> -n <name> --query servicePrincipalProfile.clientId -o tsv

Command:-az role assignment create --role "Managed Identity Operator" --assignee <AKSServicePrincipalId> --scope < ID of Managed identity>

Assign Azure Identity Roles

Command:- az role assignment create --role Reader --assignee <Principal ID of Managed identity> --scope <KeyVault Resource ID>

Set policy to access keys in your Key Vault

Command:- az keyvault set-policy -n dev-kv --key-permissions get --spn  <Client ID of Managed identity>

Set policy to access secrets in your Key Vault

Command:- az keyvault set-policy -n dev-kv --secret-permissions get --spn <Client ID of Managed identity>

Set policy to access certs in your Key Vault

Command:- az keyvault set-policy -n dev-kv --certificate-permissions get –spn <Client ID of Managed identity>

3. Install the Azure Identity

Save this Kubernetes manifest to a file named aadpodidentity.yaml:

apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzureIdentity
metadata:
name: <a-idname>
spec:
type: 0
ResourceID: /subscriptions/<subid>/resourcegroups/<resourcegroup>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<name>
ClientID: <clientId>

Replace the placeholders with your user identity values. Set type: 0 for user-assigned MSI or type: 1 for Service Principal.

Finally, save your changes to the file, then create the AzureIdentity resource in your cluster:

kubectl apply -f aadpodidentity.yaml

4. Install the Azure Identity Binding

Save this Kubernetes manifest to a file named aadpodidentitybinding.yaml:

apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzureIdentityBinding
metadata:
  name: demo1-azure-identity-binding
spec:
  AzureIdentity: <a-idname>
  Selector: <label value to match>

Replace the placeholders with your values. Ensure that the AzureIdentity name matches the one in aadpodidentity.yaml.

Finally, save your changes to the file, then create the AzureIdentityBinding resource in your cluster:

kubectl apply -f aadpodidentitybinding.yaml

Sample Nginx Deployment for accessing key vault secret using Pod Identity

Save this sample nginx pod manifest file named nginx-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: nginx-flex-kv-podid
    aadpodidbinding: 
  name: nginx-flex-kv-podid
spec:
  containers:
  - name: nginx-flex-kv-podid
    image: nginx
    volumeMounts:
    - name: test
      mountPath: /kvmnt
      readOnly: true
  volumes:
  - name: test
    flexVolume:
      driver: "azure/kv"
      options:
        usepodidentity: "true"         # [OPTIONAL] if not provided, will default to "false"
        keyvaultname: ""               # the name of the KeyVault
        keyvaultobjectnames: ""        # list of KeyVault object names (semi-colon separated)
        keyvaultobjecttypes: secret    # list of KeyVault object types: secret, key or cert (semi-colon separated)
        keyvaultobjectversions: ""     # [OPTIONAL] list of KeyVault object versions (semi-colon separated), will get latest if empty
        resourcegroup: ""              # the resource group of the KeyVault
        subscriptionid: ""             # the subscription ID of the KeyVault
        tenantid: ""            # the tenant ID of the KeyVault
Azure AD Pod Identity points to remember when implementing in cluster
  • Azure AD Pod Identity is currently bound to the default namespace. Deploying an Azure Identity and it’s binding to other namespaces, will not work!
  • Pods from all namespaces can be executed in the context of an Azure Identity deployed to the default namespace (related to point 1)
  • Every Pod Developer can add the aadpodidbinding label to his/her pod and use your Azure Identity
  • Azure Identity Binding is not using default Kubernetes label selection mechanism

As one of the most popular cloud platforms, Microsoft Azure is the backbone of thousands of businesses – 80% of the Fortune 500 companies are on Microsoft cloud, and Azure holds 31% of the global cloud market! 

Microsoft's customer-centricity shines through the entire Azure stack, and a critical part of it is the Azure Alerts that allows you to monitor the metrics and log data for the whole stack across your infrastructure, application, and Azure platform.

Azure Alerts offers organizations and IT managers, access to faster alerts and a unified monitoring platform. Once set up, the software requires minimal technical effort and gives the IT team a centralized monitoring experience through a single dashboard that manages ALL the alerts.

The platform is designed to provide low latency log alerts and metric alerts which gives IT managers the opportunity to identify and fix production and performance issues almost in real-time. Naturally, in complex IT environments, this level of control and overview of the IT infrastructure leads to higher productivity and reduced costs.

Here are more details of how Azure Alerts work

Alerts proactively notify us when important conditions are found in your monitoring data. They allow us to identify and address issues before the users notice them.

This diagram represents the flow of alerts

Alerts can be created from

  • Metric values of resources
  • Log search queries results
  • Activity log events
  • Health of the underlying Azure platform

This is what a typical alert dashboard for a single/multiple subscriptions looks like

You can see 5 entities on the dashboard
  • Severity
    • Defines how severe the alert is and how quickly action needs to be taken.
  • Total alerts
    • Total number of alerts received aggregated by the severity of the alert.
  • New
    • The issue has just been detected and hasn't yet been reviewed.
  • Acknowledged
    • An administrator has reviewed the alert and started working on it.
  • Closed
    • The issue has been resolved. After an alert has been closed, you can reopen it by changing it to another state.

We will now take you through the steps to create Metric Alerts, Log Search Query Alerts, Activity Log Alerts, and Service Health Alerts.

STEPS TO CREATE A METRIC ALERT

Go to Azure monitor. Click ‘alerts’ found on the left side.

To create a new alert, click on the ‘+ New alert rule’.

After clicking ‘+ New alert rule’ this window will appear.

To select a resource, click ‘select’. It will display this window where you can select the resource by filtering the subscription, and resource type and the location of the resource. Then select ‘Done’ in the bottom.

Once the resource is selected, now configure the condition. Click ‘select’ to configure the signal. The signal type will show both metrics and activity log for the selected resource.

Select the signal for which you need to create the alert, after selecting the signal, a new consecutive window is displayed, where you need to describe the alert logic.

Set the threshold sensitivity above which you need to trigger the alert. Setting the threshold sensitivity is applicable for static threshold only.

For dynamic threshold, the value is determined by continuously learning the data of the metric series and trying to model it using a set of algorithms and methods. It detects patterns in the data such as seasonality (Hourly / Daily / Weekly) and can handle noisy metrics (such as machine CPU or memory) as well as metrics with low dispersion (such as availability and error rate).

Now select an ‘action group’ if you already have one or create a new action group.

  • Provide a name for the action group.
  • Select the subscription and resource group where the action group needs to be deployed.
  • If you have selected the action type as Email/SMS/Push/Voice, that will display another window to configure the necessary details like email ID, contact number for SMS and voice notifications, etc., provide the information and select OK.
  • You can see the different action types available in the image below.

Input the alert details, alert rule name, description of the alert, and severity of the rule. Select ‘enable rule upon creation’.

Finally click ‘create alert rule’. It might take some time to create the alert and for it to start working.

HOW TO CREATE LOG SEARCH QUERY ALERTS

Repeat steps 1 to 3 as outlined in the Metric alert creation. In step 3 select the resource type as "log analytics workspace".

Now select the condition, you can choose the "Log (saved query)” or select “Custom log search”.

Select the signal name as per your requirements; a new signal window will be displayed containing the attributes corresponding to the selected signal.

Here we have selected a saved query, which provides the result shown as above.

  1. Rule created based on “Number of results” and the threshold provided
  2. Metric measurement and the threshold provided. The Trigger alert based on,
    • Total breaches or
    • Continuous breaches of the threshold provided in the metric measurement.

Provide the evaluation based on time and the frequency in minutes where the alert rule needs to be monitored.

Follow the steps 8 and 9 as outlined in the Metric alert creation.

STEPS TO CREATE ACTIVITY LOG ALERTS

Repeat steps 1 to 3 as outlined in the Metric alert creation. In step 3 select the resource type as "log analytics workspace".

On selecting the condition, click ‘Monitor Service’ and select the activity log-Administrative.

Here we have selected, all administrative operations as the signal.

Now configure the alert logic. The event level has many types. Select as per your requirement and click Done in the bottom.

Follow the steps 8 and 9 as outlined in the Metric alert creation.

STEPS TO CREATE A SERVICE HEALTH ALERT

You can receive an alert when Azure sends service health notifications to your Azure subscription. You can configure the alert based on:

  • The class of service health notification (Service issues, Planned maintenance, Health advisories).
  • The subscription affected.
  • The service(s) affected.
  • The region(s) affected.

Login into Azure portal, search for service health if it is on the left side. Click service health.

You can see the service health service is now visible and select the “Health alerts" in the alerts section.

Select Create service health alert and fill in the fields.

Select the subscription and services for which you need to be alerted.

Select the ‘region’ where your resources are located and select the ‘Event type’. Azure provides the following event types,

Select “all the event types” so that you can receive alerts irrespective of the event type.

Follow the steps 8 and 9 as outlined in the Metric alert creation. Then click on ‘Create Alert rule’. The service health alert can be seen in the Health Alerts section.

Automation Testing helps complete the entire software testing life cycle (STLC) in less time and improve efficiency of the testing process.

Test Automation enables teams to verify functionality, test for regression and run simultaneous tests efficiently. In this article we will take a detailed look at the Automation Testing Tools available, standards and best practices to be followed during Test Automation.

Following the best practices for Software Testing Life Cycle (Unit testing, Integration Testing & System Testing) ensures that the client gets the software as intended without any bugs. End-to-end testing is the methodology used to test whether the flow of an application is performing as designed from start to finish. Carrying out end-to-end tests helps identify system dependencies and ensure the flow of right information across various system components and the system.

Ultimately Automation Testing increases the speed of test execution and the test coverage.

When to Choose Automation Testing
  • There is lots of regression work
  • GUI is same, but you have lot of often functional changes
  • Requirements do not change frequently
  • Load and performance testing with many virtual users
  • Repetitive test cases that tend well to automation & saves time
  • Huge projects
  • Projects that need to test the same areas

Steps to Implement Automation Testing
  • Identify areas within software to automate
  • Choose the appropriate tool for test automation
  • Write test scripts
  • Develop test suits
  • Execute test scripts
  • Build result reports
  • Find possible bugs or performance issues
Choosing your Automation Testing Tool

The strategy to adopt test automation should clearly define when to opt for automation, its scope and selection of the right kind of tools for execution. And when it comes to tools the top ones to go for are

  • Cypress
  • Selenium
  • Protractor
  • Appium(Mobile)
Why Cypress?

Cypress is a JavaScript based testing framework built for the modern web. Cypress helps to create End-to-end tests, Integration tests and Unit tests. Cypress takes a different approach compared to other testing frameworks, since it’s executed in the same run loop as the application. It also leverages a Node.js server to handle any task that needs to happen outside of the browser. With its ability to understand everything happening inside and outside of the browser, it produces more consistent results.

Key Features of Cypress
  • Automatic Waiting – No need for adding wait and sleep.
  • Spies, Stubs, and Clocks – Verify and control the behaviour of functions, server responses, or timers.
  • Network traffic control and monitoring – Easily control, stub, and test edge cases without involving your server. You can stub network traffic however you like.
  • Consistent Results – Cypress architecture doesn’t use Selenium or WebDriver. It is fast, consistent and does reliable tests that are flake-free.
  • Screenshots and Videos – View screenshots taken automatically on failure, or videos of your entire test suite when run from the CLI.
Azure CICD Setup with Cypress

Cypress runs on most of the following CI providers.

Azure DevOps / VSTS CI / TeamFoundation
BitBucket
CircleCI
Docker
GitLab
Jenkins
TravisCI

Azure DevOps – Steps to Integrate Cypress Automation Tests
  • Pre-Build Testing
  • Install the Node module and run application in test mode
  • Run the tests
  • Publish the test results
  • Cypress Containerization
  • Build the docker container of cypress
  • Push the image to container
  • Publish the Build

Before we get started here are the basic Cypress installation commands

Clean up the old results
$ rm -rf cypress/reports/
 
Run the cypress application with required spec file.
$ cypress run –spec \”cypress/integration/**/*.spec.ts\” // mention your spec file
 
Configure the mocha reports path for publishing test results.
–reporter junit –reporter-options ‘mochaFile=cypress/reports/test-output-[hash].xml,toConsole=true’
 
Uninstall the application.
$ npm uninstall cypress-multi-reporters; npm uninstall cypress-promise; npm uninstall cypres

Pre-Build Testing

It is critical to test the application before the Build, Deployment or Release. Essentially the process involves regression and smoke testing. And don’t forget the sanity checks before the build is deployed in the staging environment.

Cypress comes in handy for testing angular / JavaScript applications before they are deployed to staging or production environment.

Install the Node module and run application in test mode

Install the required node module of the application then run the application with test mode.

$ npm install –save-dev start-server-and-test

$ start-server-and-test start http://localhost:4200

Publish the test results

The results of the Cypress test execution are stored in specified path and are added to the Azure DevOps test results. Cypress supports JUnit, Mocha, Mochawsome test results reporter formats and provides options to create customised test results and merge all the test results as well.

Cypress Containerization

Cypress supports docker containerization and that makes it easy to set it up in a cluster environment like AKS. The Cypress base images are available at the link below.

https://github.com/cypress-io/cypress-docker-images

Copy the package.json and UI source code to the app folder and run the Cypress test. The following commands are used to run the docker and execute.

  script: |
        docker run -d -it --name cypressName:cypressImageTag bash
        docker commit -p cypressName:cypressImageTag
        docker stop cypressName
        docker rm -f cypressName
    
    - script: docker tag cypressName:cypressImageTag
      displayName: Tag Cypress image 
      
    
    - task: Docker@1
    displayName: Push image To Registry
    inputs:
        command: push
        azureSubscriptionEndpoint: azureSubscriptionEndpoint
        azureContainerRegistry: $(azureContainerRegistry)
        imageName: acrImageName:BuildId
 
    - script: sudo rm -rf /test-results/*
    displayName: Removing Previous Results
 
    - task: ShellScript@2
    displayName: 'Bash Script - cypress base image post-deployment'
    inputs:
        scriptPath: ./cypress-deployment.sh
        args: $(azureRegistry) $(cypressImageName) $(azureContainerValue) $(CYPRESS_OPTIONS) 
        continueOnError: true
    - task: PublishTestResults@1
    displayName: 'Publish Test Results ./test-results-*.xml'
    inputs:
 
    cypress-base-image-post-deployment.sh
 
    docker run -v $systemSourceDirectory:/app/cypress/reports --name vca-arp-ui 
    $cypress_Latestimage npx cypress run $cypressOptions bash

Now the container should be set up on on your local machine and start running your specs.

Cypress is simple and easily integrates with your CI environment. Apart from the browser support, Cypress reduces the efforts of manual testing and is relatively faster when compared to other automation testing tools.

How to Import BACPAC File Created from Azure SQL Database?

When you need to export a database for archiving or for moving to another platform, you can export the database schema and data to a BACPAC file. A BACPAC file is a ZIP file with an extension of BACPAC containing the metadata and data from a SQL Server database. A BACPAC file can be stored in Azure Blob storage or in local storage in an on-premises location and later imported back into Azure SQL Database or into a SQL Server on-premises installation.

Import BACPAC File to On-Premise SQL Server :

  • C:\Program Files (x86)\Microsoft SQL Server\140\DAC\bin>
  • SqlPackage.exe /a:import /sf:\\Userdb0.bacpac /tsn:SERVER-SQL\DEV2016 /tdn:Azure_Test /p:CommandTimeout=2400

Error :

When you are try to import BACPAC File created from Azure Environment, you might encounter the following error if it consists of External Data Source Reference.

TITLE: Microsoft SQL Server Management Studio
            
    Could not import package. 
    Warning SQL72012: The object [AzureProd] exists in the target, 
    but it will not be dropped even though you selected the 
    ‘Generate drop statements for objects that are in the target database but that 
    are not in the source’ check box. 
    Warning SQL72012: The object [AzureProd_Log] exists in the target,
    but it will not be dropped even though you selected the 
    ‘Generate drop statements for objects that are in the target database but that 
    are not in the source’ check box. 
    Error SQL72014: .Net SqlClient Data Provider: Msg 102, Level 15, State 1, 
    Line 1 Incorrect syntax near ‘EXTERNAL’. 
    Error SQL72045: Script execution error. The executed script: 
    CREATE EXTERNAL DATA SOURCE [DB_EXT_EDS]
    WITH (
    TYPE = RDBMS,
    LOCATION = N’sqlserver.database.windows.net’,
    DATABASE_NAME = N’AdventureWorks’, 
    CREDENTIAL = [DB_EXT_CRED] ); 
            
    

Solution :

Drop external Tables and External Data Source in Azure SQL Database and create BACPAC File again without those references.

Drop External Tables and External Data Source

            
        IF EXISTS 
        (
        SELECT 'x' FROM sys.external_tables)
        BEGIN
        DROP EXTERNAL TABLE EXT_Table1 
        DROP EXTERNAL TABLE EXT_Table2 
        DROP EXTERNAL TABLE EXT_Table3 
        END   
         
        IF EXISTS 
        ( 
        SELECT * FROM sys.external_data_sources 
        WHERE name ='DB_EXT_EDS' 
        ) 
        BEGIN 
        DROP EXTERNAL DATA SOURCE DB_EXT_EDS; 
        END  
            
        

If you can’t recreate BACPAC without dropping the tables, you can follow these steps.

  1. Change the file extension to zip, then decompress it into a folder. Surprisingly, a bacpac is actually just a zip file, not something proprietary and hard to get into.
  2. Find the model.xml file and edit it to remove the section that looks like this:
<Element Type=”SqlExternalDataSource” Name=”[BoxDataSrc]”>
<Property Name=”DataSourceType” Value=”1′′ />
<Property Name=”Location” Value=”MYAZUREServer.database.windows.net” /> 
<Property Name=”DatabaseName” Value=”MyAzureDb” />
<Relationship Name=”Credential”>
<Entry>
<References Name=”[SQL_Credential]” />
</Entry>
</Relationship>
</Element>

If you have multiple external data sources of this type, you will probably need to repeat step 2 for each one.

Save and close model.xml.

Now you need to re-generate the checksum for model.xml so that the bacpac doesn’t think it was tampered with (since you just tampered with it). Create a PowerShell file named computeHash.ps1 and put this code into it.

Generate Checksum

             
            $modelXmlPath = Read-Host "model.xml file path" 
            $hasher = [System.Security.Cryptography.HashAlgorithm]:
            :Create("System.Security.Cryptography.SHA256Crypt oServiceProvider") 
            $fileStream = new-object System.IO.FileStream ` -ArgumentList
            @($modelXmlPath, [System.IO.FileMode]::Open) 
            $hash = $hasher.ComputeHash($fileStream) 
            $hashString = "" Foreach ($b in $hash) { $hashString += $b.ToString("X2") } 
            $fileStream.Close() $hashString 
            
     

Run the PowerShell script and give it the filepath to your unzipped and edited model.xml file. It will return a checksum value.

Copy the checksum value, then open up Origin.xml and replace the existing checksum, toward the bottom on the line that looks like this:

<Checksum Uri=”/model.xml”>9EA0F06B282G4F42955C78A98822A31AA0ED0225CB131B
8759379055A482D0 1G</Checksum> 

Save and close Origin.xml, then select all the files and put them into a new zip file and rename the extension to bacpac.

Now you can use this new bacpac to import the database without getting the error.

Deadlocks in Azure SQL Database

Recently we were working with Azure Logic Apps to invoke Azure Functions.
By Default, Logic App runs parallel threads and we didn’t explicitly control the concurrency and left the default values.

So Logic App invoked several concurrent threads which in turn invoked several Azure Functions.
The problem was Azure Functions invoked Database Calls which caused Deadlocks. In Ideal world, Database should be able to handle numerous concurrent functions without deadlocks. Our process high percentage of shared data and we wanted to ensure the consistency , so we had Explicit Transactions in our Stored procedure calls. That’s the root cause of the problem and we didn’t want to remove the explicit Transaction.

The solution we implemented to alleviate this problem is to run this process in Sequence instead of parallel threads.

Log App Concurrency Control Behavior

For each loops execute in parallel by default. Customize the degree of parallelism, or set it to 1 to execute in sequence.

Logic_App_Concurrency

Deadlock Exception

Transaction (Process ID 166) was deadlocked on lock
| communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

Deadlock_Exception

Troubleshooting Deadlocks

So we have identified Deadlock happened in the database through our Application Insights. Next logical question is , what caused this deadlock.

Azure SQL Server Deadlock Count

These queries identifies the deadlock event time as well as the deadlock event details.

                
        SELECT * FROM sys.event_log   
        WHERE event_type = 'deadlock';
        WITH CTE AS (  
        SELECT CAST(event_data AS XML)  AS [target_data_XML]   
        FROM sys.fn_xe_telemetry_blob_target_read_file('dl', 
        null, null, null)  
        )  
        SELECT target_data_XML.value('(/event/@timestamp)[1]', 
        'DateTime2') AS Timestamp,  
        target_data_XML.query('/event/data[@name=''xml_report'']
        /value/deadlock') AS deadlock_xml,  
        target_data_XML.query('/event/data[@name=''database_name'']
        /value').value('(/value)[1]', 'nvarchar(100)') AS db_name  
        FROM CTE
                
        

You can save the Deadlock xml as xdl to view the Deadlock Diagram. This provides all the information we need to identify the root cause of the deadlock and take necessary steps to resolve the issue.

References

Grafana is an open-source, general purpose dashboard and graph composer, which runs as a web application.

You can monitor Azure services and applications from Grafana using the Azure Monitor data source plugin. The plugin gathers application performance data collected by the Application Insights SDK as well as infrastructure data provided by Azure Monitor. You can then display this data on your Grafana dashboard.

Grafana uses an Azure Active Directory service principal to connect to Azure Monitor APIs and collect metrics data. You must create a service principal to manage access to your Azure resources.

Why Grafana?

Grafana provides more visualization options than the Azure Portal. It also supports multiple data sources. One can combine data from multiple sources in a single dashboard. Grafana is designed for analyzing and visualizing metrics such as system CPU, memory, disk and I/O utilization. Users can create comprehensive charts with smart axis formats (such as lines and points) as a result of Grafana’s fast, client-side rendering — even over long ranges of time.

Grafana dashboards are what made Grafana such a popular visualization tool. Visualizations in Grafana are called panels, and users can create a dashboard containing panels for different data sources. Grafana supports graph, singlestat, table, heatmap and freetext panel types. Grafana users can make use of a large ecosystem of ready-made dashboards for different data types and sources. Grafana has no time series storage support. Grafana is only a visualization solution. Time series storage is not part of its core functionality.

Some of the features of Grafana are as follows

  • Optimized for Time series
  • Can pull data from Azure Metrics, Log Analytics and Application Insights
  • Azure Data Explorer (formerly known as Kusto) plugin also released.
  • Rich ecosystem of plugins for data sources and dashboards.
  • Open Source, easy to onboard using Docker, Azure App Service etc.
Azure-app-service

Some of the requirements of Grafana are described below.

  • Azure SPN (Service Principal Name) with reader access to subscription
  • Deploy in Azure web apps.
  • Data source plugin “grafana-azure-monitor-datasource”
  • Supports AD integration via LDAP
  • Easy to export/import and templatize
  • Very DevOps friendly
  • Huge collection of panels https://play.grafana.org
Grafana

Azure Monitor Data Source For Grafana

Azure Monitor is the platform service that provides a single source for monitoring Azure resources. The Azure Monitor Data Source plugin supports Azure Monitor, Azure Log Analytics and Application Insights metrics in Grafana.

Features

  • Support for all the Azure Monitor metrics
    • includes support for the latest API version that allows multi-dimensional filtering for the Storage and SQL metrics.
    • Automatic time grain mode which will group the metrics by the most appropriate time grain value
  • Application Insights metrics
    • Write raw log analytics queries, and select x-axis, y-axis, and grouped values manually.
    • Automatic time grain support
  • Support for Log Analytics (both for Azure Monitor and Application Insights)
  • You can combine metrics from both services in the same graph.
Grafana-graph

Azure Monitor for VMs provides an in-depth view of VM health, performance trends, and dependencies. Azure Monitor for VMs includes a set of performance charts that target several key performance indicators (KPIs) to help you determine how well a virtual machine is performing. Azure Monitor for VMs is focused on the operating system as manifested through the processor, memory, network adapters, and disks.

Azure Dashboards

Azure dashboards allow you to combine different kinds of data, including both metrics and logs, into a single pane in the Azure portal. You can optionally share the dashboard with other Azure users. Elements throughout Azure Monitor can be added to an Azure dashboard in addition to the output of any log query or metrics chart. Azure Monitor is single source for monitoring azure resources. Its Azure’s time series database for all azure metrics.

Some of the important aspects of Azure Dashboard

  • No setup required, already available within Azure Portal.
  • Zoom in zoom out for metrics not available
  • All data from Azure resources.
  • Log Analytics/AI queries cannot be parameterized based on Dashboard selection.
  • Query results can be pinned to dashboards
  • Good panels are not tied to products and can’t be customized.
    • Eg. percentile panels is only available in “Container Insights” and VM insights.
    • The panel cannot be used against “Log Analytics” source or Metric source.

Some of the features of Azure Dashboard are as follows

  • Supports visualizing most Azure resources
  • OOB Integrated with Azure RBAC
  • Supports Log Analytics, App Insights and Metrics
  • No Auto refresh per panel
  • No Zoom in Zoom out.
  • Dashboard queries don’t support variables

Azure Dashboards (VM insights/ Container Insights)

  • These tiles can only be accessed by navigating to the VM resource.
  • They cannot be pinned as is, but the detailed version of this can be pinned.
  • No zoom in zoom out capability.

Azure Dashboards – Metrics

  • These are pinnable
  • Don’t support percentiles
  • No drill ability
  • Each Panel is hard coded to a specific data source even if they might be the same behind the scenes.

Comparison between Grafana and Azure Dashboard is shown below.

Azure Dashboard
  • Multiple Azure resource types
  • Limited configuration options. Requires JSON editing
  • Application Insights à OOB Azure Dashboard
  • Only static queries
  • No setup required
  • Not intuitive for overlaying.
Grafana
  • Mostly Time Series
  • Highly configurable
  • Global variables as filters
  • Dashbord and individual panel refresh.
  • Supports query macros
  • Setup required (minimal)
  • Intuitive overlays

1. Security and Compliance:

If you are wondering why we are starting with security, then check out this number. $6 trillion, that’s the amount of annual damage cyber crimes is predicted to cost us by 2021.

Which is precisely why the first thing you need to check while picking your cloud service provider is their security and compliance levels – both physical as well as virtual – this includes the geographical location of their data centers and the local laws of the country they are based in.

There are a number of certifications and standards which guarantee the security preparedness of cloud vendors; their validity must be checked and additional investigations must be carried out by checking internal and third-party audits or reports.

You need to do a deep check of:

  • Security infrastructure and procedures followed by the vendor
  • Identity management and authorizations
  • Physical security controls including the process for natural disasters
  • Policies for data back-up and disaster recovery

2. Technical Capabilities

An obvious point, but it still needs to be reiterated.

Your service provider should have a full stack of technologies that support your current applications and also has the capability to match your future needs.

Cloud partnerships last a long time, and it's important to check the future roadmap of the service provider to understand if they have the mindset to catch trends early and innovate.

Some questions to focus on:

  • Will your current software and applications integrate easily with the service provider’s cloud infrastructure?
  • Do they use standard interfaces and APIs for easy integration?
  • Do they have the capability of providing hybrid cloud computing options and do they have the flexibility to host different cloud environments and systems?
  • Are they backing their capabilities with SLAs?
  • Are they willing and capable to architect solutions tailored to your business?

3. Costs

No two cloud service providers have similar or comparable pricing packages. They each have their own formula of computing cloud costs, and it is almost impossible to make a side-by-side comparison of different vendors. What you need to do is map out your organization’s requirement as minutely as possible and then decide which pricing model suits your needs.

Keep in mind:

  • Consumption timelines as long-term contracts are better priced
  • The flexibility offered by service providers in scaling up or down
  • Check for hidden costs

4. Business Health

The stability of your business depends on the stability of its partners, and you cannot underestimate the importance of a cloud partner. Before finalizing your cloud vendor, it is important to check their business and financial health.

You should check:

  • The company’s financial records
  • Management structure and other third-party relationships
  • Reputation, reviews, and referrals from existing customers
  • For any legal run-ins
  • All available third-party audits

5. Support

Do you just have a phone or chat access or does your service provider offers dedicated account management? How much support you can get from your vendor is another important criteria, that must be considered before finalizing a service provider.

Find out about:

  • Time guarantees for solving technical issues
  • Access to support services – 24x7 or 12x5
  • Cost of opting for dedicated resources

Deciding on a cloud service provider is a long process that demands complete thoroughness and analysis from the CIO and the rest of the team.

Before we leave you to navigate your way to your future cloud partner, here are two more important points that must be considered – Right size and exit strategy.

Keep in mind that to get the best service you need to find a vendor who connects with you and for whom you are a valuable client.

And always plan an exit strategy in case things don’t work out.

Best of Luck!

When you are deploying a new change into production, the associated deployment should be in a predictable manner. In simple terms, this means no disruption and zero downtime! In case you do encounter a problem or a bottleneck, the deployment strategy should include a quick roll back.

The safe strategy can be achieved by working with two identical infrastructures - the “green” environment hosting the current production and the “blue” environment with the new changes.

The business and IT teams will have an opportunity to conduct sanity, smoke test or any other test in the “blue” environment before making a “Go” decision. Upon “Go”, the team can switch “blue” to “green” and “green” to “blue”.

In Azure, different processes are available for implementing the Blue-Green strategy with two environments.

We have listed below some of these techniques. Naturally, this list is not fixed and will grow continuously as new tool sets and services emerge.

  • Deployment slots - For Web Apps, deployment slots provide an easy way to implement Blue-Green deployments.
  • Azure Traffic Manager – This can be leveraged to realize Blue-Green deployments for smoother deployments with weighted round-robin routing method. The detailed configuration and implementation methods are available in Azure Documentation.
  • Using an Application Gateway with two backend pools and a routing rule - Have two backend resource pools with one as a stage pool and another one as a prod pool. Add stage VMSS to stage pool, prod VMSS to prod pool and have one routing rule in the app gateway. Depending on the need to use stage or prod VMSS, this rule will be changed to point to the appropriate backend address pool.

CloudIQ architects and engineers have implemented Blue-Green deployment for multiple clients, and in each case, we have customized our strategies to suit their use-cases. If you are looking for a completely safe way to deploy new software versions and applications, then reach out to us at sales@cloudiq.io

Apache Spark is an open-source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads.

Spark provides distributed task transmission, scheduling, and I/O functionality. It provides programmers with a potentially faster and more flexible alternative to MapReduce, the software framework to which early versions of Hadoop were tied.

How Apache Spark works

Apache Spark can process data from a variety of data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases and relational data stores.

The Spark Core engine uses the resilient distributed data set, or RDD, as its primary data type. The RDD is designed in such a way to hide much of the computational complexity from users. It aggregates data and partitions it across a server cluster, where it can then be computed and either moved to a different data store or run through an analytic model. The user doesn't have to define where specific files are sent or what computational resources are used to store or retrieve files.

Given below is a sample Spark program written in Python to count the number of records with each rating in the input file given in next page:

 from pyspark import SparkConf, SparkContext
        import collections
        
        conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
        sc = SparkContext(conf = conf)
        
        lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")
        ratings = lines.map(lambda x: x.split()[2])
        result = ratings.countByValue()
        
        sortedResults = collections.OrderedDict(sorted(result.items()))
        for key, value in sortedResults.items():
            print("%s %i" % (key, value))
        
         

In the above code, sc is the SparkContext associated with input file u.data. ratings is a RDD created by mapping the 3rd column in input file (array occurrence [2] – Ratings). Here map() is a transformation function which produces a new RDD.

We can have multiple transformations in a single spark program each producing a new RDD from an existing RDD or an input file. countByValue() is an Action that is performed.
In Spark, the transformations are not executed until an Action is triggered. This is called Lazy Evaluation.

Apache Spark works

Figure 1

Spark languages

Spark was written in Scala, which is considered the primary language for interacting with the Spark Core engine. Out of the box, Spark also comes with API connectors for using Java, R, and Python.

Spark libraries

  • The Spark Core engine functions partly as an application programming interface (API) layer and underpins a set of related tools for managing and analyzing data.
  • Spark SQL -- One of the most commonly used libraries, Spark SQL enables users to query data stored in disparate applications using the common SQL language.
  • Spark Streaming -- This library allows users to build applications that analyze and present data in real time.
  • MLlib -- A library of machine learning code that enables users to apply advanced statistical operations to data in their Spark cluster and to build applications around these analyses.
  • GraphX -- A built-in library of algorithms for graph-parallel computation.

RDDs, DataFrames, and Datasets

An RDD is an immutable distributed collection of elements of data, partitioned across nodes in a cluster that can be operated in parallel with a low-level API that offers transformations and actions.

Like an RDD, a DataFrame is an immutable distributed collection of data. However, unlike an RDD, data is organized into named columns, like a table in a relational database.

Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface.

Executing SQL-style functions on a Dataframe

Given below is a map-reduce program to get the list of popular movies (which has been rated by many customers using the same input data as Figure 1 above).

 from pyspark import SparkConf, SparkContext
        
        conf = SparkConf().setMaster("local").setAppName("PopularMovies")
        sc = SparkContext(conf = conf)
        
        lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")
        movies = lines.map(lambda x: (int(x.split()[1]), 1))
        movieCounts = movies.reduceByKey(lambda x, y: x + y)
        
        flipped = movieCounts.map( lambda xy: (xy[1],xy[0]) )
        sortedMovies = flipped.sortByKey()
        
        results = sortedMovies.collect()
        
        for result in results:
            print(result)
        
         

The same program, when written using DataFrames, will look like this

 from pyspark.sql import SparkSession
        from pyspark.sql import Row
        from pyspark.sql import functions
        
        def loadMovieNames():
            movieNames = {}
            with open("ml-100k/u.ITEM") as f:
                for line in f:
                    fields = line.split('|')
                    movieNames[int(fields[0])] = fields[1]
            return movieNames
        
        # Create a SparkSession (the config bit is only for Windows!)
        spark = SparkSession.builder.config("spark.sql.warehouse.dir", 
        "file:///C:/temp").appName("PopularMovies").getOrCreate()
        # Load up our movie ID -> name dictionary
        nameDict = loadMovieNames()
        
        # Get the raw data
        lines = spark.sparkContext.textFile("file:///SparkCourse/ml-100k/u.data")
        # Convert it to a RDD of Row objects
        movies = lines.map(lambda x: Row(movieID =int(x.split()[1])))
        # Convert that to a DataFrame
        movieDataset = spark.createDataFrame(movies)
        
        # Some SQL-Style magic to sort all movies by popularity in one line!
        topMovieIDs = movieDataset.groupBy("movieID").count().orderBy
        ("count", ascending=False).cache()
        
        # Show the results at this point:
        
        #|movieID|count|
        #+-------+-----+
        #|     50|  584|
        #|    258|  509|
        #|    100|  508|
        
        topMovieIDs.show()
        
        # Grab the top 10
        top10 = topMovieIDs.take(10)
        
        # Print the results
        print("\n")
        for result in top10:
            # Each row has movieID, count as above.
            print("%s: %d" % (nameDict[result[0]], result[1]))
        
        # Stop the session
        spark.stop()
        
         

As you can see DataFrames gives us the flexibility to use SQL style functions to get the required results. Because DataFrames APIs are built on top of the Spark SQL engine, it uses Catalyst to generate an optimized logical and physical query plan.

Job Scheduling

Spark has several facilities for scheduling resources between computations.

  • Each Spark application (instance of SparkContext) runs an independent set of executor processes. The cluster managers that Spark runs on provide facilities for scheduling across applications.
  • Within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads. This is common if the application is serving requests over the network. Spark includes a fair scheduler to schedule resources within each SparkContext.

Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.

Spark Streaming

The Python program shown below counts the number of words in text data received from a data server listening on a TCP socket.

Sample input entered for this program at a terminal through NetCat and the output of the program is given below.

 
     
# TERMINAL 1:
# Running Netcat

$ nc -lk 9999

hello world
...

         
 
     
# TERMINAL 2: RUNNING network_wordcount.py
        
$ ./bin/spark-submit examples/src/main/python/streaming/network_wordcount.py 
localhost 9999
...
-------------------------------------------
Time: 2014-10-14 15:25:21
-------------------------------------------
(hello,1)
(world,1)
...

         

Conclusion

Launched for the first time in May 2014, Apache Spark has become the go-to program for companies that work with large-scale Big Data applications. The speed and agility of Spark have made it incredibly useful across a wide range of industries.

From FMCG giants to BFSI companies to digital advertising firms – Apache Spark has proved to be indispensable when it comes to aggregating data, gleaning insights and forecasting industry trends.

This is the fifth blog in our series helping you understand all about cloud, when you are in a dilemma to choose Azure or AWS or both, if needed.

Before we jumpstart on the actual comparison chart of Azure and AWS, we would like to bring you some basics on data analytics and the current trends on the subject.

If you would rather like to have quick look at the comparison table, Click here

This blog is intended to help you strategize your data analytics initiatives so that you can make the most informed decision possible by analyzing all the data you need in real time. Furthermore, we also will help you draw comparisons between Azure and AWS, the two leaders in cloud, and their capabilities in Big Data and Analytics as published in a handout released by Microsoft.

Beyond doubts, this is an era of data. Every touch point of your business generates volumes of data and these data cannot be simply whisked away, cast aside as valuable business insights can be unearthed with a little effort. Here’s where your Data Analytics infrastructure helps.

A 2017 Planning Guide for Data and Analytics published by Gartner written by the Analyst John Hagerty states that

The Key Findings as per the report are as follows:

  • Data and analytics must drive modern business operations, not just reflect them. Technical professionals must holistically manage an end-to-end data and analytics architecture to acquire, organize, analyze and deliver insights to support that goal.
  • Analytics are now infused in places where they never existed before.
  • Executives will seek strategies to better manage and monetize data for internal and external business ecosystems.
  • Data gravity is rapidly shifting to the cloud, with IoT, data providers and cloud-native applications leading the way. It is no longer a question of "if" for using cloud for data and analytics; it's "how."

The last point emphasizes on how cloud is playing a prominent role when it comes to Data Analytics and if you have thoughts on who and how, Gartner in its latest magic quadrant has said that AWS and Azure are the top leaders. Now, if you are in doubt whether to go the Azure way or AWS or should it be the both, here’s the comparison table showing their respective Big Data and Analytics Capabilities

 

ServiceDescriptionAWSAzure
Elastic data warehouseA fully managed data warehouse that analyzes data using business intelligence tools. RedshiftSQL Data Warehouse
Big data processingSupports technologies that break up large data processing tasks into multiple jobs, and then combine the results to enable massive parallelism.Elastic MapReduce (EMR)HDInsight
Data orchestrationProcesses and moves data between different compute and storage services, as well as on-premises data sources at specifed intervals. Data PipelineData Factory
Cloud-based ETL/data integration service that orchestrates and automates the movement and transformation of data from various sources.AWS Glue Data CatalogData Factory + Data Catalog
AnalyticsStorage and analysis platforms that create insights from massive quantities of data, or data that originates from many sources.Kinesis AnalyticsStream Analytics

Data Lake Analytics

Data Lake Store
Streaming dataAllow mass ingestion of small data inputs, typically from devices and sensors, to process and route data.Kinesis Streams

Kinesis Firehose
Event Hubs

Event Hubs Capture
Visualizationperform ad-hoc analysis, and develop business insights from data.QuickSight (Preview)Power BI
Allows visualization and data analysis tools to be embedded in applications.Power BI Embedded
SearchA scalable search server based on Apache Lucene.Elasticsearch ServiceMarketplace—Elasticsearch
Delivers full-text search and related search analytics and capabilities.CloudSearchSearch
Machine learningProduces an end-to-end workfow to create, process, refne, and publish predictive models from complex data sets.Machine LearningMachine Learning
Data discoveryProvides the ability to better register, enrich, discover, understand, and consume data sources.Data Catalog
A serverless interactive query service that uses standard SQL for analyzing databases.Amazon AthenaData Lake Analytics

Click here to read the entire guide published by Microsoft Azure Team:

This is our fourth blog in the series of blogs intended to help you embark on a cloud strategy, most importantly when you are in dilemma to choose AWS or Azure, the two prominent cloud players today.

If you had missed our earlier blogs, click here

1st Blog - Compute

2nd Blog- Storage

3rd Blog- CDN & Networking

Before we jumpstart on the actual comparison chart of Azure and AWS, we would like to bring you some basics on the database aspect of cloud strategy.

If you would rather like to have quick look at the database comparison table, click here

Through this blog, let’s understand the database aspect of your cloud strategy. As per the guide, Database services refers to options for storing data, whether it’s a managed relational SQL database that’s globally distributed or a multi-model NoSQL database designed for any scale.

When you decide cloud, one of the critical decisions you face is which database to use - SQL or NoSQL. Though SQL has an impressive track record, NoSQL is not far behind as it is gradually making notable gains and has many proponents. Once you have picked your database, the other big decision to make is which cloud vendor to choose amongst the many vendors.

Here’s where you consider Gartner’s prediction; the research company published a document that states

“Public cloud services, such as Amazon Web Services (AWS), Microsoft Azure and IBM Cloud, are innovation juggernauts that offer highly operating-cost-competitive alternatives to traditional, on-premises hosting environments.

Cloud databases are now essential for emerging digital business use cases, next-generation applications and initiatives such as IoT. Gartner recommends that enterprises make cloud databases the preferred deployment model for all new business processes, workloads, and applications. As such, architects and tech professionals should start building a cloud-first data strategy now, if they haven't done so already”

Reinstating the trend, recently Gartner has published a new magic quadrant for infrastructure-as-a-service (IaaS) that – surprising nobody – has Amazon Web Services and Microsoft alone in the leader's quadrant and a few others thought outside of the box.

 

Now, the question really is, Azure or AWS for your cloud data? Or should it be both? Here’s a quick comparison table to guide you.

ServiceDescriptionAWSAzure
Relational databaseSQL Database is a high-performance, reliable, and secure database you can use to build data-driven applications and websites, without needing to manage infrastructure.RDSSQL Database including Postgres and MySQL
NoSQL—document storageA globally-distributed, multi-model database that natively supports multiple data models: key-value, documents, graphs, and columnar.DynamoDBCosmos DB
NoSQL—key/value storageA non-relational data store for semi-structured data. DynamoDB and SimpleDBTable Storage
CachingAn in-memory–based, distributed-caching service that provides a high-performance store typically used to offoad non-transactional work from a database.ElastiCacheRedis Cache
Database migrationFocuses on migration of database schema and data from one database format to a specifc database technology in the cloud.Database Migration Service (Preview)SQL Database Migration Wizard

Click here to read the entire guide published by Microsoft Azure Team:

  • 1
  • 2

CloudIQ is a leading Cloud Consulting and Solutions firm that helps businesses solve today’s problems and plan the enterprise of tomorrow by integrating intelligent cloud solutions. We help you leverage the technologies that make your people more productive, your infrastructure more intelligent, and your business more profitable. 

US

626 120th Ave NE, B102, Bellevue,

WA, 98005.

 sales@cloudiqtech.com

INDIA

Chennai One IT SEZ,

Module No:5-C, Phase ll, 2nd Floor, North Block, Pallavaram-Thoraipakkam 200 ft road, Thoraipakkam, Chennai – 600097


© 2019 CloudIQ Technologies. All rights reserved.