Automation Testing helps complete the entire software testing life cycle (STLC) in less time and improve efficiency of the testing process.

Test Automation enables teams to verify functionality, test for regression and run simultaneous tests efficiently. In this article we will take a detailed look at the Automation Testing Tools available, standards and best practices to be followed during Test Automation.

Following the best practices for Software Testing Life Cycle (Unit testing, Integration Testing & System Testing) ensures that the client gets the software as intended without any bugs. End-to-end testing is the methodology used to test whether the flow of an application is performing as designed from start to finish. Carrying out end-to-end tests helps identify system dependencies and ensure the flow of right information across various system components and the system.

Ultimately Automation Testing increases the speed of test execution and the test coverage.

When to Choose Automation Testing
  • There is lots of regression work
  • GUI is same, but you have lot of often functional changes
  • Requirements do not change frequently
  • Load and performance testing with many virtual users
  • Repetitive test cases that tend well to automation & saves time
  • Huge projects
  • Projects that need to test the same areas

Steps to Implement Automation Testing
  • Identify areas within software to automate
  • Choose the appropriate tool for test automation
  • Write test scripts
  • Develop test suits
  • Execute test scripts
  • Build result reports
  • Find possible bugs or performance issues
Choosing your Automation Testing Tool

The strategy to adopt test automation should clearly define when to opt for automation, its scope and selection of the right kind of tools for execution. And when it comes to tools the top ones to go for are

  • Cypress
  • Selenium
  • Protractor
  • Appium(Mobile)
Why Cypress?

Cypress is a JavaScript based testing framework built for the modern web. Cypress helps to create End-to-end tests, Integration tests and Unit tests. Cypress takes a different approach compared to other testing frameworks, since it’s executed in the same run loop as the application. It also leverages a Node.js server to handle any task that needs to happen outside of the browser. With its ability to understand everything happening inside and outside of the browser, it produces more consistent results.

Key Features of Cypress
  • Automatic Waiting – No need for adding wait and sleep.
  • Spies, Stubs, and Clocks – Verify and control the behaviour of functions, server responses, or timers.
  • Network traffic control and monitoring – Easily control, stub, and test edge cases without involving your server. You can stub network traffic however you like.
  • Consistent Results – Cypress architecture doesn’t use Selenium or WebDriver. It is fast, consistent and does reliable tests that are flake-free.
  • Screenshots and Videos – View screenshots taken automatically on failure, or videos of your entire test suite when run from the CLI.
Azure CICD Setup with Cypress

Cypress runs on most of the following CI providers.

Azure DevOps / VSTS CI / TeamFoundation
BitBucket
CircleCI
Docker
GitLab
Jenkins
TravisCI

Azure DevOps – Steps to Integrate Cypress Automation Tests
  • Pre-Build Testing
  • Install the Node module and run application in test mode
  • Run the tests
  • Publish the test results
  • Cypress Containerization
  • Build the docker container of cypress
  • Push the image to container
  • Publish the Build

Before we get started here are the basic Cypress installation commands

Clean up the old results
$ rm -rf cypress/reports/
 
Run the cypress application with required spec file.
$ cypress run –spec \”cypress/integration/**/*.spec.ts\” // mention your spec file
 
Configure the mocha reports path for publishing test results.
–reporter junit –reporter-options ‘mochaFile=cypress/reports/test-output-[hash].xml,toConsole=true’
 
Uninstall the application.
$ npm uninstall cypress-multi-reporters; npm uninstall cypress-promise; npm uninstall cypres

Pre-Build Testing

It is critical to test the application before the Build, Deployment or Release. Essentially the process involves regression and smoke testing. And don’t forget the sanity checks before the build is deployed in the staging environment.

Cypress comes in handy for testing angular / JavaScript applications before they are deployed to staging or production environment.

Install the Node module and run application in test mode

Install the required node module of the application then run the application with test mode.

$ npm install –save-dev start-server-and-test

$ start-server-and-test start http://localhost:4200

Publish the test results

The results of the Cypress test execution are stored in specified path and are added to the Azure DevOps test results. Cypress supports JUnit, Mocha, Mochawsome test results reporter formats and provides options to create customised test results and merge all the test results as well.

Cypress Containerization

Cypress supports docker containerization and that makes it easy to set it up in a cluster environment like AKS. The Cypress base images are available at the link below.

https://github.com/cypress-io/cypress-docker-images

Copy the package.json and UI source code to the app folder and run the Cypress test. The following commands are used to run the docker and execute.

  script: |
        docker run -d -it --name cypressName:cypressImageTag bash
        docker commit -p cypressName:cypressImageTag
        docker stop cypressName
        docker rm -f cypressName
    
    - script: docker tag cypressName:cypressImageTag
      displayName: Tag Cypress image 
      
    
    - task: Docker@1
    displayName: Push image To Registry
    inputs:
        command: push
        azureSubscriptionEndpoint: azureSubscriptionEndpoint
        azureContainerRegistry: $(azureContainerRegistry)
        imageName: acrImageName:BuildId
 
    - script: sudo rm -rf /test-results/*
    displayName: Removing Previous Results
 
    - task: ShellScript@2
    displayName: 'Bash Script - cypress base image post-deployment'
    inputs:
        scriptPath: ./cypress-deployment.sh
        args: $(azureRegistry) $(cypressImageName) $(azureContainerValue) $(CYPRESS_OPTIONS) 
        continueOnError: true
    - task: PublishTestResults@1
    displayName: 'Publish Test Results ./test-results-*.xml'
    inputs:
 
    cypress-base-image-post-deployment.sh
 
    docker run -v $systemSourceDirectory:/app/cypress/reports --name vca-arp-ui 
    $cypress_Latestimage npx cypress run $cypressOptions bash

Now the container should be set up on on your local machine and start running your specs.

Cypress is simple and easily integrates with your CI environment. Apart from the browser support, Cypress reduces the efforts of manual testing and is relatively faster when compared to other automation testing tools.

How to Import BACPAC File Created from Azure SQL Database?

When you need to export a database for archiving or for moving to another platform, you can export the database schema and data to a BACPAC file. A BACPAC file is a ZIP file with an extension of BACPAC containing the metadata and data from a SQL Server database. A BACPAC file can be stored in Azure Blob storage or in local storage in an on-premises location and later imported back into Azure SQL Database or into a SQL Server on-premises installation.

Import BACPAC File to On-Premise SQL Server :

  • C:\Program Files (x86)\Microsoft SQL Server\140\DAC\bin>
  • SqlPackage.exe /a:import /sf:\\Userdb0.bacpac /tsn:SERVER-SQL\DEV2016 /tdn:Azure_Test /p:CommandTimeout=2400

Error :

When you are try to import BACPAC File created from Azure Environment, you might encounter the following error if it consists of External Data Source Reference.

TITLE: Microsoft SQL Server Management Studio

            
    Could not import package. 
    Warning SQL72012: The object [AzureProd] exists in the target, 
    but it will not be dropped even though you selected the 
    ‘Generate drop statements for objects that are in the target database but that 
    are not in the source’ check box. 
    Warning SQL72012: The object [AzureProd_Log] exists in the target,
    but it will not be dropped even though you selected the 
    ‘Generate drop statements for objects that are in the target database but that 
    are not in the source’ check box. 
    Error SQL72014: .Net SqlClient Data Provider: Msg 102, Level 15, State 1, 
    Line 1 Incorrect syntax near ‘EXTERNAL’. 
    Error SQL72045: Script execution error. The executed script: 
    CREATE EXTERNAL DATA SOURCE [DB_EXT_EDS]
    WITH (
    TYPE = RDBMS,
    LOCATION = N’sqlserver.database.windows.net’,
    DATABASE_NAME = N’AdventureWorks’, 
    CREDENTIAL = [DB_EXT_CRED] ); 
            
    

Solution :

Drop external Tables and External Data Source in Azure SQL Database and create BACPAC File again without those references.

Drop External Tables and External Data Source

            
        IF EXISTS 
        (
        SELECT 'x' FROM sys.external_tables)
        BEGIN
        DROP EXTERNAL TABLE EXT_Table1 
        DROP EXTERNAL TABLE EXT_Table2 
        DROP EXTERNAL TABLE EXT_Table3 
        END   
         
        IF EXISTS 
        ( 
        SELECT * FROM sys.external_data_sources 
        WHERE name ='DB_EXT_EDS' 
        ) 
        BEGIN 
        DROP EXTERNAL DATA SOURCE DB_EXT_EDS; 
        END  
            
        

If you can’t recreate BACPAC without dropping the tables, you can follow these steps.

  1. Change the file extension to zip, then decompress it into a folder. Surprisingly, a bacpac is actually just a zip file, not something proprietary and hard to get into.
  2. Find the model.xml file and edit it to remove the section that looks like this:
<Element Type=”SqlExternalDataSource” Name=”[BoxDataSrc]”>
<Property Name=”DataSourceType” Value=”1′′ />
<Property Name=”Location” Value=”MYAZUREServer.database.windows.net” /> 
<Property Name=”DatabaseName” Value=”MyAzureDb” />
<Relationship Name=”Credential”>
<Entry>
<References Name=”[SQL_Credential]” />
</Entry>
</Relationship>
</Element>

If you have multiple external data sources of this type, you will probably need to repeat step 2 for each one.

Save and close model.xml.

Now you need to re-generate the checksum for model.xml so that the bacpac doesn’t think it was tampered with (since you just tampered with it). Create a PowerShell file named computeHash.ps1 and put this code into it.

Generate Checksum

             
            $modelXmlPath = Read-Host "model.xml file path" 
            $hasher = [System.Security.Cryptography.HashAlgorithm]:
            :Create("System.Security.Cryptography.SHA256Crypt oServiceProvider") 
            $fileStream = new-object System.IO.FileStream ` -ArgumentList
            @($modelXmlPath, [System.IO.FileMode]::Open) 
            $hash = $hasher.ComputeHash($fileStream) 
            $hashString = "" Foreach ($b in $hash) { $hashString += $b.ToString("X2") } 
            $fileStream.Close() $hashString 
            
     

Run the PowerShell script and give it the filepath to your unzipped and edited model.xml file. It will return a checksum value.

Copy the checksum value, then open up Origin.xml and replace the existing checksum, toward the bottom on the line that looks like this:

<Checksum Uri=”/model.xml”>9EA0F06B282G4F42955C78A98822A31AA0ED0225CB131B
8759379055A482D0 1G</Checksum> 

Save and close Origin.xml, then select all the files and put them into a new zip file and rename the extension to bacpac.

Now you can use this new bacpac to import the database without getting the error.

Deadlocks in Azure SQL Database

Recently we were working with Azure Logic Apps to invoke Azure Functions.
By Default, Logic App runs parallel threads and we didn’t explicitly control the concurrency and left the default values.

So Logic App invoked several concurrent threads which in turn invoked several Azure Functions.
The problem was Azure Functions invoked Database Calls which caused Deadlocks. In Ideal world, Database should be able to handle numerous concurrent functions without deadlocks. Our process high percentage of shared data and we wanted to ensure the consistency , so we had Explicit Transactions in our Stored procedure calls. That’s the root cause of the problem and we didn’t want to remove the explicit Transaction.

The solution we implemented to alleviate this problem is to run this process in Sequence instead of parallel threads.

Log App Concurrency Control Behavior

For each loops execute in parallel by default. Customize the degree of parallelism, or set it to 1 to execute in sequence.

Logic_App_Concurrency

Deadlock Exception

Transaction (Process ID 166) was deadlocked on lock
| communication buffer resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

Deadlock_Exception

Troubleshooting Deadlocks

So we have identified Deadlock happened in the database through our Application Insights. Next logical question is , what caused this deadlock.

Azure SQL Server Deadlock Count

These queries identifies the deadlock event time as well as the deadlock event details.

                
        SELECT * FROM sys.event_log   
        WHERE event_type = 'deadlock';
        WITH CTE AS (  
        SELECT CAST(event_data AS XML)  AS [target_data_XML]   
        FROM sys.fn_xe_telemetry_blob_target_read_file('dl', 
        null, null, null)  
        )  
        SELECT target_data_XML.value('(/event/@timestamp)[1]', 
        'DateTime2') AS Timestamp,  
        target_data_XML.query('/event/data[@name=''xml_report'']
        /value/deadlock') AS deadlock_xml,  
        target_data_XML.query('/event/data[@name=''database_name'']
        /value').value('(/value)[1]', 'nvarchar(100)') AS db_name  
        FROM CTE
                
        

You can save the Deadlock xml as xdl to view the Deadlock Diagram. This provides all the information we need to identify the root cause of the deadlock and take necessary steps to resolve the issue.

References

Grafana is an open-source, general purpose dashboard and graph composer, which runs as a web application.

You can monitor Azure services and applications from Grafana using the Azure Monitor data source plugin. The plugin gathers application performance data collected by the Application Insights SDK as well as infrastructure data provided by Azure Monitor. You can then display this data on your Grafana dashboard.

Grafana uses an Azure Active Directory service principal to connect to Azure Monitor APIs and collect metrics data. You must create a service principal to manage access to your Azure resources.

Why Grafana?

Grafana provides more visualization options than the Azure Portal. It also supports multiple data sources. One can combine data from multiple sources in a single dashboard. Grafana is designed for analyzing and visualizing metrics such as system CPU, memory, disk and I/O utilization. Users can create comprehensive charts with smart axis formats (such as lines and points) as a result of Grafana’s fast, client-side rendering — even over long ranges of time.

Grafana dashboards are what made Grafana such a popular visualization tool. Visualizations in Grafana are called panels, and users can create a dashboard containing panels for different data sources. Grafana supports graph, singlestat, table, heatmap and freetext panel types. Grafana users can make use of a large ecosystem of ready-made dashboards for different data types and sources. Grafana has no time series storage support. Grafana is only a visualization solution. Time series storage is not part of its core functionality.

Some of the features of Grafana are as follows

  • Optimized for Time series
  • Can pull data from Azure Metrics, Log Analytics and Application Insights
  • Azure Data Explorer (formerly known as Kusto) plugin also released.
  • Rich ecosystem of plugins for data sources and dashboards.
  • Open Source, easy to onboard using Docker, Azure App Service etc.
Azure-app-service

Some of the requirements of Grafana are described below.

  • Azure SPN (Service Principal Name) with reader access to subscription
  • Deploy in Azure web apps.
  • Data source plugin “grafana-azure-monitor-datasource”
  • Supports AD integration via LDAP
  • Easy to export/import and templatize
  • Very DevOps friendly
  • Huge collection of panels https://play.grafana.org
Grafana

Azure Monitor Data Source For Grafana

Azure Monitor is the platform service that provides a single source for monitoring Azure resources. The Azure Monitor Data Source plugin supports Azure Monitor, Azure Log Analytics and Application Insights metrics in Grafana.

Features

  • Support for all the Azure Monitor metrics
    • includes support for the latest API version that allows multi-dimensional filtering for the Storage and SQL metrics.
    • Automatic time grain mode which will group the metrics by the most appropriate time grain value
  • Application Insights metrics
    • Write raw log analytics queries, and select x-axis, y-axis, and grouped values manually.
    • Automatic time grain support
  • Support for Log Analytics (both for Azure Monitor and Application Insights)
  • You can combine metrics from both services in the same graph.
Grafana-graph

Azure Monitor for VMs provides an in-depth view of VM health, performance trends, and dependencies. Azure Monitor for VMs includes a set of performance charts that target several key performance indicators (KPIs) to help you determine how well a virtual machine is performing. Azure Monitor for VMs is focused on the operating system as manifested through the processor, memory, network adapters, and disks.

Azure Dashboards

Azure dashboards allow you to combine different kinds of data, including both metrics and logs, into a single pane in the Azure portal. You can optionally share the dashboard with other Azure users. Elements throughout Azure Monitor can be added to an Azure dashboard in addition to the output of any log query or metrics chart. Azure Monitor is single source for monitoring azure resources. Its Azure’s time series database for all azure metrics.

Some of the important aspects of Azure Dashboard

  • No setup required, already available within Azure Portal.
  • Zoom in zoom out for metrics not available
  • All data from Azure resources.
  • Log Analytics/AI queries cannot be parameterized based on Dashboard selection.
  • Query results can be pinned to dashboards
  • Good panels are not tied to products and can’t be customized.
    • Eg. percentile panels is only available in “Container Insights” and VM insights.
    • The panel cannot be used against “Log Analytics” source or Metric source.

Some of the features of Azure Dashboard are as follows

  • Supports visualizing most Azure resources
  • OOB Integrated with Azure RBAC
  • Supports Log Analytics, App Insights and Metrics
  • No Auto refresh per panel
  • No Zoom in Zoom out.
  • Dashboard queries don’t support variables

Azure Dashboards (VM insights/ Container Insights)

  • These tiles can only be accessed by navigating to the VM resource.
  • They cannot be pinned as is, but the detailed version of this can be pinned.
  • No zoom in zoom out capability.

Azure Dashboards – Metrics

  • These are pinnable
  • Don’t support percentiles
  • No drill ability
  • Each Panel is hard coded to a specific data source even if they might be the same behind the scenes.

Comparison between Grafana and Azure Dashboard is shown below.

Azure Dashboard
  • Multiple Azure resource types
  • Limited configuration options. Requires JSON editing
  • Application Insights à OOB Azure Dashboard
  • Only static queries
  • No setup required
  • Not intuitive for overlaying.
Grafana
  • Mostly Time Series
  • Highly configurable
  • Global variables as filters
  • Dashbord and individual panel refresh.
  • Supports query macros
  • Setup required (minimal)
  • Intuitive overlays

1. Security and Compliance:

If you are wondering why we are starting with security, then check out this number. $6 trillion, that’s the amount of annual damage cyber crimes is predicted to cost us by 2021.

Which is precisely why the first thing you need to check while picking your cloud service provider is their security and compliance levels – both physical as well as virtual – this includes the geographical location of their data centers and the local laws of the country they are based in.

There are a number of certifications and standards which guarantee the security preparedness of cloud vendors; their validity must be checked and additional investigations must be carried out by checking internal and third-party audits or reports.

You need to do a deep check of:

  • Security infrastructure and procedures followed by the vendor
  • Identity management and authorizations
  • Physical security controls including the process for natural disasters
  • Policies for data back-up and disaster recovery

2. Technical Capabilities

An obvious point, but it still needs to be reiterated.

Your service provider should have a full stack of technologies that support your current applications and also has the capability to match your future needs.

Cloud partnerships last a long time, and it's important to check the future roadmap of the service provider to understand if they have the mindset to catch trends early and innovate.

Some questions to focus on:

  • Will your current software and applications integrate easily with the service provider’s cloud infrastructure?
  • Do they use standard interfaces and APIs for easy integration?
  • Do they have the capability of providing hybrid cloud computing options and do they have the flexibility to host different cloud environments and systems?
  • Are they backing their capabilities with SLAs?
  • Are they willing and capable to architect solutions tailored to your business?

3. Costs

No two cloud service providers have similar or comparable pricing packages. They each have their own formula of computing cloud costs, and it is almost impossible to make a side-by-side comparison of different vendors. What you need to do is map out your organization’s requirement as minutely as possible and then decide which pricing model suits your needs.

Keep in mind:

  • Consumption timelines as long-term contracts are better priced
  • The flexibility offered by service providers in scaling up or down
  • Check for hidden costs

4. Business Health

The stability of your business depends on the stability of its partners, and you cannot underestimate the importance of a cloud partner. Before finalizing your cloud vendor, it is important to check their business and financial health.

You should check:

  • The company’s financial records
  • Management structure and other third-party relationships
  • Reputation, reviews, and referrals from existing customers
  • For any legal run-ins
  • All available third-party audits

5. Support

Do you just have a phone or chat access or does your service provider offers dedicated account management? How much support you can get from your vendor is another important criteria, that must be considered before finalizing a service provider.

Find out about:

  • Time guarantees for solving technical issues
  • Access to support services – 24x7 or 12x5
  • Cost of opting for dedicated resources

Deciding on a cloud service provider is a long process that demands complete thoroughness and analysis from the CIO and the rest of the team.

Before we leave you to navigate your way to your future cloud partner, here are two more important points that must be considered – Right size and exit strategy.

Keep in mind that to get the best service you need to find a vendor who connects with you and for whom you are a valuable client.

And always plan an exit strategy in case things don’t work out.

Best of Luck!

When you are deploying a new change into production, the associated deployment should be in a predictable manner. In simple terms, this means no disruption and zero downtime! In case you do encounter a problem or a bottleneck, the deployment strategy should include a quick roll back.

The safe strategy can be achieved by working with two identical infrastructures - the “green” environment hosting the current production and the “blue” environment with the new changes.

The business and IT teams will have an opportunity to conduct sanity, smoke test or any other test in the “blue” environment before making a “Go” decision. Upon “Go”, the team can switch “blue” to “green” and “green” to “blue”.

In Azure, different processes are available for implementing the Blue-Green strategy with two environments.

We have listed below some of these techniques. Naturally, this list is not fixed and will grow continuously as new tool sets and services emerge.

  • Deployment slots - For Web Apps, deployment slots provide an easy way to implement Blue-Green deployments.
  • Azure Traffic Manager – This can be leveraged to realize Blue-Green deployments for smoother deployments with weighted round-robin routing method. The detailed configuration and implementation methods are available in Azure Documentation.
  • Using an Application Gateway with two backend pools and a routing rule - Have two backend resource pools with one as a stage pool and another one as a prod pool. Add stage VMSS to stage pool, prod VMSS to prod pool and have one routing rule in the app gateway. Depending on the need to use stage or prod VMSS, this rule will be changed to point to the appropriate backend address pool.

CloudIQ architects and engineers have implemented Blue-Green deployment for multiple clients, and in each case, we have customized our strategies to suit their use-cases. If you are looking for a completely safe way to deploy new software versions and applications, then reach out to us at sales@cloudiq.io

Apache Spark is an open-source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads.

Spark provides distributed task transmission, scheduling, and I/O functionality. It provides programmers with a potentially faster and more flexible alternative to MapReduce, the software framework to which early versions of Hadoop were tied.

How Apache Spark works

Apache Spark can process data from a variety of data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases and relational data stores.

The Spark Core engine uses the resilient distributed data set, or RDD, as its primary data type. The RDD is designed in such a way to hide much of the computational complexity from users. It aggregates data and partitions it across a server cluster, where it can then be computed and either moved to a different data store or run through an analytic model. The user doesn't have to define where specific files are sent or what computational resources are used to store or retrieve files.

Given below is a sample Spark program written in Python to count the number of records with each rating in the input file given in next page:

 from pyspark import SparkConf, SparkContext
        import collections
        
        conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
        sc = SparkContext(conf = conf)
        
        lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")
        ratings = lines.map(lambda x: x.split()[2])
        result = ratings.countByValue()
        
        sortedResults = collections.OrderedDict(sorted(result.items()))
        for key, value in sortedResults.items():
            print("%s %i" % (key, value))
        
         

In the above code, sc is the SparkContext associated with input file u.data. ratings is a RDD created by mapping the 3rd column in input file (array occurrence [2] – Ratings). Here map() is a transformation function which produces a new RDD.

We can have multiple transformations in a single spark program each producing a new RDD from an existing RDD or an input file. countByValue() is an Action that is performed.
In Spark, the transformations are not executed until an Action is triggered. This is called Lazy Evaluation.

Apache Spark works

Figure 1

Spark languages

Spark was written in Scala, which is considered the primary language for interacting with the Spark Core engine. Out of the box, Spark also comes with API connectors for using Java, R, and Python.

Spark libraries

  • The Spark Core engine functions partly as an application programming interface (API) layer and underpins a set of related tools for managing and analyzing data.
  • Spark SQL -- One of the most commonly used libraries, Spark SQL enables users to query data stored in disparate applications using the common SQL language.
  • Spark Streaming -- This library allows users to build applications that analyze and present data in real time.
  • MLlib -- A library of machine learning code that enables users to apply advanced statistical operations to data in their Spark cluster and to build applications around these analyses.
  • GraphX -- A built-in library of algorithms for graph-parallel computation.

RDDs, DataFrames, and Datasets

An RDD is an immutable distributed collection of elements of data, partitioned across nodes in a cluster that can be operated in parallel with a low-level API that offers transformations and actions.

Like an RDD, a DataFrame is an immutable distributed collection of data. However, unlike an RDD, data is organized into named columns, like a table in a relational database.

Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface.

Executing SQL-style functions on a Dataframe

Given below is a map-reduce program to get the list of popular movies (which has been rated by many customers using the same input data as Figure 1 above).

 from pyspark import SparkConf, SparkContext
        
        conf = SparkConf().setMaster("local").setAppName("PopularMovies")
        sc = SparkContext(conf = conf)
        
        lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")
        movies = lines.map(lambda x: (int(x.split()[1]), 1))
        movieCounts = movies.reduceByKey(lambda x, y: x + y)
        
        flipped = movieCounts.map( lambda xy: (xy[1],xy[0]) )
        sortedMovies = flipped.sortByKey()
        
        results = sortedMovies.collect()
        
        for result in results:
            print(result)
        
         

The same program, when written using DataFrames, will look like this

 from pyspark.sql import SparkSession
        from pyspark.sql import Row
        from pyspark.sql import functions
        
        def loadMovieNames():
            movieNames = {}
            with open("ml-100k/u.ITEM") as f:
                for line in f:
                    fields = line.split('|')
                    movieNames[int(fields[0])] = fields[1]
            return movieNames
        
        # Create a SparkSession (the config bit is only for Windows!)
        spark = SparkSession.builder.config("spark.sql.warehouse.dir", 
        "file:///C:/temp").appName("PopularMovies").getOrCreate()
        # Load up our movie ID -> name dictionary
        nameDict = loadMovieNames()
        
        # Get the raw data
        lines = spark.sparkContext.textFile("file:///SparkCourse/ml-100k/u.data")
        # Convert it to a RDD of Row objects
        movies = lines.map(lambda x: Row(movieID =int(x.split()[1])))
        # Convert that to a DataFrame
        movieDataset = spark.createDataFrame(movies)
        
        # Some SQL-Style magic to sort all movies by popularity in one line!
        topMovieIDs = movieDataset.groupBy("movieID").count().orderBy
        ("count", ascending=False).cache()
        
        # Show the results at this point:
        
        #|movieID|count|
        #+-------+-----+
        #|     50|  584|
        #|    258|  509|
        #|    100|  508|
        
        topMovieIDs.show()
        
        # Grab the top 10
        top10 = topMovieIDs.take(10)
        
        # Print the results
        print("\n")
        for result in top10:
            # Each row has movieID, count as above.
            print("%s: %d" % (nameDict[result[0]], result[1]))
        
        # Stop the session
        spark.stop()
        
         

As you can see DataFrames gives us the flexibility to use SQL style functions to get the required results. Because DataFrames APIs are built on top of the Spark SQL engine, it uses Catalyst to generate an optimized logical and physical query plan.

Job Scheduling

Spark has several facilities for scheduling resources between computations.

  • Each Spark application (instance of SparkContext) runs an independent set of executor processes. The cluster managers that Spark runs on provide facilities for scheduling across applications.
  • Within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads. This is common if the application is serving requests over the network. Spark includes a fair scheduler to schedule resources within each SparkContext.

Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.

Spark Streaming

The Python program shown below counts the number of words in text data received from a data server listening on a TCP socket.

Sample input entered for this program at a terminal through NetCat and the output of the program is given below.

 
     
# TERMINAL 1:
# Running Netcat

$ nc -lk 9999

hello world
...

         
 
     
# TERMINAL 2: RUNNING network_wordcount.py
        
$ ./bin/spark-submit examples/src/main/python/streaming/network_wordcount.py 
localhost 9999
...
-------------------------------------------
Time: 2014-10-14 15:25:21
-------------------------------------------
(hello,1)
(world,1)
...

         

Conclusion

Launched for the first time in May 2014, Apache Spark has become the go-to program for companies that work with large-scale Big Data applications. The speed and agility of Spark have made it incredibly useful across a wide range of industries.

From FMCG giants to BFSI companies to digital advertising firms – Apache Spark has proved to be indispensable when it comes to aggregating data, gleaning insights and forecasting industry trends.

This is the fifth blog in our series helping you understand all about cloud, when you are in a dilemma to choose Azure or AWS or both, if needed.

Before we jumpstart on the actual comparison chart of Azure and AWS, we would like to bring you some basics on data analytics and the current trends on the subject.

If you would rather like to have quick look at the comparison table, Click here

This blog is intended to help you strategize your data analytics initiatives so that you can make the most informed decision possible by analyzing all the data you need in real time. Furthermore, we also will help you draw comparisons between Azure and AWS, the two leaders in cloud, and their capabilities in Big Data and Analytics as published in a handout released by Microsoft.

Beyond doubts, this is an era of data. Every touch point of your business generates volumes of data and these data cannot be simply whisked away, cast aside as valuable business insights can be unearthed with a little effort. Here’s where your Data Analytics infrastructure helps.

A 2017 Planning Guide for Data and Analytics published by Gartner written by the Analyst John Hagerty states that

The Key Findings as per the report are as follows:

  • Data and analytics must drive modern business operations, not just reflect them. Technical professionals must holistically manage an end-to-end data and analytics architecture to acquire, organize, analyze and deliver insights to support that goal.
  • Analytics are now infused in places where they never existed before.
  • Executives will seek strategies to better manage and monetize data for internal and external business ecosystems.
  • Data gravity is rapidly shifting to the cloud, with IoT, data providers and cloud-native applications leading the way. It is no longer a question of "if" for using cloud for data and analytics; it's "how."

The last point emphasizes on how cloud is playing a prominent role when it comes to Data Analytics and if you have thoughts on who and how, Gartner in its latest magic quadrant has said that AWS and Azure are the top leaders. Now, if you are in doubt whether to go the Azure way or AWS or should it be the both, here’s the comparison table showing their respective Big Data and Analytics Capabilities

 

ServiceDescriptionAWSAzure
Elastic data warehouseA fully managed data warehouse that analyzes data using business intelligence tools. RedshiftSQL Data Warehouse
Big data processingSupports technologies that break up large data processing tasks into multiple jobs, and then combine the results to enable massive parallelism.Elastic MapReduce (EMR)HDInsight
Data orchestrationProcesses and moves data between different compute and storage services, as well as on-premises data sources at specifed intervals. Data PipelineData Factory
Cloud-based ETL/data integration service that orchestrates and automates the movement and transformation of data from various sources.AWS Glue Data CatalogData Factory + Data Catalog
AnalyticsStorage and analysis platforms that create insights from massive quantities of data, or data that originates from many sources.Kinesis AnalyticsStream Analytics

Data Lake Analytics

Data Lake Store
Streaming dataAllow mass ingestion of small data inputs, typically from devices and sensors, to process and route data.Kinesis Streams

Kinesis Firehose
Event Hubs

Event Hubs Capture
Visualizationperform ad-hoc analysis, and develop business insights from data.QuickSight (Preview)Power BI
Allows visualization and data analysis tools to be embedded in applications.Power BI Embedded
SearchA scalable search server based on Apache Lucene.Elasticsearch ServiceMarketplace—Elasticsearch
Delivers full-text search and related search analytics and capabilities.CloudSearchSearch
Machine learningProduces an end-to-end workfow to create, process, refne, and publish predictive models from complex data sets.Machine LearningMachine Learning
Data discoveryProvides the ability to better register, enrich, discover, understand, and consume data sources.Data Catalog
A serverless interactive query service that uses standard SQL for analyzing databases.Amazon AthenaData Lake Analytics

Click here to read the entire guide published by Microsoft Azure Team:

This is our fourth blog in the series of blogs intended to help you embark on a cloud strategy, most importantly when you are in dilemma to choose AWS or Azure, the two prominent cloud players today.

If you had missed our earlier blogs, click here

1st Blog - Compute

2nd Blog- Storage

3rd Blog- CDN & Networking

Before we jumpstart on the actual comparison chart of Azure and AWS, we would like to bring you some basics on the database aspect of cloud strategy.

If you would rather like to have quick look at the database comparison table, click here

Through this blog, let’s understand the database aspect of your cloud strategy. As per the guide, Database services refers to options for storing data, whether it’s a managed relational SQL database that’s globally distributed or a multi-model NoSQL database designed for any scale.

When you decide cloud, one of the critical decisions you face is which database to use - SQL or NoSQL. Though SQL has an impressive track record, NoSQL is not far behind as it is gradually making notable gains and has many proponents. Once you have picked your database, the other big decision to make is which cloud vendor to choose amongst the many vendors.

Here’s where you consider Gartner’s prediction; the research company published a document that states

“Public cloud services, such as Amazon Web Services (AWS), Microsoft Azure and IBM Cloud, are innovation juggernauts that offer highly operating-cost-competitive alternatives to traditional, on-premises hosting environments.

Cloud databases are now essential for emerging digital business use cases, next-generation applications and initiatives such as IoT. Gartner recommends that enterprises make cloud databases the preferred deployment model for all new business processes, workloads, and applications. As such, architects and tech professionals should start building a cloud-first data strategy now, if they haven't done so already”

Reinstating the trend, recently Gartner has published a new magic quadrant for infrastructure-as-a-service (IaaS) that – surprising nobody – has Amazon Web Services and Microsoft alone in the leader's quadrant and a few others thought outside of the box.

 

Now, the question really is, Azure or AWS for your cloud data? Or should it be both? Here’s a quick comparison table to guide you.

ServiceDescriptionAWSAzure
Relational databaseSQL Database is a high-performance, reliable, and secure database you can use to build data-driven applications and websites, without needing to manage infrastructure.RDSSQL Database including Postgres and MySQL
NoSQL—document storageA globally-distributed, multi-model database that natively supports multiple data models: key-value, documents, graphs, and columnar.DynamoDBCosmos DB
NoSQL—key/value storageA non-relational data store for semi-structured data. DynamoDB and SimpleDBTable Storage
CachingAn in-memory–based, distributed-caching service that provides a high-performance store typically used to offoad non-transactional work from a database.ElastiCacheRedis Cache
Database migrationFocuses on migration of database schema and data from one database format to a specifc database technology in the cloud.Database Migration Service (Preview)SQL Database Migration Wizard

Click here to read the entire guide published by Microsoft Azure Team:

In line with our latest blog series highlighting how common cloud services are made available via Azure and Amazon Web Services (AWS), as published by Microsoft, this third blog in the series helps you understand Cloud Networking and Content Delivery capabilities of both Azure and AWS.

Before we jumpstart on the actual comparison chart of Azure and AWS, we would like to bring you some basics on cloud content delivery networking and the current trends on the subject.

If you would rather like to have quick look at the comparison table, click here

When we talk about cloud Content Delivery Network (CDN) and the related networking capabilities it includes all the hardware and software that allows you to easily provision private networks, connect your cloud application to your on-premises datacenters, and more.

According to Gartner, Content delivery networks (CDNs) are a type of distributed computing infrastructure, where devices (servers or appliances) reside in multiple points of presence on multi-hop packet-routing networks, such as the Internet, or on private WANs. A CDN can be used to distribute rich media downloads or streams, deliver software packages and updates, and provide services such as global load balancing, Secure Sockets Layer acceleration and dynamic application acceleration via WAN optimization techniques.

In simpler terms, this highly distributed server platforms are optimized to deliver content in a way that improves customer experience. Hence, it is important to decrease latency by keeping the data closer to the users, protect it from security threats while ensuring rapid streamlined content delivery including general web delivery, content purge, content caching and tracking history as long as 90 days.

As per G2Crowd.com, most organizations use CDN services, such as web caching, request routing, and server-load balancing, to reduce load times and improve website performance. Further to qualify as a CDN provider, a service provider must:

  • Allow access to a geographically dispersed network of PoPs in multiple data centers
  • Help websites access this network to deliver content to website visitors
  • Offer services designed to improve website performance
  • Provide scalable Internet bandwidth allowances according to customer needs
  • Maintain data center(s) of servers to reduce the possibility of overloading individual instances

With this background, let’s look at the AWS vs Azure comparison chart in terms of Networking and Content Delivery Capabilities:

ServiceDescriptionAWSAzure
Cloud virtual networkingProvides an isolated, private environment in the cloud.Virtual Private CloudVirtual Network
Cross-premises connectivityConnects Azure virtual networks to other Azure virtual networks or customer on-premises networks. It also supports VPN tunneling.AWS VPN GatewayVPN Gateway
Domain name system managementManage DNS records using the same credentials, billing, and support contract as other Azure services.

Service that hosts domain names, routes users to Internet applications, manages traf c to apps, and improves app availability with automatic failover.
Route 53




Route 53
DNS




Traffic Manager
Content delivery networkGlobal content delivery network that transfers audio, video, applications, images, and other les.CloudFrontContent Delivery Network
Dedicated networkEstablishes a dedicated, private network connection from a location to the cloud provider.Direct ConnectExpressRoute
Load balancingAutomatically distributes incoming application traf c to add scale, handle failover, and route to a collection of resources.Elastic Load BalancingLoad Balancer

Application Gateway

To read more about the Microsoft guide which briefs all about cloud by drawing comparisons between Azure or AWS, click here (link to PDF download)

You may also like to read our previous blogs in these series, if so, please click here:

http://cloudiqtech.com/azure-vs-aws-compute/
http://cloudiqtech.com/aws-vs-azure-cloud-storage/

Azure or AWS or Azure & AWS? What’s your cloud strategy for Storage?

This is our second blog, in our latest blog series helping you understand all about cloud, especially when you are in doubt whether to go Azure or AWS or both.

To read our first blog talking about Cloud strategy in general and Compute in particular, click here…

Moving on, in this blog let’s find what Azure or AWS offer when it comes to Storage Capabilities for your Cloud Infrastructure.

Globally CIOs are increasingly looking to cease running their own data centers and move to cloud which is evident when we read the projection made by a leading researcher, MarketsandMarkets. They had reported that the global cloud storage business sector to grow from $18.87 billion in 2015 to $65.41 billion by 2020, at a compound annual growth rate (CAGR) of 28.2 percent during the forecast period.

Reinstating the fact, 451 Research’s Voice of the Enterprise survey last year stated that Public cloud storage spending will double by next year (2017). "IT managers are recognizing the need for storage transformation to meet the realities of the new digital economy, especially in terms of improved efficiency and agility in the face of relentless data growth," said Simon Robinson, research vice president at 451 and research director of the new Voice of the Enterprise: Storage service. "It's clear from our Q4 study that emerging options, especially public cloud storage and all-flash array technologies, will be increasingly important components in this transformation" he added further.

As we see, many companies are in for Cloud Storage, undoubtedly. But the big question - Whom to choose from a gamut of leading public cloud players including big players like AZURE, AWS; Should it be AZURE alone for your cloud storage or AWS or a combination of both still prevails.

This needs a thorough understanding. To help you decide for good, we have decided to re-produce a guide, published by Microsoft that briefs Azure‘s capabilities in comparison to AWS when it comes to Cloud Strategy. And we will see the Storage part in this blog, but before, that a little backgrounder on Cloud Storage.

When we talk about cloud storage device mechanisms, we include all logical units of data storage covering from files, blocks, and datasets to objects and their relative storage interfaces. These instances of virtual storage devices are designed specifically for cloud-based provisioning and can be scaled as per need. It is to be noted that different cloud service consumers utilize different technologies to interface with virtualized cloud storage devices.

ServiceDescriptionAWSAzure
Object storageObject storage service for use cases including cloud apps,
content distribution, backup, archiving, disaster recovery,
and big data analytics.
Simple Storage Services (S3) Storage—Block Blob (for content logs, files) (Standard—Hot)
Virtual Server disk
infrastructure
SSD storage optimized for I/O intensive
read/write operations.
Elastic Block Store (EBS)Disk Storage—Page Blobs (for VHDs or other random-write type data)

Disk Storage—Premium Storage
Shared file storageA simple interface to create and configure file
systems quickly as well as share common files.
Elastic File SystemFile Storage (file share between VMs)
Archiving—cool storageA lower cost tier for storing data that is
infrequently accessed and long-lived.
S3 IA GlacierStorage—Hot, Cool & Archive Tier
BackupBackup and archival solutions that allow files and folders
to be backed-up and recovered from the cloud, and
provide off-site protection against data loss.
Backup and RecoveryBackup
Hybrid storageIntegrates on-premises IT environments with cloud
storage. Automates data management and storage, plus
supports disaster recovery.
Storage GatewayStorSimple
Bulk data transferA data transport solution that uses secure disks and
appliances to transfer substantial amounts of data.

Petabyte- to Exabyte-scale data transport solution.
AWS Import/Export Disk




AWS Import/Export Snowball

AWS Snowball Edge

AWS Snowmobile
Import/Export



Data Box
Disaster recoveryAutomates protection and replication of virtual
machines with health monitoring, recovery plans,
and recovery plan testing.
Site Recovery

For a more detailed understanding download the document here

Surprisingly, as per an article published by Gartner, “Cloud Computing is still perplexing to many CIOs even after a decade of cloud’. While cloud computing is a foundation for digital business, Gartner estimates that less than one-third of enterprises have a documented cloud strategy. This indeed comes as a surprise given the fact that cloud has evolved from a disruption to the indispensable tech of today and tomorrow, all along strategically adopted by many progressive companies.

In the same article Donna Scott, Vice President and distinguished analyst at Gartner states that “Cloud computing will become the dominant design style for new applications and for refactoring a large number of existing applications over the next 10-plus years”. She also added that “A cloud strategy clearly defines the business outcomes you seek, and how you are going to get there. Having a cloud strategy will enable you to apply its tenets quickly with fewer delays, thus speeding the arrival of your ultimate business outcomes.”

However, it is easier said than done. Many top businesses still have questions like how to make the most from cloud computing? What kind of architectures and techniques need to be strategized to support the many flavors of evolving cloud computing? Private or Public? Hybrid or Public? Azure or AWS, or it should be a hybrid combo?

Through a series of blogs we intent to bring answers to these questions. As a first one, we would like to highlight and represent a comparative cloud service map focusing on both Azure and AWS both leaders in public cloud platforms, as published by Microsoft.

The well-researched article draws detailed comparisons between Azure and AWS and how common cloud services across parameters such as Marketplace, Compute, Storage, Networking, Database, Analytics, Big Data, Intelligence, IOT, Mobile and Enterprise Integration are made available via Azure and Amazon Web Services (AWS)

It should be noted that as prominent public cloud platforms providers, Azure and AWS each offer businesses a wide and comprehensive capabilities across the globe. Many organizations have chosen either one of them or both depending upon their needs in order to gain more agility, and flexibility while minimizing the risk and maximizing the larger benefits of a multi-cloud environment.

For starters, let’s start with COMPUTE and the points one should consider and compare before deciding the Azure or AWS approach or a combination of both.

ServiceDescriptionAWSAzure
Virtual serversAllows users to deploy, manage, and maintain
OS and server software; instance types provide
configurations of CPU/RAM.

Offers a lightweight, simplified product offering users can
choose from from when building out a virtual machine.
Elastic Compute Cloud (EC2)
VMs




Amazon Lightsail
Virtual Machines





Virtual Machine Images
Container managementSupports Docker containers and allows users to run
applications on managed instance clusters.

Allows customers to store Docker formatted images. Used
to create all types of container deployments on Azure.
EC2 Container Service (ECS)




EC2 Container Registry
Container Service




Container Registry
Microservice-based
applications
Orchestrates and manages the execution, lifetime, and
resilience of complex, interrelated code components
that can be either stateless or stateful.
Service Fabric
Backend process logicIntegrates systems and runs backend processes
in response to events or schedules without
provisioning or managing servers.
LambdaFunctions


Event Grid
Job orchestrationWhen processing across hundreds or thousands
of compute nodes, this tool orchestrates the
tasks and interactions between compute
resources that are necessary.
AWS BatchBatch
ScalabilityAutomatically changes the number of instances
providing a compute workload. Users set defined
metrics and thresholds that determine if the platform
adds or removes instances.
AWS Auto ScalingVirtual Machine Scale Sets

App Service Scale Capability (PAAS)

AutoScaling
Pre-defined templatesCommunity-led templates for creating and
deploying virtual machine-based solutions.
AWS Quick StartQuickstart templates

For a more detailed understanding download the document here

  • 1
  • 2

CloudIQ is a leading Cloud Consulting and Solutions firm that helps businesses solve today’s problems and plan the enterprise of tomorrow by integrating intelligent cloud solutions. We help you leverage the technologies that make your people more productive, your infrastructure more intelligent, and your business more profitable. 

US

626 120th Ave NE, B102, Bellevue,

WA, 98005.

 sales@cloudiqtech.com

INDIA

No. 3 & 4, Venkateswara Avenue,Bazaar Main Rd, Ramnagar South, Madipakkam, Chennai - 600091


© 2019 CloudIQ Technologies. All rights reserved.