+01 (414) 230 - 5550

There are several open source tools available to manage infrastructure as code that are backed by large communities of contributors with enterprise offering and good documentation. Why do we choose Terraform and what makes it unique / stand out? Terraform is used to provision an infrastructure and manage the infrastructure changes by versioning. It can manage components such as compute instances, storage, and networking, as well as high-level components such as DNS entries etc.

 

Good Fit for Cloud Agnostics Strategy

Enterprises would be interested to mitigate the availability risk of mission critical system in cloud by spreading their services across multiple cloud providers. Also, enterprises would always look for avenues to reduce their infrastructure cost by moving away from vendor locking situation. Terraform comes as savior for these use cases by being cloud-agnostic and allows a single configuration to be used to manage multiple providers, and to even handle cross-cloud dependencies by simplifying management and orchestration.

 

An Orchestration Tool

Chef, Puppet, Ansible, and SaltStack are all “configuration management” tools that are designed to install and manage software on existing servers whereas Terraform is an “orchestration tool” that is designed to provision the servers and leaving the configuring job to other tools. While there might be certain overlapping features between orchestration and configuration tools, each tool is going to be a better fit for certain use case. For example, when an infrastructure is dominated by Containers, all you need to do is provision a bunch of servers, then an orchestration tool like Terraform is typically going to be a better fit than a configuration management tool

 

Combat Configuration Drift

While configuration tools are best known to combat the configuration drift in the infrastructure, they are mostly used to manage a subset of machine’s state that will lead to some gap in the infrastructure state. The management will see diminishing returns to close those gaps against the matter that needs the most for daily operations. This set of issues can be mitigated with Terraform along with Containers.

 

For example, if you tell configuration tool to install a new version of OpenSSL, it’ll run the software update on your existing servers and the changes will happen in-place. Over time, as you apply more and more updates, each server builds up a unique history of changes, causing configuration drift. If you’re using Docker and an orchestration tool such as Terraform, the docker image is already built ready for the new servers. A new server will be deployed and then uninstall the old servers. All the server states will be maintained by Terraform. This approach reduces the likelihood of configuration drift bugs.

 

Conclusion

Overall, Terraform is an open source and cloud-agnostic orchestration tool with salient features. While it might be a less mature tool compared to other tools in the market, Terraform is still a good candidate to meet a specific set of requirements.

0

Azure, Technology

Apache Spark is an open-source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads.

 

Spark provides distributed task transmission, scheduling, and I/O functionality. It provides programmers with a potentially faster and more flexible alternative to MapReduce, the software framework to which early versions of Hadoop were tied.

 

How Apache Spark works

Apache Spark can process data from a variety of data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases and relational data stores.

 

The Spark Core engine uses the resilient distributed data set, or RDD, as its primary data type. The RDD is designed in such a way to hide much of the computational complexity from users. It aggregates data and partitions it across a server cluster, where it can then be computed and either moved to a different data store or run through an analytic model. The user doesn’t have to define where specific files are sent or what computational resources are used to store or retrieve files.

 

Given below is a sample Spark program written in Python to count the number of records with each rating in the input file given in next page:

In the above code, sc is the SparkContext associated with input file u.data. ratings is a RDD created by mapping the 3rd column in input file (array occurrence [2] – Ratings). Here map() is a transformation function which produces a new RDD.

 

We can have multiple transformations in a single spark program each producing a new RDD from an existing RDD or an input file. countByValue() is an Action that is performed. In Spark, the transformations are not executed until an Action is triggered. This is called Lazy Evaluation.

 

Apache Spark works Figure 1

 

Spark languages

Spark was written in Scala, which is considered the primary language for interacting with the Spark Core engine. Out of the box, Spark also comes with API connectors for using Java, R, and Python.

 

Spark libraries

  • The Spark Core engine functions partly as an application programming interface (API) layer and underpins a set of related tools for managing and analyzing data.
  • Spark SQL — One of the most commonly used libraries, Spark SQL enables users to query data stored in disparate applications using the common SQL language.
  • Spark Streaming — This library allows users to build applications that analyze and present data in real time.
  • MLlib — A library of machine learning code that enables users to apply advanced statistical operations to data in their Spark cluster and to build applications around these analyses.
  • GraphX — A built-in library of algorithms for graph-parallel computation.

 

RDDs, DataFrames, and Datasets

An RDD is an immutable distributed collection of elements of data, partitioned across nodes in a cluster that can be operated in parallel with a low-level API that offers transformations and actions.

 

Like an RDD, a DataFrame is an immutable distributed collection of data. However, unlike an RDD, data is organized into named columns, like a table in a relational database.

 

Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface.

 

Executing SQL-style functions on a Dataframe

Given below is a map-reduce program to get the list of popular movies (which has been rated by many customers using the same input data as Figure 1 above).

 

The same program, when written using DataFrames, will look like this

 

 

As you can see DataFrames gives us the flexibility to use SQL style functions to get the required results. Because DataFrames APIs are built on top of the Spark SQL engine, it uses Catalyst to generate an optimized logical and physical query plan.

 

Job Scheduling

Spark has several facilities for scheduling resources between computations.

 

  • Each Spark application (instance of SparkContext) runs an independent set of executor processes. The cluster managers that Spark runs on provide facilities for scheduling across applications.
  • Within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads. This is common if the application is serving requests over the network. Spark includes a fair scheduler to schedule resources within each SparkContext.

 

Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards.

Spark Streaming

The Python program shown below counts the number of words in text data received from a data server listening on a TCP socket.

 

Sample input entered for this program at a terminal through NetCat and the output of the program is given below.

 

Conclusion

Launched for the first time in May 2014, Apache Spark has become the go-to program for companies that work with large-scale Big Data applications. The speed and agility of Spark have made it incredibly useful across a wide range of industries.

 

From FMCG giants to BFSI companies to digital advertising firms – Apache Spark has proved to be indispensable when it comes to aggregating data, gleaning insights and forecasting industry trends.

0