Spark And Spark SQL Ignites Hadoop With Speed And Accuracy

There are several tools for  Big Data analysis including Hadoop, which is favored by most developers.

The reason is that Hadoop framework is based on a simple programming model named Map-Reduce, which enables a computing solution that is scalable, flexible, fault-tolerant and cost effective .  But a major concern  is speed. There is a noticeable delay for processing large datasets in terms of waiting time between queries and waiting time to run the program. Spark has solved that problem. Spark was introduced by Apache Software Foundation for the specific purpose of speeding up the Hadoop computational computing software process.

Apache Spark is based on Hadoop MapReduce, so it extends the MapReduce model to efficiently use it for more computation types, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.

But there are other Spark features that also give an advantage:

Speed- Spark helps to run an application in a Hadoop cluster up to 100 times faster in memory and 10 times faster when running on disk. It has an advanced DAG execution engine that supports cyclic data flow and in-memory computing, which is made  possible by reducing the number of read/write operations to disk. It stores the intermediate processing data in memory.

Ease of Use- Spark offers over 80 high-level operators that make it easy to build parallel apps. And it  can be used interactively from the Scala, Python and R shells  due to its versatile API collectionI

Advanced Analytics-  Spark not only supports ‘Map’ and ‘reduce’, it also supports SQL queries, Streaming data, Machine learning (ML) and Graph algorithms. Spark powers a stack of libraries including SQL and Data FramesMLlib for machine learning, GraphX, and Spark Streaming, and it can combine these libraries seamlessly in the same application. These libraries are stacked on Spark’s core

Apache Spark

Runs Everywhere- Spark can be run using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFSCassandraHBase ,HiveTachyon, and any Hadoop data source.

Spark SQL

Spark introduced a programming module for structured and semi structure  data processing  called Spark SQL. It is a component on top of Spark’s core that  provides domain specific language to manipulate Data frames in SCALA, Java and Python . It also provides SQL language support, with command line interface and ODBC / JDBC server. Overall It delivers a  programming abstraction and acts like a distributed SQL query engine. But there are other features that make it compelling to employ.

Spark SQL Features

Integrated– The ability to seamlessly mix SQL queries with Spark programs. Spark SQL allows query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Python, Scala and Java. This tight integration makes it easy to run SQL queries alongside complex analytic algorithms.

Uniform Data AccessData Frames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. Spark SQL can even join data across these sources. Schema-RDDs provide a single interface for efficiently working with structured data, including Apache Hive tables, parquet files and JSON files.

Hive Compatibility– The ability to run unmodified Hive queries on existing warehouses. Spark SQL reuses the Hive frontend and Meta-Store, giving full compatibility with existing Hive data, queries, and UDFs. It simply needs to be installed alongside Hive.

Standard Connectivity– The ability to connect through JDBC or ODBC. Spark SQL includes a server mode with industry standard JDBC and ODBC connectivity.

Scalability- Spark SQL takes advantages of the RDD Model to support mid query fault tolerance, letting it scale to large jobs too. So no need to worry about using different engine for historical data

 

Spark SQL Architecture– But to understand Spark SQL’s  workflow,  we need to review its Architecture.

Apache Spark

The above figure shows  Spark-SQL’s  general Architecture consists of three layers namely,  language API, Schema RDD, Data source

Language API– Spark itself supports different languages and so does Spark SQL. Spark SQL supports  Python , SCALA,JAVA and HiveQL

Schema RDD- Spark Core is designed with a special data structure called RDD. Generally, Spark SQL works on schemas, tables, and records. Therefore, a Schema RDD can be employed as a temporary table. This Schema RDD can be called a  Data Frame.

Data sources- The data source for Spark’s core is a text file, Avro file, but the data source of Spark SQL is different. So Spark SQL is compatible with parquet file , JSON documents, Hive tables and Cassandra database

Why Spark?

It’s  a robust tool for  data analysis comparative to a map reduce model where map reduce is dependent on mapping and shuffling. but Spark additionally has DAG(direct acyclic graph) and by this can monitor data consistency. For example, if a data computation fails, Spark can track the failing data node and start computation from that node.  And using  Spark SQL, it is easier  to  compute large queries  in a shorter time.

However, it should be pointed out that there are some drawbacks in Spark SQL  like the In Data frame mode, which  it is tricky to operate, update and delete data  in the temporary table. But for this the Hive table is a solution because Spark SQL  supports permanent Hive tables  and in the Hive table it is possible to operate normal data manipulation language  as  with  MYSQL queries.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>