There are several tools for Big Data analysis including Hadoop, which is favored by most developers.
The reason is that Hadoop framework is based on a simple programming model named Map-Reduce, which enables a computing solution that is scalable, flexible, fault-tolerant and cost effective . But a major concern is speed. There is a noticeable delay for processing large datasets in terms of waiting time between queries and waiting time to run the program. Spark has solved that problem. Spark was introduced by Apache Software Foundation for the specific purpose of speeding up the Hadoop computational computing software process.
Apache Spark is based on Hadoop MapReduce, so it extends the MapReduce model to efficiently use it for more computation types, which includes interactive queries and stream processing. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
But there are other Spark features that also give an advantage:
Speed- Spark helps to run an application in a Hadoop cluster up to 100 times faster in memory and 10 times faster when running on disk. It has an advanced DAG execution engine that supports cyclic data flow and in-memory computing, which is made possible by reducing the number of read/write operations to disk. It stores the intermediate processing data in memory.
Ease of Use- Spark offers over 80 high-level operators that make it easy to build parallel apps. And it can be used interactively from the Scala, Python and R shells due to its versatile API collectionI
Advanced Analytics- Spark not only supports ‘Map’ and ‘reduce’, it also supports SQL queries, Streaming data, Machine learning (ML) and Graph algorithms. Spark powers a stack of libraries including SQL and Data Frames, MLlib for machine learning, GraphX, and Spark Streaming, and it can combine these libraries seamlessly in the same application. These libraries are stacked on Spark’s core
Spark introduced a programming module for structured and semi structure data processing called Spark SQL. It is a component on top of Spark’s core that provides domain specific language to manipulate Data frames in SCALA, Java and Python . It also provides SQL language support, with command line interface and ODBC / JDBC server. Overall It delivers a programming abstraction and acts like a distributed SQL query engine. But there are other features that make it compelling to employ.
Spark SQL Features
Integrated– The ability to seamlessly mix SQL queries with Spark programs. Spark SQL allows query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Python, Scala and Java. This tight integration makes it easy to run SQL queries alongside complex analytic algorithms.
Uniform Data Access– Data Frames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. Spark SQL can even join data across these sources. Schema-RDDs provide a single interface for efficiently working with structured data, including Apache Hive tables, parquet files and JSON files.
Hive Compatibility– The ability to run unmodified Hive queries on existing warehouses. Spark SQL reuses the Hive frontend and Meta-Store, giving full compatibility with existing Hive data, queries, and UDFs. It simply needs to be installed alongside Hive.
Standard Connectivity– The ability to connect through JDBC or ODBC. Spark SQL includes a server mode with industry standard JDBC and ODBC connectivity.
Scalability- Spark SQL takes advantages of the RDD Model to support mid query fault tolerance, letting it scale to large jobs too. So no need to worry about using different engine for historical data
Spark SQL Architecture– But to understand Spark SQL’s workflow, we need to review its Architecture.
The above figure shows Spark-SQL’s general Architecture consists of three layers namely, language API, Schema RDD, Data source
Language API– Spark itself supports different languages and so does Spark SQL. Spark SQL supports Python , SCALA,JAVA and HiveQL
Schema RDD- Spark Core is designed with a special data structure called RDD. Generally, Spark SQL works on schemas, tables, and records. Therefore, a Schema RDD can be employed as a temporary table. This Schema RDD can be called a Data Frame.
Data sources- The data source for Spark’s core is a text file, Avro file, but the data source of Spark SQL is different. So Spark SQL is compatible with parquet file , JSON documents, Hive tables and Cassandra database
It’s a robust tool for data analysis comparative to a map reduce model where map reduce is dependent on mapping and shuffling. but Spark additionally has DAG(direct acyclic graph) and by this can monitor data consistency. For example, if a data computation fails, Spark can track the failing data node and start computation from that node. And using Spark SQL, it is easier to compute large queries in a shorter time.
However, it should be pointed out that there are some drawbacks in Spark SQL like the In Data frame mode, which it is tricky to operate, update and delete data in the temporary table. But for this the Hive table is a solution because Spark SQL supports permanent Hive tables and in the Hive table it is possible to operate normal data manipulation language as with MYSQL queries.