Data Skipping technology ported to Apache Spark 3.0

The Apache Spark community announced the release of Spark 3.0 on June 18 and is the first major release of the 3.x series. The release contains many new features and improvements. It is a result of more than 3,400 fixes and improvements from more than 440 contributors worldwide. BigDataStack partner IBM focuses on a number of selective open source technologies on machine learning, AI workflow, trusted AI, metadata, and big data process platform, etc., and has delivered approximately 200 commits, including a couple of key features in this Spark 3.0 release. A few of these comments were related to the IBM Research developed BigDataStack software component Data Skipping. 

Spark 3.0 for BigDataStack

SQL queries are a widespread technique to analyse datasets. In BigDataStack, data skipping for SQL queries has been further researched and developed. This technology is relevant when the dataset resides in an Object Store. This research has already been contributed to the IBM SQL Query service for now as a closed beta.

Watch the explanation by Yosef Moatti of IBM Research. 

[Embed video: https://youtu.be/jEujY_NYmjg ]

Spark 3.0 highlights

The announcement of release 3.0 introduces a number of important features and improvements:

  • Adaptive query execution — Reoptimizing and adjusting query plans based on runtime statistics collected during query execution

  • Dynamic partition pruning — Optimized execution during runtime by reusing the dimension table broadcast results in hash joins

  • SQL compatibility — A number of enhancements on ANSI SQL compliance on syntax, function and store assignment policy, etc.

  • Language — 3.0 moved to Python 3, Scala 2.12, and JDK 11

  • PySpark — Significant improvements in pandas APIs

  • New UI for structured streaming — Now contains a structured streaming tab, which provides information about running and completed queries statistics

  • Performance — Up to 40x speedups for calling R user-defined functions and a 2x performance improvement in TPC-DS benchmark

  • Accelerator-aware scheduler — Users can now specify and use hardware accelerators (e.g. GPUs) to improve performance on tasks such as deep learning

  • SQL reference documentation — Detailed, easily navigable Spark SQL reference documentation includes syntax, semantics, keywords, and examples for common SQL usage

BigDataStack software component Data Skipping ported to Spark 3.0

IBM contributed to the changes made to support the Spark 3.0:

  • Changes to keep data skipping working with datasource v1

  • Changes to support data source v2 code path 

 

By default Spark 3.0 still uses data source v1 as there are still unfinished functionalities in datasource v2, which is expected to become the default in Spark 3.1. 

 

Curious what IBM Research commented in the Spark Open Source Community? Dive into the IBM comments to Spark 3.0 

Want to know more?

IBM is strongly involved in the advancement of AI, machine learning, big data, and analytics tools globally, actively supporting ongoing improvements in Apache Spark. Have a look at the overview of IBM contributions to Spark 3.0 here

Data Skipping was presented and demonstrated at 2 of the BigDataStack webinars: