How to Layout Big Data in IBM Cloud Object Storage for Spark SQL
Are you familiar with storage systems like IBM Cloud Object Storage (COS) or Apache Spark SQL? Dr. Paula Ta-Shma from IBM, gives us some tips & tricks you should know to improve your daily data journey.
Storing vast quantities of Rectangular Data
When you have vast quantities of rectangular data, the way you lay it out in object storage systems like IBM Cloud Object Storage (COS) makes a big difference to both the cost and performance of SQL queries; however, this task is not as simple as it sounds. Here we survey some tricks of the trade. A rectangular dataset can be conceptually thought of as a table with rows and columns, where each column is assigned a certain data type (integer, string etc.). This blog is relevant for anyone using Apache Spark SQL (Spark SQL) on data in COS. In particular, it is relevant to users of:
- IBM Cloud SQL Query (SQL Query), which uses Spark SQL as its query engine
- IBM Analytics Engine
- IBM Analytics for Apache Spark
- IBM Watson Studio
The value of the BigDataStack project
The new data-driven industrial revolution highlights the need for big data technologies to unlock the potential in various application domains. BigDataStack delivers a complete high-performant stack of technologies addressing the emerging needs of data operations and applications. The stack is based on a frontrunner infrastructure management system that drives decisions according to data aspects thus being fully scalable, runtime adaptable and performant for big data operations and data-intensive applications.
BigDataStack promotes automation and quality and ensures that the provided data are meaningful, of value and fit-for- purpose through its Data as a Service offering that addresses the complete data path with approaches for data cleaning, modelling, semantic interoperability, and distributed storage. BigDataStack builds steadily on the research performed in these areas to enhance opportunities for achieving impact such as the research results presented by Dr. Paula Ta-Shma from IBM, funded in the European Community’s Horizon 2020 research and innovation programme under grant agreements n°779747 and 644182
Learning from the GridPocket Use Case
GridPocket is a smart grid company developing energy management applications and cloud services for electricity, water, and gas utilities. We used their open source meter_gen data generator to generate meter readings, accompanied by timestamps, geospatial locations and additional information, as CSV data and stored that in COS in a bucket called metergen. Then, we used SQL query to write queries against this dataset.
The main takeaways from the research performed
-
Aim for objects sizes of around 128 MB. Objects too large or too small?
- Use maxRecordsPerFile to avoid large objects
- Use coalelsce and repartition commands to avoid small objects
-
Use Parquet, columnar storage format. Conversion to Parquet and ORC can provide significant benefit when the dataset will be extensively queried. It will:
- Decrease storage costs by using column-wise compression
- Reduce query cost and better query performance by scanning less data
- Avoid Schema inference
-
Never compress large CSV or JSON objects with Gzip.
Since Spark distributes work across multiple tasks, each task ideally reads some byte range of an object. However, Gzip is not a “splittable” compression algorithm. This means that in order to read a byte range of a gzipped object, each task will need to decompress the object starting from byte 0. Therefore, in this case a single task will decompress the entire object and share the result with other tasks, which can result in out of memory conditions and requires a shuffle to repartition the result to workers.
-
Use Hive style partitioning.
To store big datasets by mimicking virtual folder hierarchies and encoding information about object contents in object names.
Over to you
Start storing your rectangular data, learn from the GridPocket Use Case and Read the full article HERE.