Enterprise-scale Analytics Performance with Cloud Object Storage

BigDataStack at IBM THINK 2019

Speaker:Paula Ta-Shma, IBM

Highlights

Big data processing is a big deal; service providers cost on average 5$ per TB scanned, although a service provider could charge you on a time-based fee. Regardless the business you are involved in, the shared aim is one, to make your data less “big” or… to reduce the number of queries you run to process your data! And this is were BigDataStack comes into play. There are several best practices we are implementing in our code to make it perform better. So for instance, hive partitioning your database allows you to have a structured packed pool of information to work on as well as a correct metadata index could let us skip over objects we don’t need and therefore save queries, time, bytes scanned and finally, money. Minimizing the number of queries could really be life changing for your business and such improvements could reduce the number of objects scanned in a single query up to 1/64 of the original lenght.

BigDataStack at IBM Think 2019

BigDataStack was at the IBM Business and Technology event THINK2019. Paula Ta-Shma of IBM addressed Enterprise-scale Analytics Performance with Cloud Object Storage at the 2019 IBM Flagship event. The session covered IBM work on Data Skipping which is being researched in the context of BigDataStack. The ability to run SQL directly on data in object storage is a comparably young technology. But in IBM Cloud there is a rapidly growing set of optimisation mechanisms that allow you to achieve very competitive SQL performance while still benefiting from the highly elastic and highly available object storage as your data persistency. In this session we walk you through a set of IBM innovation that we apply to our object storage and SQL services that enable this competitive performance.

Challenge

Object storage is still often considered to be cold storage. However this perspective is less and less justified. But what are the mechanisms that allow you to run serious analytic workload right in place where the data is stored on object storage?

Solution

Use data skipping indexing on IBM Cloud Object storage in combination with IBM Cloud SQL Query and IBM Analytic Engine in order to push down filtering of data partitions right into object storage. Run SQLs on encrypted columns of object storage data, saving large amounts of compute resources and thus dramatically accelerating your SQL queries.

Benefits

Be surprised by the competitive SQL performance that is possible with data stored on object storage. Understand how to use indexing and column encryption to accelerate your SQL workloads by orders of magnitude.

Enterprise-scale Analytics Performance with Cloud Object Storage in BigDataStack

Have a look at the full presentation on Enterprise-scale Analytics Performance with Cloud Object Storage and how this is researched in the context of BigDataStack or visit our website to discover more about the project!