BigDataStack Innovation Potential: Initial Plan and Activities

BigDataStack Innovation Potential: Initial Plan and Activities

Enterprises today often have to use different database systems to fulfil different purposes: 

  • relational operational databases for handling operational load
  • data warehouses to submit analytical queries
  • key value stores to hold historical data 
  • perform big data analysis. 


Combining data coming from different data sources is not an easy task, while moving data from one source to the other is cost-demanding and requires ETLs and offline batch processing, which are often performed at night. Existing solutions today for polyglot applications often introduces the concept of datalakes or implements a federation on top of different sources in order to provide an easy way for common access. However, all these solutions are referring to different data sources, while for federation they are using technologies like Spark, which can be very resource consuming and cannot exploit the specific capabilities of each different data store.


The new component (to be) developed by BigDataStack

The seamless analytical framework will federate data coming from two different types of datastores: the relational operational LXS datastore and object stores, which will share the same dataset. As a result, this single component will be used within the different datastores, in order to exploit the unique characteristics of each one and this transparently from the user, without having to compromise some requirements for the benefit of others. LXS will be used for operational workload, storing data that is being continuously ingested, and when the data can be considered as outdated, historical and no longer needed to participate in operational transactions, they will be safely removed from LXS and moved to the object store for heavy analytical processing. 


LXS Query Engine will be used as the federator on top of the two datastores: Offering a common JDBC interface it will provide a common way to access data and will push down operations accordingly:  

  1. Write operations will be pushed down to LXS, as they will require transactional semantics. 
  2. Read operations will be pushed down both to LXS and to IBM object store by the federator, which will retrieve the intermediate results and merge them on the fly before sending back the accumulated response.

LXS query engine offers intra-query parallelism in order to exploit the distributed nature of its storage and the ability of the latter to accept pushdowns of aggregated operations, execute them locally and return back the intermediate results corresponding to that data node. The same concept will be applied here. The submitted query will be executed in a distributed manner, and the results will be returned to the caller.


Moreover, as already mentioned, the relational datastore and the object store will share the same dataset: Data will be periodically moved from one datastore to the other. The ability of LXS to periodically export a dump of all write operations, while ensuring data consistency allows for the seamless framework to use a queue between the two datastores. Where LXS will push these dumps in a predefined format, and IBM will pull this information and finally ingest the historical data to the object store. When data are received from the latter, only then the relational datastore can drop them, so that data loss can be prevented. In this scenario, it might appear the case where data co-exist in both stores, and have to be taken into account when the federator produces the final results.


The results so far


At the moment, the initial design of the seamless analytical framework has been performed, analyzing the potential bottlenecks:

  • problems due to the distributed nature of the system
  • race conditions when moving a data set across different data stores that they do not share the knowledge on when data are stored or accessed
  • grammar issues when having a common language to be executed by different systems. 


Therefore, the design is based on well known standards (JDBC) and tools that can support this at either sides. Moreover, an investigation of the capabilities offered by the JDBC connectivity on top of Spark is taking place in order to investigate which operations can be directly pushdown to Spark by the federator. Improving performance by executing operations locally and reduce the amount memory consumptions on the federator level, in case it needs to merge intermediate results that have not yet been filtered out. Finally, an initial design of the data pipeline between the two datastores has been conducted.


The expected innovation


The main innovation of the seamless analytical framework is that using the same dataset, it allows to be used via two different datastores, each one of those providing different capabilities. Since analytical operations compete with operational ones in transactional datastores, they cannot support both types of loads at the same time. Due to this, they move data that can be considered historical to data warehouses, using costly demanding ETLs during night, splitting the dataset into two different ones that are being examined separately. However, the seamless analytical framework, allows treating the same dataset as a whole. The framework does this by live transmitting data from the operational one to the object store, ensuring the consistency of the data on the runtime.  Without any interference with the system administrator and without the need for costly and demanding operations. At the same time, the access is also being done transparently.  The federator, using the commonly used JDBC standard forwards requests to each datastore accordingly in parallel and returns the results. From the application level, the seamless framework can be considered as a black box database that combines the features and special characteristics of two different ones, without compromising the expected performance that each of the two stores could have been achieved on their own.