One of the major goals of the BigDataStack project is to build and give access to a set of data services to facilitate the ingestion of big data as well as its exploration through analytics.
In this blog we detail the major data services that are being built. We will first deal with the data ingestion path. This path starts from the very instant a data record is created (e.g., a sensor issuing an IoT record) and ends when the data is stored in its final form in a permanent store. In a second part, we will detail the data services which ease data exploration which may be through direct SQL queries or through machine learning algorithms.
Author: Yosef Moatti – IBM Research – BigDataStack Coordinator
Reading time: 5 minutes
1. The Data Ingestion Path
The Complex Event Processing (CEP) service can be applied both at the edge, where the data is created and within the data center. At the edge, the CEP will typically be used to detect major faults such as deficient sensors. If such problems are detected the data will be flagged and alerts will be triggered. Within the data center, the CEP will typically be used to check for rules. These rules are related to the business logic, for instance checking that a vessel average fuel consumption during the past hour is under a given threshold or detect wrong values from sensors e.g., if a car is moving then either the fuel consumption is strictly positive or the gearbox is in “neutral”. A rule infringement points to some data inconsistency which causes should be understood and fixed. In such scenarios alarms will typically be created and sent to trigger corrective actions. The requirement for quick corrective actions can be fulfilled thanks to the CEP which offers the opportunity to detect problems on streaming data before the data reaches a permanent data store.
BigDataStack Data Cleaning Service
The CEP concept is not new, however in BigDataStack, using the Data-driven Infrastructure Management services, we implemented an elastic service which permits to handle a stream of data with variable load.
Once the data stream has gone through the CEP, the data can be permanently stored, however the data owner will typically need to know to what extend each data record can be relied upon. Here is where the BigDataStack data cleaning service comes in. It analyzes each of the records and adds an accuracy rate ( in %) which quantifies the odds of this record being accurate. The accuracy mark is typically used when analyzing the data and permits more precise data analysis by taking into account the record accuracy. One novel aspect of this service is that it is context independent: the cleaning service may be applied to any dataset. In a first phase a model is built which can then be applied to incoming data records.
Once the data accuracy has been rated, we want to permanently store it. Here we encounter a classic big data dilemma: on one hand Object Stores are much cheaper than traditional data bases, on the other hand Object Stores cannot perform in place data updates and do not support transactions. The following innovative data services of BigDataStack aim to address this dilemma:
- First the LeanXcale data base is an easy-to-use database ready for an operational and analytic workload and provides fast insertion and aggregation over real-time data. It is fully ACID and scales out linearly from a single node to hundreds of nodes. The BigDataStack project contributed to further improving this data base by enabling its storage mechanism to be adaptable in a distributed environment. That is, LeanXcale can identify which of its data nodes are overprovisioned, and by splitting/merging and redeploying its datasets, the overall incoming workload and resource consumption can be balanced. Furthermore, LeanXcale can be dynamically scaled-out by requesting resources from the infrastructure, and then splitting and redeploying its data. All this process is made dynamically, without downtime and without sacrificing the data consistency level. This is the main innovation added within BigDataStack, as database typically suffer from one of those issues when they scale-out.
- However, the LeanXcale is not meant to provide a solution when the size of the data is bigger than 5-10 Tera bytes. For this reason and also for data storage price / GB, the Object Store technology is needed as a complementary solution. However, splitting a dataset between a data base and an Object Store is not a trivial task. At BigDataStack Innovation Potential we detailed our plans for the seamless analytical framework which goal is to federate data of a given dataset into these two different datastores: the relational operational LXS datastore and object stores. We are now well on our way to implement this component after improving our design to now solve the race and other problems.
- The primary advantage of the seamless component is of exposing the LeanXcale data base and the Object Store as a single logical data store. Fresh data is ingested in the LeanXcale data base and historical data slices (configurable per dataset) are seamlessly moved to the Object Store. Moreover, a dataset which is stored within the LeanXcale / OS federation (the seamless component) can also be queried seamlessly without knowledge of in which store the data is.
2. The Data Query Path
In the data query path, BigDataStack provides two critical services to ease data exploration:
- As we mentioned within the data ingestion path, the seamless component permits to not only store but also query the data seamlessly, without knowing where the data resides: only at the LeanXcale data base, or both in the LeanXcale database and the Object store.
- Naturally, there is an expected performance gap between the 2 stores to analyze the data (e.g., SQL queries). To fill this gap, we developed a technology called data skipping which can significantly improve Spark SQL query performance. This technology has matured to the point where it has been announced as a beta for the IBM SQL Query service. Also it was demonstrated at the Think `19 conference over the Danaos use case.
Much more should be written about this technology, however we refer the reader to this IBM blog for details. Also this blog details how to layout a dataset for reaching a good performance acceleration.