The Weather Company (TWC) collects weather data across the globe at the rate of 34 million records per hour, and the TWC History on Demand application serves that historical weather data to users via an API, averaging 600,000 requests per day. Users are increasingly consuming large quantities of historical data to train analytics models, and require efficient asynchronous APIs in addition to existing synchronous ones which use ElasticSearch. The Weather Company (TWC) presented its architecture for asynchronous data retrieval and explained how they use Spark together with leading edge technologies to achieve an order of magnitude cost reduction while at the same time boosting performance by several orders of magnitude and tripling weather data coverage from land only to global.
The Weather Company (TWC) used IBM Cloud SQL Query, a serverless SQL service based on Spark, which supports a library of built-in geospatial analytics functions, as well as (geospatial) data skipping, which uses metadata to skip over data irrelevant to queries using those functions. They covered best practices they applied including adoption of the Parquet format, use of a multi-tenant Hive Metastore in SQL Query, continuous ingestion pipelines and periodic geospatial data layout and indexing, and explain the relative importance of each one. They also explained how they implemented geospatial data skipping for Spark, cover additional performance optimizations that were important in practice, and analyze the performance acceleration and cost reductions they achieved.
Speakers: Paula Ta-Shma and Erik Goepfert