In the previous two posts in this series I focused on the organizational and cultural items required to become a faster and more insightful data company. In the next two posts I will outline technology trends that we at RStor are seeing in companies actively pursuing and successfully implementing a more rapid and more insightful data culture. While there are many technologies involved in delivering rapid insights, in this installment I focus on three technologies that are instrumental in achieving faster insights. The three technologies are:
- Data stores, including data lakes
- In memory databases and analytics
- Data streaming technology
The Pursuit of Real Time – Options for Faster Data
To support near-real-time data warehousing, organizations will often use either change data capture (CDC) to update changed data in the warehouse or data virtualization to speed access to sources without having to move or replicate the data to a central physical store. Recent trends indicate that more organizations are using data virtualization or federation now than just two years ago.
Other organizations are using an operational data store (ODS), a system that typically complements a data warehouse by providing users with access to a selected, trusted set of integrated near-real-time data, usually for operational reporting and notification. This solution seems to be on the decline although its use is still significant.
Data lakes remain popular. Data lakes, increasingly stored in the cloud, are flexible platforms that can contain any type of data. Organizations use them to consolidate data from multiple sources, which could include operational, time-series, and near-real-time data. However, unlike an ODS, data lakes are typically set up for exploratory analytics and AI/machine learning to look for patterns and other insights. Some organizations create operational data lakes or set up portions of their data lake for fast SQL queries on big data. Organizations can also develop templates and preconfigured views of selected operational data for consistent and repeatable reports or for developing OLAP cubes.
In-memory analytics and databases have been used in the enterprise for more than a decade and today their use is becoming more common.
In-memory platforms, by reducing the need to read data stored on disk, can enable faster access to data for visualization, data exploration, and testing models. As larger random access memory (RAM) caches become available, organizations are able to keep more “hot” data available for computation. Technologies are evolving to make it possible to store entire data warehouses, data marts, or OLAP cubes in-memory. Commercial as well as open source solutions using Apache Spark or the more recent Apache Ignite can support in-memory analytics and database systems. They can also support streaming workloads.
Organizations are implementing streaming and event processing technologies.
One of the more popular solutions in use by many organizations is Apache Kafka. Now an established platform for distributed streaming of large numbers of events, Kafka began as a messaging system with optimized performance. Organizations are using Kafka to build streaming data pipelines for automated applications that must react to real-time data or must make the data available for analytics and machine learning. Kafka can be a source for Apache Spark Streaming. This module allows organizations to integrate a variety of workloads, including streaming, on the same processing platform, which can reduce programming and modeling complexity.
Clearly, there is no single technology approach to managing and analyzing near or true real-time data, including data streams. Organizations need to define their requirements for data freshness and the scale of data flows, speed, and volume. Organizations should assess their in-house skill sets in data management, data engineering, and data science to determine whether they can benefit by working with a multi cloud service provider to better meet their needs. Today customers have a rich set of choices as there are a number of data management automation and cloud and SaaS options available. Organizations should begin with proofs of concept (POCs) and test applications with smaller, well-defined projects. The data experts at RStor are one such starting point.