In today’s fast-paced world, the concept of patience as a virtue seems to be disappearing as people no longer want to wait for anything. When Netflix takes too long to load or the nearest Lyft is too far away, users quickly turn to alternative options. The demand for instant results isn’t just limited to consumer services like video streaming and ridesharing; It extends to the field of data analysis, especially when it comes to serving users at scale and automated decision-making flows. The ability to provide timely insights, make informed decisions, and take immediate action based on real-time data is becoming increasingly important. Companies like Confluent, Target and numerous others are industry leaders, because they leverage real-time analytics and data architectures that enable analytics-driven operations. This ability allows them to remain at the forefront of their respective industries.
This blog post introduces the concept of real-time analytics for data architects starting to explore design patterns and provides insights into how to define them, as well as the preferred building blocks and data architectures commonly used in this area.
What exactly is real-time analytics?
Real-time analytics are characterized by two fundamental characteristics: current data and quick insights. They are used in time-sensitive applications where the speed at which new events are converted into actionable insights is a matter of seconds.
On the other hand, traditional analytics, commonly known as business intelligence, refer to static representations of business data used primarily for reporting purposes. These analyzes are based on data warehouses such as Snowflake and Amazon Redshift and are visualized using business intelligence tools such as Tableau or PowerBI.
Unlike traditional analytics, which are based on historical data that can be days or weeks old, real-time analytics leverage fresh data and are deployed in operational workflows that require rapid responses to potentially complex queries.
For example, imagine a supply chain manager looking for historical trends in monthly inventory changes. In this scenario, traditional analytics is the ideal choice because the executive can afford to wait a few extra minutes for the report to process. On the other hand, a security team aims to detect and diagnose anomalies in network traffic. This is where real-time analysis comes in, as the SecOps team needs rapid analysis of thousands to millions of real-time log entries at sub-second intervals to identify patterns and investigate abnormal behavior.
Does the choice of architecture matter?
Many database vendors claim to be capable of real-time analytics, and they have some capabilities in that regard as well. For example, consider the weather monitoring scenario where temperature readings need to be collected from thousands of weather stations every second, and queries include threshold-based alerts and trend analysis. SingleStore, InfluxDB, MongoDB and even PostgreSQL handle this with ease. Real-time analysis can be performed by creating a push API to send the metrics directly to the database and run a simple query.
So when does the complexity of real-time analytics increase? In the example given, the data set is relatively small and the analyzes involved are simple. Since only one temperature event is generated per second and a simple SELECT query with a WHERE statement is required to retrieve the most recent events, minimal processing power is required, making it manageable for any time series or OLTP database.
The real challenges arise and databases reach their limits as the volume of events ingested increases, queries become more complex and multi-dimensional, and datasets reach terabytes or even petabytes in size. Although Apache Cassandra is often considered for high-throughput ingestion, its analytical performance may not live up to expectations. In cases where the analytics use case requires linking multiple real-time data sources at scale, alternative solutions need to be explored.
Here are some factors to consider to help determine the required specifications for the appropriate architecture:
- Do you work with high events per second, from thousands to millions?
- Is it important to minimize the latency between events being created and being polled?
- Is your entire data set large and not just a few GB?
- How important is query performance – subsecond or minutes per query?
- How complicated are the queries, exporting a few rows or large aggregations?
- Is it important to avoid data stream and analytics engine downtime?
- Are you trying to merge multiple event streams for analysis?
- Do you need to correlate real-time data with historical data?
- Do you expect many concurrent queries?
If any of these aspects are relevant, we discuss the characteristics of the ideal architecture.
building blocks
Real-time analytics require more than just a competent database. It starts with the need to establish connections, transmit and process real-time data, and leads us to the first basic element: event streaming.
1. Event streaming
In situations where real-time is paramount, traditional batch-based data pipelines are often late, creating messaging queues. In the past, message delivery was based on tools like ActiveMQ, RabbitMQ, and TIBCO. However, the modern approach involves event streaming using technologies such as Apache Kafka and Amazon Kinesis.
Apache Kafka and Amazon Kinesis address the scalability limitations often associated with traditional messaging queues, enabling high-throughput publish/subscribe mechanisms to efficiently collect large-scale event streams from and to various sources (called producers in Amazon terminology). Distribute targets (referred to as consumers in Amazon terminology) in real time.
These systems seamlessly collect real-time data from a range of sources such as databases, sensors and cloud services, encapsulating it as event streams and facilitating its transmission to other applications, databases and services.
Because of its impressive scalability (as evidenced by Apache Kafka’s support for over seven trillion messages per day on LinkedIn) and ability to support numerous concurrent data sources, event streaming has emerged as the dominant mechanism for delivering real-time data to applications.
Now that we’re able to collect real-time data, the next step is to explore how we can analyze it in real-time.
2. Real-time analysis database
Real-time analytics require a dedicated database that can take full advantage of streaming data from Apache Kafka and Amazon Kinesis and provide real-time insights. Apache Druid is exactly that database.
Because of its high performance and ability to handle streaming data, Apache Druid has become the database of choice for real-time analytics applications. With support for true stream ingestion and efficient processing of large amounts of data in a fraction of a second, even under heavy loads, Apache Druid excels at providing rapid insights into new data. Seamless integration with Apache Kafka and Amazon Kinesis further cements its position as the top choice for real-time analytics.
When choosing an analytics database to stream data with, considerations such as scale, latency, and data quality are critical. The ability to handle full-scale event streaming, ingest and correlate multiple Kafka topics or Kinesis shards, support event-based ingestion, and ensure data integrity in the event of disruptions are key requirements. Apache Druid not only meets these criteria, but goes above and beyond to meet those expectations and provide additional functionality.
Druid was deliberately designed to excel in fast capture and real-time querying of events as they occur. It has a unique approach to streaming data that captures events one at a time rather than relying on sequential batch data files to simulate a stream. This eliminates the need for connectors to Kafka or Kinesis. In addition, Druid ensures data quality by supporting exact-once semantics, ensuring the integrity and accuracy of ingested data.
Like Apache Kafka, Apache Druid was specifically designed to process Internet-scale event data. Its service-based architecture enables independent scalability of ingestion and query processing, allowing for near-limitless scalability. By mapping ingestion tasks to Kafka partitions, Druid seamlessly scales with Kafka clusters, ensuring efficient and parallel data processing
It is becoming increasingly common for organizations to ingest millions of events per second into Apache Druid. For example, Confluent, the creators of Kafka, built their observability platform with Druid and successfully ingest over five million events per second from Kafka. This demonstrates Druid’s scalability and high performance in handling massive amounts of events.
However, real-time analytics goes beyond just accessing real-time data. To gain insights into patterns and behaviors, it is important to also correlate historical data. Apache Druid excels in this regard, as illustrated in the diagram above, by seamlessly supporting both real-time and historical analytics through a single SQL query. Druid efficiently manages large amounts of data, even up to petabytes, in the background, enabling comprehensive and integrated analysis across different time periods.
When all the pieces come together, it creates a highly scalable data architecture for real-time analytics. This architecture is the preferred choice of thousands of data architects when they need high scalability, low latency, and the ability to perform complex aggregations on real-time data. By leveraging event streaming with Apache Kafka or Amazon Kinesis combined with the power of Apache Druid for efficient real-time and historical analysis, organizations can gain robust and comprehensive insights from their data.
Case Study: Ensuring a Premium Viewing Experience – The Netflix Approach
Real-time analytics are a critical component of Netflix’s relentless pursuit of delivering an exceptional experience to over 200 million users, who collectively consume 250 million hours of content every day. With an observability application tailored for real-time monitoring, Netflix effectively monitors more than 300 million devices to ensure optimal performance and customer satisfaction.
Leveraging real-time logs generated by playback devices, seamlessly streamed through Apache Kafka and ingested event by event into Apache Druid, Netflix gains valuable insights and quantifiable metrics on user device performance during browsing and playback activities.
With a staggering throughput of over two million events per second and lightning-fast, sub-second queries performed on a massive dataset of 1.5 trillion rows, Netflix engineers have the ability to pinpoint anomalies within their infrastructure, endpoint activity, and content flow identify and investigate.
Get real-time insights with Apache Druid, Apache Kafka, and Amazon Kinesis
If you’re interested in building real-time analytics solutions, I highly recommend exploring Apache Druid in conjunction with Apache Kafka and Amazon Kinesis.