In this blog series, we are looking at the matchup of Near-Realtime data vs. Realtime data scenarios. What considerations go into handling those types of data sources? How do we process them, and how do we provide data visualizations based on those data sources that meet the needs of our audiences?
Realtime vs Near-Realtime Overview
Before we dive in, let’s get an overview of those Realtime data scenarios. By that, I mean where we need to evaluate data that’s in-flight so we can make determinations or predictions that could result in an intervention from a business standpoint.
For example, we might need to evaluate credit card transactions to attempt to detect fraud and then prevent that fraudulent sale from occurring. In those cases, it’s going to be very important that we look at that data in a Realtime scenario.
A Near-Realtime case would be if having up-to-date data available is a priority, but we don’t have that requirement to intervene. In those cases, we want to make sure our end users see “up-to-date” data that they recently entered in business systems reflected in their data visualizations and reports. By “up-to-date” we mean data that’s from a couple of minutes to fifteen minutes (or so) old. We may want some additional capabilities along with that Near-Realtime case (which we will cover later in this blog series).
In general, the need for Realtime versus Near-Realtime data largely depends on the cost of delayed action. Where that delayed action of a couple minutes to an hour could result in significant cost, then we want to apply Realtime data processing to those scenarios. If the cost of delayed action is not high, but we want additional analytical capabilities, we might choose a Near-Realtime scenario.
In the Microsoft space, there are a couple of different ways to provide Realtime and Near-Realtime capabilities. We show some examples of these in the infographic below. You can see in the top layer how we can use Realtime data processing, a Realtime pipeline that leverages tools like Event Hubs and Stream Analytics to move data quickly through to Power BI from your devices. Those could be IoT devices or even IIoT (Industrial Internet of Things) data from machines on the shop floor.
When you get into the two scenarios in the lower section, you may want to use different types of storage mechanisms like Azure Data Lake Store or Azure SQL Data Warehouse (recently rebranded as Azure Synapse Analytics) in conjunction with Azure Data Factory to do some processing. Those scenarios often include data coming out of ERP systems, point-of-sale systems, and manufacturing execution systems. They typically have that more Near-Realtime analysis capability that we want to enable.
This is just a high-level overview. The blog from here on is going to dig into some of the specific advantages and disadvantages of Realtime and Near-Realtime.
Pros of Near-Realtime Data Processing
First, let’s talk about some of the pros of Near-Realtime data processing and evaluation.
One of the big advantages to using Near-Realtime data processing is the ability to persist data, meaning store it somewhere that’s not coming directly off the stream. That also provides us the ability to combine that data with data from other systems or other historical data to help us look at trends. We can incorporate all of that into our data modeling in those Near-Realtime scenarios.
A second capability that’s enabled in Near-Realtime is the ability to look at larger windows of time to do historical analysis and possibly to do even more complex analysis: what we call cross-domain analysis. Those could be scenarios where we’re looking at how many orders we’ve taken that have ultimately converted into completed sales – invoices, for example. Even further upstream we could look at how many opportunities we’ve converted out of our CRM system into those orders and invoices.
Another potential advantage of Near-Realtime is that we can still have a very high speed related to how we refresh that data and make it available. Even though it may not be second or sub-second latency, we can still have a very tight time window as far as how we refresh that data – think minutes.
Cons of Near-Realtime Data Processing
These are some strong advantages, but Near-Realtime data processing also brings a couple potential drawbacks to the table: latency and the fact that we can only see historical data.
We typically see Near-Realtime latency as 5-15 minutes or longer. That’s due to the need to first persist the data and then process it. Persisting the data may require bringing it together from multiple data sources. Every time we have to perform some kind of an extract, and then ultimately process that data (whether through an ETL process or a model processing step), that introduces latency. For very, very large models, the typical 5-15 minute latency could stretch into the 20-30 minute range.
I have historical listed as both a pro and a con. The advantage is that we get historical analysis, and we can see those longer trends. What we don’t get (because it’s historical) is the ability to do that immediate intervention. We have to evaluate the data after an event has already occurred. You can’t have that snap-action type of Realtime intervention that you might want to have with data that’s coming across the wire.
Pros of Realtime Streaming
Near-Realtime data processing has a lot of good points, but some of those disadvantages make it necessary to look at other options. Below, I’ve laid out some of the potential advantages of Realtime streaming data.
First, we get very low latency: that is second or sub-second data availability and visualization based on that data. That has tremendous uses in cases where immediate decision-making and action are required. For example, when a shop floor operator or manager is tracking part quality or potential machine failures, seconds and minutes count and could be the difference between thousands of dollars of scrapped parts or damaged equipment.
Unlike when you are working with Near-Realtime data, with Realtime we can do that intervention. We can perform in-flight transaction evaluation, make recommendations, and potentially change outcomes. This makes things possible like personalized marketing based on previous choices in an e-commerce system. This intervention option is very important when there’s a high value to immediate feedback or (conversely) a high cost of delayed action.
Cons of Realtime Streaming
Like Near-Realtime, Realtime streaming has some drawbacks as well.
No Summarized Data
We can’t summarize data. In Realtime scenarios, if we’re doing any aggregation at all, we’re doing it over very small windows of time. We’re not capturing those individual data points for aggregation. There are some potential ways we can combine these two forms of processing (which I’ll cover more in a later blog), but the lack of summarized data can certainly be a Realtime drawback.
Lack of Persistence
Another disadvantage of Realtime is that streaming data isn’t persisted for deeper analysis.
No Complex Calculations
Lastly, from a calculation standpoint, it can be difficult to add any type of even moderately complex calculation logic. It’s often very difficult to do complex, averaging data evaluation relative to other transactions that have moved through our system. In many ways, we could say it’s even unsupported. There are some calculations that are supported, but very moderately complex calculation logic becomes difficult to do in a Realtime scenario.
In summary, choosing the right option depends on your use cases. Either way, solutions exist to address these scenarios at reasonable costs. If you need help evaluating specific scenarios and related technology approaches to manage them, please feel free to reach out.