In previous blogs in this series, I dug into advantages, disadvantages, and best practices related to Realtime vs Near-Realtime data processing and availability. In this final installment, I want to offer a few quick notes on data sources and data integrity related to these two solution types.
Realtime Dataset Options
Let’s start with a couple quick notes on Realtime datasets. Power BI supports several different Realtime data sources including push datasets, streaming datasets, and what’s known as a PubNub streaming dataset.
With a pushed dataset, data is basically being pushed into the Power BI service. We can think of this as being very similar to archiving the data in the back-end of Power BI. The Power BI service creates a new database underlying the service to store that data.
That also creates the ability to look at a historical version of what’s happened over time. Once a report has been created using a push dataset, any visuals from that dataset can be pinned to a dashboard. Those visuals will update in Realtime whenever the data is updated. It’s like a trigger – within the service, the dashboard is triggered to refresh that tile when new data is received.
Data is also being pushed into the Power BI service with a streaming dataset, but with one important difference: Power BI is storing the data in a temporary cache, which quickly expires. That temporary cache is used to display visuals, which have a transient sense of history. We may be able to see a little trending information (like a line chart with a time window of an hour), but it expires. We can’t look back over longer periods of time.
With the streaming dataset, there is no underlying database. You can’t build report visuals using the data that flows in from the stream. There are no filtering, custom visuals, or other report functions that we would typically have in Power BI. The only way to visualize data coming from that streaming dataset is to add a tile to your dashboard that uses the streaming dataset as its data source. When that happens, those custom streaming tiles are optimized for quickly displaying that custom Realtime data.
The result is that there’s very little latency between the time when the data’s being pushed into the Power BI service, and when that visual is being updated.
PubNub Streaming Dataset
The third scenario is PubNub. With a PubNub streaming dataset, the Power BI web client is using the PubNub software development kit to read an existing PubNub data stream. I won’t go into the details of building a PubNub data stream in this blog, but their datasets are like a traditional streaming dataset. They can only be visualized by adding a tile to the dashboard.
Each of these three types of Realtime datasets offer different advantages and update rates that may meet your ultimate Power BI use case.
Storage Approaches for High-Volume/High-Velocity
Of course, there are different concerns if you need to store high volume or high velocity data, rather than just have it displayed or temporarily stored in Power BI.
Persisting data in an Azure SQL DB instance is a solid and highly scalable option. Microsoft has invested heavily in scalability, and the ability to turn up the performance of a single Azure SQL DB instance is rather amazing. In many cases, that instance will suffice.
However, if we’ve got high velocity and high-volume data, we will want to think about Azure Data Lake storage or Azure Blob storage. I prefer Azure Data Lake, especially the Gen2 capabilities. Azure Data Lake lets us store big data at a very low cost. It also provides a table structure – which is familiar to those used to working in Azure SQL DB or a regular SQL server instance. We have many clients with IoT or IIoT Realtime scenarios who are persisting data in an Azure Data Lake storage environment.
Data Integrity with Cloud-Based Data Solutions
As a last comment, one of the questions we’ve received when we talk about data solutions is, “Have you seen any issues with the data and data integrity when it comes to cloud-based solutions?” Users can be concerned with keeping transactions in sync. Locking and blocking issues can happen in a transactional system when, as a record is being updated, we don’t have the capability to also load that record into another data storage area. We can seamlessly engineer around potential data integrity issues, but it does require an awareness of the capabilities and limitations of the data technologies being used.
One of the other keys to maintaining that integrity is implementing a framework that identifies where a potential data integrity issue has occurred – like where a transaction hasn’t loaded or where we’ve had a failure in processing. At Skyline, we implement different logging and auditing capabilities so we can watch for those issues and potentially quarantine the data if something comes up.