There are a few “hot topics” that come up in every data strategy, innovation or future-focused conversation and none come up more frequently than real time data. For some years now it’s been the talk of the town and is often (incorrectly) seen as a flex as to how customer centric and modern you are as an organisation.

“We want real-time data!”

“Let’s talk about streaming!”

“I hear that BigTechCompanyXYZ has an event-first data architecture”

Yes, they probably do. And chances are you aren’t burdened by the same expectations that forced them down that extremely complicated and costly technical path. That’s a good thing! Simplicity is great if you can get away with it.

The reality is that the cost and complexity that go with building and operating an effective real-time data architecture will, to be blunt, simply not make financial sense for the vast majority.

Ahh, but Mal, have you forgotten that it was predicted 30% of data generated will be real-time by 2025?

Nope. Remember that you can consume and utilise real-time data without having to keep it real-time, reducing complexity without sacrificing value. And this is more often than not the answer that we should be defaulting to, at least until the technology catches up and is as easy to use as batch tech.

Now that you know the likely answer, let’s delve into a bit of background as to how we should arrive there.

What is Real-Time Data?

If you ask this question point-blank you’ll get many answers - for some it’s <5 minutes, for others <0.5 seconds. For the sake of this article we’ll consider real-time data to be data that requires an action/response in a sub-second timeframe - the kind of latency that will force you to consider different technologies and architectural patterns. If your organisational requirements are closer to the 5 minute timeframe, you’re doing batch and you can stop reading here because you’ve already won.

An important distinction to call out is the difference between real-time data, streaming data and event data. These terms can sometimes be used interchangeably, but the difference between them is important.

Real-time data: This is a requirement more than a technical construct. It is the need to respond or act on data “in the moment” which will typically mean you’re counting a response in a latency of seconds.

Streaming data: This is the emission and ingestion mechanism for some real-time data. Often associated with technologies like Kafka and it’s various cloud flavour equivalents the term ‘streaming’ will refer to the technology mechanism by which a real-time data requirement can be realised.

Event data: This is the atomic data packets that will most often flow in a real-time requirement through a streaming technology. Events (or signals) are self contained pieces of data associated with a specific point in time. Using streaming data technologies these events can be ordered, replayed, stored or processed - but more on that in the next section.

Examples of Real-Time Data in Action

Now we have level set some of the terminology, let’s go onto some existing use cases of real-time requirements in use all over the world today:

Payments - as soon as you tap your card at a payment terminal the transaction that flows back and forth with your payment provider and various financial institutions are events that flow through a streaming pipeline with a real-time requirement as latency is important to the outcome of the transaction and the customer experience.

Streaming services for music or video - Each video/audio block is broken up into chunks and delivered through a stream to allow just-in-time stitching together of those chunks on the user side to present fluid video/audio. If this weren’t using a real-time architecture we’d have to wait for an entire movie to download before we could enjoy the first few minutes.

Online Games - when your enemy fires an arrow in your direction you need immediate notification in order to respond before the consequences are unavoidable. This multi-way real-time action and consequence framework requires large scale streaming technologies in order to create a playable experience.

The list could go on, but all use cases will have one fundamental in common and that is that the value of data diminishes dramatically if it is slowed down.

Real-time data requirements MUST decrease value significantly over seconds of time.

What are Real-Time Data Architectures

There is one major architectural choices that you will need to make early on when considering any level of real-time data through your data ecosystem and that is the choice between a Lambda and a Kappa architecture.

I’m going to provide a surface-level explanation of both - if you want in-depth details please refer to one of the many reference architecture sites out there - here is one example from Microsoft.

Lambda Architecture

Lambda Architecture (Credit to Microsoft Reference Architecture)

The Lambda Architecture will fork your data down both a batch and a streaming pathway. All of your real-time requirements will be served directly from the “hot path” of the streaming pathway and this pathway will be built only for those requirements (and thus will not be very rich with data to limit the expense and complexity). Optionally there will be a point in time where any real-time analytics or insights are then pushed back into your “cold path” batch pathway so there is a marrying of the two worlds in a batch latency.

Your “cold path” batch pathway will be used for any use cases that don’t require sub-second latency and this pathway will typically be richer in data as the processing is cheaper and simpler to maintain.

The technologies between the two pathways will be different - you may use Spark for your batch processing and Kafka / ksql for your streaming workloads.

This “best of both worlds” architecture is a safe first step into the world of real-time but the criticisms are obvious - you are maintaining two patterns, two technologies and two flows for the same data and all of the downsides that come with that (cost, effort, synchronisation etc).

Kappa Architecture

Kappa Architecture (Credit to Microsoft Reference Architecture)

Kappa Architecture treats all data as streaming data and separates this into bound streams (eg batch data) and unbound streams (eg streaming data). This architecture only has one path for data processing and will integrate and enrich any master or reference data into your incoming data on-the-fly.

Whilst the architectural pattern of Kappa is more elegant and simplistic (one flow as opposed to two, one technology as opposed to two, no downstream synchronisation issues) the downsides lie in the complexity and cost of the underlying “streaming is everything” technologies. Having first-hand experience designing, deploying and operating an Apache Flink environment (one of the premier technologies for realising a Kappa architecture) I can say the complexity was factors higher, the operational costs were significantly greater (several orders of magnitude higher) and the engineering costs were also more challenging.

There will be a time in the future (probably near future) where Kappa is the no-brainer architecture of choice as the implementation of ingestion and processing of both bound and unbound streams will be equally painless, but as of publishing this article this architecture should be seen as pretty serious business and you should definitely know you will get value out of it before jumping in the deep end. Remediating a foundational architecture is expensive.

Will I get value from Real-Time?

This is the multi-million-dollar question. As in, if you get it wrong be prepared to sink many millions of dollars rectifying it.

My default answer is still to ignore the trendiness and to go batch-first as a matter of simplicity. When, and only when, you find that your batch latency is causing an unsolvable issue with your business model or customer experience should you entertain a move to dipping your toes into a Real-Time Data Architecture.

As always, stand on the shoulders of giants - use the real-time data provided by others around you (social data, customer data, IoT data) but always consider whether there is a value loss in turning that data into batch before you use it. In my experience 95%+ of the time the value lost is far outweighed by the expense incurred by processing and consuming the data in real-time.

Try batch. If it isn’t working for you, try micro-batch. If that isn’t working for you, reframe the problem and make sure you’re answering the right question. If you’ve done that and still come up with real-time as the answer, do a high level cost/benefit analysis.

If you’ve done all of that and you’re convinced that real-time data architecture is for you - strap yourself in and enjoy the ride. The technology is awesome, boundaryless, and extremely finicky and frustrating - you will go through a learning journey and wear some scars…but for the (very small) number of use cases where it truly is required, it can reap enormous benefits.