- Real-time data/stream processing is attractive for data-driven applications with low latency requirements such as network monitoring and real-time price analysis applications.
- An in-memory distributed is an ideal platform for the development of real-time stream processing applications due to extremely fast data access and linear scalability.
- Modern distributed caching solutions facilitate real-time communication for .NET and .NET Core stream processing applications through Pub/Sub messaging, SQL and LINQ searching, events, and continuous query.
- NCache is a .NET/.NET Core distributed cache that helps such applications to process fast and scale out with ease.
Real-Time Data/Stream Processing with Distributed Caching
With technological advancement, data-driven applications with low latency requirements are emerging rapidly. Real-time stream processing applications in .NET/.NET Core are particularly attractive since they quickly process huge amounts of incoming data from various sources before it is stored in the database. This allows quick decision-making for businesses and in turn, reduces the response time for data events in an application.
In-memory storage and computing: the de facto standard
Real-time stream processing is used for a wide range of business applications that deal with incoming data at high volume and require immediate analysis. For example, e-commerce, risk management, fraud detection, network monitoring, real-time price analysis applications, etc. The applications that work with data streams crave storage and processing to record large streams of data and perform computations. The legacy databases aren’t fast enough to meet the real-time processing needs and easily become a performance bottleneck when it comes to scalability.
Distributed caching paradigm
Recently, distributed caching has emerged as a promising paradigm for in-memory storage and computing. A distributed cache is a system where random-access memory (RAM) of multiple networked computers is pooled into a single in-memory cache cluster for providing fast access to data. With the distributed architecture, a distributed cache can grow beyond the memory limits of a single physical server. Hence, the capacity and processing power are aggregated by avoiding any single point of failure.
Distributed caches are particularly useful in environments with high transaction data and load. The distributed architecture attributes scaling by adding more physical servers to the cluster and thus allows the cache to scale with an increase in data. The popular distributed caches used in the industry are Eh-cache, Memcache, Redis, Riak, Hazelcast, and NCache.
Rethinking stream processing with distributed caching
In-memory distributed caching is an ideal platform for real-time stream processing applications. With different streaming modes in a cache you can easily:
- Add or update cache data with streams in write mode.
- Read data from a stream with or without locking.
- Write data in parallel to read operations without locking.
- Read data simultaneously for multiple clients with locking.
Here, we discuss why and how distributed caching compliments real-time stream processing applications.
Distributed caching features benefiting stream processing
Let us see what specific features in a distributed cache assist real-time stream processing applications.
A typical stream processing app consists of various several producers including data ingestion, integration, and real-time analytics. The underlying responsibilities are handled by different modules which are usually codependent. Hence, these modules need to communicate with each other to collaborate. Pub/Sub messaging feature allows asynchronous microservice communication while maintaining decoupling. Therefore, it can enable real-time information sharing on a large scale for stream processing applications.
The performance of Pub/Sub messaging can be scaled when it is integrated with a distributed in-memory cache that acts as a messaging bus. The basic working of the Pub/Sub mechanism is something like this: A Topic is created inside the cache. A publisher publishes messages on this Topic. Client applications (also known as subscribers) make a subscription against this Topic to receive the messages published on it by the Publisher. The subscription made by the subscribers against a topic is of two types:
- Durable: Subscribers can receive messages intended for them even after a period of disconnection. Meaning, the messages are maintained for them.
- Non-Durable: Subscribers receive messages from the topic as long as they are connected to the network. Once a subscriber leaves the network and then, later on, re-joins the network, it will not receive messages intended for it during that disconnection phase. Meaning, messages for a subscriber aren’t maintained.
Let's assume an e-commerce application, that processes thousands of customers each day for their online purchases. The customers can belong to different categories based on their purchases. To make the processing of customers efficient, the unfiltered customers are categorized and filtered as Gold, Silver, and Bronze customers based on the number of their orders. Using the Pub/Sub messaging feature of a cache, a Publisher can create a topic Orders and publish relevant updates to it. Meanwhile, a subscriber can subscribe to the topic for order information to further process the customers.
Searching with SQL & LINQ in Cache
Stream processing applications handle the aggregation and filtering of huge datasets. Similar to a database, caches also offer a searching mechanism where you can fetch the cache keys/data using SQL-like query syntax while using a rich set of operators. Hence, you can bring SQL querying capabilities to your data streams.
LINQ or Language Integrated Query is a Microsoft .NET component that adds data querying features to .NET languages. Syntax-wise it’s quite similar to SQL. LINQ can be integrated seamlessly into a cache via a LINQ provider. The LINQ provider facilitates the execution of LINQ queries over the distributed cache while improving the application’s performance without changing the LINQ object model.
The stream processing applications work on data in motion where actions need to be performed on the data flow as soon as possible. Meanwhile, the underlying events generated need to be handled efficiently since the real-time response is the fundamental requirement here. A distributed cache seamlessly enables real-time notifications through events. These notifications are sent asynchronously to the client so there is no overhead on the client’s activities.
It allows the client applications to register events for cache activities of their interest. This facilitates run-time data sharing when a single cache is shared by separate applications or separate modules of the same application. Hence, the client applications can monitor the state of their data and perform actions accordingly. You have the flexibility to register for cache-level events, item-level events, and management-level events. Moreover, you can either get data or metadata along with the event by using event filters in a cache.
In cache level events, the registered clients can be notified of the addition, insertion, and removal of any cache item. On the other hand, item-level events can be registered for specific keys or chunk of keys. In that case, the cache is responsible for monitoring changes for the specified key/keys, and notifications are triggered when data against those keys are altered. As the name indicates, management level events can be registered for notifying when any management operation such as cache clear and cache stop are performed.
Continuous Queries (CQ)
A typical stream application consists of several producers generating new events and a set of consumers processing these events. In this regard, there is a need to monitor a Window of time for specific types of data additions and changes. Cache provides Continuous Query for runtime monitoring and sharing of data related to an observable data set. This allows applications to be notified of all changes that occur within the queried result set in the cache.
Scalable distributed stream processing
Designing scalable applications is crucial while working with streaming data. An in-memory distributed cache is extremely fast due to the storage mode and underlying architecture. And, it also provides linear scalability due to which it never becomes a bottleneck for your .NET / .NET Core Stream Processing application performance even under peak loads.
While handling incoming data from various sources and in varying formats, your system should be able to tolerate disruptions from a single point of failure. A distributed cache provides a self-healing peer-to-peer clustering architecture that has no single point of failure. Additionally, caches implement intelligent replication strategies with minimum replication cost so there is no data loss in case a cache server goes down.
NCache is the only truly native .NET/.NET Core distributed cache available in the market. All others are developed in C++ on Linux and later ported to Windows with limited compatibility of .NET. NCache fits nicely into your .NET / .NET Core application stack and simplifies your cost of development and maintenance. Meanwhile, it provides different streaming modes, a dedicated Pub/Sub store, event notifications, cache searching with SQL queries, and LINQ. It's an ideal choice for your .NET/.NET Core stream processing applications.