If you are evaluating high-performance NoSQL solutions such as:
- Apache Cassandra
Or in even rarer cases if you’re evaluating cacheing solutions such as Memcached or Ehcache, it’s possible that your best choice may be Hazelcast:
Hazelcast uses a considerably different approach to any of the above projects, and yet for some classes of people looking for a Key-Value store, Hazelcast may be the best option for you.
So what is the Hazelcast approach and how is it different from the above solutions?
Hazelcast is an in-memory data grid (IMDG), not a NoSQL database. What does this suggest?
Pros and Cons of Going In-memory
First of all, your Key Value Store (Map) is stored in RAM. This has some natural advantages and disadvantages. Of course, one of the natural advantages is raw speed. This speed comes with tradeoffs however.
On the downside, we have two new problems… first is scalability and second is volatility. We know that available RAM is almost always smaller than available disk space. We also know that RAM is volatile and subject to data loss when servers crash or otherwise fail.
Increasingly, more and more deployments are seeing the advantages of ever-expanding RAM sizes at lower costs and Hazelcast IMDG has been providing the advantage of this difference and mitigating the shortcomings. We keep hearing the phrase “RAM is the new disk, and disk is the new tape.”
Scalability: Size of RAM vs Disk
How does Hazelcast overcome the problem of scalability? The first answer to this is through clustering. By adding Hazelcast IMDG, a small Java library (about 2.6MB) to any JVM in the network, the processor and RAM resources of that node become available to the network, with the goal of linear scalability.
By joining hundreds of nodes in a cluster, you can aggregate Terabytes of RAM in order to accommodate your map in memory. This method allows you to hold very large map files in memory. Of course disks can hold up to petabytes, so depending on your use case, Hazelcast may be appropriate for your need standalone, or in conjunction with a back-end data store.
Volatility: Volatility of RAM vs Disk
How does Hazelcast overcome the problem of volatility? First and foremost, Hazelcast uses peer-to-peer data distribution to provide no single point of failure.
By default, Hazelcast has data stored in two locations in the cluster, thus enabling any node to spontaneously go down with no loss of data, and the application can continue operating with no apparent change.
As you scale out and increase the number of nodes, the resiliency of this network increases, as the exposure to the loss of a single node gets smaller as the network grows. The number of backup copies of data in the network can be tuned, but of course carrying many backups also reduces the overall availability of working memory.
Hazelcast provides lots of ways to tune availability and reliability of your distributed Map including setting up backup groups, for example where one rack of servers can be set to back up another rack–so the failure of an entire group can be managed gracefully. It even provides WAN replication in the case of the loss of an entire location. Finally, the grid provides a write-through and write-behind paradigm for bringing data onto disk storage as needed.
In dynamically scalable partitioned storage systems, whether it is a NoSQL database, filesystem or in-memory data grid, changes in the cluster (adding or removing a node) can lead to big data moves in the network to re-balance the cluster. Re-balancing will be needed for both primary and backup data on those nodes.
If a node crashes for example, the dead node’s data has to be re-owned (become primary) by other node(s) and also its backup has to be taken immediately to be fail-safe again. Shuffling MBs of data around has a negative effect in the cluster as it consumes your valuable resources such as network, CPU and RAM. It might also lead to higher latency of your operations during that period.
Hazelcast IMDG focuses on latency and makes it easier to cache/share/operate TBs of data in-memory. Storing terabytes of data in-memory is not a problem, but avoiding Java garbage collection to achieve predictable, low latency and resiliency to crashes is a big challenge.
By default, Hazelcast stores your distributed data (Map entries, Queue items) into the Java heap which is subject to garbage collection. As your heap gets bigger, garbage collection might cause your application to pause tens of seconds, badly effecting your application performance and response times.
The High-Density Memory Store is Hazelcast with native memory storage to avoid garbage collection pauses. Even if you have terabytes of cache in-memory with lots of updates, garbage collection will have almost no effect; resulting in more predictable latency and throughput.
The High-Density Memory Store implementation uses NIO DirectByteBuffers and doesn’t require any defragmentation.
Here is how things work:
User defines the number of GB storage to have in native memory per JVM, let’s say it is 40GB. Hazelcast will create 40 DirectBuffers, each with 1GB capacity.
If you have, say 100 nodes, then you have total of 4TB native memory storage capacity. Each buffer is divided into configurable chunks (blocks) (default chunk-size is 1KB). Hazelcast uses a queue of available (writable) blocks. 3KB value, for example, will be stored into 3 blocks. When the value is removed, these blocks are returned back into the available blocks queue so that they can be reused to store another value.
With new backup implementation, data owned by a node is divided into chunks and evenly backed up by all the other nodes.
In other words, every node takes equal responsibility to backup every other node. This leads to better memory usage and less influence in the cluster when you add/remove nodes.