Get Started

Get Started

These guides demonstrate the operational flexibility and speed of the Hazelcast In-Memory Computing Platform. Set-up in seconds, data in microseconds. Operations and developer friendly.

Hazelcast IMDG

Find out for yourself how to get a Hazelcast IMDG cluster up and running. In this Getting Started guide you’ll learn how to:

  • Create a Cluster of 3 Members.
  • Start the Hazelcast Management Center.
  • Add data to the cluster using a sample client in the language of your choice.
  • Add and remove some cluster members to demonstrate the automatic rebalancing of data and back-ups.

Hazelcast Jet

Learn how to run a distributed data stream processing pipeline in Java. In this Getting Started guide you’ll learn how to:

  • Start a Hazelcast Jet cluster in your JVM.
  • Build the Word Count application.
  • Execute the application wit Jet.
  • Push testing data to the cluster and explore results

As a next step, you will be able explore the code samples and extend your setup more features and connectors.

Jet Architecture

Programming Model

Hazelcast Jet is a framework for continuous computations over streams of data. What exactly does this mean?

A data stream is continuous flow of objects or records without a notion of explicit beginning or end. An abstract concept of the stream can be thought of as a:

  • Queue in Java
  • Topic in a message broker such as Apache Kafka or JMS
  • Unix pipe

Data streams don’t originate in these messaging systems. Rather, there are data producers creating the objects and publishing them to the stream. These producers could be:

  • Applications producing log events, business events, transaction records, etc.
  • Sensors or devices producing measurement events.
  • Databases producing change events. For example, every INSERT, UPDATE, or REMOVE could produce a new record in the stream.

Data streams are generally infinite, growing as new objects are produced, but the memory and disk space of computers have capacity constraints. Physical stream storage systems can hold only a portion of the stream. Therefore, consumers must work in parallel with producers to continuously process data as objects are created or within a reasonable delay, based on the capacity of the stream storage system.

Continuous processing, unlike staged processing, makes the data pipeline real-time, reducing the end-to-end processing latency to milliseconds in optimized in-memory systems such as Hazelcast.

The benefits of continuous processing are valuable for data-driven and latency-sensitive use cases such as:

  • Analytics and decision-making. Analytical computations driven by incoming data can produce insights in real time.
  • Data integration. Systems and applications can stay in sync by sharing the stream of change events.
  • Event-driven applications. An application can change its state based on a stream of events.

Hazelcast Jet is a framework for building continuous data processing applications that scale.

Building Blocks

The main APIs of Hazelcast Jet are the Hazelcast Jet Pipeline API, the Hazelcast Jet Connector APIs, and the Hazelcast IMDG API.

Hazelcast Jet Pipeline API

Hazelcast Jet Pipeline API is an elegant, functional-style declarative Java API that allows users to implement computations by linking the respective transformations, then connecting those to the upstream data sources and downstream sinks.

Hazelcast Jet Pipeline API follows the continuous processing model and provides you with the tools to implement stateful computations over both bounded (batch) and unbounded (streaming) data.

Hazelcast Jet Connector APIs

Data to be processed may live outside of Jet in databases, Apache Kafka, JMS, Hadoop, files, REST services, sockets, Twitter, or many other sources, or it may live in the robust, distributed in-memory data storage of Hazelcast IMDG. Hazelcast Jet connectors make data available to the Jet pipelines.

The connectors provide an abstraction layer over many flavors of data sets such as unbounded streams, bounded collections, push or pull models, and more, so that the Pipeline API can approach the data with a unified set of operations.

You can combine several data sources with one pipeline to join or correlate them. Or you can fork the pipeline to output to more than one destination.

Visit the Hub for a complete list of connectors or use a convenient builder API to integrate your custom data source or sink.

Hazelcast IMDG API

Hazelcast Jet builds on a tight integration with Hazelcast IMDG – the robust, distributed in-memory storage with querying and event-driven programming support. The services of Hazelcast IMDG are available in the Jet cluster to be used in conjunction with the Pipeline API and Connector API to:

  • Cache your reference data and enrich the event stream with it. [code sample]
  • Store the results of a computation. [code sample]
  • Cache the input data for faster processing with Jet. [code sample]

Coordinate your application using a linear and distributed implementation of the Java concurrency primitives backed by the Raft consensus algorithm such as locks, atomics, semaphores, and latches. [code sample]

Architecture

Hazelcast Jet follows the client-server architecture to decouple application declaration and execution. The Jet SDK exposes the Pipeline API, Connector API, and IMDG API. Use it to build Jet applications. To run the application, you must submit it to the Jet server for execution.

The Hazelcast Jet server can run either embedded or as a standalone data processing cluster. In embedded mode, you package the Jet JAR with your application and start the Jet cluster member from your application code. The Jet member then runs in-process, in the same JVM as the application. The member discovers and joins the running cluster and the processing job, starting it if absent. Embedded mode simplifies the packaging and deployment because everything is distributed in one self-contained package such as a JAR or Docker container.

Also, you can start a standalone Hazelcast Jet cluster from a command line and use the CLI or a Java client to control the cluster and submit Jet Jobs. The cluster members use the same execution engine as they would in embedded mode.

The SDK, client, and server are all distributed in one Java library with no dependencies. This significantly simplifies the architecture of your solution.

Continuous Execution

After the Pipeline is submitted for execution, the Hazelcast Jet planner translates it into a Directed Acyclic Graph (DAG) representing the data flow. The DAG is distributed and replicated to the entire cluster and executed in parallel on each member.

Each DAG node specifies how many parallel instances will run per cluster member. This varies based on the operator—some operations can be parallelized while others can not—and available CPU cores. Every member has the same number of parallel instances, or processors.

Hazelcast Jet establishes channels between respective processors to mediate the data flow. The processors run concurrently. Each processor continuously consumes data from the upstream channels and publishes the results downstream. The capacity of the channels is limited to keep the amount of in-flight data under control and to provide natural back pressure.

Threading

Job execution is done on a user-configurable number of threads. Each worker thread has a list of processors of which it is in charge. As the processors complete at different rates, the unutilized processors back off and stay idle, allowing busy workers to catch up to keep the load balanced.

This approach, called cooperative multi-threading, is one of the core features of Hazelcast Jet and can be roughly compared to green threads. Each processor can perform a small amount of work each time it is invoked, then yield back to the Hazelcast Jet engine. The engine manages a thread pool of fixed size, and the instances take their turn on each thread in a round-robin fashion.

The purpose of cooperative multi-threading is to improve performance. Two key factors contribute to this:

  • The overhead of context switching is much lower since the operating system’s thread scheduler is not involved.
  • The worker thread driving the instances stays on the same core for longer periods, preserving the CPU cache lines.

Scaling and Elasticity

Continuous streaming jobs tend to be long-running tasks. The elasticity of a Hazelcast Jet cluster allows scaling up and down with the load to manage load spikes and to prevent overprovisioning. Hazelcast Jet uses data parallelism to scale the computations.

Hazelcast Jet splits partitions of the input data among the DAG instances concurrently running in the cluster to effectively use all available CPU. Not all data can be split. For example, reading from a single file or counting all records in the stream has to be done by a single processor in the cluster. In such cases, Jet uses fewer processors—just one file reader processor instance for example—for the respective DAG operation.

The cluster is elastic. You can scale the cluster up and down by adding and removing cluster members without downtime. When the cluster topology changes, Jet redistributes the data partitions among available DAG instances.

Fault Tolerance

Jet keeps processing data without loss even if a node fails, so it’s easy to build fault-tolerant data processing pipelines. Under the hood, Jet regularly creates a snapshot of the cluster and its jobs to recover after failure. The snapshot is stored in the resilient data storage of Jet in a configurable number of backup copies. Computations can be restored if at least one copy of the data survives.

For increased fault tolerance, one can use the Lossless Cluster restart feature to persist the snapshots to the disk.

The fault tolerance protocol of Jet is based on replayable input data and deterministic computations. The snapshot consistently captures the state of the computation including the “last committed offset” of the input data. The cluster can use the snapshot to recover the computation by restoring the state from the snapshot and replaying all the data following the snapshotted source offset. You can use transactions to prevent duplicate output.

Regular snapshots are a very lightweight means of fault-tolerance compared to performing a full  cluster backup with each state change.

 

Clustering

Hazelcast Jet automatically handles clustering, including node discovery, cluster and job management, and fault tolerance, without any runtime dependencies. Fewer moving parts means simpler architecture.

The nodes will automatically start to discover each other and form a cluster. Jet rescales running jobs to use all available resources. A built-in resilient storage backs up cluster state for automatic fault tolerance.

The nodes in a Jet cluster are symmetrical – there is no dedicated master node or worker node. The Jet cluster assigns these roles dynamically. Roles may change over time as the cluster topology changes.

There are many cluster and resource management tools available, such as YARN, Mesos, Kubernetes, and more. Jet isn’t associated with any of them, nor does it try to replicate what they do. Instead, Jet focuses on running jobs and dealing with topology changes. Cluster resource allocation is left to a tool of your choice.

Delegating resource management to different layers allows you to simplify deployments where you easily launch Jet nodes manually by copying necessary JARs and running Jet from a command line.

A Hazelcast-supported Docker image is available on Docker Hub for container deployments. Hazelcast Jet Docker images are Kubernetes-ready. There is a Helm chart available that bootstraps Hazelcast Jet deployments on a Kubernetes cluster using the Helm package manager.

Free Hazelcast Online Training Center

Whether you're interested in learning the basics of in-memory systems, or you're looking for advanced, real-world production examples and best practices, we've got you covered.

Open Gitter Chat