Jet is a single, embeddable Java library for building fault-tolerant and elastic data processing pipelines. The nodes automatically discover each other and form a cluster. You can add more nodes that immediately share the computation load. Jet continues processing data without loss even if a node fails. Jet runs in any cloud and functions seamlessly in Kubernetes.
Jet is designed for predictable low latency. It uses a combination of a directed acyclic graph (DAG) computation model, parallel execution, in-memory processing and storage, data locality, partition mapping affinity, single-producer/single-consumer queues, and green threads to achieve very high and predictable performance.
When working with a single Java Virtual Machine, a developer would use java.util.Collections to store operational data and java.util.stream as a higher-level API to process the data. Hazelcast Jet shifts this approach to a multi-JVM setup for scalability and high availability.
Hazelcast Jet is a distributed and robust implementation of Java Collections and Concurrency APIs extended with a functional-style, declarative data processing Java API inspired by Java Streams.
Hazelcast Jet builds on its tight integration with Hazelcast IMDG – the robust, distributed in-memory storage with querying and event-driven programming support. The services of Hazelcast IMDG are available in the Jet cluster to be used in conjunction with the data processing to:
Hazelcast IMDG deployments can be upgraded to Jet. Jet can also use a remote Hazelcast IMDG cluster as a data layer.
Hazelcast Jet combines services for stateful streaming, batch processing, caching, operational storage, messaging, and coordination into one integrated package
Hazelcast Jet is a framework for building continuous data processing applications that scale.
Jet makes the stream actionable by building and maintaining a queryable view of it. Jet continuously reads the streaming source, processes (deduplicates, aggregates, correlates, joins) new data as it arrives, and updates the cache with fresh results. Consumers fetch the pre-processed data from the cache rather than running queries on raw sequences of records. This reduces latency for consumers. Consumers can query the cache or subscribe for cache updates to get fresh data in milliseconds.
Use Hazelcast Jet pipelines to replace and speed up your MapReduce, Spark, or custom Java data processing jobs. Load data sets to a cluster cache and perform fast compute jobs on top of the cached data. The distributed storage of Jet can reasonably accommodate terabytes of data, and cached data can be reused by multiple jobs. Pipelines can combine in-memory collections with external data sets to cache only the hot, frequently accessed data.
Significant performance gains can be achieved by combining an in-memory approach with job and data co-location plus parallel execution.
Integrate data in real time with continuous pipelines. Hazelcast Jet talks to various systems (messaging, databases, caches, file systems, RPC services) using connectors to continuously move data from place to place.
The distributed execution engine of Jet makes the ETL process scalable and resilient. Highlights include:
The Hazelcast Jet server can run either embedded or as a standalone data processing cluster. In embedded mode, package the Jet JAR with the application and start the Jet cluster member from application code. The Jet member runs in the same JVM as the application.
Embedding Jet provides services for data processing, storage, messaging, and distributed coordination at an application level. It simplifies the packaging and deployment because everything is distributed in one self-contained package (such as a JAR or Docker container) with no runtime dependencies.
Embedding Jet is possible because:
Now, deploying Hazelcast-powered applications in a cloud-native way becomes even easier with the introduction of Hazelcast Cloud Enterprise, a fully-managed service built on the Enterprise edition of Hazelcast IMDG. Can't attend the live times? You should still register! We'll be sending out the recording after the webinar to all registrants.
There are common themes when people describe their reasons for rearchitecting legacy business applications, at a technical level: Speed & Scalability. At a business level: The need to gain new real-time insights. These legacy applications commonly center around some central datastore such as a relational database. Moving away from this architecture requires massive migration effort. The talk is a practical introduction to CDC (Change Data Capture). It covers an architecture, trade-offs, tooling, and demos.
In this talk, Marko will show one approach which allows you to write a low-latency, auto-parallelized and distributed stream processing pipeline in Java that seamlessly integrates with a data scientist's work taken in almost unchanged form from their Python development environment. The talk includes a live demo using the command line and going through some Python and Java code snippets.