Apache Hudi - The Streaming Data Lake Platform

Data Dashboard Big Data Data Management

RocksDB Landing Page
1

RocksDB

A persistent key-value store for fast storage environments
Pricing:
- Open Source
Hudi tables can be used as sinks for Spark/Flink pipelines and the Hudi writing path provides several enhanced capabilities over file writing done by vanilla parquet/avro sinks. Hudi classifies write operations carefully into incremental (insert, upsert, delete) and batch/bulk operations (insert_overwrite, insert_overwrite_table, delete_partition, bulk_insert) and provides relevant functionality for each operation in a performant and cohesive way. Both upsert and delete operations automatically handle merging of records with the same key in the input stream (say, a CDC stream obtained from upstream table) and then lookup the index, finally invoke a bin packing algorithm to pack data into files, while respecting a pre-configured target file size. An insert operation on the other hand, is intelligent enough to avoid the precombining and index lookup, while retaining the benefits of the rest of the pipeline. Similarly, bulk_insert operation provides several sort modes for controlling initial file sizes and file counts, when importing data from an external table to Hudi. The other batch write operations provide MVCC based implementations of typical overwrite semantics used in batch data pipelines, while retaining all the transactional and incremental processing capabilities, making it seamless to switch between incremental pipelines for regular runs and batch pipelines for backfilling/dropping older partitions. The write pipeline also contains lower layers optimizations around handling large merges by spilling to rocksDB or an external spillable map, multi-threaded/concurrent I/O to improve write performance.

#NoSQL Databases #Databases #Key-Value Database 14 social mentions
Apache Parquet Landing Page
2

Apache Parquet

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem.
Pricing:
- Open Source
The following stack captures layers of software components that make up Hudi, with each layer depending on and drawing strength from the layer below. Typically, data lake users write data out once using an open file format like Apache Parquet/ORC stored on top of extremely scalable cloud storage or distributed file systems. Hudi provides a self-managing data plane to ingest, transform and manage this data, in a way that unlocks incremental data processing on them.

#Databases #Big Data #Key-Value Database 25 social mentions
Apache ORC Landing Page
3

Apache ORC

Apache ORC is a columnar storage for Hadoop workloads.
Pricing:
- Open Source
The following stack captures layers of software components that make up Hudi, with each layer depending on and drawing strength from the layer below. Typically, data lake users write data out once using an open file format like Apache Parquet/ORC stored on top of extremely scalable cloud storage or distributed file systems. Hudi provides a self-managing data plane to ingest, transform and manage this data, in a way that unlocks incremental data processing on them.

#Big Data #Data Management #Data Dashboard 3 social mentions
Apache Kafka Landing Page
4

Apache Kafka

Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala.
Pricing:
- Open Source
Streaming: At its core, by optimizing for fast upserts & change streams, Hudi provides the primitives to data lake workloads that are comparable to what Apache Kafka does for event-streaming (namely, incremental produce/consume of events and a state-store for interactive querying).

#Stream Processing #Data Integration #ETL 146 social mentions
Javalin Landing Page
5

Javalin

Simple REST APIs for Java and Kotlin
Pricing:
- Open Source
Storing and serving table metadata right on the lake storage is scalable, but can be much less performant compared to RPCs against a scalable meta server. Most cloud warehouses internally are built on a metadata layer that leverages an external database (e.g Snowflake uses foundationDB). Hudi also provides a metadata server, called the “Timeline server”, which offers an alternative backing store for Hudi’s table metadata. Currently, the timeline server runs embedded in the Hudi writer processes, serving file listings out of a local rocksDB store/Javalin REST API during the write process, without needing to repeatedly list the cloud storage. Given we have hardened this as the default option since our 0.6.0 release, we are considering standalone timeline server installations, with support for horizontal scaling, database/table mappings, security and all the features necessary to turn it into a highly performant next generation lake metastore.

#Developer Tools #Web Frameworks #Runtime 36 social mentions
Apache Avro Landing Page
6

Apache Avro

Apache Avro is a comprehensive data serialization system and acting as a source of data exchanger service for Apache Hadoop.
Pricing:
- Open Source
Hudi is designed around the notion of base file and delta log files that store updates/deltas to a given base file (called a file slice). Their formats are pluggable, with Parquet (columnar access) and HFile (indexed access) being the supported base file formats today. The delta logs encode data in Avro (row oriented) format for speedier logging (just like Kafka topics for e.g). Going forward, we plan to inline any base file format into log blocks in the coming releases, providing columnar access to delta logs depending on block sizes. Future plans also include Orc base/log file formats, unstructured data formats (free form json, images), and even tiered storage layers in event-streaming systems/OLAP engines/warehouses, work with their native file formats.

#Development #OS & Utilities #Tool 14 social mentions