Apache Hudi

Apache Hudi is the next generation streaming data lake platform. Apache Hudi brings core warehouse and database functionality directly to a data lake. Hudi provides tables, transactions, efficient upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open source file formats.

Apache Hudi can easily be used on any cloud storage platform. Hudi’s advanced performance optimizations, make analytical workloads faster with any of the popular query engines including, Apache Spark, Flink, Presto, Trino, Hive, etc.

If you are interested in a direct Decodable Connector for Hudi, please contact support@decodable.co or join our Slack community and let us know!

Overview

Connector name

hudi

Type

sink

Getting Started

Sending a Decodable data stream to Hudi is accomplished in two stages, first by creating a sink connector to a data source that is supported by Hudi, and then by adding that data source to your Hudi configuration. Decodable and Hudi mutually support several technologies, including Apache Kafka.

Configure As A Sink

This example demonstrates using Kafka as the sink from Decodable and the source for Hudi. Sign in to Decodable Web and follow the configuration steps provided in the Apache Kafka topic to create a sink connector. For examples of using the command line tools or scripting, see the How To guides.

Create Kafka Data Source

There are multiple ways of ingesting data streams into Hudi, including DeltaStreamer or Kafka Connect. For example, here are the steps for using Kafka Connect.

Create the environment
Set up the schema registry
Create the Hudi Control Topic for coordination of the transactions
Create the Hudi Topic for the Sink and insert data into the topic
Run the Sink connector worker
Add the Hudi Sink to the Connector
Run async compaction and clustering if scheduled
Query via Hive

For more detailed information, please refer to Hudi’s Kafka Connect documentation.

Apache Kafka, Kafka®, Apache® and associated open source project names are either registered trademarks or trademarks of The Apache Software Foundation.