Apache Pinot

Apache Pinot is a real-time distributed OLAP datastore, purpose-built to provide ultra low-latency analytics, even at extremely high throughput. It can ingest directly from streaming data sources - such as Apache Kafka and Amazon Kinesis - and make the events available for querying instantly. It can also ingest from batch data sources such as Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage.

At the heart of the system is a columnar store, with several smart indexing and pre-aggregation techniques for low latency. This makes Pinot the most perfect fit for user-facing real-time analytics. At the same time, Pinot is also a great choice for other analytical use-cases, such as internal dashboards, anomaly detection, and ad-hoc data exploration.

Getting Started

There are two ways of ingesting data to Apache Pinot from Decodable:

  • The Decodable Pinot Connector, which pushes a finished Pinot segment directly to Pinot at each internal Decodable checkpoint.

    • This option is best for proof of concept and certain moderate throughput use cases. It is simpler in that you can get nearly immediate results in Pinot without dealing with an intermediate streaming provider.

    • For sustained use it may require some additional Pinot table configuration.

  • Indirectly via an intermediary streaming technology supported by both Decodable and Pinot, such as Amazon Kinesis, Apache Kafka, or Apache Pulsar.

    • This option allows you more control for high throughput, including how and when Pinot creates segments from incoming data. This is a more typical operating mode for Pinot, and may be better supported by Pinot. However, it requires you to configure and manage the streaming service and its topics and partitions.

This document describes each of these options in turn.

Option 1: The Decodable Pinot Connector

The Decodable Pinot Connector pushes one Pinot Segment per Decodable checkpoint.

Note that this Connector uses a checkpoint interval of 5 minutes, rather than the 10 seconds used by default for most Connectors and all Pipelines. This supports Pinot’s need for larger segment sizes at longer intervals. It may still be valuable to configure Pinot and your Pinot Table with a rollup task; see below.

Prerequisites

The following directions assume you have:

  • A Decodable Account.

  • A Pinot instance:

  • A Pinot Table of type OFFLINE, with a corresponding Pinot Schema that matches the schema of the Connection you’ll create here.

Optional Pinot rollup task

To ensure Pinot query performance over time, this Table may require configuration with a MergeRollupTask. This may or may not be required for your use case and Pinot provider.

We recommend daily rollup as a first step, but other rollup granularities may be appropriate for your use case.

Note that the Pinot provider must be configured to support this table configuration; otherwise it will be ignored with no actual rollup occurring.

See Pinot documentation:

Directions: create a Pinot Connection via the Decodable UI

To create and configure a Connection for Pinot, sign in to Decodable Web, navigate to the Connections tab, click on New Connection, and follow the steps below. For examples of using the command line tools or scripting, see the How To guides.

  1. Click "Connect" in the "Pinot" item.

  2. The Connector Type will default to sink, since that is the only option for Pinot Connections.

  3. Specify the URL of the Pinot Controller API endpoint.

  4. Specify the Name of your Pinot Table. Do not include the Type (OFFLINE) in the Name.

  5. The Table Type defaults to OFFLINE, which should be correct for most uses of this Connector. Type REALTIME is allowed for advanced use.

  6. Specify the Username and Password for authentication to your Controller API endpoint.

For more detailed information about Apache Pinot, see the Pinot Getting Started guide and related documentation.

Reference

Connector name

pinot

Type

sink

Delivery guarantee

exactly once but Pinot operates as at least once

Properties

If you are using the Decodable CLI to create or edit a connection to a Pinot table, then use the following table as a reference for what properties are required and supported.

Property Disposition Description

url

required

Pinot Controller endpoint URL

table.name

required

Name of Pinot Table (without Type (OFFLINE))

table.type

optional

Typically OFFLINE, the default.
May be REALTIME in some advanced cases

auth.basic.username

required

Username for authentication to the Pinot Controller at url

auth.basic.password

required

Password to use with Username

Option 2: Indirect streaming through an external service

Sending a Decodable data stream to Pinot is accomplished in two stages, first by creating a sink connector to a data source that is supported by Pinot, and then by adding that data source to your Pinot configuration. Decodable and Pinot mutually support several technologies, including the following:

  • Amazon Kinesis

  • Apache Kafka

  • Apache Pulsar

Example: Use Kafka As A Sink

This example demonstrates using Kafka as the sink from Decodable and the source for Pinot. Sign in to Decodable Web and follow the configuration steps provided for the Apache Kafka to create a sink connector. For examples of using the command line tools or scripting, see the How To guides.

Create Kafka Data Source in Pinot

Pinot has out-of-the-box real-time ingestion support for Kafka. Apache Pinot lets users consume data from streams and push it directly into the database, in a process known as stream ingestion. Stream Ingestion makes it possible to query data within seconds of publication. Stream Ingestion provides support for checkpoints for preventing data loss. Setting up Stream ingestion involves the following steps:

  1. Create schema configuration. Schema defines the fields along with their data types. The schema also defines whether fields serve as dimensions, metrics, or timestamp.

  2. Create table configuration. The real-time table configuration consists of the following fields:

    • tableName, the name of the table where the data should flow.

    • tableType, the internal type for the table. Should always be set to REALTIME for realtime ingestion.

    • tableIndexConfig, defines which column to use for indexing along with the type of index. It has the following required fields:

      • loadMode, specifies how the segments should be loaded. Should be heap or mmap

      • streamConfig, specifies the data source along with the necessary configs to start consuming the real-time data. The streamConfig can be thought of as the equivalent to the job spec for batch ingestion.

  3. Upload table and schema spec. Once the table and schema configurations have been created, they can be uploaded to the Pinot cluster. As soon as the configs are uploaded, Pinot will start ingesting available records from the topic.

For more detailed information, please refer to Pinot’s Kafka documentation.


Apache Kafka, Kafka®, Apache® and associated open source project names are either registered trademarks or trademarks of The Apache Software Foundation.