Apache Kafka

Apache Kafka is a distributed streaming platform. It is designed to to handle streams of data with high-availability and is mainly used to build real-time streaming applications.

Why use kafka?

Kafka offers several advantages for streaming data processing:

  • Scalability: kafka can handle large volumes of data and high message throughput by distributing the load across multiple brokers.

  • Durability: messages are replicated across multiple brokers to ensure data availability in case of failures.

  • Real-time processing: Kafka offers low-latency message delivery and supporting stream processing frameworks like Apache Spark and Apache Flink.

  • Reliability: Kafka guarantees message delivery and ordering within a partition.

Kafka Components

kafka architecture

Kafka’s architecture comprises various components:

  • Topics:

    Events are organized and durably stored in topics, like folders in a filesystem and the events themselves are like files within these folders. Kafka topics provide a logical and structured way to group related events together.

    Topics in kafka support multiple producers and subscribers. The events within a topic are partitioned and replicated across multiple brokers, ensuring scalability and fault tolerance. Furthermore, these events are ordered and persistently stored for a designated retention period.

    Producers submit messages to specific topics, while consumers subscribe to topics to consume the messages.

    Kafka topics

  • Producers:

    Producers are client applications that publish (write) events to Kafka. stream of events are published to one or more Kafka topics.They are responsible for choosing which topic and which partition to write to.

  • Consumers and Consumer Groups:

    Consumers are applications or systems that subscribe to topics and consume the messages published to those topics.

    A consumer group consists of multiple consumers that pull data from the same topic or set of topics. Consumers choose the offset to start reading from and each record is delivered to one consumer per partition within a consumer group.

    Kafka consumers

  • Brokers:

    Kafka brokers are the servers that oversee the storage, management, replication, and distribution of topic data. Each partition within a topic has a designated “leader” broker that handles read and write requests, while the remaining “followers” brokers replicate the data for fault tolerance and high availability.

  • ZooKeeper:

    Kafka uses ZooKeeper for coordination and maintaining metadata information such as broker and topic configurations, consumer group offsets, etc.

Kafka in TDP

Components described above are distributed as follows on the TDP getting started cluster:

  • Kafka Brokers are deployed on the worker nodes.
  • Kafka Client runs on the edge node along with other clients.
  • ZooKeeper Kafka : ZooKeeper dedicated to Kafka. The ZooKeeper server is located on the master nodes, and the zkForKafkaCli.sh command-line interface is available on the edge node.

TDP also provides a Kafka Command-Line Interfaces (CLIs) on the edge, which is used to interact with Kafka services. The client properties files can be found in /etc/kafka/conf/*.properties.

Kafka on the TDP is SSL-enabled and Kerberos authenticated, utilizing the SASL_SSL listener and the GSSAPI mechanism. Also, The Kafka Ranger plugin is installed and enabled by default.

What to learn next?

Now that you’re familiar with Kafka,the next tutorial will guide you through the fundamental commands of Kafka.