Skip to content

Kafka

Notes about Kafka.

  • The key data structure is append only logs (immutable data) or ledger.
  • Events are facts about the world.
  • Data are segmented into topics, producer sends events to topics, and consumers pull from it.

The architectures runs with zookeepers, servers and topics. Zookeeper managed servers.

--from-begininng in consumer console can show all the data in the topics.

Topics

  • Number of replicas [how many copies]
  • Number of partitions [number of splits for a topics]
  • Retention policy (either size, or time)
  • Cleanup policy can be either compact (keep only a single observation per key) or delete. Compact is useful when consumers always expect on observation per key.
  • Compression type (lz4 [fast], snappy [fast], gzip [compression/network], zstd [compression/network]).

Producer

  • Acks: acknowledgments settings, how many brokers have to receive a message for a producer to consider a message received.

Insure to set:

  • client id
  • batch size
  • linger time (time waiting between sending message)
  • compression type

Consummers

  • Topics (avoid automatic creation)
  • client id
  • poll/consume
  • offset number [number consumed message, earliest known offset], where to start on restart.

Avro

  • Schema definition.
  • Schema registry: a web server to save your schema.

Terms

  • Broker: a single server use for kafka.
  • Cluster: a group of Kafka servers/brokers.
  • Node: a compute instance.
  • Zookeeper: determine which broker is the leader of a given partition and topic as well s track cluster membership and configuration for Kafka.
  • Data partition: topics consist of one or more partitions. A partition provides ordering guarantees for all data contained within it, partition are chosen by hashing key values. [Nested map logic with topic->partition->log ]. The log is split into multiple partitions.
  • Data Replication: a mechanism by which data is written to more than one broker to ensure that if a single broker is lost, a replicated copy of the data is available.
  • In-Sync-Replica (ISR): A broker which is up to date with the leader for a particular broker for all message in the current topic. This number may be less than the replication factor for a topic.
  • Rebalance: A process in which the current set of consumer changes (addition or removal of consumer). When this occurs, assignment of partitions to the various consumers in a consumer group must be change.
  • Batch size: number of data sent/received by Kafka.
  • acks: numbers of broker acknowledgments that must be received from Kafka before a producer continue processing.
  • Synchronous production: producers which waits for a response before additional processing. Async: does not wait.
  • Consumer Offset: a value indicating the last seen and processed message of a given consumer, by ID.
  • Consumer Group: a collection of one or more consumers, identified by group.id, which collaborate to consume data from Kafka and share a consumer offset.
  • Consumer Group Leader: the consumer in charge of working with the Group Coordinator to manage the consumer group.
  • Topic Subscription: Kafka consumers indicate to the Kafka Cluster that they would like to consume from one or more topics by specifying those they want to subscribe to.
  • Consumer Lag: difference between the offset of a consumer group and the latest message offset in Kafka itself.

Streaming Application

  • Two differences: KStream and KTable. The difference is KStream are stateless and KTable are stateful, they usually keep their state in RocksDB database to maintain the state and speed up recovery time.

Kafka Connect

API allowing to bind (source and sink) to multiple data sources without the need to write kafka specific code.

Kafka Rest

HTTP server to Create Kafka application with Rest endpoint.

See also (generated)