Accreditation Bodies
Accreditation Bodies
Accreditation Bodies
Supercharge your career with our Multi-Cloud Engineer Bootcamp
KNOW MOREKafka is an open-source distributed streaming platform designed to handle large volumes of data in real time and is used in building real-time streaming data pipelines, powering analytics and machine learning applications, and supporting event-driven architectures. Prepare for your next Kafka interview with the top Apache Kafka interview questions and answers compiled by industry experts. These will help you crack your Kafka interview as a beginner, intermediate or expert DevOps professional. The following Apache Kafka interview questions discuss the key features of Kafka, how it differs from other messaging frameworks, partitions, broker and its usage, etc. With Kafka Interview Questions, you can be confident that you will be well-prepared for your next interview. So, if you are looking to advance your career in big data, this guide is the perfect resource for you. Prepare well and crack your interview with ease and confidence!
Filter By
Clear all
Kafka is a messaging framework developed by apache foundation, which is to create the create the messaging system along with can provide fault tolerant cluster along with the low latency system, to ensure end to end delivery.
Below are the bullet points:
Kafka required other component such as the zookeeper to create a cluster and act as a coordination server
Kafka provide a reliable delivery for messages from sender to receiver apart from that it has other key features as well.
To utilize all this key feature, we need to configure the Kafka cluster properly along with the zookeeper configuration.
Now a days kafka is a key messaging framework, not because of its features even for reliable transmission of messages from sender to receiver, however, below are the key points which should consider.
Considering the above features Kafka is one of the best options to use in Bigdata Technologies to handle the large volume of messages for a smooth delivery.
This is one of the most frequently asked Apache Kafka interview questions for freshers in recent times.
There is plethora of use case, where Kafka fit into the real work application, however I listed below are the real work use case which is frequently using.
Above are the use case where predominately require a Kafka framework, apart from that there are other cases which depends upon the requirement and design.
Let’s talk about some modern source of data now a days which is a data—transactional data such as orders, inventory, and shopping carts — is being augmented with things such as clicking, likes, recommendations and searches on a web page. All this data is deeply important to analyze the consumers behaviors, and it can feed a set of predictive analytics engines that can be the differentiator for companies.
So, when we need to handle this kind of volume of data, we need Kafka to solve this problem.
Kafka process diagram comprises the below essential component which is require to setup the messaging infrastructure.
Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version
Topic is a logical feed name to which records are published. Topics in Kafka supports multi-subscriber model, so that topic can have zero, one, or many consumers that subscribe to the data written to it.
Every partition has an ordered and immutable sequence of records which is continuously appended to—a structured commit log. The Kafka cluster durably persists all published records—whether they have been consumed—using a configurable retention period.
Kafka topic is shared into the partitions, which contains messages in an unmodifiable sequence.
This is one of the most frequently asked Apache Kafka interview questions and answers for freshers in recent times.
The offset is a unique identifier of a record within a partition. It denotes the position of the consumer in the partition. Consumers can read messages starting from a specific offset and can read from any offset point they choose.
Topic can also have multiple partition logs like the click-topic has in the image to the right. This allows for multiple consumers to read from a topic in parallel.
The answer to this question encompasses two main aspects – Partitions in a topic and Consumer Groups.
A Kafka topic is divided into partitions. The message sent by the producer is distributed among the topic’s partitions based on the message key. Here we can assume that the key is such that messages would get equally distributed among the partitions.
Consumer Group is a way to bunch together consumers so as to increase the throughput of the consumer application. Each consumer in a group latches to a partition in the topic. i.e. if there are 4 partitions in the topic and 4 consumers in the group then each consumer would read from a single partition. However, if there are 6 partitions and 4 consumers, then the data would be read in parallel from 4 partitions only. Hence its ideal to maintain a 1 to 1 mapping of partition to the consumer in the group.
Now in order to scale up processing at the consumer end, two things can be done:
Doing this would help read data from the topic in parallel and hence scale up the consumer from 2500 messages/sec to 10000 messages per second.
Don't be surprised if this question pops up as one of the top interview questions on Kafka in your next interview.
Dumb broker/Smart producer implies that the broker does not attempt to track which messages have been read by each consumer and only retain unread messages; rather, the broker retains all messages for a set amount of time, and consumers are responsible to track what all messages have been read.
Apache Kafka employs this model only wherein the broker does the work of storing messages for a time (7 days by default), while consumers are responsible for keeping track of what all messages they have read using offsets.
The opposite of this is the Smart Broker/Dumb Consumer model wherein the broker is focused on the consistent delivery of messages to consumers. In such a case, consumers are dumb and consume at a roughly similar pace as the broker keeps track of consumer state. This model is followed by RabbitMQ.
Kafka is a distributed system wherein data is stored across multiple nodes in the cluster. There is a high probability that one or more nodes in the cluster might fail. Fault tolerance means that the data is the system is protected and available even when some of the nodes in the cluster fail.
One of the ways in which Kafka provides fault tolerance is by making a copy of the partitions. The default replication factor is 3 which means for every partition in a topic, two copies are maintained. In case one of the broker fails, data can be fetched from its replica. This way Kafka can withstand N-1 failures, N being the replication factor.
Kafka also follows the leader-follower model. For every partition, one broker is elected as the leader while others are designated, followers. A leader is responsible for interacting with the producer/consumer. If the leader node goes down, then one of the remaining followers is elected as a leader.
Kafka also maintains a list of In Sync replicas. Say the replication factor is 3. That means there will be a leader partition and two follower partitions. However, the followers may not be in sync with the leader. The ISR shows the list of replicas that are in sync with the leader.
As we already know, a Kafka topic is divided into partitions. The data inside each partition is ordered and can be accessed using an offset. Offset is a position within a partition for the next message to be sent by the consumer. There are two types of offsets maintained by Kafka:
Current Offset
Committed Offset
There are two ways to commit an offset:
Prior to Kafka v0.9, Zookeeper was being used to store topic offset, however from v0.9 onwards, the information regarding offset on a topic’s partition is stored on a topic called _consumer_offsets.
An ack or acknowledgment is sent by a broker to the producer to acknowledge receipt of the message. Ack level can be set as a configuration parameter in the Producer and it defines the number of acknowledgments the producer requires the leader to have received before considering a request complete. The following settings are allowed:
In this case, the producer doesn’t wait for any acknowledgment from the broker. No guarantee can be that the broker has received the record.
In this case, the leader writes the record to its local log file and responds back without waiting for acknowledgment from all its followers. In this case, the message can get lost only if the leader fails just after acknowledging the record but before the followers have replicated it, then the record would be lost.
In this case, a set leader waits for all entire sets of in sync replicas to acknowledge the record. This ensures that the record does not get lost as long as one replica is alive and provides the strongest possible guarantee. However it also considerably lessens the throughput as a leader must wait for all followers to acknowledge before responding back.
acks=1 is usually the preferred way of sending records as it ensures receipt of record by a leader, thereby ensuring high durability and at the same time ensures high throughput as well. For highest throughput set acks=0 and for highest durability set acks=all.
It is an open-source message broker system developed by LinkedIn and supported by Apache. The underlying technology being used in Kafka is Java and Scala. It supports the publisher-subscriber model of communication where publisher publishes messages and subscribers get notified when any new message gets published. It is categorised in distributed streaming messages software. Earlier we were using messaging queue and different enterprise messaging systems like RabbitMQ and many others for the same purpose but Kafka has become an industry leader in quick time. It is being used in building stream pipeline in high volume applications to reliably transfer/transform data between different systems. It has an inbuilt fault-tolerant system to store messages. It is distributed and supports partition as well. Kafka runs in a clustered environment which makes it. It is gaining popularity due to its high throughput of messages in a microservice architecture. The software component which publishes messages is called producer while consumers are the one to which messages are broadcasted. In the below diagram you can see different parts of Kafka system:
There are some important components of any Kafka architecture. Please find below an overview of components:
The below diagram explains Producer, Topic, Partition consumer & consumer group.
ZooKeeper is an open-source system. As we know that Kafka brokers work in a cluster environment where several servers process the incoming messages before broadcasting to subscribed consumers. ZooKeeper in Kafka is used to manage broker state within the cluster and to perform leader election for partitions among other coordination tasks. It ensures that the Kafka cluster remains consistent and can handle failures of brokers gracefully. However, starting with Kafka version 2.8, it is possible to run Kafka without ZooKeeper using the KRaft (Kafka Raft Metadata mode). KRaft simplifies Kafka's architecture by removing the external dependency on ZooKeeper and managing metadata directly within Kafka itself. This advancement aims to streamline operations and potentially improve performance by reducing the number of moving parts in the system.
In a cluster, partitions are distributed across nodes where each server share the responsibility of request processing for individual partitions. Even partitions are replicated across several nodes configured in a Kafka environment. This is the reason why Kafka is the system. The system or node which act as a primary server for each partition is termed as a leader while other systems where data gets replicated is called a follower. It is a leader's responsibility to read or write data on given partitions but followers system passively replicates data of leader’s system. In case of failure of the leader system, any of the follower systems becomes a leader. Now we can understand cluster environment architecture where each system playing the role of leader for some of the partitions while other system playing as a follower and making Kafka popular as a fault-tolerant system. This is the reason why Kafka has taken a centre stage in-stream messaging platforms.
It is one of the key concepts in Kafka. The replication ensures data is safe and secure even in case of system failure. The Kafka stores the published messages which are broadcasted to subscribed systems but what if Kafka server goes down, will published messages be available when the system goes down? The answer is yes and it is all possible due to replication behaviour where messages are replicated across multiple servers or nodes. There could be multiple reasons for system failure like program failure, system error or frequent software upgrades. The replication safeguard published data in case of any such failures. The fault-tolerant behaviour is one of the key reasons why Kafka has become a market leader in a very short period of time. ISR which stands for In sync replicas ensures sync between the leader and follower systems. If replicas are not in sync with ISR then it points that follower systems are lagging behind leaders and not catching up with leader activities. Please refer to the below diagram which will make your understanding more crystal clear :
As the name suggests this is the exception which occurs when producer systems sending more messages above and beyond the capacity of the broker system then brokers would not be able to handle the same. The queue gets full at broker end so no incoming request can be handled any more. As producer systems do not have any information on the capacity of the broker system results in such exceptions. The messages get overflowed at broker end. To avoid such a scenario we should have multiple systems working as a broker system so messages can be evenly distributed across multiple systems. The clusters environment where we have multiple nodes servicing the message processing avoid such exceptions to occur. The clustering, partitioning helps in avoiding any such exceptions.
Although there is a score of benefits of using Kafka, we can list down some key benefits which are making the tool more popular in-stream messaging platforms. The below diagram has summarized the key benefits :
The leader and follower nodes serve the purpose of load balancing in Kafka. As we know that leader does the actual writing/reading of data in a given partition while follower systems do the same in passive mode. This ensures data gets replicated across different nodes. In case of any failure due to any reasoning system, software upgrade data remains available. If the leader system goes down for any reason then follower system which was working in the passive mode now becomes a leader and ensure data remains available to external system irrespective of any internal outage. The load balancer does the same thing, it distributes loads across multiple systems in caseload gets increased. In the same way, balances load by replicating messages on different systems and when the leader system goes down, the other follower system becomes the leader and ensure data is available to subscribers system.
A staple in Kafka technical interview questions and answers, be prepared to answer this one using your hands-on experience.
Log compaction in Kafka ensures that even if a topic logs millions of messages, the log will retain at least the last known value for each unique key within the topic. This feature benefits users by providing a compact, historical record of state changes, essential for restoring state after a failure or for systems that require a complete history of changes.
In Kafka, custom serialization and deserialization involve converting data between the byte format used for Kafka message storage and the specific data types used in your application. This is necessary to efficiently send and receive messages through Kafka, enabling tailored handling of complex data structures or specific encoding schemes.
Kafka achieves message durability through data replication across various brokers within a cluster. You can set a replication factor for each topic, which dictates how many copies of each partition are maintained. Initially, messages are written to a leader partition and subsequently replicated to follower partitions. For added durability, Kafka commits these messages to disk.
Producer is a client who send or publish the record. Producer applications write data to topics and consumer applications read from topics.
Messages sent by a producer to a topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
Consumer is a subscriber who consume the messages which predominantly stores in a partition. Consumer is a separate process and can be separate application altogether which run in individual machine.
If all the consumer falls into the same consumer group, then by using load balancer the message will be distributed over the consumer instances, if consumer instances falls in different group, than each message will be broadcast to all consumer group.
The working principle of Kafka follows the below order.
Apart from other benefits, below are the key advantages of using Kafka messaging framework.
Considering all the above advantages, Kafka is one of the most popular frameworks utilize in Micro service architecture, Big Data architecture, Enterprise Integration architecture, publish-subscribe architecture.
Expect to come across this, one of the most important Kafka interview questions for experienced professionals in data management, in your next interviews.
Considering the advantages, to setup and configure the Kafka ecosystem is bit difficult and one needs a good knowledge to implement, apart from that I listed some more use case.
Zookeeper is a distributed open source configuration, synchronization service along with the naming registry for distributed application.
Zookeeper is a separate component, which is not a mandatory component to implement with Kafka, however when we need to implement cluster, we have to setup as a coordination server.
Zookeeper plays a significant role when it comes to cluster management like fault tolerant and identify when one cluster down its replicate the messages to other cluster.
groupId = org.apache.spark artifactId = spark-streaming-kafka-0-10_2.11 version = 2.2.0 groupId = org.apache.zookeeper artifactId = zookeeper version=3.4.5
This dependency comes with child dependency which will download and add to the application as a part of parent dependency.
Here are the packages which need to import in Java/Scala.
A must-know for anyone looking for top Kafka interview questions, this is one of the frequently asked interview questions on Kafka.
Kafka follows a pub-sub mechanism wherein producer writes to a topic and one or more consumers read from that topic. However, Reads in Kafka always lag behind Writes as there is always some delay between the moment a message is written and the moment it is consumed. This delta between Latest Offset and Consumer Offset is called Consumer Lag
There are various open source tools available to measure consumer lag e.g. LinkedIn Burrow. Confluent Kafka comes with out of the box tools to measure lag.
With Kafka messaging system, three different types of semantics can be achieved.
Kafka transactions help achieve exactly once semantic between Kafka brokers and clients. In order to achieve this we need to set below properties at producer end – enable.idempotence=true and transactional.id=<some unique id>. We also need to call initTransaction to prepare the producer to use transactions. With these properties set, if the producer (characterized by producer id> accidentally sends the same message to Kafka more than once, then Kafka broker detects and de-duplicates it.
Kafka is a durable, distributed and scalable messaging system designed to support high volume transactions. Use cases that require a publish-subscribe mechanism at high throughput are a good fit for Kafka. In case you need a point to point or request/reply type communication then other messaging queues like RabbitMQ can be considered.
Kafka is a good fit for real-time stream processing. It uses a dumb broker smart consumer model with the broker merely acting as a message store. So a scenario wherein the consumer cannot be smart and requires a broker to smart instead is not a good fit for Kafka. In such a case, RabbitMQ can be used which uses a smart broker model with the broker responsible for consistent delivery of messages at a roughly similar pace.
Also in cases where protocols like AMQP, MQTT, and features like message routing are needed, in those cases, RabbitMQ is a better alternative over Kafka.
This is a common yet one of the most tricky Kafka interview questions and answers for experienced professionals, don't miss this one.
A producer publishes messages to one or more Kafka topics. The message contains information related to what topic and partition should the message be published to.
There are three different types of producer APIs –
Kafka messages are key-value pairs. The key is used for partitioning messages being sent to the topic. When writing a message to a topic, the producer has an option to provide the message key. This key determines which partition of the topic the message goes to. If the key is not specified, then the messages are sent to partitions of the topic in round robin fashion.
Note that Kafka orders messages only inside a partition, hence choosing the right partition key is an important factor in application design.
Kafka supports data replication within the cluster to ensure high availability. But enterprises often need data availability guarantees to span the entire cluster and even withstand site failures.
The solution to this is Mirror Maker – a utility that helps replicate data between two Kafka clusters within the same or different data centers.
MirrorMaker is essentially a Kafka consumer and producer hooked together. The origin and destination clusters are completely different entities and can have a different number of partitions and offsets, however, the topic names should be the same between source and a destination cluster. The MirrorMaker process also retains and uses the partition key so that ordering is maintained within the partition.
Both belong to the same league of messages streaming platform. RabbitMQ belong to the traditional league of the messaging platform which supports several protocols. It is an open-source message broker platform with a reasonable number of features. It supports the AMQP messaging protocol with the routing feature.
Kafka was written in Scala and first introduced in LinkedIn to facilitate intrasystem communication. Now Kafka is being developed under the umbrella of Apache software and more suitable in an event-driven ecosystem. Now let's compare both the platform. Kafka is a distributed, scalable and high throughput system as compared to rabbitMQ. In terms of performance as well as Kafka scores much better. The RabbitMQ can process only 20000 messages per second while Kafka can process 5 times more messages.
Please find below diagram detailing out key differences.
There are four core API which is available in Kafka. Please find below an overview of the core API :
The geo-replication enables replication across different data or different clusters. The Kafka mirror maker enables geo-replication. This process is called mirroring. The mirroring process is different from replication across different nodes in the same cluster. Kafka’s mirror maker ensure messages from topic belonging to one or more Kafka clusters are replicated to destination cluster with same topic names.
We should use at least one mirror maker to replicate one source cluster. We can have multiple mirror maker processes to mirror topics within the same consumer groups. This enables high throughput and enable the system. If one of the mirror maker processes goes down, the other can take over the additional load. One thing is very important here as the source and destination clusters are independent of each other having different partition and offsets.
Kafka is dependent on ZooKeeper so we must first start ZooKeeper before starting Kafka server. Please find below the step by step process to start the Kafka server :
1:Starting the ZooKeeper by typing the following command in a terminal :
2: Once ZooKeeper starts running, we can start the Kafka server by running the following command
3: The next step is checking the services running in backend by checking below commands :
4: Once the Kafka server starts running, we can create a topic by running below command :
5: We can check the available topic by triggering below command in terminal :
A staple in Kafka advanced interview questions with answers, be prepared to answer this one using your hands-on experience. This is also one of the top interview questions to ask a Kafka developer.
As we know that producers publish messages to a different partition of the topic. Messages consist of chunks of data. Along with the data producer system also send one key. This key is called a partition key. The data which comes with unique key always gets stored in the same partition. Consider a real-world system where we have to track the user’s activity while using the application. We can store the user’s data using the partition key in the same partition. So basically the user’s data being tagged with key helps us in achieving this objective. Let's say if we have to store user u0 data into partition p0 then we can tag u0 data with some unique key which will ensure that user’s data always gets stored in partition p0. But it does not mean that p0 partition can not store other user’s data. To summarize partition key is used to validate messages and knowing destination partition where messages will be stored. Let's have a look at the below diagram which clearly explains the usage of partition key in Kafka architecture.
Kafka stores messages in topics which in turn gets stored in different partitions. The partition is an immutable sequence of ordered messages which is continuously appended to. A message is uniquely identified in the partition by a sequential number called offset. The FIFO behaviour can only be achieved inside the partitions. We can achieve FIFO behaviour by following the steps below :
1: We need to first set enable auto-commit property false :
Set enable.auto.commit=false
2: Once messages get processed, we should not make a call to consumer.commitSync();
3: We can then call to “subscribe” and ensure the registry of consumer system to the topic.
4: The Listener consumerRebalance should be implemented and within a listener, we should call consumer.seek(topicPartition, offset).
5: Once the message gets processed, the offset associated with the message should be stored along with the processed message.
6: We should also ensure idempotent as a safety measure.
Kafka system by default does not handle the large size of data. The data max size is 1 MB but there are ways to increase that size. We should also ensure to increase the network buffers as well for our consumers and producers system. We need to adjust a few properties to achieve the same :
Now it is clear that if we would like to send large Kafka messages then it can be easily achieved by tweaking few properties explained above. The broker related config can be found at $KAFKA_HOME/config/server.properties while consumer-related config found at $KAFKA_HOME/config/consumer.properties
A staple in Kafka advanced interview questions with answers, be prepared to answer this one using your hands-on experience. This is also one of the top interview questions to ask a Kafka developer.
Kafka and flume both are offerings from Apache software only but there are some key differences. Please find below an overview of both to understand the differences :
Kafka
Flume
At the point when to utilize:
1. Flume: When working with non-social information sources, for example, log documents which are to be gushed into Hadoop. Kafka: When needing a very dependable and versatile enterprise-level framework to interface numerous various frameworks (Including Hadoop)
2. Kafka for Hadoop: Kafka resembles a pipeline that gathers information continuously and pushes to Hadoop. Hadoop forms it inside and after that according to the prerequisite either serve to different consumers(Dashboards, BI, and so on) or stores it for further handling.
Kafka | Flume |
---|---|
Apache Kafka is multiple producers-consumers general-purpose tool. | Apache Flume is a special-purpose tool for specific applications. |
It replicates the events. | It does not replicate the events. |
Kafka support data streams for multiple applications | Flume is specific for Hadoop and big data analysis. |
Apache Kafka can process and monitor data in distributed systems. | Apache Flume gathers data from distributed systems to a centralized data store. |
Kafka supports large sets of publishers, subscribers and multiple applications. | Flume supports a large set of source and destination types to land data on Hadoop. |
One can easily follow the below steps to install Kafka :
Step 1: Ensuring java is installed on the machine by running below command in CMD :
$ java -version
You will be able to see a version of java if it is installed. In case Java is not installed we can follow below steps to install java successfully:
1: Download the latest JDK by visiting below link: JDK
2: Extract the executables and then move to Opt directory.
3: Next step is setting the local path for the JAVA_HOME variable. We can set this by running below command in
~/.bashrc file.
4: Ensure above changes are in sync in the running system along with making changes in java alternative by invoking
command
Step 2: Next step is ZooKeeper framework installation by visiting the below link: ZooKeeper
1: Once the files have been extracted we need to modify the config file before starting ZooKeeper server. We can follow below command to open “conf/zoo.cfg”
After making the changes ensure config file get saved before executing the following command to start the server :
$bin/zkServer.sh start
Once you execute above command below response can be seen:
$JMX enabled by default $Using config: /Users/../zookeeper-3.4.6/bin/../conf/zoo.cfg $Starting zookeeper ...STARTED
2: Next step is starting CLI
$ bin /zkCli.sh The above command will ensure we connect to zookeeper and below response will come:
Connecting to localhost:2181
……………………
……………………
…………………….
Welcome to ZooKeeper!
……………………
……………………
WATCHER::
WatchedEvent state:SyncConnected type: None path:null
[zk: localhost:2181(CONNECTED) 0]
We can also stop the ZooKeeper server after doing all the basic validations :
Step 3: Now we can move to apache Kafka installation by visiting the below link: Kafka
Once Kafka is downloaded locally we can extract the files by running the command :
The above command will ensure Kafka installation. After Kafka installation we need to start Kafka server:
Shared message Queue
A shared message framework takes into account a surge of messages from a producer to serve a single customer. Each message pushed to the framework is perused just once and just by one customer. The consumers pull messages from the queue end only. Queuing frameworks at that point expel the message from the line once pulled effectively.
Downsides:
Traditional Publisher Subscribe Systems
The publisher-subscriber model considers various publishers to distribute messages to subjects facilitated by message brokers which can be subscribed by different endorsers. A message is in this way communicated to every one of the supporters of a subject.
Downsides:
Let's first understand the concept of the consumer in Kafka architecture. The consumers are the system or process which subscribe to topics created at the Kafka broker. The producer's system sends messages to topics and once messages are committed successfully then only subscribers systems are allowed to read the messages. The consumer group is tagging of consumers system in such a way to make it multi-threaded or multi-machine system.
As we can see in the above diagram, two consumers 1 & 2 are being tagged in the same group. Also, we can see that individual customers reading data from different partition of topics. Some common characteristic of consumer groups are as follows:
The recommendation for the consumer group suggests having a similar number of consumer instances in line with several partitions. In case if we will go with a greater number of consumers then it will result in excess customers sitting idle resulting in wasting resources. In the case of partitions numbers greater then it will result in the same consumers reading from more than one partition. This should not be an issue until the time ordering of messages is not important for the use case. Kafka does not have inbuilt support for the ordering of messages across different partitions.
This is the reason why Kafka recommends to have the same number of consumers in line with partitions to maintain the ordering of messages.
The core part of Kafka producer API is “KafkaProducer” class. Once we instantiate this class, it allows the option to connect to Kafka broker inside its constructor. It has the method “send” which allows the producer system to send messages to topic asynchronously:
The Kafka producer has one flush method which is used to ensure previously sent messages are cleared from the buffer.
public void send(KeyedMessaget<k,v> message) - sends the data to a single topic,par-titioned by key using either sync or async producer. public void send(List<KeyedMessage<k,v>>messages) - sends data to multiple topics. Properties prop = new Properties(); prop.put(producer.type,”async”) ProducerConfig config = new ProducerConfig(prop);
The producer is broadly classified into two types: Sync & Async
A message is sent directly to the broker in sync producer while it passes through in the background in case of an async producer. Async producer is used in case we need high throughput
The following are the configuration settings listed in producer API :
S.No | Configuration Settings and Description |
---|---|
1 | client.id identifies producer application |
2 | producer.type either sync or async |
3 | acks The acks config controls the criteria under producer requests are con-sidered complete. |
4 | retries If producer request fails, then automatically retry with specific value. |
5 | bootstrap.servers bootstrapping list of brokers. |
6 | linger.ms if you want to reduce the number of requests you can set linger.ms to something greater than some value. |
7 | key.serializer Key for the serializer interface. |
8 | value.serializer value for the serializer interface. |
9 | batch.size Buffer size. |
10 | buffer.memory controls the total amount of memory available to the producer for buff-ering. |
public ProducerRecord (string topic, int partition, k key, v value)
public ProducerRecord (string topic, k key, v value)
ProducerRecord class constructor is used to create a record with key, value pairs and without partition.
public ProducerRecord (string topic, v value)
ProducerRecord class creates a record without partition and key.
Regular micro services arrangements will have many microservices collaborating, and that is a colossal issue if not taken care of appropriately. It isn't practical for each service to have an immediate association with each service that it needs to converse with for 2 reasons: First, the number of such associations would develop quickly; Second, the services being called might be down or may have moved to another server.
On the off chance that you have 2 services, at that point, there are up to 2 direct associations. With 3 services, there are 6. With 4 services, there are 12, etc. As it were, such associations can be seen as the coupling between the objects in an OO program. You have to cooperate with different objects yet the lesser the coupling between their classes, the more sensible your program is.
Message Brokers are a method for decoupling the sending and accepting services through the idea of Publish and Subscribe. The sending service (maker) posts it message/load on the message queue and the accepting service (consumer), which is tuning in for messages, will get it. Message Broking is one of the key use cases for Kafka.
Something else Message Brokers do is a queue or hold the message till the time consumer lifts it. On the off chance that the customer service is down or occupied when the sender sends the message, it can generally take it up later. The result of this is the producer services doesn't need to stress over checking if the message has gone through, retry on failure, and so on.
Kafka is incredible because it enables us to have both Pub-Sub just as queuing highlights (generally, it is possible that either was upheld by such intermediaries). It additionally ensures that the request of the messages is kept up and not expose to arrange idleness or different elements. Kafka likewise enables us to "communicate" messages to different consumers, if necessary. Kafka importance can be understood in building reliable, scalable microservices solution with minimum configuration.
The Kafka which has established itself as a market leader in stream processing platform. It is one of the popular message broker platforms. It works on the publisher-subscriber model of messaging. It provides decoupling between producer and consumer system. They are unaware of each other and work independently. The consumer system has no information on the source system which has pushed the messages into Kafka system. The producer systems publish messages on the topic(tagging of messages in a group called topic) and messages are broadcasted to consumer systems which are subscribed to those topics. It is event-driven architecture and solves most of the problems faced by the traditional messaging platform. The key features like data partitioning, scalability, low latency and high throughput are the reason why it has become a top choice for any real-time data integration and data processing needs.
The topic is a very important feature of Kafka architecture. The messages are grouped into a topic. The producer system sends messages to a specific topic while consumer system read messages from a specific topic only. Further messages in the topic are distributed into several partitions. The partition ensures same topic data is replicated across multiple brokers. The individual partition can reside on an individual machine which allows message reading from same topic parallel. The multiple subscriber systems can process data from multiple partitions which result in high messaging throughput. The unique identifier is tagged with each message within a partition which is called offset. The offset is sequentially incremented to ensure ordering of messages. The subscriber system can read data from the specified offset but at the same time, they are allowed to read data from any other offset point as well.
Multi-tenancy system allows multiple client service at the same time. There is inbuilt support on multi-tenancy if we are not concerned with isolation and security. So Kafka is already a multi-tenant system as everyone can read/write data to Kafka broker. But in the real multi-tenant system should provide isolation and security to provide multiple client servicing. The security and isolation can be achieved by doing below set up :
The two way SSL can be used for authentication/authorization. We can also use token-based identity provider for the same purpose. We can also set up role-based access to the topic using ACLs.
The first step for any consumer to join any consumer group is raising a request to the group coordinator. There is a group leader in a consumer group which is usually the first member of the group. The group leader gets the list of all members from co-ordinator. It keeps track of all the consumers which have recently contributed in the group are considered alive while other members are off tracked from the system. It is the responsibility of the group leader to assign partitions to individual consumers. It implements PartitionAssignor to assign partitions.
There is an in-built partition policy to assign a partition to consumers. Once the partition is done, group leader sends that information to group co-ordinator which in turn inform respective consumers about their assignments. Individual consumers have only knowledge of respective assignments while group leader keeps track of all assignments. This whole process is called partition rebalancing. This happens whenever any new consumer joins the groups or exits the group. This step is very critical to performance and high throughput of messages.
As we know that consumer system subscribes to topics in Kafka but it is Pooling loop which informs consumers if any new data has arrived or not. It is poll loop responsibility to handle coordination, partition rebalances, heartbeats, and data fetching. It is the core function in consumer API which keeps polling the server for any new data. Let's try to understand polling look in Kafka :
try { while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { log.debug("topic = %s, partition = %d, offset = %d," customer = %s, country = %s\n", record.topic(), record.partition(), record.offset(), record.key(), record.value()); int updatedCount = 1; if (custCountryMap.countainsValue(record.value())) { updatedCount = custCountryMap.get(record.value()) + 1; } custCountryMap.put(record.value(), updatedCount) JSONObject json = new JSONObject(custCountryMap); System.out.println(json.toString(4))
Let’s consider a scenario where we need to read data from the Kafka topic and only after some custom validation, we can add data into some data storage system. To achieve this we would develop some consumer application which will subscribe to the topic. This ensures that our application will start receiving messages from the topic on which data validation and storage process would run eventually. Now we come across a scenario where messages publishing rate to topic exceed the rate at which it is consumed by our consumer application.
If we go with a single consumer then we may fall behind keeping our system updated with incoming messages. The solution to this problem is by adding more consumers. This will scale up the consumption of topics. This can be easily achieved by creating a consumer group, the consortium under which similar behaviour consumers would reside which can read messages from the same topic by splitting the workload. Consumers from the same group usually get their partition of the topic which eventually scales up message consumption and throughput. In case if we have a single consumer for a given topic with 4 partitions then it will read messages from all partitions :
The ideal architecture for the above scenario is as below when we have four consumers reading messages from individual partition :
Even in the case of more consumers then partition results in consumer sitting idle, which is also not good architecture design:
There is another scenario as well where we can have more than one consumer groups subscribed to the same topic:
This, along with other Kafka basic questions for freshers, is a regular feature in Kafka interviews, be ready to tackle it with the approach mentioned.
Kafka is a messaging framework developed by apache foundation, which is to create the create the messaging system along with can provide fault tolerant cluster along with the low latency system, to ensure end to end delivery.
Below are the bullet points:
Kafka required other component such as the zookeeper to create a cluster and act as a coordination server
Kafka provide a reliable delivery for messages from sender to receiver apart from that it has other key features as well.
To utilize all this key feature, we need to configure the Kafka cluster properly along with the zookeeper configuration.
Now a days kafka is a key messaging framework, not because of its features even for reliable transmission of messages from sender to receiver, however, below are the key points which should consider.
Considering the above features Kafka is one of the best options to use in Bigdata Technologies to handle the large volume of messages for a smooth delivery.
This is one of the most frequently asked Apache Kafka interview questions for freshers in recent times.
There is plethora of use case, where Kafka fit into the real work application, however I listed below are the real work use case which is frequently using.
Above are the use case where predominately require a Kafka framework, apart from that there are other cases which depends upon the requirement and design.
Let’s talk about some modern source of data now a days which is a data—transactional data such as orders, inventory, and shopping carts — is being augmented with things such as clicking, likes, recommendations and searches on a web page. All this data is deeply important to analyze the consumers behaviors, and it can feed a set of predictive analytics engines that can be the differentiator for companies.
So, when we need to handle this kind of volume of data, we need Kafka to solve this problem.
Kafka process diagram comprises the below essential component which is require to setup the messaging infrastructure.
Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version
Topic is a logical feed name to which records are published. Topics in Kafka supports multi-subscriber model, so that topic can have zero, one, or many consumers that subscribe to the data written to it.
Every partition has an ordered and immutable sequence of records which is continuously appended to—a structured commit log. The Kafka cluster durably persists all published records—whether they have been consumed—using a configurable retention period.
Kafka topic is shared into the partitions, which contains messages in an unmodifiable sequence.
This is one of the most frequently asked Apache Kafka interview questions and answers for freshers in recent times.
The offset is a unique identifier of a record within a partition. It denotes the position of the consumer in the partition. Consumers can read messages starting from a specific offset and can read from any offset point they choose.
Topic can also have multiple partition logs like the click-topic has in the image to the right. This allows for multiple consumers to read from a topic in parallel.
The answer to this question encompasses two main aspects – Partitions in a topic and Consumer Groups.
A Kafka topic is divided into partitions. The message sent by the producer is distributed among the topic’s partitions based on the message key. Here we can assume that the key is such that messages would get equally distributed among the partitions.
Consumer Group is a way to bunch together consumers so as to increase the throughput of the consumer application. Each consumer in a group latches to a partition in the topic. i.e. if there are 4 partitions in the topic and 4 consumers in the group then each consumer would read from a single partition. However, if there are 6 partitions and 4 consumers, then the data would be read in parallel from 4 partitions only. Hence its ideal to maintain a 1 to 1 mapping of partition to the consumer in the group.
Now in order to scale up processing at the consumer end, two things can be done:
Doing this would help read data from the topic in parallel and hence scale up the consumer from 2500 messages/sec to 10000 messages per second.
Don't be surprised if this question pops up as one of the top interview questions on Kafka in your next interview.
Dumb broker/Smart producer implies that the broker does not attempt to track which messages have been read by each consumer and only retain unread messages; rather, the broker retains all messages for a set amount of time, and consumers are responsible to track what all messages have been read.
Apache Kafka employs this model only wherein the broker does the work of storing messages for a time (7 days by default), while consumers are responsible for keeping track of what all messages they have read using offsets.
The opposite of this is the Smart Broker/Dumb Consumer model wherein the broker is focused on the consistent delivery of messages to consumers. In such a case, consumers are dumb and consume at a roughly similar pace as the broker keeps track of consumer state. This model is followed by RabbitMQ.
Kafka is a distributed system wherein data is stored across multiple nodes in the cluster. There is a high probability that one or more nodes in the cluster might fail. Fault tolerance means that the data is the system is protected and available even when some of the nodes in the cluster fail.
One of the ways in which Kafka provides fault tolerance is by making a copy of the partitions. The default replication factor is 3 which means for every partition in a topic, two copies are maintained. In case one of the broker fails, data can be fetched from its replica. This way Kafka can withstand N-1 failures, N being the replication factor.
Kafka also follows the leader-follower model. For every partition, one broker is elected as the leader while others are designated, followers. A leader is responsible for interacting with the producer/consumer. If the leader node goes down, then one of the remaining followers is elected as a leader.
Kafka also maintains a list of In Sync replicas. Say the replication factor is 3. That means there will be a leader partition and two follower partitions. However, the followers may not be in sync with the leader. The ISR shows the list of replicas that are in sync with the leader.
As we already know, a Kafka topic is divided into partitions. The data inside each partition is ordered and can be accessed using an offset. Offset is a position within a partition for the next message to be sent by the consumer. There are two types of offsets maintained by Kafka:
Current Offset
Committed Offset
There are two ways to commit an offset:
Prior to Kafka v0.9, Zookeeper was being used to store topic offset, however from v0.9 onwards, the information regarding offset on a topic’s partition is stored on a topic called _consumer_offsets.
An ack or acknowledgment is sent by a broker to the producer to acknowledge receipt of the message. Ack level can be set as a configuration parameter in the Producer and it defines the number of acknowledgments the producer requires the leader to have received before considering a request complete. The following settings are allowed:
In this case, the producer doesn’t wait for any acknowledgment from the broker. No guarantee can be that the broker has received the record.
In this case, the leader writes the record to its local log file and responds back without waiting for acknowledgment from all its followers. In this case, the message can get lost only if the leader fails just after acknowledging the record but before the followers have replicated it, then the record would be lost.
In this case, a set leader waits for all entire sets of in sync replicas to acknowledge the record. This ensures that the record does not get lost as long as one replica is alive and provides the strongest possible guarantee. However it also considerably lessens the throughput as a leader must wait for all followers to acknowledge before responding back.
acks=1 is usually the preferred way of sending records as it ensures receipt of record by a leader, thereby ensuring high durability and at the same time ensures high throughput as well. For highest throughput set acks=0 and for highest durability set acks=all.
It is an open-source message broker system developed by LinkedIn and supported by Apache. The underlying technology being used in Kafka is Java and Scala. It supports the publisher-subscriber model of communication where publisher publishes messages and subscribers get notified when any new message gets published. It is categorised in distributed streaming messages software. Earlier we were using messaging queue and different enterprise messaging systems like RabbitMQ and many others for the same purpose but Kafka has become an industry leader in quick time. It is being used in building stream pipeline in high volume applications to reliably transfer/transform data between different systems. It has an inbuilt fault-tolerant system to store messages. It is distributed and supports partition as well. Kafka runs in a clustered environment which makes it. It is gaining popularity due to its high throughput of messages in a microservice architecture. The software component which publishes messages is called producer while consumers are the one to which messages are broadcasted. In the below diagram you can see different parts of Kafka system:
There are some important components of any Kafka architecture. Please find below an overview of components:
The below diagram explains Producer, Topic, Partition consumer & consumer group.
ZooKeeper is an open-source system. As we know that Kafka brokers work in a cluster environment where several servers process the incoming messages before broadcasting to subscribed consumers. ZooKeeper in Kafka is used to manage broker state within the cluster and to perform leader election for partitions among other coordination tasks. It ensures that the Kafka cluster remains consistent and can handle failures of brokers gracefully. However, starting with Kafka version 2.8, it is possible to run Kafka without ZooKeeper using the KRaft (Kafka Raft Metadata mode). KRaft simplifies Kafka's architecture by removing the external dependency on ZooKeeper and managing metadata directly within Kafka itself. This advancement aims to streamline operations and potentially improve performance by reducing the number of moving parts in the system.
In a cluster, partitions are distributed across nodes where each server share the responsibility of request processing for individual partitions. Even partitions are replicated across several nodes configured in a Kafka environment. This is the reason why Kafka is the system. The system or node which act as a primary server for each partition is termed as a leader while other systems where data gets replicated is called a follower. It is a leader's responsibility to read or write data on given partitions but followers system passively replicates data of leader’s system. In case of failure of the leader system, any of the follower systems becomes a leader. Now we can understand cluster environment architecture where each system playing the role of leader for some of the partitions while other system playing as a follower and making Kafka popular as a fault-tolerant system. This is the reason why Kafka has taken a centre stage in-stream messaging platforms.
It is one of the key concepts in Kafka. The replication ensures data is safe and secure even in case of system failure. The Kafka stores the published messages which are broadcasted to subscribed systems but what if Kafka server goes down, will published messages be available when the system goes down? The answer is yes and it is all possible due to replication behaviour where messages are replicated across multiple servers or nodes. There could be multiple reasons for system failure like program failure, system error or frequent software upgrades. The replication safeguard published data in case of any such failures. The fault-tolerant behaviour is one of the key reasons why Kafka has become a market leader in a very short period of time. ISR which stands for In sync replicas ensures sync between the leader and follower systems. If replicas are not in sync with ISR then it points that follower systems are lagging behind leaders and not catching up with leader activities. Please refer to the below diagram which will make your understanding more crystal clear :
As the name suggests this is the exception which occurs when producer systems sending more messages above and beyond the capacity of the broker system then brokers would not be able to handle the same. The queue gets full at broker end so no incoming request can be handled any more. As producer systems do not have any information on the capacity of the broker system results in such exceptions. The messages get overflowed at broker end. To avoid such a scenario we should have multiple systems working as a broker system so messages can be evenly distributed across multiple systems. The clusters environment where we have multiple nodes servicing the message processing avoid such exceptions to occur. The clustering, partitioning helps in avoiding any such exceptions.
Although there is a score of benefits of using Kafka, we can list down some key benefits which are making the tool more popular in-stream messaging platforms. The below diagram has summarized the key benefits :
The leader and follower nodes serve the purpose of load balancing in Kafka. As we know that leader does the actual writing/reading of data in a given partition while follower systems do the same in passive mode. This ensures data gets replicated across different nodes. In case of any failure due to any reasoning system, software upgrade data remains available. If the leader system goes down for any reason then follower system which was working in the passive mode now becomes a leader and ensure data remains available to external system irrespective of any internal outage. The load balancer does the same thing, it distributes loads across multiple systems in caseload gets increased. In the same way, balances load by replicating messages on different systems and when the leader system goes down, the other follower system becomes the leader and ensure data is available to subscribers system.
A staple in Kafka technical interview questions and answers, be prepared to answer this one using your hands-on experience.
Log compaction in Kafka ensures that even if a topic logs millions of messages, the log will retain at least the last known value for each unique key within the topic. This feature benefits users by providing a compact, historical record of state changes, essential for restoring state after a failure or for systems that require a complete history of changes.
In Kafka, custom serialization and deserialization involve converting data between the byte format used for Kafka message storage and the specific data types used in your application. This is necessary to efficiently send and receive messages through Kafka, enabling tailored handling of complex data structures or specific encoding schemes.
Kafka achieves message durability through data replication across various brokers within a cluster. You can set a replication factor for each topic, which dictates how many copies of each partition are maintained. Initially, messages are written to a leader partition and subsequently replicated to follower partitions. For added durability, Kafka commits these messages to disk.
Producer is a client who send or publish the record. Producer applications write data to topics and consumer applications read from topics.
Messages sent by a producer to a topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
Consumer is a subscriber who consume the messages which predominantly stores in a partition. Consumer is a separate process and can be separate application altogether which run in individual machine.
If all the consumer falls into the same consumer group, then by using load balancer the message will be distributed over the consumer instances, if consumer instances falls in different group, than each message will be broadcast to all consumer group.
The working principle of Kafka follows the below order.
Apart from other benefits, below are the key advantages of using Kafka messaging framework.
Considering all the above advantages, Kafka is one of the most popular frameworks utilize in Micro service architecture, Big Data architecture, Enterprise Integration architecture, publish-subscribe architecture.
Expect to come across this, one of the most important Kafka interview questions for experienced professionals in data management, in your next interviews.
Considering the advantages, to setup and configure the Kafka ecosystem is bit difficult and one needs a good knowledge to implement, apart from that I listed some more use case.
Zookeeper is a distributed open source configuration, synchronization service along with the naming registry for distributed application.
Zookeeper is a separate component, which is not a mandatory component to implement with Kafka, however when we need to implement cluster, we have to setup as a coordination server.
Zookeeper plays a significant role when it comes to cluster management like fault tolerant and identify when one cluster down its replicate the messages to other cluster.
groupId = org.apache.spark artifactId = spark-streaming-kafka-0-10_2.11 version = 2.2.0 groupId = org.apache.zookeeper artifactId = zookeeper version=3.4.5
This dependency comes with child dependency which will download and add to the application as a part of parent dependency.
Here are the packages which need to import in Java/Scala.
A must-know for anyone looking for top Kafka interview questions, this is one of the frequently asked interview questions on Kafka.
Kafka follows a pub-sub mechanism wherein producer writes to a topic and one or more consumers read from that topic. However, Reads in Kafka always lag behind Writes as there is always some delay between the moment a message is written and the moment it is consumed. This delta between Latest Offset and Consumer Offset is called Consumer Lag
There are various open source tools available to measure consumer lag e.g. LinkedIn Burrow. Confluent Kafka comes with out of the box tools to measure lag.
With Kafka messaging system, three different types of semantics can be achieved.
Kafka transactions help achieve exactly once semantic between Kafka brokers and clients. In order to achieve this we need to set below properties at producer end – enable.idempotence=true and transactional.id=<some unique id>. We also need to call initTransaction to prepare the producer to use transactions. With these properties set, if the producer (characterized by producer id> accidentally sends the same message to Kafka more than once, then Kafka broker detects and de-duplicates it.
Kafka is a durable, distributed and scalable messaging system designed to support high volume transactions. Use cases that require a publish-subscribe mechanism at high throughput are a good fit for Kafka. In case you need a point to point or request/reply type communication then other messaging queues like RabbitMQ can be considered.
Kafka is a good fit for real-time stream processing. It uses a dumb broker smart consumer model with the broker merely acting as a message store. So a scenario wherein the consumer cannot be smart and requires a broker to smart instead is not a good fit for Kafka. In such a case, RabbitMQ can be used which uses a smart broker model with the broker responsible for consistent delivery of messages at a roughly similar pace.
Also in cases where protocols like AMQP, MQTT, and features like message routing are needed, in those cases, RabbitMQ is a better alternative over Kafka.
This is a common yet one of the most tricky Kafka interview questions and answers for experienced professionals, don't miss this one.
A producer publishes messages to one or more Kafka topics. The message contains information related to what topic and partition should the message be published to.
There are three different types of producer APIs –
Kafka messages are key-value pairs. The key is used for partitioning messages being sent to the topic. When writing a message to a topic, the producer has an option to provide the message key. This key determines which partition of the topic the message goes to. If the key is not specified, then the messages are sent to partitions of the topic in round robin fashion.
Note that Kafka orders messages only inside a partition, hence choosing the right partition key is an important factor in application design.
Kafka supports data replication within the cluster to ensure high availability. But enterprises often need data availability guarantees to span the entire cluster and even withstand site failures.
The solution to this is Mirror Maker – a utility that helps replicate data between two Kafka clusters within the same or different data centers.
MirrorMaker is essentially a Kafka consumer and producer hooked together. The origin and destination clusters are completely different entities and can have a different number of partitions and offsets, however, the topic names should be the same between source and a destination cluster. The MirrorMaker process also retains and uses the partition key so that ordering is maintained within the partition.
Both belong to the same league of messages streaming platform. RabbitMQ belong to the traditional league of the messaging platform which supports several protocols. It is an open-source message broker platform with a reasonable number of features. It supports the AMQP messaging protocol with the routing feature.
Kafka was written in Scala and first introduced in LinkedIn to facilitate intrasystem communication. Now Kafka is being developed under the umbrella of Apache software and more suitable in an event-driven ecosystem. Now let's compare both the platform. Kafka is a distributed, scalable and high throughput system as compared to rabbitMQ. In terms of performance as well as Kafka scores much better. The RabbitMQ can process only 20000 messages per second while Kafka can process 5 times more messages.
Please find below diagram detailing out key differences.
There are four core API which is available in Kafka. Please find below an overview of the core API :
The geo-replication enables replication across different data or different clusters. The Kafka mirror maker enables geo-replication. This process is called mirroring. The mirroring process is different from replication across different nodes in the same cluster. Kafka’s mirror maker ensure messages from topic belonging to one or more Kafka clusters are replicated to destination cluster with same topic names.
We should use at least one mirror maker to replicate one source cluster. We can have multiple mirror maker processes to mirror topics within the same consumer groups. This enables high throughput and enable the system. If one of the mirror maker processes goes down, the other can take over the additional load. One thing is very important here as the source and destination clusters are independent of each other having different partition and offsets.
Kafka is dependent on ZooKeeper so we must first start ZooKeeper before starting Kafka server. Please find below the step by step process to start the Kafka server :
1:Starting the ZooKeeper by typing the following command in a terminal :
2: Once ZooKeeper starts running, we can start the Kafka server by running the following command
3: The next step is checking the services running in backend by checking below commands :
4: Once the Kafka server starts running, we can create a topic by running below command :
5: We can check the available topic by triggering below command in terminal :
A staple in Kafka advanced interview questions with answers, be prepared to answer this one using your hands-on experience. This is also one of the top interview questions to ask a Kafka developer.
As we know that producers publish messages to a different partition of the topic. Messages consist of chunks of data. Along with the data producer system also send one key. This key is called a partition key. The data which comes with unique key always gets stored in the same partition. Consider a real-world system where we have to track the user’s activity while using the application. We can store the user’s data using the partition key in the same partition. So basically the user’s data being tagged with key helps us in achieving this objective. Let's say if we have to store user u0 data into partition p0 then we can tag u0 data with some unique key which will ensure that user’s data always gets stored in partition p0. But it does not mean that p0 partition can not store other user’s data. To summarize partition key is used to validate messages and knowing destination partition where messages will be stored. Let's have a look at the below diagram which clearly explains the usage of partition key in Kafka architecture.
Kafka stores messages in topics which in turn gets stored in different partitions. The partition is an immutable sequence of ordered messages which is continuously appended to. A message is uniquely identified in the partition by a sequential number called offset. The FIFO behaviour can only be achieved inside the partitions. We can achieve FIFO behaviour by following the steps below :
1: We need to first set enable auto-commit property false :
Set enable.auto.commit=false
2: Once messages get processed, we should not make a call to consumer.commitSync();
3: We can then call to “subscribe” and ensure the registry of consumer system to the topic.
4: The Listener consumerRebalance should be implemented and within a listener, we should call consumer.seek(topicPartition, offset).
5: Once the message gets processed, the offset associated with the message should be stored along with the processed message.
6: We should also ensure idempotent as a safety measure.
Kafka system by default does not handle the large size of data. The data max size is 1 MB but there are ways to increase that size. We should also ensure to increase the network buffers as well for our consumers and producers system. We need to adjust a few properties to achieve the same :
Now it is clear that if we would like to send large Kafka messages then it can be easily achieved by tweaking few properties explained above. The broker related config can be found at $KAFKA_HOME/config/server.properties while consumer-related config found at $KAFKA_HOME/config/consumer.properties
A staple in Kafka advanced interview questions with answers, be prepared to answer this one using your hands-on experience. This is also one of the top interview questions to ask a Kafka developer.
Kafka and flume both are offerings from Apache software only but there are some key differences. Please find below an overview of both to understand the differences :
Kafka
Flume
At the point when to utilize:
1. Flume: When working with non-social information sources, for example, log documents which are to be gushed into Hadoop. Kafka: When needing a very dependable and versatile enterprise-level framework to interface numerous various frameworks (Including Hadoop)
2. Kafka for Hadoop: Kafka resembles a pipeline that gathers information continuously and pushes to Hadoop. Hadoop forms it inside and after that according to the prerequisite either serve to different consumers(Dashboards, BI, and so on) or stores it for further handling.
Kafka | Flume |
---|---|
Apache Kafka is multiple producers-consumers general-purpose tool. | Apache Flume is a special-purpose tool for specific applications. |
It replicates the events. | It does not replicate the events. |
Kafka support data streams for multiple applications | Flume is specific for Hadoop and big data analysis. |
Apache Kafka can process and monitor data in distributed systems. | Apache Flume gathers data from distributed systems to a centralized data store. |
Kafka supports large sets of publishers, subscribers and multiple applications. | Flume supports a large set of source and destination types to land data on Hadoop. |
One can easily follow the below steps to install Kafka :
Step 1: Ensuring java is installed on the machine by running below command in CMD :
$ java -version
You will be able to see a version of java if it is installed. In case Java is not installed we can follow below steps to install java successfully:
1: Download the latest JDK by visiting below link: JDK
2: Extract the executables and then move to Opt directory.
3: Next step is setting the local path for the JAVA_HOME variable. We can set this by running below command in
~/.bashrc file.
4: Ensure above changes are in sync in the running system along with making changes in java alternative by invoking
command
Step 2: Next step is ZooKeeper framework installation by visiting the below link: ZooKeeper
1: Once the files have been extracted we need to modify the config file before starting ZooKeeper server. We can follow below command to open “conf/zoo.cfg”
After making the changes ensure config file get saved before executing the following command to start the server :
$bin/zkServer.sh start
Once you execute above command below response can be seen:
$JMX enabled by default $Using config: /Users/../zookeeper-3.4.6/bin/../conf/zoo.cfg $Starting zookeeper ...STARTED
2: Next step is starting CLI
$ bin /zkCli.sh The above command will ensure we connect to zookeeper and below response will come:
Connecting to localhost:2181
……………………
……………………
…………………….
Welcome to ZooKeeper!
……………………
……………………
WATCHER::
WatchedEvent state:SyncConnected type: None path:null
[zk: localhost:2181(CONNECTED) 0]
We can also stop the ZooKeeper server after doing all the basic validations :
Step 3: Now we can move to apache Kafka installation by visiting the below link: Kafka
Once Kafka is downloaded locally we can extract the files by running the command :
The above command will ensure Kafka installation. After Kafka installation we need to start Kafka server:
Shared message Queue
A shared message framework takes into account a surge of messages from a producer to serve a single customer. Each message pushed to the framework is perused just once and just by one customer. The consumers pull messages from the queue end only. Queuing frameworks at that point expel the message from the line once pulled effectively.
Downsides:
Traditional Publisher Subscribe Systems
The publisher-subscriber model considers various publishers to distribute messages to subjects facilitated by message brokers which can be subscribed by different endorsers. A message is in this way communicated to every one of the supporters of a subject.
Downsides:
Let's first understand the concept of the consumer in Kafka architecture. The consumers are the system or process which subscribe to topics created at the Kafka broker. The producer's system sends messages to topics and once messages are committed successfully then only subscribers systems are allowed to read the messages. The consumer group is tagging of consumers system in such a way to make it multi-threaded or multi-machine system.
As we can see in the above diagram, two consumers 1 & 2 are being tagged in the same group. Also, we can see that individual customers reading data from different partition of topics. Some common characteristic of consumer groups are as follows:
The recommendation for the consumer group suggests having a similar number of consumer instances in line with several partitions. In case if we will go with a greater number of consumers then it will result in excess customers sitting idle resulting in wasting resources. In the case of partitions numbers greater then it will result in the same consumers reading from more than one partition. This should not be an issue until the time ordering of messages is not important for the use case. Kafka does not have inbuilt support for the ordering of messages across different partitions.
This is the reason why Kafka recommends to have the same number of consumers in line with partitions to maintain the ordering of messages.
The core part of Kafka producer API is “KafkaProducer” class. Once we instantiate this class, it allows the option to connect to Kafka broker inside its constructor. It has the method “send” which allows the producer system to send messages to topic asynchronously:
The Kafka producer has one flush method which is used to ensure previously sent messages are cleared from the buffer.
public void send(KeyedMessaget<k,v> message) - sends the data to a single topic,par-titioned by key using either sync or async producer. public void send(List<KeyedMessage<k,v>>messages) - sends data to multiple topics. Properties prop = new Properties(); prop.put(producer.type,”async”) ProducerConfig config = new ProducerConfig(prop);
The producer is broadly classified into two types: Sync & Async
A message is sent directly to the broker in sync producer while it passes through in the background in case of an async producer. Async producer is used in case we need high throughput
The following are the configuration settings listed in producer API :
S.No | Configuration Settings and Description |
---|---|
1 | client.id identifies producer application |
2 | producer.type either sync or async |
3 | acks The acks config controls the criteria under producer requests are con-sidered complete. |
4 | retries If producer request fails, then automatically retry with specific value. |
5 | bootstrap.servers bootstrapping list of brokers. |
6 | linger.ms if you want to reduce the number of requests you can set linger.ms to something greater than some value. |
7 | key.serializer Key for the serializer interface. |
8 | value.serializer value for the serializer interface. |
9 | batch.size Buffer size. |
10 | buffer.memory controls the total amount of memory available to the producer for buff-ering. |
public ProducerRecord (string topic, int partition, k key, v value)
public ProducerRecord (string topic, k key, v value)
ProducerRecord class constructor is used to create a record with key, value pairs and without partition.
public ProducerRecord (string topic, v value)
ProducerRecord class creates a record without partition and key.
Regular micro services arrangements will have many microservices collaborating, and that is a colossal issue if not taken care of appropriately. It isn't practical for each service to have an immediate association with each service that it needs to converse with for 2 reasons: First, the number of such associations would develop quickly; Second, the services being called might be down or may have moved to another server.
On the off chance that you have 2 services, at that point, there are up to 2 direct associations. With 3 services, there are 6. With 4 services, there are 12, etc. As it were, such associations can be seen as the coupling between the objects in an OO program. You have to cooperate with different objects yet the lesser the coupling between their classes, the more sensible your program is.
Message Brokers are a method for decoupling the sending and accepting services through the idea of Publish and Subscribe. The sending service (maker) posts it message/load on the message queue and the accepting service (consumer), which is tuning in for messages, will get it. Message Broking is one of the key use cases for Kafka.
Something else Message Brokers do is a queue or hold the message till the time consumer lifts it. On the off chance that the customer service is down or occupied when the sender sends the message, it can generally take it up later. The result of this is the producer services doesn't need to stress over checking if the message has gone through, retry on failure, and so on.
Kafka is incredible because it enables us to have both Pub-Sub just as queuing highlights (generally, it is possible that either was upheld by such intermediaries). It additionally ensures that the request of the messages is kept up and not expose to arrange idleness or different elements. Kafka likewise enables us to "communicate" messages to different consumers, if necessary. Kafka importance can be understood in building reliable, scalable microservices solution with minimum configuration.
The Kafka which has established itself as a market leader in stream processing platform. It is one of the popular message broker platforms. It works on the publisher-subscriber model of messaging. It provides decoupling between producer and consumer system. They are unaware of each other and work independently. The consumer system has no information on the source system which has pushed the messages into Kafka system. The producer systems publish messages on the topic(tagging of messages in a group called topic) and messages are broadcasted to consumer systems which are subscribed to those topics. It is event-driven architecture and solves most of the problems faced by the traditional messaging platform. The key features like data partitioning, scalability, low latency and high throughput are the reason why it has become a top choice for any real-time data integration and data processing needs.
The topic is a very important feature of Kafka architecture. The messages are grouped into a topic. The producer system sends messages to a specific topic while consumer system read messages from a specific topic only. Further messages in the topic are distributed into several partitions. The partition ensures same topic data is replicated across multiple brokers. The individual partition can reside on an individual machine which allows message reading from same topic parallel. The multiple subscriber systems can process data from multiple partitions which result in high messaging throughput. The unique identifier is tagged with each message within a partition which is called offset. The offset is sequentially incremented to ensure ordering of messages. The subscriber system can read data from the specified offset but at the same time, they are allowed to read data from any other offset point as well.
Multi-tenancy system allows multiple client service at the same time. There is inbuilt support on multi-tenancy if we are not concerned with isolation and security. So Kafka is already a multi-tenant system as everyone can read/write data to Kafka broker. But in the real multi-tenant system should provide isolation and security to provide multiple client servicing. The security and isolation can be achieved by doing below set up :
The two way SSL can be used for authentication/authorization. We can also use token-based identity provider for the same purpose. We can also set up role-based access to the topic using ACLs.
The first step for any consumer to join any consumer group is raising a request to the group coordinator. There is a group leader in a consumer group which is usually the first member of the group. The group leader gets the list of all members from co-ordinator. It keeps track of all the consumers which have recently contributed in the group are considered alive while other members are off tracked from the system. It is the responsibility of the group leader to assign partitions to individual consumers. It implements PartitionAssignor to assign partitions.
There is an in-built partition policy to assign a partition to consumers. Once the partition is done, group leader sends that information to group co-ordinator which in turn inform respective consumers about their assignments. Individual consumers have only knowledge of respective assignments while group leader keeps track of all assignments. This whole process is called partition rebalancing. This happens whenever any new consumer joins the groups or exits the group. This step is very critical to performance and high throughput of messages.
As we know that consumer system subscribes to topics in Kafka but it is Pooling loop which informs consumers if any new data has arrived or not. It is poll loop responsibility to handle coordination, partition rebalances, heartbeats, and data fetching. It is the core function in consumer API which keeps polling the server for any new data. Let's try to understand polling look in Kafka :
try { while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { log.debug("topic = %s, partition = %d, offset = %d," customer = %s, country = %s\n", record.topic(), record.partition(), record.offset(), record.key(), record.value()); int updatedCount = 1; if (custCountryMap.countainsValue(record.value())) { updatedCount = custCountryMap.get(record.value()) + 1; } custCountryMap.put(record.value(), updatedCount) JSONObject json = new JSONObject(custCountryMap); System.out.println(json.toString(4))
Let’s consider a scenario where we need to read data from the Kafka topic and only after some custom validation, we can add data into some data storage system. To achieve this we would develop some consumer application which will subscribe to the topic. This ensures that our application will start receiving messages from the topic on which data validation and storage process would run eventually. Now we come across a scenario where messages publishing rate to topic exceed the rate at which it is consumed by our consumer application.
If we go with a single consumer then we may fall behind keeping our system updated with incoming messages. The solution to this problem is by adding more consumers. This will scale up the consumption of topics. This can be easily achieved by creating a consumer group, the consortium under which similar behaviour consumers would reside which can read messages from the same topic by splitting the workload. Consumers from the same group usually get their partition of the topic which eventually scales up message consumption and throughput. In case if we have a single consumer for a given topic with 4 partitions then it will read messages from all partitions :
The ideal architecture for the above scenario is as below when we have four consumers reading messages from individual partition :
Even in the case of more consumers then partition results in consumer sitting idle, which is also not good architecture design:
There is another scenario as well where we can have more than one consumer groups subscribed to the same topic:
This, along with other Kafka basic questions for freshers, is a regular feature in Kafka interviews, be ready to tackle it with the approach mentioned.
Apache Kafka is an open-source stream-processing software program developed by Linkedin and donated to the Apache Software Foundation.
The increase in popularity of Apache Kafka has led to an extensive increase in demand for professionals who are certified in the field of Apache Kafka. It is a highly appealing option for data integration as it contributes various unique attributes like unifies, low-latency, high-throughput platform to handle real-time data feeds. Other features such as scalability, low latency, data partitioning and its ability to handle numerous diverse consumers makes it more desirable for cases related to data integration. To mention, Apache Kafka has a market share of about 9.1%. It is the best opportunity to move ahead in your career. This is also the best time to enroll for Apache Kafka training.
There are many companies who use Apache Kafka. According to cwiki.apache.org, the top companies that use Kafka are LinkedIn, Yahoo, Twitter, Netflix, etc.
According to indeed.com, the average salary for apache kafka architect for Senior Technical Lead ranges from $101,298 per year to $148,718 per year for Enterprise Architect.
With a lot of research, we have brought you a few apache kafka interview questions that you might encounter in your upcoming interview. These apache kafka interview questions and answers for experienced and freshers alone will help you crack the apache kafka interview and give you an edge over your competitors. So, in order to succeed in the interview, you need to read, re-read and practice these apache kafka interview questions as much as possible. You can do this with ease in you are enrolled in a big data training program.
If you wish to make a career and have Apache Kafka interviews lined up, then you need not fret. Take a look at the set of Apache Kafka interview questions assembled by experts. These kafka interview questions for experienced as well as freshers with detailed answers will guide you in a whole new manner to crack the Apache Kafka interviews. Stay focused on the essential interview questions on Kafka and prepare well to get acquainted with the types of questions that you may come across in your interview on Apache Kafka.
Hope these Kafka Interview Questions will help you to crack the interview. All the best!
Submitted questions and answers are subjecct to review and editing,and may or may not be selected for posting, at the sole discretion of Knowledgehut.
Get a 1:1 Mentorship call with our Career Advisor
By tapping submit, you agree to KnowledgeHut Privacy Policy and Terms & Conditions