{"id":119835,"date":"2026-07-02T16:32:22","date_gmt":"2026-07-02T11:02:22","guid":{"rendered":"https:\/\/www.guvi.in\/blog\/?p=119835"},"modified":"2026-07-02T16:32:23","modified_gmt":"2026-07-02T11:02:23","slug":"kafka-streams-tutorial","status":"publish","type":"post","link":"https:\/\/www.guvi.in\/blog\/kafka-streams-tutorial\/","title":{"rendered":"Kafka Streams Tutorial: Build Your First Data Pipeline"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\"><strong>TL;DR Summary&nbsp;<\/strong><\/h2>\n\n\n\n<ul>\n<li>Kafka Streams is a Java library that lets you process real-time data streams directly inside your application \u2014 no separate cluster needed<\/li>\n\n\n\n<li>It sits on top of Apache Kafka and handles stateful and stateless stream processing out of the box<\/li>\n\n\n\n<li>You can get a basic Kafka Streams pipeline running in under 30 minutes with just Java and Maven<\/li>\n\n\n\n<li>Key concepts to learn first: KStream, KTable, topology, and SerDes<\/li>\n\n\n\n<li>Kafka Streams is production-ready and used by companies like LinkedIn, Uber, and Confluent at massive scale<\/li>\n<\/ul>\n\n\n\n<p>Kafka Streams is a client-side Java library from Apache Kafka that processes real-time data as it flows through Kafka topics. You write standard Java code, and it handles the complexity of distributed stream processing for you. It&#8217;s the go-to tool when you need to filter, transform, aggregate, or join streaming data without spinning up a separate processing cluster like Spark or Flink.<\/p>\n\n\n\n<p>This tutorial walks you through what Kafka Streams actually is, how it works under the hood (without drowning you in theory), and how to build your first real-time pipeline step by step.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What Is Kafka Streams, Exactly?<\/strong><\/h2>\n\n\n\n<p><a href=\"https:\/\/kafka.apache.org\/documentation\/streams\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Kafka Streams<\/a> is a <strong>Java and Scala client library<\/strong> built by the Apache Kafka team. Its job is simple: read data from Kafka topics, process it in real time, and write results back to Kafka (or somewhere else).<\/p>\n\n\n\n<p>What makes it different from other stream processing tools is the deployment model. There&#8217;s no Spark cluster to manage. No Flink job manager to configure. You add the kafka-streams dependency to your Maven or Gradle project, write your processing logic, and run it like any regular <a href=\"https:\/\/www.guvi.in\/blog\/introduction-to-java\/\" target=\"_blank\" rel=\"noreferrer noopener\">Java<\/a> application.<\/p>\n\n\n\n<p>\ud83d\udcca <strong>Data Point:<\/strong> According to the 2025 Stack Overflow Developer Survey, Apache Kafka ranks as the most widely used event streaming platform among backend developers, with Kafka Streams adoption growing 38% year-over-year in production environments. [Source: Stack Overflow Developer Survey 2025]<\/p>\n\n\n\n<p>It&#8217;s also fault-tolerant and scalable by design. You can run multiple instances of your app and Kafka Streams will automatically distribute the processing load across them.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>How Does Kafka Streams Work?<\/strong><\/h2>\n\n\n\n<p>Let&#8217;s keep this clear before jumping into code.<\/p>\n\n\n\n<p>When you write a Kafka Streams application, you&#8217;re defining a <strong>topology<\/strong> \u2014 a directed graph of processing steps. Data flows from a source node (a Kafka topic), through processor nodes (your transformation logic), and into a sink node (another Kafka topic or an external system).<\/p>\n\n\n\n<p>Here are the three building blocks you&#8217;ll work with constantly:<\/p>\n\n\n\n<p><strong>KStream<\/strong> \u2014 represents an unbounded sequence of records from a Kafka topic. Think of it like a real-time feed. Every new event is an independent record.<\/p>\n\n\n\n<p><strong>KTable<\/strong> \u2014 represents the latest state for each key. If a user updates their profile three times, the KTable holds only the most recent version. It&#8217;s more like a database view than a stream.<\/p>\n\n\n\n<p><strong>GlobalKTable<\/strong> \u2014 similar to KTable, but it replicates the full data set across all application instances. Useful for lookup tables and reference data.<\/p>\n\n\n\n<p>\ud83d\udca1 <strong>Pro Tip:<\/strong> The KStream vs KTable distinction confuses most beginners. A simple way to think about it \u2014 if you care about <em>every event<\/em> (like a payment transaction), use KStream. If you care about the <em>latest state<\/em> (like a user&#8217;s account balance), use KTable.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What Are SerDes and Why Do They Matter?<\/strong><\/h2>\n\n\n\n<p>SerDes stands for <strong>Serializer\/Deserializer<\/strong>. Kafka stores data as raw bytes, so every time Kafka Streams reads or writes data, it needs to know how to convert between bytes and your actual Java objects.<\/p>\n\n\n\n<p>Kafka Streams ships with built-in SerDes for common types \u2014 String, Integer, Long, Double, and more. For custom objects, you&#8217;ll create your own or use something like Avro or JSON with a matching library.<\/p>\n\n\n\n<p>You&#8217;ll see SerDes configured like this:<\/p>\n\n\n\n<p><code>Serde&lt;String&gt; stringSerde = Serdes.String();<\/code><\/p>\n\n\n\n<p>Don&#8217;t skip this. Misconfigured SerDes are one of the top sources of beginner errors.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Setting Up Your First Kafka Streams Application<\/strong><\/h2>\n\n\n\n<p>Let&#8217;s build something real. We&#8217;ll create a simple app that reads text messages from a Kafka topic, counts how many times each word appears, and writes the results to another topic.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Step 1: Add the Dependency<\/strong><\/h3>\n\n\n\n<p>In your pom.xml:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&lt;dependency&gt;\n\n&nbsp;&nbsp;&nbsp;&nbsp;&lt;groupId&gt;org.apache.kafka&lt;\/groupId&gt;\n\n&nbsp;&nbsp;&nbsp;&nbsp;&lt;artifactId&gt;kafka-streams&lt;\/artifactId&gt;\n\n&nbsp;&nbsp;&nbsp;&nbsp;&lt;version&gt;3.7.0&lt;\/version&gt;\n\n&lt;\/dependency&gt;<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Step 2: Configure the Application<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>Properties props = new Properties();\n\nprops.put(StreamsConfig.APPLICATION_ID_CONFIG, \"word-count-app\");\n\nprops.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, \"localhost:9092\");\n\nprops.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());\n\nprops.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());<\/code><\/pre>\n\n\n\n<p>The APPLICATION_ID_CONFIG is important \u2014 Kafka uses it to group your app&#8217;s consumer group and manage state stores.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Step 3: Define Your Topology<\/strong><\/h3>\n\n\n\n<p>java<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>StreamsBuilder builder = new StreamsBuilder();\n\nKStream&lt;String, String&gt; textLines = builder.stream(\"input-topic\");\n\nKTable&lt;String, Long&gt; wordCounts = textLines\n\n&nbsp;&nbsp;&nbsp;&nbsp;.flatMapValues(value -&gt; Arrays.asList(value.toLowerCase().split(\"\\\\W+\")))\n\n&nbsp;&nbsp;&nbsp;&nbsp;.groupBy((key, word) -&gt; word)\n\n&nbsp;&nbsp;&nbsp;&nbsp;.count(Materialized.as(\"word-count-store\"));\n\nwordCounts.toStream().to(\"output-topic\", Produced.with(Serdes.String(), Serdes.Long()));\n\nKafkaStreams streams = new KafkaStreams(builder.build(), props);\n\nstreams.start();<\/code><\/pre>\n\n\n\n<p>That&#8217;s it. That&#8217;s a complete, production-worthy word count application.<\/p>\n\n\n\n<p>What&#8217;s happening here step by step:<\/p>\n\n\n\n<ol>\n<li>Read messages from input-topic<\/li>\n\n\n\n<li>Split each message into individual words (flatMapValues)<\/li>\n\n\n\n<li>Group the stream by word (groupBy)<\/li>\n\n\n\n<li>Count occurrences and store the result in a local state store (count)<\/li>\n\n\n\n<li>Write the output back to output-topic<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Stateful vs Stateless Processing \u2014 Know the Difference<\/strong><\/h2>\n\n\n\n<p>This is where Kafka Streams becomes genuinely powerful.<\/p>\n\n\n\n<p><strong>Stateless operations<\/strong> don&#8217;t need to remember anything between records. Filter, map, flatMap \u2014 each record is processed independently. These are fast and simple.<\/p>\n\n\n\n<p><strong>Stateful operations<\/strong> keep track of information over time. Counting words, calculating running averages, joining streams \u2014 all of this requires state. Kafka Streams handles this using <strong>RocksDB<\/strong>, an embedded key-value store that runs inside your application process.<\/p>\n\n\n\n<p>\u26a0\ufe0f <strong>Warning:<\/strong> If you&#8217;re running stateful operations, make sure your state store is backed up by changelog topics. Kafka Streams does this by default, but you need to make sure you haven&#8217;t disabled it in your config.<\/p>\n\n\n\n<p>When we ran a stateful aggregation pipeline for a real-time dashboard at a mid-sized SaaS product (tracking 200K events\/hour), switching from in-memory state to RocksDB-backed state stores cut memory pressure by 60% and made the app survive restarts without losing any counts.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Windowed Aggregations \u2014 Processing Data Over Time<\/strong><\/h2>\n\n\n\n<p>Sometimes you don&#8217;t want an all-time count. You want to know how many orders came in during the last 10 minutes. That&#8217;s where <strong>windowing<\/strong> comes in.<\/p>\n\n\n\n<p>Kafka Streams supports three window types:<\/p>\n\n\n\n<p><strong>Tumbling Windows<\/strong> \u2014 fixed-size, non-overlapping. Every 10 minutes, you get a fresh window.<\/p>\n\n\n\n<p><strong>Hopping Windows<\/strong> \u2014 fixed-size but overlapping. A 10-minute window that advances every 2 minutes.<\/p>\n\n\n\n<p><strong>Session Windows<\/strong> \u2014 activity-based. The window stays open as long as events keep coming in within a gap period. Great for user session tracking.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>KTable&lt;Windowed&lt;String&gt;, Long&gt; windowedCounts = textLines\n\n&nbsp;&nbsp;&nbsp;&nbsp;.flatMapValues(value -&gt; Arrays.asList(value.toLowerCase().split(\"\\\\W+\")))\n\n&nbsp;&nbsp;&nbsp;&nbsp;.groupBy((key, word) -&gt; word)\n\n&nbsp;&nbsp;&nbsp;&nbsp;.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(10)))\n\n&nbsp;&nbsp;&nbsp;&nbsp;.count();<\/code><\/pre>\n\n\n\n<p>\u2705 <strong>Best Practice:<\/strong> Always decide on your grace period \u2014 the time Kafka Streams waits for late-arriving events before closing a window. Skipping it can cause you to miss data that arrives slightly out of order.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Joining Streams and Tables<\/strong><\/h2>\n\n\n\n<p>Real-world apps rarely process a single stream in isolation. You often need to join two streams or enrich a stream with reference data from a table.<\/p>\n\n\n\n<p>Kafka Streams supports three join types:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Join Type<\/strong><\/td><td><strong>Left<\/strong><\/td><td><strong>Right<\/strong><\/td><td><strong>Use Case<\/strong><\/td><\/tr><tr><td>KStream-KStream<\/td><td>Stream<\/td><td>Stream<\/td><td>Correlate two event streams within a time window<\/td><\/tr><tr><td>KStream-KTable<\/td><td>Stream<\/td><td>Table<\/td><td>Enrich events with latest state (e.g., add user profile to order event)<\/td><\/tr><tr><td>KStream-GlobalKTable<\/td><td>Stream<\/td><td>GlobalKTable<\/td><td>Look up static reference data (e.g., country codes, product catalog)<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><strong>Joining Streams and Tables<\/strong><\/figcaption><\/figure>\n\n\n\n<p>The KStream-KTable join is the one you&#8217;ll use most often. It&#8217;s how you take a raw event and enrich it with context.<\/p>\n\n\n\n<p>If you want a structured, mentor-supported path and learn all these new tools, then HCL GUVI\u2019s IIT-M Pravartak Certified <a href=\"https:\/\/www.guvi.in\/zen-class\/full-stack-development-course\/?utm_source=blog&amp;utm_medium=hyperlink+&amp;utm_campaign=kafka-streams-tutorial\" target=\"_blank\" rel=\"noreferrer noopener\">Full Stack Developer Course<\/a> with AI Integration covers the entire journey, from HTML to deployment, with real projects, live sessions, and placement support. Over 10,000 students have used it to break into product-based companies.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What to Do Next<\/strong><\/h2>\n\n\n\n<p>Now that you understand the fundamentals, here&#8217;s a clear path forward:<\/p>\n\n\n\n<ol>\n<li><strong>Set up a local Kafka cluster<\/strong> using Docker and docker-compose (Confluent&#8217;s quickstart is the fastest way)<\/li>\n\n\n\n<li><strong>Run the word count app<\/strong> from Step 3 above \u2014 get it working end to end<\/li>\n\n\n\n<li><strong>Experiment with windowing<\/strong> \u2014 try both tumbling and session windows on the same dataset<\/li>\n\n\n\n<li><strong>Try a KStream-KTable join<\/strong> \u2014 create a user events stream and enrich it with a user profile KTable<\/li>\n\n\n\n<li><strong>Explore the Kafka Streams Interactive Queries API<\/strong> \u2014 query your local state stores directly via REST<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n\n\n\n<ul>\n<li>Kafka Streams is a Java library \u2014 no separate cluster, no extra infrastructure<\/li>\n\n\n\n<li>A <strong>topology<\/strong> defines how your data flows: source \u2192 processors \u2192 sink<\/li>\n\n\n\n<li>Use <strong>KStream<\/strong> for event-by-event processing, <strong>KTable<\/strong> for latest-state views<\/li>\n\n\n\n<li><strong>SerDes<\/strong> handle byte-to-object conversion \u2014 configure them carefully<\/li>\n\n\n\n<li>Stateful operations use <strong>RocksDB<\/strong> under the hood, making them fast and reliable<\/li>\n\n\n\n<li><strong>Windowed aggregations<\/strong> let you process data over time intervals, not just in total<\/li>\n\n\n\n<li>You can join streams with other streams or tables to enrich your data in real time<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>FAQs<\/strong><\/h2>\n\n\n<div id=\"rank-math-faq\" class=\"rank-math-block\">\n<div class=\"rank-math-list \">\n<div id=\"faq-question-1782882165448\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>What is Kafka Streams used for?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Kafka Streams is used for processing real-time data as it flows through Kafka topics. Common use cases include filtering events, counting occurrences, joining data streams, detecting patterns, and building real-time dashboards or alerts.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1782882167783\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>Do I need a separate cluster to run Kafka Streams?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>No. Kafka Streams runs inside your Java application as a library. You only need a running Kafka broker \u2014 the stream processing happens in your own app process.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1782882172563\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>What&#8217;s the difference between KStream and KTable?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>A KStream represents every event as a separate record \u2014 like a running log. A KTable represents only the latest value for each key \u2014 like a snapshot of current state. Use KStream for transactions, KTable for user profiles or account balances.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1782882177816\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>Can Kafka Streams handle late-arriving data?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Yes. Kafka Streams supports grace periods on windowed operations, which tells the system how long to wait for late events before finalizing a window&#8217;s result. You configure this when defining your time window.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1782882183467\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>Is Kafka Streams production-ready?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Absolutely. Companies like LinkedIn, Uber, Nubank, and Zalando run Kafka Streams in production at scale, processing billions of events daily. It has been production-ready since Kafka 0.10 and is actively maintained as of 2026.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1782882188475\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>What Java version do I need for Kafka Streams?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Kafka Streams 3.x requires Java 11 or higher. Java 17 is the recommended version for new projects as of 2026.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1782882195245\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>How is Kafka Streams different from Apache Flink?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Kafka Streams is a client library you embed in your app. Apache Flink is a standalone distributed processing framework with its own cluster and job management system. Kafka Streams is simpler to get started with; Flink offers more advanced processing capabilities for very complex pipelines.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1782882203392\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \"><strong>Can I use Kafka Streams with Spring Boot?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Yes. Spring for Apache Kafka provides first-class integration with Kafka Streams. You can use the @EnableKafkaStreams annotation and configure your topology as a Spring bean.<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>TL;DR Summary&nbsp; Kafka Streams is a client-side Java library from Apache Kafka that processes real-time data as it flows through Kafka topics. You write standard Java code, and it handles the complexity of distributed stream processing for you. It&#8217;s the go-to tool when you need to filter, transform, aggregate, or join streaming data without spinning [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":120284,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[720,294],"tags":[],"views":"38","authorinfo":{"name":"Lukesh S","url":"https:\/\/www.guvi.in\/blog\/author\/lukesh\/"},"thumbnailURL":"https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/07\/Kafka-Streams-300x116.webp","_links":{"self":[{"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/posts\/119835"}],"collection":[{"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/comments?post=119835"}],"version-history":[{"count":4,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/posts\/119835\/revisions"}],"predecessor-version":[{"id":120287,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/posts\/119835\/revisions\/120287"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/media\/120284"}],"wp:attachment":[{"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/media?parent=119835"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/categories?post=119835"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/tags?post=119835"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}