Persistent Structural Memory for AI: The Architecture Behind Infigraph

June 28, 2026July 12, 2026Sandeep Mewara Leave a comment

In my previous article, I wrote about what I called Code Blindness – the hidden operational cost of forcing AI assistants to repeatedly rediscover the structure and architectural relationships that already exist inside our codebases.

https://learnbyinsight.com/wp-content/uploads/2026/07/infigraph-blog-banner.png

Today’s coding assistants can inspect local files, trace explicit imports and painstakingly piece together relationships to answer familiar engineering questions:

– Who calls this function?
– What breaks if I alter this API route?
– Which services depend on this component?
– What is the true blast radius of this change?

These aren’t difficult questions because the code is hard to read. They’re difficult because the relationships that answer them aren’t explicitly available. Every new AI session reconstructs them from source code, only to discard that understanding when the conversation ends.

That observation eventually led us to build Infigraph – an attempt to turn software structure into reusable, local infrastructure.

When we recently open-sourced the project, one question came up repeatedly:

“What makes Infigraph different from the other code intelligence and code graph projects already out there?”

It’s a fair question.

The ecosystem already has tools for code search, static analysis, architecture visualization and AI-assisted development. Some focus on helping engineers navigate code. Others generate knowledge graphs for LLMs, visualize architecture or build richer retrieval pipelines.

We weren’t trying to build another code intelligence tool.

We were trying to build a local-first, persistent structural memory layer that AI assistants could query directly instead of repeatedly reconstructing software relationships from source code.

That objective influenced nearly every architectural decision we made – from how code is parsed, to how relationships are extracted and stored, to how AI agents retrieve information.

Looking back, those decisions weren’t independent optimizations. They were consequences of a single design principle:

If software structure changes far more slowly than AI conversations, then structural knowledge should be treated as infrastructure and not something rebuilt from scratch for every prompt.

This article walks through the engineering decisions that followed from that principle, the tradeoffs we accepted and the lessons we learned while building Infigraph.

The System Blueprint

Before discussing the individual architectural decisions, it’s worth understanding how the pieces fit together. At a high level, Infigraph continuously transforms a codebase into a persistent structural representation that AI assistants can query directly. Instead of rediscovering relationships for every conversation, those relationships become shared infrastructure.

https://learnbyinsight.com/wp-content/uploads/2026/07/infigraph-architecture-overview.png

The graph doesn’t replace the language model. It changes the question the language model has to answer. Rather than treating every prompt as an isolated reasoning exercise, Infigraph treats structural understanding as persistent infrastructure.

The important observation isn’t the individual technologies. It’s where the computational effort moves.

Traditional AI workflows spend most of their effort reconstructing architecture from source code every time a question is asked. Infigraph moves that work to indexing time. Parsing source code, resolving symbols, understanding imports and discovering relationships happen once – when the repository is indexed. Every subsequent question becomes a retrieval problem instead of a reconstruction problem.

https://learnbyinsight.com/wp-content/uploads/2026/07/infigraph-workflow.png

That architectural shift immediately imposed a new set of engineering requirements. We needed:

A storage engine optimized for relationship traversal rather than document retrieval
A retrieval layer that could combine graph queries with traditional search
A parsing architecture capable of understanding modern polyglot codebases without becoming language-specific

The next three sections explain how those requirements shaped Infigraph’s architecture.

Decision #1: Represent Code as a Persistent Graph

The first architectural decision was to decide how software itself should be represented.

A software system isn’t just a collection of source files. It’s a network of explicit relationships. A function call is an explicit relationship. An import statement expresses a dependency. A class hierarchy defines inheritance. Module boundaries already exist whether an AI model discovers them or not. We needed a statically discoverable representation where relationships were first-class citizens.

That naturally led us to a graph.

Instead of storing source files as isolated text, Infigraph persists a connected topology of software entities and the relationships between them.

https://learnbyinsight.com/wp-content/uploads/2026/07/infigraph-graph-setup.png

Once relationships become explicit, architectural questions stop being text-search problems. Familiar engineering questions become a graph traversal. The system isn’t reading raw source code inside an LLM reasoning loop to find callers. It is traversing an index that already knows they exist.

Once we committed to representing software as a graph, the next question became much more practical:

What kind of graph engine could support interactive AI workflows without becoming another server-side dependency?

That question shaped our next architectural decision.

Decision #2: Persist Structural Memory Locally

We could have stored the graph in a traditional client-server database. We could have relied on a managed graph service. Or we could have generated structural context on demand through cloud-hosted retrieval pipelines.

All of those approaches work.

But they conflicted with one of our architectural constraints from the very beginning:

Structural knowledge should live alongside the repository, not behind another network boundary.

That single constraint influenced far more than our storage engine. It shaped the entire architecture.

If AI assistants increasingly become part of the developer’s inner loop, structural knowledge should be available with the same characteristics developers already expect from their source code:

local
immediately accessible
private
independent of cloud connectivity

That immediately narrowed our design space. We needed a graph engine that was:

embedded rather than server-based
lightweight enough to ship with the developer environment
optimized for large relationship traversals
capable of answering structural queries within an interactive AI workflow

That led us to KuzuDB , an embedded, columnar graph database designed around analytical graph workloads rather than transactional business operations. The workload wasn’t updating records, it was traversing relationships. A columnar storage engine aligns well with that access pattern because it can efficiently scan relationship data without repeatedly loading complete records.

The architectural layout shift here is central to performance. The difference isn’t the graph model – it’s the storage layout:

https://learnbyinsight.com/wp-content/uploads/2026/07/infigraph-kuzu-db.png

Performance Benchmarking

When traversing deep, multi-hop dependency tracks across half a million nodes, we rarely need to unpack full, heavy row configurations. Benchmarks were run on representative repositories and consistently observed substantially lower traversal latency for deep dependency walks.

The important result wasn’t the absolute latency. It was that the storage layout aligned far better with the traversal-heavy workload of AI-assisted development.

https://learnbyinsight.com/wp-content/uploads/2026/07/infigraph-perf-kuzuDB.png

Like every architectural decision, it came with tradeoffs. KuzuDB is a younger ecosystem than some of the established graph platforms. We consciously traded ecosystem maturity for an embedded architecture that better matched the interaction model we were trying to enable. Looking back, that tradeoff shaped much more than storage. Once structural memory became local and inexpensive to traverse, the next challenge was no longer storage.

Decision #3: Retrieve Structural Context Before Reasoning

Persisting structural knowledge solved only half the problem. The remaining challenge was retrieving the right structural context quickly enough that an AI assistant never needed to fall back to reading large portions of the repository.

At first, it seemed tempting to rely on a single retrieval strategy. Keyword search is excellent when an engineer already knows the exact symbol they’re looking for. Semantic search is better when they describe an idea rather than an identifier. Graph traversal is indispensable when the question is fundamentally about relationships. But, none of these approaches is sufficient on its own.

Different questions require different retrieval strategies.

Instead of trying to force every question through a single search engine, Infigraph combines multiple retrieval mechanisms, each optimized for a different type of query.

https://learnbyinsight.com/wp-content/uploads/2026/07/infigraph-hybrid-retrieval.png

We built a local-first, parallel hybrid retrieval pipeline where each engine contributes a different signal:

BM25 (Exact Retrieval): Fast, deterministic lookup for symbols, filenames, identifiers and keywords
Semantic Retrieval (Model2Vec): A bundled 29 MB embedding model retrieves conceptually similar code without relying on external embedding APIs
Regex Retrieval: Captures explicit syntactic conventions, decorators, annotations and language-specific patterns that keyword and semantic search may overlook

Once these candidate starting points are identified, Graph Traversal takes over. The retrieval layer expands those candidate matches into architectural context.

If retrieval is part of the developer’s inner loop, it should remain just as local and self-contained as the graph itself. That led us to build the entire retrieval pipeline including keyword indexes and semantic embeddings to execute locally without depending on external services.

The goal wasn’t simply lower latency. It was to ensure that structural understanding remained available regardless of network connectivity, while keeping source code inside the developer’s environment. The retrieval layer shouldn’t decide what the model thinks. It should decide what the model needs to think about.

Making Structural Memory Consumable

Building a retrieval pipeline solves only part of the problem. The other half is exposing that structural knowledge in a way AI assistants can consume naturally. Rather than embedding graph traversal logic into individual coding assistants, Infigraph exposes focused capabilities – symbol lookup, dependency traversal, call graph exploration and structural search – through the Model Context Protocol (MCP).

That separation was intentional.

The graph remains the system of record. MCP becomes the interface through which AI assistants access that knowledge. Whether the client is Claude Code, Cursor, GitHub Copilot, Windsurf or another MCP-compatible tool, they all interact with the same persistent structural memory instead of rebuilding it independently.

This reinforces the same architectural principle that shaped the rest of Infigraph:

Structural knowledge should be shared infrastructure. MCP simply makes that infrastructure accessible.

The final challenge was making that extraction scale across the reality of modern polyglot systems.

Decision #4: Decouple Structural Extraction from Language

Very few systems live entirely within a single language. A typical request may begin in a TypeScript frontend, flow through a Java service, invoke a Python-based machine learning component and finally interact with SQL or infrastructure configuration. Supporting that reality required more than adding parsers. It required separating the extraction engine from language-specific syntax.

That became our final architectural decision:

The extraction pipeline should remain stable as language support grows.

Instead of writing language-specific logic inside the core engine, Infigraph separates parsing from extraction.

https://learnbyinsight.com/wp-content/uploads/2026/07/infigraph-multi-lang.png

To support both mainstream languages and enterprise-specific grammars, we built a dual-extraction architecture:

For mainstream languages, we rely on Tree-sitter grammars and declarative queries to identify structural entities such as symbols, imports, calls and inheritance.
For proprietary languages, internal DSLs or environments where Tree-sitter isn’t the right fit, Infigraph provides an ANTLR-based extension path. New grammars can be added without modifying the extraction engine itself.

That separation turned out to be more valuable than we initially expected.

Once parsing produces a common structural representation, everything else in the architecture remains unchanged. Every additional language increases the capability of the platform without increasing the complexity of its core.

Today, that approach allows Infigraph to support 62 languages out of the box while remaining extensible for environments that need more. Persistent structural memory shouldn’t become more complicated every time your software ecosystem grows. By separating extraction from language, we made language diversity an extension point instead of an architectural constraint.

The Landscape: Where Infigraph Fits

Most code intelligence platforms are ultimately designed around one of two consumers:

Humans, who need to search, visualize, analyze or understand software systems.
Analysis engines, which evaluate code for correctness, security, compliance or quality.

Our primary consumer is different. It’s an AI assistant operating inside a developer’s editing loop.

Projects such as SciTools Understand, Sourcegraph, Joern and newer AI-native graph initiatives have each pushed the ecosystem forward in different ways. Many engineers already rely on them successfully.

Our goal wasn’t to replace those tools. It was to optimize for a different execution model.

The architectural differences become clearer when viewed through the problems each category was designed to solve. The differences aren’t primarily about features. They’re about architectural optimization. Each category solves a different problem and therefore makes different tradeoffs.

Dimension	Human-Centric Platforms	AI Knowledge Builders	Infigraph
Typical Examples	SciTools Understand, Sourcegraph, Joern	Understand-Anything, Graphiti, Nomik and similar projects	Infigraph
Primary Consumer	Engineers & Architects	AI knowledge generation workflows	AI coding assistants
Structural Extraction	Parser / index-based	Often combines parsing with LLM summarization	Deterministic parser-based extraction
Deployment Model	Desktop or centralized infrastructure	Frequently cloud-assisted	Local-first embedded infrastructure
Primary Interaction	Search, navigation, visualization	Repository understanding and documentation	Real-time MCP tool calls
Optimization Target	Human understanding	AI-generated repository knowledge	Persistent structural memory for AI

These categories aren’t mutually exclusive. In many organizations they complement one another. The difference lies in which problem each one is optimized to solve. This distinction matters because our optimization target was fundamentally different.

We weren’t building another interface for engineers to explore repositories OR building another cloud pipeline that asks an external LLM to understand a repository before a developer can ask a question.

We were trying to answer a much narrower architectural question:

How do we make structural knowledge continuously available to AI assistants without paying to rediscover it every conversation?

That single question explains almost every architectural decision described in this article.

Represent software as a persistent graph
Persist structural memory locally
Retrieve structural context instead of raw files
Expose that knowledge through MCP
Keep extraction extensible across languages

Everything else follows from that design center.

Choose Infigraph when…

Your primary development workflow revolves around AI coding assistants, such as Claude Code, Cursor, GitHub Copilot, etc
Your agents repeatedly ask structural questions about callers, dependencies, ownership or impact analysis
You want local-first execution without repeatedly sending repository context to external services
You want persistent structural context that survives beyond individual AI conversations

Continue using existing tools when…

Your primary need is enterprise-scale code search
You’re performing security or compliance analysis
You need architecture visualization or reverse engineering for human exploration

Instead of asking the AI to reconstruct relationships every session, Infigraph provides them as persistent structural memory that can be queried locally in milliseconds.

Our goal isn’t to replace the existing code intelligence ecosystem. It’s to become the lightweight local-first, structural memory layer that complements it for AI-native software development.

Looking Ahead

I don’t think Infigraph is the final answer to AI-native software development. In fact, I suspect we’re only beginning to define what this architecture layer should become.

Today, persistent structural memory captures relationships between software entities. Tomorrow, it may also incorporate architectural evolution, ownership boundaries, runtime behavior, operational telemetry, organizational knowledge and historical change patterns.

The better AI becomes at generating code, the more important these structural layers become. Generated code is only valuable if it fits coherently inside the system around it. I believe our responsibility is gradually shifting toward building better representations of the systems AI increasingly helps us evolve.

That’s ultimately why we open-sourced Infigraph .

Not because we think we’ve solved the problem, but because we believe persistent structural memory is an architectural direction worth exploring together.

If this way of thinking resonates with you, I’d encourage you to try Infigraph against your own repositories, challenge the assumptions we’ve made and contribute where you think the architecture can be improved.

We’re still learning.

Hopefully, we’ll learn together.

. Sandeep Mewara Github
Tech Explore
Trend
Learn Machine Learning with Examples
Machine Learning workflow

https://learnbyinsight.com/wp-content/uploads/2026/07/infigraph-vertical-light.png

GitHub : https://github.com/intuit/infigraph
Documentation : Detailed design specs and contribution guidelines are included in the repo.

The Hidden Cost of Code Blindness in the Age of AI

June 14, 2026June 14, 2026Sandeep Mewara Leave a comment

Last month, I was looking over the shoulder of one of our engineers as they worked with an AI coding assistant. They asked a question that should have been entirely straightforward: “Who calls the validate_user function in our codebase?” The answer eventually came back. But watching them get there required a familiar and surprisingly expensive loop: reading multiple files, tracing imports, reconstructing call paths and inferring relationships that already existed inside the system.

As we stood there brainstorming around their screen, a realization struck us. What broke the workflow wasn’t the token count. It was the repetition. If anyone on the team opened a new session tomorrow and asked the exact same question, the model would perform much of the same work all over again. The relationship hadn’t changed. The code hadn’t changed. Only the cost and our collective time had.

The Cost of Rediscovery

That moment exposed a fundamental flaw in how we approach AI-assisted development. The problem isn’t that AI is inherently expensive. The problem is that AI keeps paying a premium to repeatedly rediscover the same foundational knowledge. What looks like a token limitation is actually a structural understanding problem. And that’s ultimately why our engineering team set out to build Infigraph.

AI Has Context. It Doesn’t Have Structure.

The last few years have been dominated by a single, brute-force idea: give AI more context. Bigger context windows, more capable models, better reasoning and smarter agents. All of those advances matter. But many of the questions engineers ask every day aren’t really code-understanding questions. They are system-understanding questions.

Engineers ask questions like: Who calls this function? What breaks if I change this API? Which services depend on this component? What is the blast radius of this change?

These are not primarily language problems. They are relationship problems. They are graph problems. A model can read raw text files incredibly well, but what it lacks is a persistent understanding of the architecture that connects those files together. Software systems are not just collections of files, they are collections of relationships. The industry has spent years teaching machines how to read code, but we are only beginning to teach them how to understand systems.

The Economics of Reconstructing Knowledge

Every engineering organization already possesses a vast amount of implicit structural knowledge. The system already knows which modules depend on each other, which symbols are reachable, which services communicate and which changes create downstream impact. Yet, most AI workflows require that knowledge to be rediscovered from first principles, repeatedly.

When you ask who calls validate_user, the model reads files and reconstructs relationships. Open a new session tomorrow, ask the same question and the model performs much of the same work again. The relationship didn’t change, but the cost did.

We don’t rebuild database schemas every time a SQL query executes and we don’t rebuild search indexes every time a user types a keyword. We persist structure because persistence is more efficient than rediscovery. Software systems deserve the same treatment:

Persist the knowledge once. Query it many times.

The Shift I Think We’re Entering

I don’t pretend to have all the answers for how AI and complex architectures will evolve together. But as an architect looking at how our workflows are changing, I know where the responsibility is moving. Historically, our primary effort as developers was spent translating intent into syntax. Increasingly, AI handles that translation smoothly. As that happens, the bottleneck shifts away from writing code and toward understanding architecture, change impact, dependency boundaries and system behavior.

The better AI becomes at generating code, the more critical structural understanding becomes. Generated code is only an asset if it fits correctly inside the system around it. Otherwise, it’s just technical debt written at supersonic speed. We would never build an application that rediscovered its data schema for every transaction, yet that is effectively how many AI-assisted workflows approach codebases today.

Why We Built Infigraph

As we discussed this pattern internally, a simple question emerged: If structural knowledge is repeatedly rediscovered, why aren’t we persisting it? Instead of parsing relationships from raw source files every time a question is asked, what if those relationships were represented directly? What if structural understanding became infrastructure?

That idea became Infigraph. Infigraph creates a persistent representation of codebase structure that AI agents can query directly. Rather than repeatedly reading files to discover relationships, agents can ask questions about relationships that already exist. The goal was never to replace AI reasoning; the goal was to make AI contextually aware of the broader systems it operates within.

Same Question. Same Codebase. Different Architecture.

Three principles shaped our approach:

Structure First: Code contains explicit relationships. Those relationships deserve first-class, deterministic representation.
Local First: Code intelligence should be private, fast, and fully available even when disconnected from the cloud.
Polyglot Reality: Real systems span many languages, frameworks, technologies, and internal platforms. Infigraph currently supports 63 languages out of the box because the tool should adapt to your system—not the other way around.

The Byproducts of Structural Awareness

Cost is simply the easiest metric to measure, but it isn’t the most important outcome. The more important outcome is quality. When structural relationships are treated as a foundational layer, the system answers questions with greater consistency and more complete coverage than transient inference from raw files can reliably provide.

A cheaper answer is useful, but a more complete answer is transformative. Architects care about correctness, engineering leaders care about confidence and developers care about understanding impact before making a change. Structural awareness improves all three.

When we stopped asking, “How do we slash our token bill?” and started asking, “Why are we repeatedly paying to rediscover the same relationships?” the economics fell into place naturally. Fewer files needed to be pulled into context, tool call chains became shorter, latency dropped and cost followed. Cost savings are not the primary innovation but they are a consequence of eliminating redundant engineering work.

Why We Open-Sourced It

We originally built Infigraph to solve systemic problems inside our own development workflows. But as more engineers and teams began using it, we realized that this challenge isn’t unique to us. The entire industry is moving aggressively toward AI-assisted development while software systems continue growing larger and more interconnected.

Those two trends collide around a simple question: How do we help machines understand software systems, not just individual files? We know the current trajectory: repeatedly paying to rediscover knowledge that already exists within our own codebases. That model isn’t sustainable. We believe the next step deserves community participation, scrutiny and collective engineering.

That’s why we released Infigraph as an open-source project under the Apache 2.0 license. Not because we think it’s finished, but because we believe this is a direction worth building together.

What’s Next

This article focused on the core problem. The next article (in upcoming week) will focus entirely on the engineering decisions behind our approach from graph-based representations and retrieval strategies to the tradeoffs we encountered while building local-first code intelligence.

But you don’t have to wait for that deep dive to start exploring.

⭐ Star the repository on GitHub and follow the project.
👥 Assess, Contribute and raise PR.
🚀 Install it and try it now against your own codebase.

If you hit issues, open a GitHub issue. If you want to contribute, whether that’s a new language parser, search improvements or new MCP integrations, we’d love to collaborate.

Thanks for reading. And, a special thanks to the engineers on our team who transformed a whiteboard conversation into a tool we can now share with the broader community.

. Sandeep Mewara Github
Tech Explore
Trend
samples GitHub Profile Readme
Learn Machine Learning with Examples
Machine Learning workflow

Kubernetes – Evolution of application deployment

August 9, 2020September 25, 2020Sandeep Mewara Leave a comment

Kubernetes (K8s) is turning out as the cutting-edge of application deployment. It is becoming core to the creation and operation of modern software (few call it as modern SaaS). Thus, I planned to look into it and see what Kubernetes is and how/what application design will help adapt it in the application deployment evolution.

Kubernetes is a portable, extensible, open-source platform for automating deployment, scaling, and management of containerized applications.

History

Google originally designed and open-sourced the Kubernetes project in 2014. Kubernetes has inputs from over 15 years of Google’s experience to run production workloads at scale with best ideas and practices from the community. It is maintained by the Cloud Native Computing Foundation now. It’s current development repository is here.

First challenge …

With modern goal parameters like: recoverability, release cycle time & release frequency – applications need to be designed and deployed in a way that makes them improve year over year.

This leads to first step of breaking the monolith into microservices such that the changes and impact are compartmentalized for easy deployment and recovery.

A monolithic application puts all it’s functionality in a single process. In need of scaling, it replicates entire monolith on multiple servers. On the other hand, a microservice architecture separates out (keeps) each functionality into a separate service. Thus in case of scaling need, these services are distributed across servers as required.

Second challenge …

With multiple microservices in play, a variance of stack versions or deployment styles kicks in as trouble. Each team would have their own set of tools, versions to build the artifacts, store them and then deploy them. Thus, different applications/services can have different patterns and network topology. This in turn makes managing security and infrastructure more challenging.

This leads to the step of abstracting infrastructure out to ease maintenance and relieve from security and other infrastructure related concerns.

deployment-progression — Deployment scheme evolution

Traditional: Applications running on a physical server. No way to define resource boundaries for applications.
Virtualization: Allows to run multiple Virtual Machines (VMs) on a single physical server’s CPU. This leads to better utilization of resources and better scalability as an application can be added or updated easily. Also, if needed, applications can be isolated between different VMs to provide a level of security.
Containers: Like VM, it has its own filesystem, CPU, memory, process space, etc. Are environment consistent, easy to scale, portable across clouds and OS distributions. This leads to loosely coupled setup where application is totally decoupled from infrastructure and makes it easy to move towards smaller, modular microservices.

Containers are abstraction to next level. It does not matter on which OS you are on (although there could be different containers for different OS and how they work underlying), all we need is to package our code and needed libraries together, which then runs inside a container based on configured resource need. Docker is an example of container runtime, a packaging software.

Final challenge …

So, the packaging has been simplified and running the application on a single node has been simplified. When we move to enterprise, we need to scale up/down our containers on need basis automatically. Further, one would scale the application to be served from multiple servers instead of just one for better load distribution and easy recovery/fail safe. Now, while distributing the load, we would need to ensure the availability of nodes, resources like space on node for running a container, etc.

This is where Kubernetes pitch in. It acts as a container orchestrator that help provides with a framework to run distributed systems resiliently. It takes care of scaling and failover of containers having application, provides deployment patterns, and more.

Kubernetes has master-slave architecture where there is one master node and multiple worker nodes. A Pod is the smallest deployable unit in it. In order to run a single container, we would need to create a Pod for that container. A Pod can contain more than one container if those containers are relatively tightly coupled (like a container to download all secret configs related before application starts in other container).

API Server is the heart of the architecture. User interacts with Kubernetes via it and master node communicates to worker nodes through it. Number of containers requested is stored in the etcd (key-value store). Controller acts as a manager that keeps a constant check on the store, schedules the request for scheduler to pick and execute, spins of another worker node in case of need.

Wrap Up …

I have just touched the surface of both containerization and Kubernetes. They seem to have much more and can be explored in depth. Along with vast benefits, it can also bring new challenges on the table with moving to cloud like security and networking.

It was good to know how application design and deployment are evolving, getting abstracted and loosely coupled.

Keep learning!

Reference: https://kubernetes.io/docs/home/

GitHub Readme Samples

Beginner’s Guide to understand Kafka

July 26, 2020September 25, 2020Sandeep Mewara Leave a comment

It’s a digital age. Wherever there is data, we hear about Kafka these days. One of my projects I work, involves entire data system (Java backend) that leverages Kafka to achieve what deals with tonnes of data through various channels and departments. While working on it, I thought of exploring the setup in Windows. Thus, this guide helps learn Kafka and showcases the setup and test of data pipeline in Windows.

Introduction

An OpenSource Project in Java & Scala

Apache Kafka is a distributed streaming platform with three key capabilities:

Messaging system – Publish-Subscribe to stream of records
Availability & Reliability – Store streams of records in a fault tolerant durable way
Scalable & Real time – Process streams of records as they occur

Data system components

Kafka is generally used to stream data into applications, data lakes and real-time stream analytics systems.

Application inputs messages onto the Kafka server. These messages can be any defined information planned to capture. It is passed across in a reliable (due to distributed Kafka architecture) way to another application or service to process or re-process them.

Internally, Kafka uses a data structure to manage its messages. These messages have a retention policy applied at a unit level of this data structure. Retention is configurable – time based or size based. By default, the data sent is stored for 168 hours (7 days).

Kafka Architecture

Typically, there would be multiples of producers, consumers, clusters working with messages across. Horizontal scaling can be easily done by adding more brokers. Diagram below depicts the sample architecture:

Kafka communicates between the clients and servers with TCP protocol. For more details, refer: Kafka Protocol Guide

Kafka ecosystem provides REST proxy that allows an easy integration via HTTP and JSON too.

Primarily it has four key APIs: Producer API, Consumer API, Streams API, Connector API

Key Components & related terminology

Messages/Records – byte arrays of an object. Consists of a key, value & timestamp
Topic – feeds of messages in categories
Producer – processes that publish messages to a Kafka topic
Consumer – processes that subscribe to topics and process the feed of published messages
Broker – It hosts topics. Also referred as Kafka Server or Kafka Node
Cluster – comprises one or more brokers
Zookeeper – keeps the state of the cluster (brokers, topics, consumers)
Connector – connect topics to existing applications or data systems
Stream Processor – consumes an input stream from a topic and produces an output stream to an output topic
ISR (In-Sync Replica) – replication to support failover.
Controller – broker in a cluster responsible for maintaining the leader/follower relationship for all the partitions

Zookeeper

Apache ZooKeeper is an open source that helps build distributed applications. It’s a centralized service for maintaining configuration information. It holds responsibilities like:

Broker state – maintains list of active brokers and which cluster they are part of
Topics configured – maintains list of all topics, number of partitions for each topic, location of all replicas, who is the preferred leader, list of ISR for partitions
Controller election – selects a new controller whenever a node shuts down. Also, makes sure that there is only one controller at any given time
ACL info – maintains Access control lists (ACLs) for all the topics

Kafka Internals

Brokers in a cluster are differentiated based on an ID which typically are unique numbers. Connecting to one broker bootstraps a client to the entire Kafka cluster. They receive messages from producers and allow consumers to fetch messages by topic, partition and offset.

A Topic is spread across a Kafka cluster as a logical group of one or more partitions. A partition is defined as an ordered sequence of messages that are distributed across multiple brokers. The number of partitions per topic are configurable during creation.

Producers write to Topics. Consumers read from Topics.

Kafka uses Log data structure to manage its messages. Log data structure is an ordered set of Segments that are collection of messages. Each segment has files that help locate a message:

Log file – stores message
Index file – stores message offset and its starting position in the log file

Kafka appends records from a producer to the end of a topic log. Consumers can read from any committed offset and are allowed to read from any offset point they choose. The record is considered committed only when all ISRs for partition write to their log.

Among the multiple partitions, there is one leader and remaining are replicas/followers to serve as back up. If a leader fails, an ISR is chosen as a new leader. Leader performs all reads and writes to a particular topic partition. Followers passively replicate the leader. Consumers are allowed to read only from the leader partition.

A leader and follower of a partition can never reside on the same node.

Kafka also supports log compaction for records. With it, Kafka will keep the latest version of a record and delete the older versions. This leads to a granular retention mechanism where the last update for each key is kept.

Offset manager is responsible for storing, fetching and maintaining consumer offsets. Every live broker has one instance of an offset manager. By default, consumer is configured to use an automatic commit policy of periodic interval. Alternatively, consumer can use a commit API for manual offset management.

Kafka uses a particular topic, __consumer_offsets, to save consumer offsets. This offset records the read location of each consumer in each group. This helps a consumer to trace back its last location in case of need. With committing offsets to the broker, consumer no longer depends on ZooKeeper.

Older versions of Kafka (pre 0.9) stored offsets in ZooKeeper only, while newer version of Kafka, by default stores offsets in an internal Kafka topic __consumer_offsets

Kafka allows consumer groups to read data in parallel from a topic. All the consumers in a group has same group ID. At a time, only one consumer from a group can consume messages from a partition to guarantee the order of reading messages from a partition. A consumer can read from more than one partition.

Kafka Setup On Windows

Pre-Requisite

Java SE Runtime Environment
- System has: jre-8u261-windows-x64.exe
Kafka
- Sample app uses: Scala 2.12 – kafka_2.12-2.5.0.tgz
Any unzip tool to extract files out of *.tgz
- I have Mac that extracted it just by double click

Setup files

Install JRE – default settings should be fine
Un-tar Kafka files at C:\Installs (could be any location by choice). All the required script files for Kafka data pipeline setup will be located at: C:\Installs\kafka_2.12-2.5.0\bin\windows
Configuration changes as per Windows need
- Setup for Kafka logs – Create a folder ‘logs’ at location C:\Installs\kafka_2.12-2.5.0
- Set this logs folder location in Kafka config file: C:\Installs\kafka_2.12-2.5.0\config\server.properties as log.dirs=C:\Installs\kafka_2.12-2.5.0\logs
- Setup for Zookeeper data – Create a folder ‘data’ at location C:\Installs\kafka_2.12-2.5.0
- Set this data folder location in Zookeeper config file: C:\Installs\kafka_2.12-2.5.0\config\zookeeper.properties as dataDir=C:\Installs\kafka_2.12-2.5.0\data

Execute

ZooKeeper – Get a quick-and-dirty single-node ZooKeeper instance using the convenience script already packaged along with Kafka files.
- Open a command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
- Execute script: zookeeper-server-start.bat C:\Installs\kafka_2.12-2.5.0\config\zookeeper.properties
- ZooKeeper started at localhost:2181. Keep it running.
Kafka Server – Get a single-node Kafka instance.
- Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
- ZooKeeper is already configured in the properties file as zookeeper.connect=localhost:2181
- Execute script: kafka-server-start.bat C:\Installs\kafka_2.12-2.5.0\config\server.properties
- Kafka server started at localhost: 9092. Keep it running.
  Now, topics can be created and messages can be stored. We can produce and consume data from any client. We will use command prompt for now.
Topic – Create a topic named ‘testkafka’
- Use replication factor as 1 & partitions as 1 given we have made a single instance node
- Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
- Execute script: kafka-topics.bat --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic testkafka
- Execute script to see created topic: kafka-topics.bat --list --bootstrap-server localhost:9092
- Keep the command prompt open just in case.
Producer – setup to send messages to the server
- Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
- Execute script: kafka-console-producer.bat --bootstrap-server localhost:9092 --topic testkafka
- It will show a ‘>’ as a prompt to type a message. Type: “Kafka demo – Message from server”
- Keep the command prompt open. We will come back to it to push more messages
Consumer – setup to receive messages from the server
- Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
- Execute script: kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic testkafka --from-beginning
- You would see the Producer sent message in this command prompt window – “Kafka demo – Message from server”
- Go back to Producer command prompt and type any other message to see them appearing real time in Consumer command prompt
Check/Observe – few key changes behind the scene
- Files under topic created – they keep track of the messages pushed for a given topic
- Data inside the log file – All the messages that are pushed by producer are stored here
- Topics present in Kafka – once a consumer starts reading messages from topic, __consumer_offsets is automatically created as a topic

NOTE: In case you want to choose Zookeeper to store topics instead of Kafka server, it would require following script commands:

Topic create: kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic testkafka
Topics view: kafka-topics.bat --list --zookeeper localhost:2181

With above, we are able to see messages sent by Producer and received by Consumer using a Kafka setup.

When I tried to setup Kafka, I faced few issues on the way. I have documented them for reference to learn. This should also help others if they face something similar: Troubleshoot: Kafka setup on Windows.

One should not encounter any issues with below shared files and the steps/commands shared above.

Download entire modified setup files for Windows from here: https://github.com/sandeep-mewara/kafka-demo-windows

References:
https://kafka.apache.org
https://cwiki.apache.org/confluence/display/KAFKA
https://docs.confluent.io/2.0.0/clients/consumer.html

CodeProject