Linear time partition – a three way split

Linear-time partition is a divide & conquer based selection algorithm. With it, data is split into three groups using a pivot.
.

linear-time-partioning

An integral part of Quick Sort algorithm which uses this partitioning logic recursively. All the elements smaller than the pivot are put on one side and all the larger ones on the other side of the pivot.

Similar to discussion of Dynamic Programming, this algorithm plays on solving sub-problems to solve complex problem.

Algorithm

Post selecting the pivot, Linear-time partition routine separates the data into three groups with values:

  • less than the pivot
  • equal to the pivot
  • greater than the pivot

Generally, this algorithm is done in place. This results in partially sorting the data. There are handful of problems that makes use of this fact, like:

  • Sort an array that contains only 0’s, 1’s & 2’s
  • Dutch national flag problem
  • Print all negative integers followed by positive for an array full of them
  • Print all 0’s first and then 1’s or vice-versa for an array with only 0’s & 1’s
  • Move all the 0’s to the end maintaining relative order of other elements for an array of integers

If done out of place, (i.e. not changing the original data), it would cost O(n) additional space

Example

Let’s take an example of: sort a array that contains only 0’s, 1’s & 2’s

First thought for such problem is to perform a count of 0’s, 1’s and 2’s. Once we have the counts, reset the array with them. Though it has time complexity O(n), it takes two traversal of the array or uses an extra array.

Below is an attempt to solve using Linear-time partition algorithm to avoid that extra traversal/space.

def threeWayPartition(A):
    start = mid = 0
    end = len(A)-1
    
    # define a Pivot
    pivot = 1
    
    while (mid <= end):
        # mid element is less than pivot
        # current element is 0
        
        # so lets move it to start
        # current start is good. 
        # move start to next element
        # move mid to next element to move forward
        if (A[mid] < pivot) :
            swap(A, start, mid)
            start = start + 1
            mid = mid + 1
            
        # mid element is more than pivot
        # current element is 2
        
        # so lets move it to end
        # current end is good. 
        # move end to previous element
        elif (A[mid] > pivot) :
            swap(A, mid, end)
            end = end - 1
        
        # mid element is same as pivot
        # current element is 1
        
        # just move forward: 
        # mid to next element
        else :
            mid = mid + 1
            
# Swap two elements A[i] and A[j] in the list
def swap(A, i, j):
    A[i], A[j] = A[j], A[i]


# Define an array
inputArray = [0, 1, 2, 2, 1, 0, 0, 2]

# Call the Linear-time partition routine
threeWayPartition(inputArray)

# print the final result
print(inputArray)

# Outputs
# [0, 0, 0, 1, 1, 2, 2, 2]

With a defined pivot, we segregated the data on the either side which resulted in desired output. Dutch nation flag problem or printing all negative first and then positive, or printing all 0s first follows the same code.

For moving all 0’s to end maintaining other elements order, we do a tweak in swap index to maintain order:

def threeWayPartition(A):
    current = 0
    nonzero = 0
    end = len(A)-1
    
    # define a Pivot
    pivot = 0
    
    while (current <= end):
        if (A[current] != pivot) :
            swap(A, current, nonzero)
            nonzero = nonzero + 1
        current = current + 1
            
# Swap two elements A[i] and A[j] in the list
def swap(A, i, j):
    A[i], A[j] = A[j], A[i]


# Define an array
inputArray = [7,0,5,1,2,0,2,0,6]

# Call the Linear-time partition routine
threeWayPartition(inputArray)

# print the final result
print(inputArray)

# Output
# [7, 5, 1, 2, 2, 6, 0, 0, 0]

Complexity

With above algorithm approach, we solved our problem with Time complexity O(n) & Space complexity O(1) (with single traversal of the array)


It was fun solving!

samples GitHub Profile Readme
Learn Python – Beginners step by step – Basics and Examples
Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples
Sandeep Mewara Github Repositories

Dynamic Programming – What, How & When

There are many programming techniques, which when applied appropriately to specific situation, leads to efficient results (either time wise or space wise or both). Dynamic Programming (DP) is one such optimization concept.

dp-header

Not to be confused with Dynamic programming language or Dynamic type in C#

What is?

Dynamic programming is solving a complicated problem by breaking it down into simpler sub-problems and make use of past solved sub-problems.

Dynamic Programming is mainly an optimization over plain recursion.

Let’s use Fibonacci series as an example to understand this in detail.

Fibonacci Series is a sequence, such that each number is the sum of the two preceding ones, starting from 0 and 1.

0, 1, 1, 2, 3, 5, 8, 13, 21…

When to?

It is best applied to problems that exhibits Overlapping Subproblems & Optimal Substructure.

Following is how our Fibonacci problem fib(4) breakdown would look like:

Problem tree for fib(4)

We can see that fib(0), fib(1) & fib(2) are occurring multiple times in the graph. These are overlapping subproblems.

Binary search tree would not fall into same category. There would no common subproblem by binary search definition itself.

We can see for optimal solution of fib(4), we would need optimal solution of fib(3) & fib(2) and so on. This tells it has optimal substructure.

Longest path problem does not exhibit as such. Optimal solution for longest distance between A & C might not be optimal for B & C that involves A.

How to?

It’s not an unknown or a new technique. Our brain do it almost all the time. Let’s use Fibonacci series as an example to understand this and then we will apply DP to solve it programmatically.

Logically …

Let’s say, I ask, what is the 4th number in the Fibonacci series? You would work from 0, 1, 1, etc and come to fib(4) = 2. Now, if I ask, what is the 5th number? Would you restart from 0, 1, 1..? Or you would just make use of fib(4) you just solved and get fib(5)?

You see – it’s obvious, you will make use of previous solved result and right away come up with fib(5) instead of starting from 0. In programming language, this is tabulation using calculated fib(4) & fib(3).

Tabulation is a technique of starting from smallest sub-problem and storing the results of entries until target value is reached.

Let’s say, now I ask to calculate fib(8). Our natural way of working on it would be to first find fib(7) & fib(6). In order to know fib(7), we will figure out fib(6) & fib(5) and hence forth. With fib(7), we already traversed through fib(6). This is another approach where we calculated the next subproblem and keep on storing them in case of need later. In programming language, this is memoization.

Memoization is a technique of storing the results of expensive function calls and returning the cached result when the same inputs occur again.

DP breaks the problem into sub-problems and uses memoization or tabulation to optimize. We will understand about them with examples below.

Programmatically in action …

In order to compare the optimization cost, we will use recursion as another way to solve the problem.

static void Main(string[] args)
{
    Stopwatch stopwatch = new Stopwatch();

    Console.WriteLine($"Fibonacci Recursive:");
    for (int i = 30; i <= 50; i+=5)
    { 
        stopwatch.Reset();
        stopwatch.Start();
        var _ = FibonacciRecursive(i);
        stopwatch.Stop();
        Console.WriteLine($"{i}th took time: ({stopwatch.Elapsed})");
    }

    Console.WriteLine($"Dynamic Programming:");
    for (int i = 30; i <= 50; i+=5)
    { 
        stopwatch.Reset();
        stopwatch.Start();
        var _ = FibonacciDPTabulation(i);
        stopwatch.Stop();
        Console.WriteLine($"{i}th took time: ({stopwatch.Elapsed})");
    }
}

static long FibonacciRecursive(int number)
{
    if (number <= 1)
        return number;
    return FibonacciRecursive(number-2) + FibonacciRecursive(number - 1);
}

static long FibonacciDPTabulation(int number)
{
    long[] arr = new long[100];
    arr[0] = 1; arr[1] = 1; 
    for( int x=2; x <= number; x++)
    {
        arr[x] = arr[x-1] + arr[x-2];
    }
    return arr[number];
}

With above code, we got the following output:

 RecursiveDP (Tabulation)DP (Memoization)
30th took time(00:00:00.0090536)(00:00:00.0002756)(00:00:00.0183122)
35th took time(00:00:00.0908688)(00:00:00.0000037)(00:00:00.0000009)
40th took time(00:00:00.9856354)(00:00:00.0000006)(00:00:00.0000011)
45th took time(00:00:10.7981258)(00:00:00.0000006)(00:00:00.0000008)
50th took time(00:02:24.8694889)(00:00:00.0000006)(00:00:00.0000008)

Difference is astonishing! DP is too fast in comparison. Just look at the 50th time difference.

With simple iterative approach, Fibonacci Series can be calculated in the same ballpark time as of DP. Current example is just to keep things simple and to understand the DP concept. Please look at the end of post for common examples that would clarify where DP would be of real help over recursion. (& iterative approach would be difficult to code)

Above approach of DP is considered Bottom-Up approach as we started with bottom (lowest term) and then moved to the highest one. This is tabulation. We keep on filling on the cache here till we reach the target.

Alternatively, there is a Top-down approach. We start solving from highest term and store solutions of sub problems along the way. This is memoization. A code for this approach would look like:

private static Dictionary<int, long> memo 
                    = new Dictionary<int, long>();

static long FibonacciDPMemoization(int number)
{
    if (number == 0 || number == 1) {
        return number;
    }
    
    // see if we've already calculated this
    if (memo.ContainsKey(number)) {                
        return memo.GetValueOrDefault(number);
    }

    long result = FibonacciDPMemoization(number - 1) 
                + FibonacciDPMemoization(number - 2);
    memo.Add(number, result);

    return result;
}

memoization is sometimes simpler to understand and write code because of it’s natural way of solving.

Generally, tabulation outperformes memoization by a constant factor. This is because of no overhead for recursion and can be stored as a pre-allocated array. We first calculate and store the value of a sub problem repeated most number of times here.

Few Common Examples?

  • Tower of Hanoi
  • Checkerboard
  • Egg dropping puzzle
  • Matrix chain multiplication

Details of these above problem can be read here.

Beware?

Apart from avoiding problems that do not have Overlapping Subproblems & Optimal Substructure, one need to understand we are doing a trade here – space for time!

We can always discuss that though DP is using extra space to store the sub-problems data, but in turn helps save in memory calculation which could be expensive. Thus, in case of constrained space, we need to evaluate usage of DP.


Keep learning!

samples GitHub Profile Readme
Learn Python – Beginners step by step – Basics and Examples
Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples

Any anagram of a string, a palindrome?

Often in our group we discuss about puzzles or problems related to data structure and algorithms. One such day, we discussed about:

how will we find if any anagram of a string is palindrome or not?

Our first thought went in the direction to start from first character and then traverse till end to see if there could be matching pair. Keep track of it, move to next character till middle and then stitch all to figure if so. It solves, but the query was – could it be solved better though?

Of course! With putting some stress on the brain, it turned out that in a single read, we will have info enough, to tell, if any anagram formed from the input string can be a palindrome or not.

Thought converted to Code

static void CheckIfStringAnagramHasPalindrome()
{
    Console.WriteLine($"Please enter a string:");

    // Ignore casing
    var inputString = Console.ReadLine().ToLower();

    // Just need to keep track of unique characters
    var characterSet = new HashSet<char>();

    // Single traversal of input string 
    for(int i=0; i<inputString.Length; i++)
    {
        char currentCharacter = inputString[i];
        if(characterSet.Contains(currentCharacter))
            characterSet.Remove(currentCharacter);
        else
            characterSet.Add(currentCharacter);
    }

    // Character counts in set will help 
    // identify if palindrome possible 
    var leftChars = characterSet.Count;
    if(leftChars == 0 || leftChars == 1)
        Console.WriteLine($"YES - possible.");
    else
        Console.WriteLine($"NO - Not possible.");
}


Approach looked good, as with a single traversal and usage of HashSet, i.e. with overall Order of Time complexity O(n) & Space complexity O(1), we were able to solve it.

It was fun solving!

GitHub Profile Readme Samples
Learn Python – Beginners step by step – Basics and Examples
Sandeep Mewara Github
Sandeep Mewara Learn By Insight

Read from console in VS Code

I moved to Apple Mac late last year because of different set of technologies now I work in. As shared in one of my previous posts, I use Visual Studio Code for programming in Python exploring Machine Learning. Though, for anything in .NET, I switch to a Windows VM and use Visual Studio.

For quick console apps, it feels painful to switch to a VM and work. Thus, I looked and installed C# extension in VS Code to try of. Details are here.

While running a console app, I got stuck to read any value from Console. In debug mode, IDE would stop on the Console.ReadLine() but whatever I type in Console would not go through.

I looked around and found that there are few settings for Console in VS Code. The console setting controls what console (terminal) window the target app is launched into.

"internalConsole" (default) : This does NOT work for applications that want to read from the console (ex: Console.ReadLine).

How to Solve it?

Suggested way to take input is to set the console setting as integratedTerminal. This is a configuration setting in the launch.json file under .vscode folder.

"integratedTerminal" : the target process will run inside VS Code’s integrated terminal (Terminal tab in the tab group beneath the editor). Alternatively add "internalConsoleOptions": "neverOpen" to make it so that the default foreground tab is the terminal tab.

Change the default setting like below:

vscode-read-console

With above change, the input and output will happen through integrated terminal like:

vscode-read-console-2

So far, it looks good and seems I will stick to Visual Studio Code on Mac for quick console applications.

Reference here.


Keep learning!

LearnByInsight
GitHub Profile Readme Samples

New C# features I really liked!

There has been many new features added to C# over last few years. A recent survey in CodeProject community lead me to the thought of sharing what I find really helpful. It spreads from C# 6.0 to C# 8.0. Below few made writing code easy, more fun and have improved productivity.

Null Conditional Operator (?. & ?[])

They make null checks much easier and fluid. Add a ? just before the the member access . or indexer access [] that can be null. It short-circuits and returns null for assignment.

// safegaurd against NullReferenceException

Earlier
if(address != null)
{
   var street = address.StreetName;
}
Now
var street = address?.StreetName;

// safegaurd against IndexOutOfRangeException

Earlier
if(row != null && row[0] != null)
{
   int data = row[0].SomeCount;
}
Now
int? data = row?[0]?.SomeCount;

Null Coalescing Operator (?? & ??=)

Null-coalescing operator ?? helps to assign default value if the properties is null. Often, used along with null conditional operator.

Earlier
if(address == null)
{
   var street = "NA";
}
Now
var street = address?.StreetName ?? "NA"; 

Null-coalescing assignment operator ??= helps to assign the value of its right-hand operand to its left-hand operand only if the left-hand operand evaluates to null.

The left-hand operand of the ??= operator must be a variable, a property, or an indexer element

Earlier
int? i = null;

if(i == null)
   i = 0;

Console.WriteLine(i);  // output: 0
Now
int? i = null;

i ??= 0;
Console.WriteLine(i);  // output: 0

String Interpolation ($)

It enables to embed expressions in a string. With a special character $ to identify a string literal as an interpolated string. Interpolation expressions are replaced by the string representations of the expression results in the result string.

Earlier
string address = string.Format("{0},{1}", HouseNo, StreetName);

log.Write("Address: "+ HouseNo.ToLower() + "," + StreetName);
Now
string address = $"{HouseNo}, {StreetName}";

log.Write($"Address: {HouseNo.ToLower()}, {StreetName}");

Auto-Property Initializer

It helps declare the initial value for a property as part of the property declaration itself.

Earlier
Language _currentLanguage = Language.English;
public Language CurrentLanguage
{
   get { return _currentLanguage; }
   set { _currentLanguage = value; }
}   

// OR

// Improvement in C# 3.0
public Language CurrentLanguage { get; set; }

public MyClass()
{
    CurrentLanguage = Language.English;
}
Now
public Language CurrentLanguage { get; set; } = Language.English;

using static

It helps to import the enum or static methods of a single class.

Earlier
public class Enums
{
    public enum Language
    {
        English,
        Hindi,
        Spanish
    }
}

// Another file
public class MyClass
{
    public Enums.Language CurrentLanguage { get; set; };
}
Now
public class Enums
{
    public enum Language
    {
        English,
        Hindi,
        Spanish
    }
}

// Another file
using static mynamespace.Enums

public class MyClass
{
    public Language CurrentLanguage { get; set; };
}

Tuples

They are lightweight data structures that contain multiple fields to represent the data members.

# Initialize Way 1
(string First, string Second) ranks = ("1", "2");
Console.WriteLine($"{ranks.First}, {ranks.Second}");

# Initialize Way 2
var ranks = (First: "1", Second: "2");
Console.WriteLine($"{ranks.First}, {ranks.Second}");

It support == and !=

Expression bodied get/set accessors

With it, members can be implemented as expressions.

Earlier
public string Title
{
    get { return _title; }
    set 
    { 
       this._title = value ?? "Default - Hello";
    }
}
Now
public string Title
{
    get => _title;
    set => this._title = value ?? "Default - Hello";
}

Access modifier: private protected

A new compound access modifier: private protected to indicate a member accessible by containing class or by derived classes that are declared in the same assembly. One more level of abstraction compared to protected internal.

// Assembly1.cs
public class BaseClass
{
    private protected int myValue = 0;
}

public class DerivedClass1 : BaseClass
{
    void Access()
    {
        var baseObject = new BaseClass();

        // Error CS1540, because myValue can only be
        // accessed by classes derived from BaseClass
        // baseObject.myValue = 5;

        // OK, accessed through the current 
        // derived class instance
        myValue = 5;
    }
}

//
// Assembly2.cs
// Compile with: /reference:Assembly1.dll
class DerivedClass2 : BaseClass
{
    void Access()
    {
        // Error CS0122, because myValue can only
        // be accessed by types in Assembly1
        // myValue = 10;
    }
}

await

It helps suspend evaluation of the enclosing async method until the asynchronous operation represented by its operand completes. On completion, it returns result of the operation if any.

It does not blocks the thread that evaluates async method, instead suspends the enclosing async method and returns to the caller of the method.

using System;
using System.Net.Http;
using System.Threading.Tasks;

public class AwaitOperatorDemo
{
    // async Main method allowed since C# 7.1 
    public static async Task Main()
    {
        Task<int> downloading = DownloadProfileAsync();
        Console.WriteLine($"{nameof(Main)}: Started download.");

        int bytesLoaded = await downloading;
        Console.WriteLine($"{nameof(Main)}: Downloaded {bytesLoaded} bytes.");
    }

    private static async Task<int> DownloadProfileAsync()
    {
        Console.WriteLine($"{nameof(DownloadProfileAsync)}: Starting download.");

        var client = new HttpClient();
        // time taking call - await and move on
        byte[] content = await client.GetByteArrayAsync("https://learnbyinsight.com/about/");

        Console.WriteLine($"{nameof(DownloadProfileAsync)}: Finished download.");
        return content.Length;
    }
}

// Output:
// DownloadProfileAsync: Starting download.
// Main: Started download.
// DownloadProfileAsync: Finished download.
// Main: Downloaded 27700 bytes.

Default Interface methods

Now, we can add members to interfaces and provide a default implementation for those members. It helps in supporting backward compatibility. There would be no breaking change to existing interface consumers. Existing implementations inherit the default implementation.

public interface ICustomer
{
    DateTime DateJoined { get; }
    string Name { get; }

    // Later added to interface:
    public string Contact()
    {
       return "contact not provided";
    }
}

Wrap up

There are many more additions to C#. Believe, above are few that one should know and use in their day to day coding right away (if not already doing it). Most of it helps us with being more concise and avoid convoluted code.

Reference: https://docs.microsoft.com/en-us/dotnet/csharp/whats-new

Keep learning!

LearnByInsight
GitHub Profile Readme Samples

Kubernetes – Evolution of application deployment

Kubernetes (K8s) is turning out as the cutting-edge of application deployment. It is becoming core to the creation and operation of modern software (few call it as modern SaaS). Thus, I planned to look into it and see what Kubernetes is and how/what application design will help adapt it in the application deployment evolution.

Kubernetes is a portable, extensible, open-source platform for automating deployment, scaling, and management of containerized applications.

History

Google originally designed and open-sourced the Kubernetes project in 2014. Kubernetes has inputs from over 15 years of Google’s experience to run production workloads at scale with best ideas and practices from the community. It is maintained by the Cloud Native Computing Foundation now. It’s current development repository is here.

First challenge …

With modern goal parameters like: recoverability, release cycle time & release frequency – applications need to be designed and deployed in a way that makes them improve year over year.

This leads to first step of breaking the monolith into microservices such that the changes and impact are compartmentalized for easy deployment and recovery.

monolith2microservice

A monolithic application puts all it’s functionality in a single process. In need of scaling, it replicates entire monolith on multiple servers. On the other hand, a microservice architecture separates out (keeps) each functionality into a separate service. Thus in case of scaling need, these services are distributed across servers as required.

Second challenge …

With multiple microservices in play, a variance of stack versions or deployment styles kicks in as trouble. Each team would have their own set of tools, versions to build the artifacts, store them and then deploy them. Thus, different applications/services can have different patterns and network topology. This in turn makes managing security and infrastructure more challenging.

This leads to the step of abstracting infrastructure out to ease maintenance and relieve from security and other infrastructure related concerns.

deployment-progression
Deployment scheme evolution
  • Traditional: Applications running on a physical server. No way to define resource boundaries for applications.
  • Virtualization: Allows to run multiple Virtual Machines (VMs) on a single physical server’s CPU. This leads to better utilization of resources and better scalability as an application can be added or updated easily. Also, if needed, applications can be isolated between different VMs to provide a level of security.
  • Containers: Like VM, it has its own filesystem, CPU, memory, process space, etc. Are environment consistent, easy to scale, portable across clouds and OS distributions. This leads to loosely coupled setup where application is totally decoupled from infrastructure and makes it easy to move towards smaller, modular microservices.

Containers are abstraction to next level. It does not matter on which OS you are on (although there could be different containers for different OS and how they work underlying), all we need is to package our code and needed libraries together, which then runs inside a container based on configured resource need. Docker is an example of container runtime, a packaging software.

Final challenge …

So, the packaging has been simplified and running the application on a single node has been simplified. When we move to enterprise, we need to scale up/down our containers on need basis automatically. Further, one would scale the application to be served from multiple servers instead of just one for better load distribution and easy recovery/fail safe. Now, while distributing the load, we would need to ensure the availability of nodes, resources like space on node for running a container, etc.

This is where Kubernetes pitch in. It acts as a container orchestrator that help provides with a framework to run distributed systems resiliently. It takes care of scaling and failover of containers having application, provides deployment patterns, and more.

kubernetes-architecture

Kubernetes has master-slave architecture where there is one master node and multiple worker nodes. A Pod is the smallest deployable unit in it. In order to run a single container, we would need to create a Pod for that container. A Pod can contain more than one container if those containers are relatively tightly coupled (like a container to download all secret configs related before application starts in other container).

API Server is the heart of the architecture. User interacts with Kubernetes via it and master node communicates to worker nodes through it. Number of containers requested is stored in the etcd (key-value store). Controller acts as a manager that keeps a constant check on the store, schedules the request for scheduler to pick and execute, spins of another worker node in case of need.

Wrap Up …

I have just touched the surface of both containerization and Kubernetes. They seem to have much more and can be explored in depth. Along with vast benefits, it can also bring new challenges on the table with moving to cloud like security and networking.

It was good to know how application design and deployment are evolving, getting abstracted and loosely coupled.

Keep learning!

Reference: https://kubernetes.io/docs/home/

GitHub Readme Samples

Beginner’s Guide to understand Kafka

It’s a digital age. Wherever there is data, we hear about Kafka these days. One of my projects I work, involves entire data system (Java backend) that leverages Kafka to achieve what deals with tonnes of data through various channels and departments. While working on it, I thought of exploring the setup in Windows. Thus, this guide helps learn Kafka and showcases the setup and test of data pipeline in Windows.

Introduction

<kafka-logo>
An OpenSource Project in Java & Scala

Apache Kafka is a distributed streaming platform with three key capabilities:

  • Messaging system – Publish-Subscribe to stream of records
  • Availability & Reliability – Store streams of records in a fault tolerant durable way
  • Scalable & Real time – Process streams of records as they occur

Data system components

Kafka is generally used to stream data into applications, data lakes and real-time stream analytics systems.

<kafka-highlevel-architecture>

Application inputs messages onto the Kafka server. These messages can be any defined information planned to capture. It is passed across in a reliable (due to distributed Kafka architecture) way to another application or service to process or re-process them.

Internally, Kafka uses a data structure to manage its messages. These messages have a retention policy applied at a unit level of this data structure. Retention is configurable – time based or size based. By default, the data sent is stored for 168 hours (7 days).

Kafka Architecture

Typically, there would be multiples of producers, consumers, clusters working with messages across. Horizontal scaling can be easily done by adding more brokers. Diagram below depicts the sample architecture:

kafka-internals

Kafka communicates between the clients and servers with TCP protocol. For more details, refer: Kafka Protocol Guide

Kafka ecosystem provides REST proxy that allows an easy integration via HTTP and JSON too.

Primarily it has four key APIs: Producer API, Consumer API, Streams API, Connector API

Key Components & related terminology
  • Messages/Records – byte arrays of an object. Consists of a key, value & timestamp
  • Topic – feeds of messages in categories
  • Producer – processes that publish messages to a Kafka topic
  • Consumer – processes that subscribe to topics and process the feed of published messages
  • Broker – It hosts topics. Also referred as Kafka Server or Kafka Node
  • Cluster – comprises one or more brokers
  • Zookeeper – keeps the state of the cluster (brokers, topics, consumers)
  • Connector – connect topics to existing applications or data systems
  • Stream Processor – consumes an input stream from a topic and produces an output stream to an output topic
  • ISR (In-Sync Replica) – replication to support failover.
  • Controller – broker in a cluster responsible for maintaining the leader/follower relationship for all the partitions
Zookeeper

Apache ZooKeeper is an open source that helps build distributed applications. It’s a centralized service for maintaining configuration information. It holds responsibilities like:

  • Broker state – maintains list of active brokers and which cluster they are part of
  • Topics configured – maintains list of all topics, number of partitions for each topic, location of all replicas, who is the preferred leader, list of ISR for partitions
  • Controller election – selects a new controller whenever a node shuts down. Also, makes sure that there is only one controller at any given time
  • ACL info – maintains Access control lists (ACLs) for all the topics

Kafka Internals

Brokers in a cluster are differentiated based on an ID which typically are unique numbers. Connecting to one broker bootstraps a client to the entire Kafka cluster. They receive messages from producers and allow consumers to fetch messages by topic, partition and offset.

A Topic is spread across a Kafka cluster as a logical group of one or more partitions. A partition is defined as an ordered sequence of messages that are distributed across multiple brokers. The number of partitions per topic are configurable during creation.

Producers write to Topics. Consumers read from Topics.

<kafka-partition>

Kafka uses Log data structure to manage its messages. Log data structure is an ordered set of Segments that are collection of messages. Each segment has files that help locate a message:

  1. Log file – stores message
  2. Index file – stores message offset and its starting position in the log file

Kafka appends records from a producer to the end of a topic log. Consumers can read from any committed offset and are allowed to read from any offset point they choose. The record is considered committed only when all ISRs for partition write to their log.

leader-follower

Among the multiple partitions, there is one leader and remaining are replicas/followers to serve as back up. If a leader fails, an ISR is chosen as a new leader. Leader performs all reads and writes to a particular topic partition. Followers passively replicate the leader. Consumers are allowed to read only from the leader partition.

A leader and follower of a partition can never reside on the same node.

leader-follower2

Kafka also supports log compaction for records. With it, Kafka will keep the latest version of a record and delete the older versions. This leads to a granular retention mechanism where the last update for each key is kept.

Offset manager is responsible for storing, fetching and maintaining consumer offsets. Every live broker has one instance of an offset manager. By default, consumer is configured to use an automatic commit policy of periodic interval. Alternatively, consumer can use a commit API for manual offset management.

Kafka uses a particular topic, __consumer_offsets, to save consumer offsets. This offset records the read location of each consumer in each group. This helps a consumer to trace back its last location in case of need. With committing offsets to the broker, consumer no longer depends on ZooKeeper.

Older versions of Kafka (pre 0.9) stored offsets in ZooKeeper only, while newer version of Kafka, by default stores offsets in an internal Kafka topic __consumer_offsets

consumer-groups

Kafka allows consumer groups to read data in parallel from a topic. All the consumers in a group has same group ID. At a time, only one consumer from a group can consume messages from a partition to guarantee the order of reading messages from a partition. A consumer can read from more than one partition.

Kafka Setup On Windows

setup-on-windows
Pre-Requisite
Setup files
  1. Install JRE – default settings should be fine
  2. Un-tar Kafka files at C:\Installs (could be any location by choice). All the required script files for Kafka data pipeline setup will be located at: C:\Installs\kafka_2.12-2.5.0\bin\windows
  3. Configuration changes as per Windows need
    • Setup for Kafka logs – Create a folder ‘logs’ at location C:\Installs\kafka_2.12-2.5.0
    • Set this logs folder location in Kafka config file: C:\Installs\kafka_2.12-2.5.0\config\server.properties as log.dirs=C:\Installs\kafka_2.12-2.5.0\logs
    • Setup for Zookeeper data – Create a folder ‘data’ at location C:\Installs\kafka_2.12-2.5.0
    • Set this data folder location in Zookeeper config file: C:\Installs\kafka_2.12-2.5.0\config\zookeeper.properties as dataDir=C:\Installs\kafka_2.12-2.5.0\data
Execute
  1. ZooKeeper – Get a quick-and-dirty single-node ZooKeeper instance using the convenience script already packaged along with Kafka files.
    • Open a command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
    • Execute script: zookeeper-server-start.bat C:\Installs\kafka_2.12-2.5.0\config\zookeeper.properties
    • ZooKeeper started at localhost:2181. Keep it running.
      demo-zookeeper
  2. Kafka Server – Get a single-node Kafka instance.
    • Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
    • ZooKeeper is already configured in the properties file as zookeeper.connect=localhost:2181
    • Execute script: kafka-server-start.bat C:\Installs\kafka_2.12-2.5.0\config\server.properties
    • Kafka server started at localhost: 9092. Keep it running.
      demo-kafka
      Now, topics can be created and messages can be stored. We can produce and consume data from any client. We will use command prompt for now.
  3. Topic – Create a topic named ‘testkafka’
    • Use replication factor as 1 & partitions as 1 given we have made a single instance node
    • Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
    • Execute script: kafka-topics.bat --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic testkafka
    • Execute script to see created topic: kafka-topics.bat --list --bootstrap-server localhost:9092
      demo-topic
    • Keep the command prompt open just in case.
  4. Producer – setup to send messages to the server
    • Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
    • Execute script: kafka-console-producer.bat --bootstrap-server localhost:9092 --topic testkafka
    • It will show a ‘>’ as a prompt to type a message. Type: “Kafka demo – Message from server”
      demo-producer
    • Keep the command prompt open. We will come back to it to push more messages
  5. Consumer – setup to receive messages from the server
    • Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
    • Execute script: kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic testkafka --from-beginning
    • You would see the Producer sent message in this command prompt window – “Kafka demo – Message from server”
      demo-consumer
    • Go back to Producer command prompt and type any other message to see them appearing real time in Consumer command prompt
      kafka-demo
  6. Check/Observe – few key changes behind the scene
    • Files under topic created – they keep track of the messages pushed for a given topic
      topic-files
    • Data inside the log file – All the messages that are pushed by producer are stored here
      topic-log
    • Topics present in Kafka – once a consumer starts reading messages from topic, __consumer_offsets is automatically created as a topic
      topic-present

NOTE: In case you want to choose Zookeeper to store topics instead of Kafka server, it would require following script commands:

  • Topic create: kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic testkafka
  • Topics view: kafka-topics.bat --list --zookeeper localhost:2181

With above, we are able to see messages sent by Producer and received by Consumer using a Kafka setup.

When I tried to setup Kafka, I faced few issues on the way. I have documented them for reference to learn. This should also help others if they face something similar: Troubleshoot: Kafka setup on Windows.

One should not encounter any issues with below shared files and the steps/commands shared above.

Download entire modified setup files for Windows from here: https://github.com/sandeep-mewara/kafka-demo-windows

hurray

References:
https://kafka.apache.org
https://cwiki.apache.org/confluence/display/KAFKA
https://docs.confluent.io/2.0.0/clients/consumer.html

Make browser a basic html editor

While working on one of my recent blogs, I stumbled upon an HTML DOM property that looked interesting.

In past,

  • I have to see how the text change looks in a webpage – make change, refresh page or run the application again
  • I have to inspect, find the related DOM to make any text change to it and then write code to make the change to see it
  • I downloaded HTML page and then made some change in its text to add/edit/remove some comments for clean print.
  • Have some logic to provide an editable HTML page to users

Well, no more. Seems we have a new property (surely it was not there few years back but introduced recently): document.designMode

I tried in Firefox, from menu items, go to: Tools -> Web Developer -> Web Console. Write:

document.designMode = "on"

Post this, you can edit the webpage text right in your browser!

documentModeEx

Sample real use case could be providing a portion of page editable to users. Add that in an iframe and then turn the designMode of that to ‘on’:

iframeNode.contentDocument.designMode = "on";

A string indicating whether designMode is (or should be) set to on or off. Valid values are on and off

In IE, it would be under Developer Tools, and so on for other browsers.

design-mode
Browser Compatibility

Nice to have something like it to to convert browser into a basic HTML editor! Keep learning.

Reference: https://developer.mozilla.org/en-US/docs/Web/API/Document/designMode

Quick look into SignalR

Last year, I was looking into various ways of communication between client and server for one of my project work. Evaluated SignalR too as one of the options. I found decent documentation online to put across a test app to see how it works.

Introduction

SignalR is a free and open-source software library for Microsoft ASP.NET that provides ability to server-side code push content to the connected clients in real-time.

Pushing data from the server to the client (not just browser clients) has always been a tough problem. SignalR makes it dead easy and handles all the heavy lifting for you.

https://github.com/SignalR/SignalR

Detailed documentation can be found at: https://dotnet.microsoft.com/apps/aspnet/signalr

Most of the shareout online considers it as ASP.NET/Web application solution only which is not true. As mentioned in quote above, client could be a non-browser too like a desktop application.

I gathered that behind the scenes, SignalR primarily tries to use Web Socket protocol to send data. WebSocket is a new HTML5 API (refer my detailed article on it) that enables bi-directional communication between the client and server. In case of any compatibility gap, SignalR uses other protocols like long polling.

P.S.: Since most of the code to make the test app was picked from web, all the credit to them. One specific source that I can recall: https://docs.microsoft.com/en-us/aspnet/signalr/overview/deployment/tutorial-signalr-self-host

Now, a quick peek into the SignalR test app
  • Using the SignalR Hub class to setup the server. Defined a hub that exposes a method like an end point for clients to send a message. Server can process the message and send back a response to all or few of the clients.
[HubName("TestHub")]
public class TestHub : Hub
{
    public void ProcessMessageFromClient(string message)
    {
        Console.WriteLine($"<Client sent:> {message}");

        // Do some processing with the client request and send the response back
        string newMessage = $"<Service sent>: Client message back in upper case: {message.ToUpper()}";
        Clients.All.ResponseToClientMessage(newMessage);
    }
}

  • Specify which “origins” we want to allow our application to accept. It is setup via CORS configuration. CORS is a security concept that allows end points from different domains to interact with each other.
public void Configuration(IAppBuilder app)
{
     app.UseCors(CorsOptions.AllowAll);
     app.MapSignalR();
}

  • Server is setup using OWIN (Open Web Interface for .NET). OWIN defines an abstraction between .NET web servers and web applications. This helps in self-hosting a web application in a process, outside of IIS.
static void Main(string[] args)
{
    string url = @"http://localhost:8080/";

    using (WebApp.Start<Startup>(url))
    {
        Console.WriteLine($"============ SERVER ============");
        Console.WriteLine($"Server running at {url}");
        Console.WriteLine("Wait for clients message requests for server to respond OR");
        Console.WriteLine("Type any message - it will be broadcast to all clients.");
        Console.WriteLine($"================================");

        // For server broadcast test
        // Get hub context 
        IHubContext ctx = GlobalHost.ConnectionManager.GetHubContext<TestHub>();

        string line = null;
        while ((line = Console.ReadLine()) != null)
        {
            string newMessage = $"<Server sent:> {line}";
            ctx.Clients.All.MessageFromServer(newMessage);
        }

        // pause to allow clients to receive
        Console.ReadLine();
    }
}

In above code, Using IHubContext:

  1. Server is setup to have earlier defined Hub as one of the broadcast end point.
  2. Server is also setup to broadcast any message to all clients by itself if needed be

  • Client is setup to communicate with the server (send and receive message) via hub using HubConnection & IHubProxy. Client can invoke the exposed end point in the hub to a send a message.
static void Main(string[] args)
{
    string url = @"http://localhost:8080/";

    var connection = new HubConnection(url);
    IHubProxy _hub = connection.CreateHubProxy("TestHub");
    connection.Start().Wait();

    // For server side initiation of messages
    _hub.On("MessageFromServer", x => Console.WriteLine(x));
    _hub.On("ResponseToClientMessage", x => Console.WriteLine(x));

    Console.WriteLine($"============ CLIENT ============");
    Console.WriteLine("Type any message - it will be sent as a request to server for a response.");
    Console.WriteLine($"================================");

    string line = null;
    while ((line = Console.ReadLine()) != null)
    {
        // Send message to Server
        _hub.Invoke("ProcessMessageFromClient", line).Wait();
    }

    Console.Read();
}

With above setup, we can see the communication between Client & Server realtime like below:

SignalR test

Things to consider while opting for SignalR

SignalR looks awesome. But, there are couple of things one should know while using it:

  • It’s a connected technology – each client connected through SignalR is using a persistent and dedicated connection on the Web Server. SignalR is shared as scalable but it never harms to look for any queries around connections, memory, cpu, etc with higher number of clients.
  • One would need to setup a valid host server with a PORT open on it that can be used for communication. This could depend on an organisation security protocols and approvals.

Hope this gives a quick overview about SignalR. Keep learning!

Download entire code for lookup from here: https://github.com/sandeep-mewara/SignalRTest

Beginner’s quick start to learn React.js

I recently experimented with React.js, so thought of sharing key points that I learnt. Though there are handful of material online, couldn’t find one that covers all in a concise way that can help learn key aspects of ReactJS. I believe this would resonate with few and help them learn, understand and get a jumpstart with ReactJS.

react-js
What is React.js?

React.js is an open source JavaScript based library for building frontend (user interface) of a web or mobile application.

Why React.js?

Every web application core is to have a fast rendering response for better user experience. Because of this ease, users come back often and it leads higher usage and adaptability.

Further, based on how it achieves speed, it is scalable and reusable.

How React.js does it?

React.js works at component level. It helps break an app into many small components with their own responsibilities. This makes things simpler and scalable. With this breakdown,

  • it’s easier to refresh/update a portion of view without reloading an entire page of the app.
  • it leads to build once and reuse across.

Another key part of React.js is being declarative. There is a an abstraction from details on how to do. This makes it easier to read and understand.

A declarative example would be telling my son to make a house craft from paper instead of guiding him with each step of how to get the paper, cut it, paste it to form a house craft. Of course the assumption here has to be true that my son knows how to make it.

A quick comparison with jQuery here (it’s imperative) – it would need details on how to build the house craft.

Translating above in Javascript langauge world:

  • With React – we define on how we want a particular component to be rendered and we never interact with DOM to reference later
  • With jQuery – we would tell browser exactly what needs to be done using DOM elements or events need basis
Key features

Following features help us achieve above:

  • Components – Simple or State

These are small reusable codes that returns a React element to render. This component can have state related aspect based on need.

// Simple component - a Function Component
// props - input to React component - data passed from parent caller
function ComponentExample(props) {
  return <h1>Hola! {props.name}</h1>;
}

// Simple component - a Class Component
class ComponentExample extends React.Component {
  render() {
    return <h2>Hola! {this.props.name}</h2>;
  }
}

// State based component
// Needed when data associated with component change over time
class ComponentExample extends React.Component {
  constructor(props) {
    super(props);
    this.state = {author: "Sandeep Mewara"};
  }
  render() {
    return (
      <div>
        <h2>Hola! {this.props.name}</h2>
        <p>Author: {this.state.author}</p>
      </div>
   );
  }
}

For above example component, use normal HTML syntax: <ComponentExample />

  • Virtual DOM

DOM (Document Object Model) is a structured representation of the HTML elements present on a web page. Traditionally, one would need to get elements out of DOM to make any change. In context of an area of a webpage, it would need a lot more work to refresh it with updated content when needed.

React helps here with its declarative API. A copy of actual DOM is kept in memory which is much faster to change. Once done, React uses its ReactDOM library to sync the virtual representation of UI in memory to the actual DOM.

ReactDOM library internally keeps two VDOMs – one before update and one after. With them, React knows exactly what all to be updated in actual DOM and does all of it on the fly leading much faster updates compared to traditional DOM updates.

React.js has a library ReactDOM to access and modify the actual DOM.

To render HTML on a webpage, use: ReactDOM.render()

  • JSX (JavaScript eXtension)

JSX is a syntax extension to JavaScript that follows XML rules. It’s more of a helpful tool than requirement in React as mentioned below in their website:

React doesn’t require using JSX, but most people find it helpful as a visual aid when working with UI inside the JavaScript code

JSX converts HTML tags into React elements that are placed in DOM without any commands like createElements(), etc.

// Example with JSX
const testHtml = <h2>Hola! Sandeep Mewara</h2>;
ReactDOM.render(testHtml, document.getElementById('root'));

// Same above example without JSX
const testHtml = React.createElement('h2', {}, 'Hola! Sandeep Mewara');
ReactDOM.render(testHtml, document.getElementById('root'));

Normally, we can’t assign an HTML tag to a JavaScript variable but we can with JSX!

  • Unidirectional data flow

React implements one way reactive data flow. It uses flux as a pattern to keep data unidirectional. Interpret it as you often nest child components within higher order parent components. Snapshot of state is passed across from parent to child components via props (readonly, cannot be updated) and updates from child to parent happen via callbacks bound to some control on child component.

  • ES6 compatible

React library is ES6 (ECMAScript 2015 or JavaScript 6) enabled and thus makes it easier to write code in React. Among all changes to standardize JavaScript in ES6Classes introduction is one of them which plays a critical role in React.

  • Lifecycle

Each React component has a lifecycle that helps write a code at a specific time during the flow as per need.

// Use class for any local state & lifecycle hooks
class TestClass extends Component 
{  
    // first call to component when it is initiated
    constructor(props) 
    {    
        // with it, 'this' would refer to this component
        super(props); 
        // some local state initialized 
        this.state = {currentdate: new Date()};
    };   
    
    // executes before the initial render
    componentWillMount() {    
     
    };  
    
    // executes after the initial render
    componentDidMount() {  

    };

    // executes when component gets new props
    componentWillReceiveProps() {   
          
    };

    // executes before rendering with new props or state
    shouldComponentUpdate() {   
          
    };
    
    // executes before rendering with new props or state
    componentWillUpdate() {   
        
    };

    // executes after rendering with new props or state
    componentDidUpdate() {   
          
    };
    
    // executes before component is removed from DOM
    componentWillUnmount() {   
          
    };

    // HTML to be displayed by the component rendering 
    render() 
    {    
        return (      
            <h1>Current Date: {this.state.currentdate.toString()}</h1>
        );  
    }; 
}

For entire React glossary, please refer: https://reactjs.org/docs/glossary.html

Sample application Setup

We will explore and understand more from React’s demo app. We will jump start our sample app bootstrapped with Create React App

I used yarn create react-app demo-react-app and opened the created directory in IDE that looked like:

default react project structure

With above, once I ran yarn start in root folder demo-react-app, app was up and running without any code change. We can see default app hosted in browser at following url: http://localhost:3000/

default home page

Quick look at few key files that connect dots that lead to above UI view:

  • public/index.html

Base file which we browse using url. We see the HTML defined in it. For now, the element to notice would be a div named root.

  • src/index.js

Located at root of app, is like an entry file (like main) for app that has code like below:

import React from 'react';
import ReactDOM from 'react-dom';
import './index.css';
import App from './App';
import * as serviceWorker from './serviceWorker';

ReactDOM.render(
  <React.StrictMode>
    <App />
  </React.StrictMode>,
  document.getElementById('root')
);

// If you want your app to work offline and load faster, you can change
// unregister() to register() below. Note this comes with some pitfalls.
// Learn more about service workers: https://bit.ly/CRA-PWA
serviceWorker.unregister();

It imported React and related library, CSS file for app, a component named App. After this, it defines a render method which displays whatever is defined in component App as page root element.

  • src/App.js

Defines a function component of React that returns an html with React logo and a link to render.

import React from 'react';
import logo from './logo.svg';
import './App.css';

function App() {
  return (
    <div className="App">
      <header className="App-header">
        <img src={logo} className="App-logo" alt="logo" />
        <p>
          Edit <code>src/App.js</code> and save to reload.
        </p>
        <a
          className="App-link"
          href="https://reactjs.org"
          target="_blank"
          rel="noopener noreferrer"
        >
          Learn React
        </a>
      </header>
    </div>
  );
}

export default App; 

How did index.js got connected with index.html?

Create React App uses Webpack with html-webpack-plugin underneath. This Webpack uses src/index.js as an entry point. Because of this, index.js comes into picture and all other modules referenced in it. With html-webpack-plugin configuration, it automatically adds the script tag in html page.

Let’s see with few modifications to the app now!

Specifically I will be changing flavour of above 3 files to play around.

  1. AppHola.js file for a HelloWorld kind of change – displays my name instead of other texts
  2. AppNavigation.js (has portion of pages updated)
    • Introduction – simple display of texts
    • Clock/counters auto updating
    • Random color generator that updates background color of defined area
demo-app-gif

Given this was for beginners, I have not added too much of complexity to the app. I have tried to keep it as simple possible with some variance of what all can be tried.

There are plenty of imports that can be used. For example, in our demo app, to have navigation, we have used a navigation router react-router-dom import (run npm i react-router-dom --save inside root folder).

Hope this short guide/tutorial gives a broad overview about React.JS and how to start development of the same. Keep learning!

Download entire code for lookup from here: https://github.com/sandeep-mewara/demo-react-app