Three A’s of the Next Gen World!

October 4, 2020October 11, 2020Sandeep Mewara Leave a comment

There are many new ideas and research going on in today’s tech world. I believe the potential of maximum disruption is by the three A’s – Analytics, Automation & Artificial Intelligence. Because of their scope of evolution and impact, they tend to be the foundation of the next generation tech world.
.

A natural progressive way of doing it would be to:

first gather data (analytics)
define what’s crucial or repetitive that can be replaced to ease the flow (automate)
collate such multiple replaced things to move into predictive analysis and actions to complete a complex task (artificial intelligence)

Day by day, it looks more feasible to achieve above because of technology innovations and the processing power.

Analytics

It is the first step towards anything more concrete and is a great supportive to theories and assumptions. Enables to have data backed decisions which are easy to relate and comprehend.

This would mean collecting data and making better use of it. This would need to have a data architecture setup. With it, it would be easier to know and decide what and how to collect. Post it, using any data visualization tool, it would be easier to deduce insights, patterns and plan accordingly.

It’s not about just any or more data, but the right, quality data that would matter

Automation

It is solving for complex processes using technology. Data analytics help a great deal identify and solve for such processes. These repetitive mundane tasks can be automated and the efforts saved from it can be put elsewhere. Thus helps in concentrating on tasks that needs more human attention.

Just like machines brought industrial revolution to factory floor, automation has similar potential to transform most of our day to day work. This could lead to enhanced productivity, thus better outcomes, leading to more accurate predictions and optimizations.

Only way to scale the massive amount of work in defined time without hiring an army of professionals

Artificial Intelligence

This is exploring the world that was considered impossible a decade ago by most of us. It is easier to relate and understand when comparing with human capabilities like recognizing handwriting or identifying a particular disease cell. Not just it, AI has great potential to automate non-routine tasks.

With AI, we can analyze data, understand concepts, make assumptions, learn behaviors and provide predictions at a scale with detail that would be impossible for individual humans. It can also have self-learning to improve with more and more interactions with data and human.

AI evolution is being greatly supported with advanced algorithms and improved computing power & storage.

Potential

There is a nice data backed assessment and shareout by McKinsey Global Institute on the three A’s. I would suggest a must read of it. In there, they shared AI/ML potential usecases across industries.

AI and Automation will provide a much-needed boost to global productivity and may help some ‘moonshot’ challenges.
McKinsey Global Institute

Wrap Up

AI combined with various Process Automation and powerful Data Analytics transforms into an intelligent automated solution. Potential of these solutions are huge.

It would be more appropriate to say that it’s no more a choice but compulsion for all sectors to go through this digital transformation. Those who are able to do it will be setup sooner for the next generation of the tech world. It will provide them with an edge over their competitors, putting them in a position to take advantage big time.

Keep exploring!

samples GitHub Profile Readme

Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples
Sandeep Mewara Github Repositories

Probability Distribution – An aid to know the data

September 27, 2020October 19, 2020Sandeep Mewara Leave a comment

A probability distribution helps understand the likelihood of possible values that a random variable can take. It is one of the must needed statistical knowledge for any data science aspirant.

Few consider, Probability distributions are fundamental to statistics, like data structures are to computer science

In Layman terms

Let’s say, you pick any 100 employees of an organization. Measure their heights (or weights). As you measure them, create a distribution of it on a graph. Keep height on X-Axis & frequency of a particular height on Y-Axis. With this, we will get a distribution for a range of heights.

This distribution will help know which outcomes are most likely, the spread of potential values, and the likelihood of different results.

Basic terminology

Random Sample

The set of 100 people selected above in our example will be termed as random sample.

Sample Space

The range of possible heights of the 100 people is our sample space. It’s the set of all possible values in the setup.

Random Variable

The height of the 100 people measured are termed as random variable. It’s a variable that takes different values of the sample space randomly.

Mean (Expected Value)

Let’s say most of the people in those 100 are of height 5 feet, 3 inches (making it an average height of those 100). This would be termed expected value. It’s an average value of a random variable.

Standard deviation & Variance

Let’s say most of the people in those 100 are of height 5 feet, 1 inches to 5 feet, 5 inches. This is variance for us. It’s an average spread of values around the expected value. Standard Deviation is the square root of the variance.

Types of data

Ordinal – They have a meaningful order. All numerical data fall in this bucket. They can be ordered in relative numerical strength.
Nominal – They cannot be ordered. All categorical data fall in this bucket. Like, colors – Red, Blue & Green – there cannot be an order or a sequence of high or low in them by itself.
Discrete – an ordinal data that can take only certain values (like soccer match score)
Continuous – an ordinal data that can take any real or fractional value (like height & weight)

In Continuous distribution, random variables can have an infinite range of possible outcomes

Probability Distribution Flowchart

Following diagram shares few of the common distributions used:

distribution-common — Credit: cloudera blog

Based on above diagram, will cover three distributions to have a broad understanding:

Uniform Distribution

It is the simplest form of distribution. Every outcome of the sample space has equal probability to happen. An example would be to roll a fair dice that would have an equal probability outcome of 1-6.

Normal (Gaussian) Distribution

The most common distribution. Few would recognize this by a ‘bell curve’. Most values are around the mean value making the distribution arrangement symmetric.

Central limit theorem suggests that sum of several independent random variables is normally distributed

The area under the distribution curve is equal to 1 (all the probabilities must sum up to 1)

A parameter Mew drives the distribution center (mean). It corresponds to the maximum height of the graph. A parameter Sigma corresponds to the range of variation (variance or standard deviation).

68–95–99.7 rule (empirical rule) – approximate percentage of the data covered by ranges defined by 1, 2, and 3 standard deviations from the mean

Exponential Distribution

It is where a few outcomes are most likely with a rapid decrease in probability to all other outcomes. An example of it would be a car battery life in months.

A parameter Beta deals with scale that defines the mean and standard deviation of the distribution. A parameter Lambda deals with rate of change in the distribution

Probability Distribution Choices

I came across an awesome representation of the probability distribution choices. It works as a cheat sheet to understand the provided data.

distibutional-choices — Credit: nyu.edu/adamodar

Wrap Up

Though above is just an introduction, believe it should be good enough to start, correlate and understand some basics of machine learning algorithms. There would be more to it while working on algorithms and problems while analyzing data to predict trends, etc.

Keep learning!

samples GitHub Profile Readme

Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples

Data Visualization – Insights with Matplotlib

September 13, 2020October 5, 2020Sandeep Mewara Leave a comment

While working on a machine learning problem, Matplotlib is the most popular python library used for visualization that helps in representing & analyzing the data and work through insights.

Generally, it’s difficult to interpret much about data, just by looking at it. But, a presentation of the data in any visual form, helps a great deal to peek into it. It becomes easy to deduce correlations, identify patterns & parameters of importance.

In data science world, data visualization plays an important role around data pre-processing stage. It helps in picking appropriate features and apply appropriate machine learning algorithm. Later, it helps in representing the data in a meaningful way.

Data Insights via various plots

If needed, we will use these dataset for plot examples and discussions. Based on the need, following are the common plots that are used:

Line Chart | `ax.plot(x,y)`

It helps in representing series of data points against a given range of defined parameter. Real benefit is to plot multiple line charts in a single plot to compare and track changes.

Points next to each other are related that helps to identify repeated or a defined pattern

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 1, 0.05)
y1 = x**2
y2 = x**3

plt.plot(x, y1,
    linewidth=0.5,
    linestyle='--',
    color='b',
    marker='o',
    markersize=10,
    markerfacecolor='red')

plt.plot(x, y2,
    linewidth=0.5,
    linestyle='dotted',
    color='g',
    marker='^',
    markersize=10,
    markerfacecolor='yellow')

plt.title('x Vs f(x)')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend(['f(x)=x^2', 'f(x)=x^3'])
plt.xticks(np.arange(0, 1.1,0.2),
    ['0','0.2','0.4','0.6','0.8','1.0'])

plt.grid(True)
plt.show()

Real world example:

We will work with dataset created from collating historical data for few stocks downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

stocksdf1 = pd.read_csv('data-files/stock-INTU.csv') 
stocksdf2 = pd.read_csv('data-files/stock-AAPL.csv') 
stocksdf3 = pd.read_csv('data-files/stock-ADBE.csv') 

stocksdf = pd.DataFrame()
stocksdf['date'] = pd.to_datetime(stocksdf1['Date'])
stocksdf['INTU'] = stocksdf1['Open']
stocksdf['AAPL'] = stocksdf2['Open']
stocksdf['ADBE'] = stocksdf3['Open']

plt.plot(stocksdf['date'], stocksdf['INTU'])
plt.plot(stocksdf['date'], stocksdf['AAPL'])
plt.plot(stocksdf['date'], stocksdf['ADBE'])

plt.legend(labels=['INTU','AAPL','ADBE'])
plt.grid(True)

plt.show()

With the above, we have couple of quick assessments:
Q: How a particular stock fared over last year?
A: Stocks were roughly rising till Feb 2020 and then took a dip in April and then back up since then.

Q: How the three stocks behaved during the same period?
A: Stock price of ADBE was more sensitive and AAPL being least sensitive to the change during the same period.

Histogram | `ax.hist(data, n_bins)`

It helps in showing distributions of variables where it plots quantitative data with range of the data grouped into intervals.

We can use Log scale if the data range is across several orders of magnitude.

import numpy as np
import matplotlib.pyplot as plt

mean = [0, 0]
cov = [[2,4], [5, 9]]
xn, yn = np.random.multivariate_normal(
                                mean, cov, 100).T

plt.hist(xn,bins=25,label="Distribution on x-axis"); 

plt.xlabel('x')
plt.ylabel('frequency')
plt.grid(True)
plt.legend()

Real world example

We will work with dataset of Indian Census data downloaded from here.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='STATE'
mask2 = populationdf['TRU']=='Total'
df = populationdf[mask1 & mask2]

plt.hist(df['TOT_P'], label='Distribution')

plt.xlabel('Total Population')
plt.ylabel('State Count')
plt.yticks(np.arange(0,20,2))

plt.grid(True)
plt.legend()

With the above, couple of quick assessments about population in states of India:
Q: What’s the general population distribution of states in India?
A: More than 50% of states have population less than 2 crores (20 million)

Q: How many states are having population more than 10 crores (100 million)?
A: Only 3 states have that high a population.

Bar Chart | `ax.bar(x_pos, heights)`

It helps in comparing two or more variables by displaying values associated with categorical data.

Most commonly used plot in Media sharing data around surveys displaying every data sample.

import numpy as np
import matplotlib.pyplot as plt

data = [[60, 45, 65, 35],
        [35, 25, 55, 40]]

x_pos = np.arange(4)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.set_xticks(x_pos)

ax.bar(x_pos - 0.1, data[0], color='b', width=0.2)
ax.bar(x_pos + 0.1, data[1], color='g', width=0.2)

ax.yaxis.grid(True)

Real world example

We will work with dataset of Indian Census data downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='STATE'
mask2 = populationdf['TRU']=='Total'
statesdf = populationdf.loc[mask1].loc[mask2]
statesdf = statesdf.sort_values('TOT_P')

plt.figure(figsize=(10,8))
plt.barh(range(len(statesdf)), 
    statesdf['TOT_P'], tick_label=statesdf['Name'])
plt.grid(True)
plt.title('Total Population')
plt.show()

With the above, couple of quick assessments about population in states of India:
– Uttar Pradesh has the highest total population and Lakshadeep has lowest
– Relative popluation across states with Uttar Pradesh almost double the second most populated state

Pie Chart | `ax.pie(sizes, labels=[labels])`

It helps in showing the percentage (or proportional) distribution of categories at a certain point of time. Usually, it works well if it’s limited to single digit categories.

A circular statistical graphic where the arc length of each slice is proportional to the quantity it represents.

import numpy as np
import matplotlib.pyplot as plt

# Slices will be ordered n plotted counter-clockwise
labels = ['Audi','BMW','LandRover','Tesla','Ferrari']
sizes = [90, 70, 35, 20, 25]

fig, ax = plt.subplots()
ax.pie(sizes,labels=labels, autopct='%1.1f%%')
ax.set_title('Car Sales')
plt.show()

Real world example

We will work with dataset of Alcohol Consumption downloaded from here.

import panda as pd
import matplotlib.pyplot as plt

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

labels = ['Beer', 'Spirit', 'Wine']
sizes = [drinksdf['beer'].sum(), 
         drinksdf['spirit'].sum(), 
         drinksdf['wine'].sum()]

fig, ax = plt.subplots()
explode = [0.05,0.05,0.2]
ax.pie(sizes,explode=explode,
    labels=labels, autopct='%1.1f%%')

ax.set_title('Alcohol Consumption')
plt.show()

With the above, we can have a quick assessment that alcohol consumption is distributed overall. This view helps if we have less number of slices (categories).

Scatter plot | `ax.scatter(x_points, y_points)`

It helps representing paired numerical data either to compare how one variable is affected by another or to see how multiple dependent variables value is spread for each value of independent variable.

Sometimes the data points in a scatter plot form distinct groups and are called as clusters.

import numpy as np
import matplotlib.pyplot as plt

# random but focused cluster data
x1 = np.random.randn(100) + 8
y1 = np.random.randn(100) + 8
x2 = np.random.randn(100) + 3
y2 = np.random.randn(100) + 3

x = np.append(x1,x2)
y = np.append(y1,y2)

plt.scatter(x,y, label="xy distribution")
plt.legend()

Real world example

We will work with dataset of Alcohol Consumption downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

drinksdf['total'] = drinksdf['beer'] 
+ drinksdf['spirit'] 
+ drinksdf['wine'] 
+ drinksdf['alcohol']

# drinksdf.corr() tells beer and alcochol 
# are highly corelated
fig = plt.figure()

# Compare beet and alcohol consumption
# Use color to show a third variable.
# Can also use size (s) to show a third variable.
scat = plt.scatter(drinksdf['beer'], 
                   drinksdf['alcohol'], 
                   c=drinksdf['total'], 
                   cmap=plt.cm.rainbow)

# colorbar to explain the color scheme
fig.colorbar(scat, label='Total drinks')

plt.xlabel('Beer')
plt.ylabel('Alcohol')
plt.title('Comparing beer and alcohol consumption')
plt.grid(True)
plt.show()

With the above, we can have a quick assessment that beer and alcohol consumption have strong positive correlation which would suggest a large overlap of people who drink beer and alcohol.

2. We will work with dataset of Mall Customers downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

malldf = pd.read_csv('data-files/mall-customers.csv',
                skiprows=1, 
                names = ['customerid', 'genre', 
                         'age', 'annualincome', 
                         'spendingscore'])

plt.scatter(malldf['annualincome'], 
            malldf['spendingscore'], 
            marker='p', s=40, 
            facecolor='r', edgecolor='b', 
            linewidth=2, alpha=0.4)

plt.xlabel("Annual Income")
plt.ylabel("Spending Score (1-100)")
plt.grid(True)

With the above, we can have a quick assessment that there are five clusters there and thus five segments or types of customers one can make plan for.

Box Plot | `ax.boxplot([data list])`

A statistical plot that helps in comparing distributions of variables because the center, spread and range are immediately visible. It only shows the summary statistics like mean, median and interquartile range.

Easy to identify if data is symmetrical, how tightly it is grouped, and if and how data is skewed

import numpy as np
import matplotlib.pyplot as plt

# some random data
data1 = np.random.normal(0, 2, 100)
data2 = np.random.normal(0, 4, 100)
data3 = np.random.normal(0, 3, 100)
data4 = np.random.normal(0, 5, 100)
data = list([data1, data2, data3, data4])

fig, ax = plt.subplots()
bx = ax.boxplot(data, patch_artist=True)

ax.set_title('Box Plot Sample')
ax.set_ylabel('Spread')
xticklabels=['category A', 
             'category B', 
             'category B', 
             'category D']

colors = ['pink','lightblue','lightgreen','yellow']
for patch, color in zip(bx['boxes'], colors):
    patch.set_facecolor(color)

ax.set_xticklabels(xticklabels)
ax.yaxis.grid(True)
plt.show()

Real world example

We will work with dataset of Tips downloaded from he r e.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tipsdf = pd.read_csv('data-files/tips.csv') 
sns.boxplot(x="time", y="tip", 
            hue='sex', data=tipsdf, 
            order=["Dinner", "Lunch"],
            palette='coolwarm')

With the above, we can have a quick couple of assessments:
– male gender gives more tip compared to females
– tips during dinner time can vary a lot (more) by males mean tip

Violen Plot | `ax.violinplot([data list])`

A statistical plot that helps in comparing distributions of variables because the center, spread and range are immediately visible. It shows the full distribution of data.

A quick way to compare distributions across multiple variables

import numpy as np
import matplotlib.pyplot as plt

data = [np.random.normal(0, std, size=100) 
        for std in range(2, 6)]

fig, ax = plt.subplots()
bx = ax.violinplot(data)

ax.set_title('Violin Plot Sample')
ax.set_ylabel('Spread')
xticklabels=['category A', 
             'category B', 
             'category B', 
             'category D']

ax.set_xticks([1,2,3,4])
ax.set_xticklabels(xticklabels)

ax.yaxis.grid(True)
plt.show()

Real world example

We will work with dataset of Tips downloaded from he r e.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tipsdf = pd.read_csv('data-files/tips.csv') 
sns.violinplot(x="day", y="tip", 
               split="True", data=tipsdf)

With the above, we can have a quick assessment that the tips on Saturday has more relaxed distribution whereas Friday has much narrow distribution in comparison.

2. We will work with dataset of Indian Census data downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='DISTRICT'
mask2 = populationdf['TRU']!='Total'
statesdf = populationdf[mask1 & mask2]

maskUP = statesdf['State']==9
maskM = statesdf['State']==27
data = statesdf.loc[maskUP | maskM]

sns.violinplot( x='State', y='P_06', 
inner='quartile', hue='TRU',  
palette={'Rural':'green','Urban':'blue'}, 
scale='count', split=True, 
data=data, size=6)

plt.title('In districts of UP and Maharashtra')
plt.show()

With the above, we can have couple of quick assessments:
– Uttar Pradesh has high volume and distribution of rural child population.
– Maharashtra has almost equal spread of rural and urban child population

Heatmap

It helps in representing a 2-D matrix form of data using variation of color for different values. Variation of color maybe hue or intensity.

Generally used to visualize correlation matrix which in turn helps in features (variables) selection.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# create 2D array
array_2d = np.random.rand(4, 6)
sns.heatmap(array_2d, annot=True)

Real world example

We will work with dataset of Alcohol Consumption downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

sns.heatmap(drinksdf.corr(),annot=True,cmap='YlGnBu')

With the above, we can have a quick couple of assessments:
– there is a strong correlation between beer and alcohol and thus a strong overlap there.
– wine and spirit are almost not correlated and thus it would be rare to have a place where wine and spirit consumption equally high. One would be preferred over other.

If we notice, upper and lower halves along the diagonal are same. Correlation of A is to B is same as B is to A. Further, A correlation with A will always be 1. Such case, we can make a small tweak to make it more presentable and avoid any correlation confusion.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

drinksdf = pd.read_csv(
    'data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

# correlation and masks
drinks_cr = drinksdf.corr()
drinks_mask = np.triu(drinks_cr)

# remove the last ones on both axes
drinks_cr = drinks_cr.iloc[1:,:-1]
drinks_mask = drinks_mask[1:, :-1]

sns.heatmap(drinks_cr, 
        mask=drinks_mask,
        annot=True,
        cmap='coolwarm')

It is the same correlation data but just the needed one is represented.

Data Image

It helps in displaying data as an image, i.e. on a 2D regular raster.

Images are internally just arrays. Any 2D numpy array can be displayed as an image.

import pandas as pd
import matplotlib.pyplot as plt

M,N = 25,30
data = np.random.random((M,N)) 
plt.imshow(data)

Real world example

Let’s read an image and then try to display it back to see how it looks

import cv2
import matplotlib.pyplot as plt

img = cv2.imread('data-files/babygroot.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# print(img.shape)
# output => (500, 359, 3)

plt.imshow(img)

It read the image as an array of matrix and then drew it as plot that turned to be same as the image. Since, images are like any other plots, we can plot other objects (like annotations) on top of it.

SubPlots | `fig, (ax1,ax2,ax3, ax4) = plt.subplots(2,2)`

Generally, it is used in comparing multiple variables (in pairs) against each other. With multiple plots stacked against each other in the same figure, it helps in quick assessment for correlation and distribution for a pair.

Parameters are: number of rows, number of columns, the index of the subplot
(Index are counted row wise starting with 1)

The widths of the different subplots may be different with use of GridSpec.

import numpy as np
import matplotlib.pyplot as plt
import math

# data setup
x = np.arange(1, 100, 5)
y1 = x**2
y2 = 2*x+4
y3 = [ math.sqrt(i) for i in x]  
y4 = [ math.log(j) for j in x] 

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)

ax1.plot(x, y1)
ax1.set_title('f(x) = quadratic')
ax1.grid()

ax2.plot(x, y2)
ax2.set_title('f(x) = linear')
ax2.grid()

ax3.plot(x, y3)
ax3.set_title('f(x) = sqareroot')
ax3.grid()

ax4.plot(x, y4)
ax4.set_title('f(x) = log')
ax4.grid()

fig.tight_layout()
plt.show()

We can stack up m x n view of the variables and have a quick look on how they are correlated. With the above, we can quickly assess that second graph parameters are linearly correlated.

Data Representation

Plot Anatomy

Below picture will help with plots terminology and representation:

Figure above is the base space where the entire plot happens. Most of the parameters can be customized for better representation. For specific details, look here.

Plot annotations

It helps in highlighting few key findings or indicators on a plot. For advanced annotations, look here.

import numpy as np
import matplotlib.pyplot as plt

# A simple parabolic data
x = np.arange(-4, 4, 0.02)
y = x**2

# Setup plot with data
fig, ax = plt.subplots()
ax.plot(x, y)

# Setup axes
ax.set_xlim(-4,4)
ax.set_ylim(-1,8)

# Visual titles
ax.set_title('Annotation Sample')
ax.set_xlabel('X-values')
ax.set_ylabel('Parabolic values')

# Annotation
# 1. Highlighting specific data on the x,y data
ax.annotate('local minima of \n the parabola',
            xy=(0, 0),
            xycoords='data',
            xytext=(2, 3),
            arrowprops=
                dict(facecolor='red', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='top')

# 2. Highlighting specific data on the x/y axis
bbox_yproperties = dict(
    boxstyle="round,pad=0.4", fc="w", ec="k", lw=2)
ax.annotate('Covers 70% of y-plot range',
            xy=(0, 0.7),
            xycoords='axes fraction',
            xytext=(0.2, 0.7),
            bbox=bbox_yproperties,
            arrowprops=
                dict(facecolor='green', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='center')

bbox_xproperties = dict(
    boxstyle="round,pad=0.4", fc="w", ec="k", lw=2)
ax.annotate('Covers 40% of x-plot range',
            xy=(0.3, 0),
            xycoords='axes fraction',
            xytext=(0.1, 0.4),
            bbox=bbox_xproperties,
            arrowprops=
                dict(facecolor='blue', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='center')

plt.show()

Plot style | `plt.style.use('style')`

It helps in customizing representation of a plot, like color, fonts, line thickness, etc. Default styles get applied if the customization is not defined. Apart from adhoc customization, we can also choose one of the already defined template styles and apply them.

# To know all existing styles with package
for style in plt.style.available:
    print(style)

Solarize_Light2, _classic_test_patch, bmh, classic, dark_background, fast, fivethirtyeight, ggplot, grayscale, seaborn, seaborn-bright, seaborn-colorblind, seaborn-dark, seaborn-dark-palette, seaborn-darkgrid, seaborn-deep, seaborn-muted, seaborn-notebook, seaborn-paper, seaborn-pastel, seaborn-poster, seaborn-talk, seaborn-ticks, seaborn-white, seaborn-whitegrid, tableau-colorblind10
pre-defined styles available for use

More details around customization are here.

# To use a defined style for plot
plt.style.use('seaborn')

# OR
with plt.style.context('Solarize_Light2'):
    plt.plot(np.sin(np.linspace(0, 2 * np.pi)), 'r-o')
plt.show()

Saving plots | `ax.savefig()`

It helps in saving figure with plot as an image file of defined parameters. Parameters details are here. It will save the image file to the current directory by default.

plt.savefig('plot.png', dpi=300, bbox_inches='tight')

Additional Usages of plots

Data Imputation

It helps in filling missing data with some reasonable data as many statistical or machine learning packages do not work with data containing null values.

Data interpolation can be defined to use pre-defined functions such as linear, quadratic or cubic

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(20,1))
df = df.where(df<0.5)

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.plot(df)
ax1.set_title('f(x) = data missing')
ax1.grid()

ax2.plot(df.interpolate())
ax2.set_title('f(x) = data interpolated')
ax2.grid()

fig.tight_layout()
plt.show()

With the above, we see all the missing data replaced with some probably interpolation supported by dataframe based on valid previous and next data.

Animation

At times, it helps in presenting the data as an animation. On a high level, it would need data to be plugged in a loop with delta changes translating into a moving view.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import animation

fig = plt.figure()

def f(x, y):
    return np.sin(x) + np.cos(y)

x = np.linspace(0, 2 * np.pi, 80)
y = np.linspace(0, 2 * np.pi, 70).reshape(-1, 1)

im = plt.imshow(f(x, y), animated=True)


def updatefig(*args):
    global x, y
    x += np.pi / 5.
    y += np.pi / 10.
    im.set_array(f(x, y))
    return im,

ani = animation.FuncAnimation(
    fig, updatefig, interval=100, blit=True)
plt.show()

3-D Plotting

If needed, we can also have an interactive 3-D plot though it might be slow with large datasets.

import numpy as np
import matplotlib.pyplot as plt

def randrange(n, vmin, vmax):
     return (vmax-vmin)*np.random.rand(n) + vmin

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
n = 200
for c, m, zl in [('g', 'o', +1), ('r', '^', -1)]:
    xs = randrange(n, 0, 50)
    ys = randrange(n, 0, 100)
    zs = xs+zl*ys  
    ax.scatter(xs, ys, zs, c=c, marker=m)

ax.set_xlabel('X data')
ax.set_ylabel('Y data')
ax.set_zlabel('Z data')
plt.show()

Cheat Sheet

A page representation of the key features for quick lookup or revision:

Download the PDF version of cheatsheet from here.
Overall reference & for more details, look: https://matplotlib.org/

Entire Jupyter notebook with more samples can be downloaded or forked from my GitHub to look or play around: https://github.com/sandeep-mewara/data-visualization

Keep learning!

LearnByInsight C#
GitHub Profile Readme Samples
LearnByInsight Machine Learning

Beginner’s Guide to understand Kafka

July 26, 2020September 25, 2020Sandeep Mewara Leave a comment

It’s a digital age. Wherever there is data, we hear about Kafka these days. One of my projects I work, involves entire data system (Java backend) that leverages Kafka to achieve what deals with tonnes of data through various channels and departments. While working on it, I thought of exploring the setup in Windows. Thus, this guide helps learn Kafka and showcases the setup and test of data pipeline in Windows.

Introduction

An OpenSource Project in Java & Scala

Apache Kafka is a distributed streaming platform with three key capabilities:

Messaging system – Publish-Subscribe to stream of records
Availability & Reliability – Store streams of records in a fault tolerant durable way
Scalable & Real time – Process streams of records as they occur

Data system components

Kafka is generally used to stream data into applications, data lakes and real-time stream analytics systems.

Application inputs messages onto the Kafka server. These messages can be any defined information planned to capture. It is passed across in a reliable (due to distributed Kafka architecture) way to another application or service to process or re-process them.

Internally, Kafka uses a data structure to manage its messages. These messages have a retention policy applied at a unit level of this data structure. Retention is configurable – time based or size based. By default, the data sent is stored for 168 hours (7 days).

Kafka Architecture

Typically, there would be multiples of producers, consumers, clusters working with messages across. Horizontal scaling can be easily done by adding more brokers. Diagram below depicts the sample architecture:

Kafka communicates between the clients and servers with TCP protocol. For more details, refer: Kafka Protocol Guide

Kafka ecosystem provides REST proxy that allows an easy integration via HTTP and JSON too.

Primarily it has four key APIs: Producer API, Consumer API, Streams API, Connector API

Key Components & related terminology

Messages/Records – byte arrays of an object. Consists of a key, value & timestamp
Topic – feeds of messages in categories
Producer – processes that publish messages to a Kafka topic
Consumer – processes that subscribe to topics and process the feed of published messages
Broker – It hosts topics. Also referred as Kafka Server or Kafka Node
Cluster – comprises one or more brokers
Zookeeper – keeps the state of the cluster (brokers, topics, consumers)
Connector – connect topics to existing applications or data systems
Stream Processor – consumes an input stream from a topic and produces an output stream to an output topic
ISR (In-Sync Replica) – replication to support failover.
Controller – broker in a cluster responsible for maintaining the leader/follower relationship for all the partitions

Zookeeper

Apache ZooKeeper is an open source that helps build distributed applications. It’s a centralized service for maintaining configuration information. It holds responsibilities like:

Broker state – maintains list of active brokers and which cluster they are part of
Topics configured – maintains list of all topics, number of partitions for each topic, location of all replicas, who is the preferred leader, list of ISR for partitions
Controller election – selects a new controller whenever a node shuts down. Also, makes sure that there is only one controller at any given time
ACL info – maintains Access control lists (ACLs) for all the topics

Kafka Internals

Brokers in a cluster are differentiated based on an ID which typically are unique numbers. Connecting to one broker bootstraps a client to the entire Kafka cluster. They receive messages from producers and allow consumers to fetch messages by topic, partition and offset.

A Topic is spread across a Kafka cluster as a logical group of one or more partitions. A partition is defined as an ordered sequence of messages that are distributed across multiple brokers. The number of partitions per topic are configurable during creation.

Producers write to Topics. Consumers read from Topics.

Kafka uses Log data structure to manage its messages. Log data structure is an ordered set of Segments that are collection of messages. Each segment has files that help locate a message:

Log file – stores message
Index file – stores message offset and its starting position in the log file

Kafka appends records from a producer to the end of a topic log. Consumers can read from any committed offset and are allowed to read from any offset point they choose. The record is considered committed only when all ISRs for partition write to their log.

Among the multiple partitions, there is one leader and remaining are replicas/followers to serve as back up. If a leader fails, an ISR is chosen as a new leader. Leader performs all reads and writes to a particular topic partition. Followers passively replicate the leader. Consumers are allowed to read only from the leader partition.

A leader and follower of a partition can never reside on the same node.

Kafka also supports log compaction for records. With it, Kafka will keep the latest version of a record and delete the older versions. This leads to a granular retention mechanism where the last update for each key is kept.

Offset manager is responsible for storing, fetching and maintaining consumer offsets. Every live broker has one instance of an offset manager. By default, consumer is configured to use an automatic commit policy of periodic interval. Alternatively, consumer can use a commit API for manual offset management.

Kafka uses a particular topic, __consumer_offsets, to save consumer offsets. This offset records the read location of each consumer in each group. This helps a consumer to trace back its last location in case of need. With committing offsets to the broker, consumer no longer depends on ZooKeeper.

Older versions of Kafka (pre 0.9) stored offsets in ZooKeeper only, while newer version of Kafka, by default stores offsets in an internal Kafka topic __consumer_offsets

Kafka allows consumer groups to read data in parallel from a topic. All the consumers in a group has same group ID. At a time, only one consumer from a group can consume messages from a partition to guarantee the order of reading messages from a partition. A consumer can read from more than one partition.

Kafka Setup On Windows

Pre-Requisite

Java SE Runtime Environment
- System has: jre-8u261-windows-x64.exe
Kafka
- Sample app uses: Scala 2.12 – kafka_2.12-2.5.0.tgz
Any unzip tool to extract files out of *.tgz
- I have Mac that extracted it just by double click

Setup files

Install JRE – default settings should be fine
Un-tar Kafka files at C:\Installs (could be any location by choice). All the required script files for Kafka data pipeline setup will be located at: C:\Installs\kafka_2.12-2.5.0\bin\windows
Configuration changes as per Windows need
- Setup for Kafka logs – Create a folder ‘logs’ at location C:\Installs\kafka_2.12-2.5.0
- Set this logs folder location in Kafka config file: C:\Installs\kafka_2.12-2.5.0\config\server.properties as log.dirs=C:\Installs\kafka_2.12-2.5.0\logs
- Setup for Zookeeper data – Create a folder ‘data’ at location C:\Installs\kafka_2.12-2.5.0
- Set this data folder location in Zookeeper config file: C:\Installs\kafka_2.12-2.5.0\config\zookeeper.properties as dataDir=C:\Installs\kafka_2.12-2.5.0\data

Execute

ZooKeeper – Get a quick-and-dirty single-node ZooKeeper instance using the convenience script already packaged along with Kafka files.
- Open a command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
- Execute script: zookeeper-server-start.bat C:\Installs\kafka_2.12-2.5.0\config\zookeeper.properties
- ZooKeeper started at localhost:2181. Keep it running.
Kafka Server – Get a single-node Kafka instance.
- Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
- ZooKeeper is already configured in the properties file as zookeeper.connect=localhost:2181
- Execute script: kafka-server-start.bat C:\Installs\kafka_2.12-2.5.0\config\server.properties
- Kafka server started at localhost: 9092. Keep it running.
  Now, topics can be created and messages can be stored. We can produce and consume data from any client. We will use command prompt for now.
Topic – Create a topic named ‘testkafka’
- Use replication factor as 1 & partitions as 1 given we have made a single instance node
- Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
- Execute script: kafka-topics.bat --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic testkafka
- Execute script to see created topic: kafka-topics.bat --list --bootstrap-server localhost:9092
- Keep the command prompt open just in case.
Producer – setup to send messages to the server
- Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
- Execute script: kafka-console-producer.bat --bootstrap-server localhost:9092 --topic testkafka
- It will show a ‘>’ as a prompt to type a message. Type: “Kafka demo – Message from server”
- Keep the command prompt open. We will come back to it to push more messages
Consumer – setup to receive messages from the server
- Open another command prompt and move to location: C:\Installs\kafka_2.12-2.5.0\bin\windows
- Execute script: kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic testkafka --from-beginning
- You would see the Producer sent message in this command prompt window – “Kafka demo – Message from server”
- Go back to Producer command prompt and type any other message to see them appearing real time in Consumer command prompt
Check/Observe – few key changes behind the scene
- Files under topic created – they keep track of the messages pushed for a given topic
- Data inside the log file – All the messages that are pushed by producer are stored here
- Topics present in Kafka – once a consumer starts reading messages from topic, __consumer_offsets is automatically created as a topic

NOTE: In case you want to choose Zookeeper to store topics instead of Kafka server, it would require following script commands:

Topic create: kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic testkafka
Topics view: kafka-topics.bat --list --zookeeper localhost:2181

With above, we are able to see messages sent by Producer and received by Consumer using a Kafka setup.

When I tried to setup Kafka, I faced few issues on the way. I have documented them for reference to learn. This should also help others if they face something similar: Troubleshoot: Kafka setup on Windows.

One should not encounter any issues with below shared files and the steps/commands shared above.

Download entire modified setup files for Windows from here: https://github.com/sandeep-mewara/kafka-demo-windows

References:
https://kafka.apache.org
https://cwiki.apache.org/confluence/display/KAFKA
https://docs.confluent.io/2.0.0/clients/consumer.html

CodeProject

Analytics

Automation

Artificial Intelligence

Potential

Wrap Up

In Layman terms

Basic terminology

Random Sample

Sample Space

Random Variable

Mean (Expected Value)

Standard deviation & Variance

Types of data

Probability Distribution Flowchart

Uniform Distribution

Normal (Gaussian) Distribution

Exponential Distribution

Probability Distribution Choices

Wrap Up

Data Insights via various plots

Line Chart | ax.plot(x,y)

Real world example:

Histogram | ax.hist(data, n_bins)

Real world example

Bar Chart | ax.bar(x_pos, heights)

Real world example

Pie Chart | ax.pie(sizes, labels=[labels])

Real world example

Scatter plot | ax.scatter(x_points, y_points)

Real world example

Box Plot | ax.boxplot([data list])

Real world example

Violen Plot | ax.violinplot([data list])

Real world example

Heatmap

Real world example

Data Image

Real world example

SubPlots | fig, (ax1,ax2,ax3, ax4) = plt.subplots(2,2)

Data Representation

Plot Anatomy

Plot annotations

Plot style | plt.style.use('style')

Saving plots | ax.savefig()

Additional Usages of plots

Data Imputation

Animation

3-D Plotting

Cheat Sheet

Introduction

Data system components

Kafka Architecture

Key Components & related terminology

Zookeeper

Kafka Internals

Kafka Setup On Windows

Pre-Requisite

Setup files

Execute

Line Chart | `ax.plot(x,y)`

Histogram | `ax.hist(data, n_bins)`

Bar Chart | `ax.bar(x_pos, heights)`

Pie Chart | `ax.pie(sizes, labels=[labels])`

Scatter plot | `ax.scatter(x_points, y_points)`

Box Plot | `ax.boxplot([data list])`

Violen Plot | `ax.violinplot([data list])`

SubPlots | `fig, (ax1,ax2,ax3, ax4) = plt.subplots(2,2)`

Plot style | `plt.style.use('style')`

Saving plots | `ax.savefig()`