Flood forecasting – new tech way!

September 20, 2020September 25, 2020Sandeep Mewara Leave a comment

Recently, Google opened up its Flood Forecasting Initiative that uses Artificial Intelligence to predict when and where flood will occur for India and Bangladesh. They worked with governments to develop systems that predict flood and thus keep people safe and informed.

Google now covers 200 million people living in more than 250,000 square kilometers in India.

This topic was also touched upon in the Decode with Google event last week.

Initiative Plan

Google started this initiative back in 2018.

Floods are devastating natural disasters worldwide—it’s estimated that every year, 250 million people around the world are affected by floods, causing around $10 billion in damages.

The plan was to use AI and create forecasting models based on:

historical events
river level readings
terrain and elevation of an area

An inside look at the flood forecasting was published here that covers:
1. The Inundation Model
2. Real time water level measurements
3. Elevation Map creation
4. Hydraulic modeling

Recent Improvements

The new approach devised for inundation modeling is called a morphological inundation model. It combines physics-based modeling with machine learning to create more accurate and scalable inundation models in real-world settings.

This new forecasting system covers:
1. Forecasting Water Levels
2. Morphological Inundation Modeling
3. Alert targeting
4. Improved Water Levels Forecasting

Have a read of the following blog for full details.

Current State

As shared here, they partnered with Indian Central Water Commission to expand forecasting models and services. For research, they have collaborated with Yale to visit flood affected areas. This helps them to understand how to provide information and what information would people need to protect themselves.

We’re providing people with information about flood depth: when and how much flood waters are likely to rise. And in areas where we can produce depth maps throughout the floodplain, we’re sharing information about depth in the user’s village or area.

To increase it’s reach about alerts, Google.org has started a collaboration with the International Federation of Red Cross and Red Crescent Societies.

My Thoughts

It’s a great use of technology to help mankind. Floods are life changing events and an early prediction and shareout would help big to everyone.

Awesome initiative, breakthroughs and progress!

samples GitHub Profile Readme

Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples

Any anagram of a string, a palindrome?

September 19, 2020September 25, 2020Sandeep Mewara Leave a comment

Often in our group we discuss about puzzles or problems related to data structure and algorithms. One such day, we discussed about:

how will we find if any anagram of a string is palindrome or not?

Our first thought went in the direction to start from first character and then traverse till end to see if there could be matching pair. Keep track of it, move to next character till middle and then stitch all to figure if so. It solves, but the query was – could it be solved better though?

Of course! With putting some stress on the brain, it turned out that in a single read, we will have info enough, to tell, if any anagram formed from the input string can be a palindrome or not.

Thought converted to Code

static void CheckIfStringAnagramHasPalindrome()
{
    Console.WriteLine($"Please enter a string:");

    // Ignore casing
    var inputString = Console.ReadLine().ToLower();

    // Just need to keep track of unique characters
    var characterSet = new HashSet<char>();

    // Single traversal of input string 
    for(int i=0; i<inputString.Length; i++)
    {
        char currentCharacter = inputString[i];
        if(characterSet.Contains(currentCharacter))
            characterSet.Remove(currentCharacter);
        else
            characterSet.Add(currentCharacter);
    }

    // Character counts in set will help 
    // identify if palindrome possible 
    var leftChars = characterSet.Count;
    if(leftChars == 0 || leftChars == 1)
        Console.WriteLine($"YES - possible.");
    else
        Console.WriteLine($"NO - Not possible.");
}

Approach looked good, as with a single traversal and usage of HashSet, i.e. with overall Order of Time complexity O(n) & Space complexity O(1), we were able to solve it.

It was fun solving!

GitHub Profile Readme Samples

Sandeep Mewara Github
Sandeep Mewara Learn By Insight

Data Visualization – Insights with Matplotlib

September 13, 2020October 5, 2020Sandeep Mewara Leave a comment

While working on a machine learning problem, Matplotlib is the most popular python library used for visualization that helps in representing & analyzing the data and work through insights.

Generally, it’s difficult to interpret much about data, just by looking at it. But, a presentation of the data in any visual form, helps a great deal to peek into it. It becomes easy to deduce correlations, identify patterns & parameters of importance.

In data science world, data visualization plays an important role around data pre-processing stage. It helps in picking appropriate features and apply appropriate machine learning algorithm. Later, it helps in representing the data in a meaningful way.

Data Insights via various plots

If needed, we will use these dataset for plot examples and discussions. Based on the need, following are the common plots that are used:

Line Chart | `ax.plot(x,y)`

It helps in representing series of data points against a given range of defined parameter. Real benefit is to plot multiple line charts in a single plot to compare and track changes.

Points next to each other are related that helps to identify repeated or a defined pattern

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 1, 0.05)
y1 = x**2
y2 = x**3

plt.plot(x, y1,
    linewidth=0.5,
    linestyle='--',
    color='b',
    marker='o',
    markersize=10,
    markerfacecolor='red')

plt.plot(x, y2,
    linewidth=0.5,
    linestyle='dotted',
    color='g',
    marker='^',
    markersize=10,
    markerfacecolor='yellow')

plt.title('x Vs f(x)')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend(['f(x)=x^2', 'f(x)=x^3'])
plt.xticks(np.arange(0, 1.1,0.2),
    ['0','0.2','0.4','0.6','0.8','1.0'])

plt.grid(True)
plt.show()

Real world example:

We will work with dataset created from collating historical data for few stocks downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

stocksdf1 = pd.read_csv('data-files/stock-INTU.csv') 
stocksdf2 = pd.read_csv('data-files/stock-AAPL.csv') 
stocksdf3 = pd.read_csv('data-files/stock-ADBE.csv') 

stocksdf = pd.DataFrame()
stocksdf['date'] = pd.to_datetime(stocksdf1['Date'])
stocksdf['INTU'] = stocksdf1['Open']
stocksdf['AAPL'] = stocksdf2['Open']
stocksdf['ADBE'] = stocksdf3['Open']

plt.plot(stocksdf['date'], stocksdf['INTU'])
plt.plot(stocksdf['date'], stocksdf['AAPL'])
plt.plot(stocksdf['date'], stocksdf['ADBE'])

plt.legend(labels=['INTU','AAPL','ADBE'])
plt.grid(True)

plt.show()

With the above, we have couple of quick assessments:
Q: How a particular stock fared over last year?
A: Stocks were roughly rising till Feb 2020 and then took a dip in April and then back up since then.

Q: How the three stocks behaved during the same period?
A: Stock price of ADBE was more sensitive and AAPL being least sensitive to the change during the same period.

Histogram | `ax.hist(data, n_bins)`

It helps in showing distributions of variables where it plots quantitative data with range of the data grouped into intervals.

We can use Log scale if the data range is across several orders of magnitude.

import numpy as np
import matplotlib.pyplot as plt

mean = [0, 0]
cov = [[2,4], [5, 9]]
xn, yn = np.random.multivariate_normal(
                                mean, cov, 100).T

plt.hist(xn,bins=25,label="Distribution on x-axis"); 

plt.xlabel('x')
plt.ylabel('frequency')
plt.grid(True)
plt.legend()

Real world example

We will work with dataset of Indian Census data downloaded from here.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='STATE'
mask2 = populationdf['TRU']=='Total'
df = populationdf[mask1 & mask2]

plt.hist(df['TOT_P'], label='Distribution')

plt.xlabel('Total Population')
plt.ylabel('State Count')
plt.yticks(np.arange(0,20,2))

plt.grid(True)
plt.legend()

With the above, couple of quick assessments about population in states of India:
Q: What’s the general population distribution of states in India?
A: More than 50% of states have population less than 2 crores (20 million)

Q: How many states are having population more than 10 crores (100 million)?
A: Only 3 states have that high a population.

Bar Chart | `ax.bar(x_pos, heights)`

It helps in comparing two or more variables by displaying values associated with categorical data.

Most commonly used plot in Media sharing data around surveys displaying every data sample.

import numpy as np
import matplotlib.pyplot as plt

data = [[60, 45, 65, 35],
        [35, 25, 55, 40]]

x_pos = np.arange(4)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.set_xticks(x_pos)

ax.bar(x_pos - 0.1, data[0], color='b', width=0.2)
ax.bar(x_pos + 0.1, data[1], color='g', width=0.2)

ax.yaxis.grid(True)

Real world example

We will work with dataset of Indian Census data downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='STATE'
mask2 = populationdf['TRU']=='Total'
statesdf = populationdf.loc[mask1].loc[mask2]
statesdf = statesdf.sort_values('TOT_P')

plt.figure(figsize=(10,8))
plt.barh(range(len(statesdf)), 
    statesdf['TOT_P'], tick_label=statesdf['Name'])
plt.grid(True)
plt.title('Total Population')
plt.show()

With the above, couple of quick assessments about population in states of India:
– Uttar Pradesh has the highest total population and Lakshadeep has lowest
– Relative popluation across states with Uttar Pradesh almost double the second most populated state

Pie Chart | `ax.pie(sizes, labels=[labels])`

It helps in showing the percentage (or proportional) distribution of categories at a certain point of time. Usually, it works well if it’s limited to single digit categories.

A circular statistical graphic where the arc length of each slice is proportional to the quantity it represents.

import numpy as np
import matplotlib.pyplot as plt

# Slices will be ordered n plotted counter-clockwise
labels = ['Audi','BMW','LandRover','Tesla','Ferrari']
sizes = [90, 70, 35, 20, 25]

fig, ax = plt.subplots()
ax.pie(sizes,labels=labels, autopct='%1.1f%%')
ax.set_title('Car Sales')
plt.show()

Real world example

We will work with dataset of Alcohol Consumption downloaded from here.

import panda as pd
import matplotlib.pyplot as plt

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

labels = ['Beer', 'Spirit', 'Wine']
sizes = [drinksdf['beer'].sum(), 
         drinksdf['spirit'].sum(), 
         drinksdf['wine'].sum()]

fig, ax = plt.subplots()
explode = [0.05,0.05,0.2]
ax.pie(sizes,explode=explode,
    labels=labels, autopct='%1.1f%%')

ax.set_title('Alcohol Consumption')
plt.show()

With the above, we can have a quick assessment that alcohol consumption is distributed overall. This view helps if we have less number of slices (categories).

Scatter plot | `ax.scatter(x_points, y_points)`

It helps representing paired numerical data either to compare how one variable is affected by another or to see how multiple dependent variables value is spread for each value of independent variable.

Sometimes the data points in a scatter plot form distinct groups and are called as clusters.

import numpy as np
import matplotlib.pyplot as plt

# random but focused cluster data
x1 = np.random.randn(100) + 8
y1 = np.random.randn(100) + 8
x2 = np.random.randn(100) + 3
y2 = np.random.randn(100) + 3

x = np.append(x1,x2)
y = np.append(y1,y2)

plt.scatter(x,y, label="xy distribution")
plt.legend()

Real world example

We will work with dataset of Alcohol Consumption downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

drinksdf['total'] = drinksdf['beer'] 
+ drinksdf['spirit'] 
+ drinksdf['wine'] 
+ drinksdf['alcohol']

# drinksdf.corr() tells beer and alcochol 
# are highly corelated
fig = plt.figure()

# Compare beet and alcohol consumption
# Use color to show a third variable.
# Can also use size (s) to show a third variable.
scat = plt.scatter(drinksdf['beer'], 
                   drinksdf['alcohol'], 
                   c=drinksdf['total'], 
                   cmap=plt.cm.rainbow)

# colorbar to explain the color scheme
fig.colorbar(scat, label='Total drinks')

plt.xlabel('Beer')
plt.ylabel('Alcohol')
plt.title('Comparing beer and alcohol consumption')
plt.grid(True)
plt.show()

With the above, we can have a quick assessment that beer and alcohol consumption have strong positive correlation which would suggest a large overlap of people who drink beer and alcohol.

2. We will work with dataset of Mall Customers downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

malldf = pd.read_csv('data-files/mall-customers.csv',
                skiprows=1, 
                names = ['customerid', 'genre', 
                         'age', 'annualincome', 
                         'spendingscore'])

plt.scatter(malldf['annualincome'], 
            malldf['spendingscore'], 
            marker='p', s=40, 
            facecolor='r', edgecolor='b', 
            linewidth=2, alpha=0.4)

plt.xlabel("Annual Income")
plt.ylabel("Spending Score (1-100)")
plt.grid(True)

With the above, we can have a quick assessment that there are five clusters there and thus five segments or types of customers one can make plan for.

Box Plot | `ax.boxplot([data list])`

A statistical plot that helps in comparing distributions of variables because the center, spread and range are immediately visible. It only shows the summary statistics like mean, median and interquartile range.

Easy to identify if data is symmetrical, how tightly it is grouped, and if and how data is skewed

import numpy as np
import matplotlib.pyplot as plt

# some random data
data1 = np.random.normal(0, 2, 100)
data2 = np.random.normal(0, 4, 100)
data3 = np.random.normal(0, 3, 100)
data4 = np.random.normal(0, 5, 100)
data = list([data1, data2, data3, data4])

fig, ax = plt.subplots()
bx = ax.boxplot(data, patch_artist=True)

ax.set_title('Box Plot Sample')
ax.set_ylabel('Spread')
xticklabels=['category A', 
             'category B', 
             'category B', 
             'category D']

colors = ['pink','lightblue','lightgreen','yellow']
for patch, color in zip(bx['boxes'], colors):
    patch.set_facecolor(color)

ax.set_xticklabels(xticklabels)
ax.yaxis.grid(True)
plt.show()

Real world example

We will work with dataset of Tips downloaded from he r e.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tipsdf = pd.read_csv('data-files/tips.csv') 
sns.boxplot(x="time", y="tip", 
            hue='sex', data=tipsdf, 
            order=["Dinner", "Lunch"],
            palette='coolwarm')

With the above, we can have a quick couple of assessments:
– male gender gives more tip compared to females
– tips during dinner time can vary a lot (more) by males mean tip

Violen Plot | `ax.violinplot([data list])`

A statistical plot that helps in comparing distributions of variables because the center, spread and range are immediately visible. It shows the full distribution of data.

A quick way to compare distributions across multiple variables

import numpy as np
import matplotlib.pyplot as plt

data = [np.random.normal(0, std, size=100) 
        for std in range(2, 6)]

fig, ax = plt.subplots()
bx = ax.violinplot(data)

ax.set_title('Violin Plot Sample')
ax.set_ylabel('Spread')
xticklabels=['category A', 
             'category B', 
             'category B', 
             'category D']

ax.set_xticks([1,2,3,4])
ax.set_xticklabels(xticklabels)

ax.yaxis.grid(True)
plt.show()

Real world example

We will work with dataset of Tips downloaded from he r e.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tipsdf = pd.read_csv('data-files/tips.csv') 
sns.violinplot(x="day", y="tip", 
               split="True", data=tipsdf)

With the above, we can have a quick assessment that the tips on Saturday has more relaxed distribution whereas Friday has much narrow distribution in comparison.

2. We will work with dataset of Indian Census data downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='DISTRICT'
mask2 = populationdf['TRU']!='Total'
statesdf = populationdf[mask1 & mask2]

maskUP = statesdf['State']==9
maskM = statesdf['State']==27
data = statesdf.loc[maskUP | maskM]

sns.violinplot( x='State', y='P_06', 
inner='quartile', hue='TRU',  
palette={'Rural':'green','Urban':'blue'}, 
scale='count', split=True, 
data=data, size=6)

plt.title('In districts of UP and Maharashtra')
plt.show()

With the above, we can have couple of quick assessments:
– Uttar Pradesh has high volume and distribution of rural child population.
– Maharashtra has almost equal spread of rural and urban child population

Heatmap

It helps in representing a 2-D matrix form of data using variation of color for different values. Variation of color maybe hue or intensity.

Generally used to visualize correlation matrix which in turn helps in features (variables) selection.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# create 2D array
array_2d = np.random.rand(4, 6)
sns.heatmap(array_2d, annot=True)

Real world example

We will work with dataset of Alcohol Consumption downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

sns.heatmap(drinksdf.corr(),annot=True,cmap='YlGnBu')

With the above, we can have a quick couple of assessments:
– there is a strong correlation between beer and alcohol and thus a strong overlap there.
– wine and spirit are almost not correlated and thus it would be rare to have a place where wine and spirit consumption equally high. One would be preferred over other.

If we notice, upper and lower halves along the diagonal are same. Correlation of A is to B is same as B is to A. Further, A correlation with A will always be 1. Such case, we can make a small tweak to make it more presentable and avoid any correlation confusion.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

drinksdf = pd.read_csv(
    'data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

# correlation and masks
drinks_cr = drinksdf.corr()
drinks_mask = np.triu(drinks_cr)

# remove the last ones on both axes
drinks_cr = drinks_cr.iloc[1:,:-1]
drinks_mask = drinks_mask[1:, :-1]

sns.heatmap(drinks_cr, 
        mask=drinks_mask,
        annot=True,
        cmap='coolwarm')

It is the same correlation data but just the needed one is represented.

Data Image

It helps in displaying data as an image, i.e. on a 2D regular raster.

Images are internally just arrays. Any 2D numpy array can be displayed as an image.

import pandas as pd
import matplotlib.pyplot as plt

M,N = 25,30
data = np.random.random((M,N)) 
plt.imshow(data)

Real world example

Let’s read an image and then try to display it back to see how it looks

import cv2
import matplotlib.pyplot as plt

img = cv2.imread('data-files/babygroot.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# print(img.shape)
# output => (500, 359, 3)

plt.imshow(img)

It read the image as an array of matrix and then drew it as plot that turned to be same as the image. Since, images are like any other plots, we can plot other objects (like annotations) on top of it.

SubPlots | `fig, (ax1,ax2,ax3, ax4) = plt.subplots(2,2)`

Generally, it is used in comparing multiple variables (in pairs) against each other. With multiple plots stacked against each other in the same figure, it helps in quick assessment for correlation and distribution for a pair.

Parameters are: number of rows, number of columns, the index of the subplot
(Index are counted row wise starting with 1)

The widths of the different subplots may be different with use of GridSpec.

import numpy as np
import matplotlib.pyplot as plt
import math

# data setup
x = np.arange(1, 100, 5)
y1 = x**2
y2 = 2*x+4
y3 = [ math.sqrt(i) for i in x]  
y4 = [ math.log(j) for j in x] 

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)

ax1.plot(x, y1)
ax1.set_title('f(x) = quadratic')
ax1.grid()

ax2.plot(x, y2)
ax2.set_title('f(x) = linear')
ax2.grid()

ax3.plot(x, y3)
ax3.set_title('f(x) = sqareroot')
ax3.grid()

ax4.plot(x, y4)
ax4.set_title('f(x) = log')
ax4.grid()

fig.tight_layout()
plt.show()

We can stack up m x n view of the variables and have a quick look on how they are correlated. With the above, we can quickly assess that second graph parameters are linearly correlated.

Data Representation

Plot Anatomy

Below picture will help with plots terminology and representation:

Figure above is the base space where the entire plot happens. Most of the parameters can be customized for better representation. For specific details, look here.

Plot annotations

It helps in highlighting few key findings or indicators on a plot. For advanced annotations, look here.

import numpy as np
import matplotlib.pyplot as plt

# A simple parabolic data
x = np.arange(-4, 4, 0.02)
y = x**2

# Setup plot with data
fig, ax = plt.subplots()
ax.plot(x, y)

# Setup axes
ax.set_xlim(-4,4)
ax.set_ylim(-1,8)

# Visual titles
ax.set_title('Annotation Sample')
ax.set_xlabel('X-values')
ax.set_ylabel('Parabolic values')

# Annotation
# 1. Highlighting specific data on the x,y data
ax.annotate('local minima of \n the parabola',
            xy=(0, 0),
            xycoords='data',
            xytext=(2, 3),
            arrowprops=
                dict(facecolor='red', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='top')

# 2. Highlighting specific data on the x/y axis
bbox_yproperties = dict(
    boxstyle="round,pad=0.4", fc="w", ec="k", lw=2)
ax.annotate('Covers 70% of y-plot range',
            xy=(0, 0.7),
            xycoords='axes fraction',
            xytext=(0.2, 0.7),
            bbox=bbox_yproperties,
            arrowprops=
                dict(facecolor='green', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='center')

bbox_xproperties = dict(
    boxstyle="round,pad=0.4", fc="w", ec="k", lw=2)
ax.annotate('Covers 40% of x-plot range',
            xy=(0.3, 0),
            xycoords='axes fraction',
            xytext=(0.1, 0.4),
            bbox=bbox_xproperties,
            arrowprops=
                dict(facecolor='blue', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='center')

plt.show()

Plot style | `plt.style.use('style')`

It helps in customizing representation of a plot, like color, fonts, line thickness, etc. Default styles get applied if the customization is not defined. Apart from adhoc customization, we can also choose one of the already defined template styles and apply them.

# To know all existing styles with package
for style in plt.style.available:
    print(style)

Solarize_Light2, _classic_test_patch, bmh, classic, dark_background, fast, fivethirtyeight, ggplot, grayscale, seaborn, seaborn-bright, seaborn-colorblind, seaborn-dark, seaborn-dark-palette, seaborn-darkgrid, seaborn-deep, seaborn-muted, seaborn-notebook, seaborn-paper, seaborn-pastel, seaborn-poster, seaborn-talk, seaborn-ticks, seaborn-white, seaborn-whitegrid, tableau-colorblind10
pre-defined styles available for use

More details around customization are here.

# To use a defined style for plot
plt.style.use('seaborn')

# OR
with plt.style.context('Solarize_Light2'):
    plt.plot(np.sin(np.linspace(0, 2 * np.pi)), 'r-o')
plt.show()

Saving plots | `ax.savefig()`

It helps in saving figure with plot as an image file of defined parameters. Parameters details are here. It will save the image file to the current directory by default.

plt.savefig('plot.png', dpi=300, bbox_inches='tight')

Additional Usages of plots

Data Imputation

It helps in filling missing data with some reasonable data as many statistical or machine learning packages do not work with data containing null values.

Data interpolation can be defined to use pre-defined functions such as linear, quadratic or cubic

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(20,1))
df = df.where(df<0.5)

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.plot(df)
ax1.set_title('f(x) = data missing')
ax1.grid()

ax2.plot(df.interpolate())
ax2.set_title('f(x) = data interpolated')
ax2.grid()

fig.tight_layout()
plt.show()

With the above, we see all the missing data replaced with some probably interpolation supported by dataframe based on valid previous and next data.

Animation

At times, it helps in presenting the data as an animation. On a high level, it would need data to be plugged in a loop with delta changes translating into a moving view.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import animation

fig = plt.figure()

def f(x, y):
    return np.sin(x) + np.cos(y)

x = np.linspace(0, 2 * np.pi, 80)
y = np.linspace(0, 2 * np.pi, 70).reshape(-1, 1)

im = plt.imshow(f(x, y), animated=True)


def updatefig(*args):
    global x, y
    x += np.pi / 5.
    y += np.pi / 10.
    im.set_array(f(x, y))
    return im,

ani = animation.FuncAnimation(
    fig, updatefig, interval=100, blit=True)
plt.show()

3-D Plotting

If needed, we can also have an interactive 3-D plot though it might be slow with large datasets.

import numpy as np
import matplotlib.pyplot as plt

def randrange(n, vmin, vmax):
     return (vmax-vmin)*np.random.rand(n) + vmin

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
n = 200
for c, m, zl in [('g', 'o', +1), ('r', '^', -1)]:
    xs = randrange(n, 0, 50)
    ys = randrange(n, 0, 100)
    zs = xs+zl*ys  
    ax.scatter(xs, ys, zs, c=c, marker=m)

ax.set_xlabel('X data')
ax.set_ylabel('Y data')
ax.set_zlabel('Z data')
plt.show()

Cheat Sheet

A page representation of the key features for quick lookup or revision:

Download the PDF version of cheatsheet from here.
Overall reference & for more details, look: https://matplotlib.org/

Entire Jupyter notebook with more samples can be downloaded or forked from my GitHub to look or play around: https://github.com/sandeep-mewara/data-visualization

Keep learning!

LearnByInsight C#
GitHub Profile Readme Samples
LearnByInsight Machine Learning

Microsoft .NET Conference 2020

September 13, 2020September 25, 2020Sandeep Mewara Leave a comment

.NET Conf 2020 is a free virtual developer event organized by the .NET Community and Microsoft. I came across discussion if I am interested in speaking at the event as the call for content is now open.

The conference showcases the .NET platform for developers focusing on desktop, mobile, web, IoT, games, cloud and open source projects. Read details about the event here: .NET Conf 2020

If you feel interested, you don’t need to register for it. Just keep track of the dates and attend event session live here.

This year .NET 5.0 will launch at .NET Conf 2020! Come celebrate and learn about the new release. We’re also celebrating our 10th anniversary and we’re working on a few more surprises.
Event highlight shared by .NET foundation

There would be live sessions by speakers from the community and .NET team members. One can ask questions live on Twitter, join the fun on Twitch and attend the virtual attendee parties. It would help know whats happening and upcoming in the .NET world.

Happy connecting!

LearnByInsight
Samples GitHub Profile Readme
Sandeep Mewara GithubIO

Read from console in VS Code

September 12, 2020September 25, 2020Sandeep Mewara Leave a comment

I moved to Apple Mac late last year because of different set of technologies now I work in. As shared in one of my previous posts, I use Visual Studio Code for programming in Python exploring Machine Learning. Though, for anything in .NET, I switch to a Windows VM and use Visual Studio.

For quick console apps, it feels painful to switch to a VM and work. Thus, I looked and installed C# extension in VS Code to try of. Details are here.

While running a console app, I got stuck to read any value from Console. In debug mode, IDE would stop on the Console.ReadLine() but whatever I type in Console would not go through.

I looked around and found that there are few settings for Console in VS Code. The console setting controls what console (terminal) window the target app is launched into.

"internalConsole" (default) : This does NOT work for applications that want to read from the console (ex: Console.ReadLine).

How to Solve it?

Suggested way to take input is to set the console setting as integratedTerminal. This is a configuration setting in the launch.json file under .vscode folder.

"integratedTerminal" : the target process will run inside VS Code’s integrated terminal (Terminal tab in the tab group beneath the editor). Alternatively add "internalConsoleOptions": "neverOpen" to make it so that the default foreground tab is the terminal tab.

Change the default setting like below:

With above change, the input and output will happen through integrated terminal like:

So far, it looks good and seems I will stick to Visual Studio Code on Mac for quick console applications.

Reference here.

Keep learning!

LearnByInsight
GitHub Profile Readme Samples

New C# features I really liked!

September 5, 2020September 25, 2020Sandeep Mewara Leave a comment

There has been many new features added to C# over last few years. A recent survey in CodeProject community lead me to the thought of sharing what I find really helpful. It spreads from C# 6.0 to C# 8.0. Below few made writing code easy, more fun and have improved productivity.

Null Conditional Operator (?. & ?[])

They make null checks much easier and fluid. Add a ? just before the the member access . or indexer access [] that can be null. It short-circuits and returns null for assignment.

// safegaurd against NullReferenceException

Earlier

if(address != null)
{
   var street = address.StreetName;
}

Now

var street = address?.StreetName;

// safegaurd against IndexOutOfRangeException

Earlier

if(row != null && row[0] != null)
{
   int data = row[0].SomeCount;
}

Now

int? data = row?[0]?.SomeCount;

Null Coalescing Operator (?? & ??=)

Null-coalescing operator ?? helps to assign default value if the properties is null. Often, used along with null conditional operator.

Earlier

if(address == null)
{
   var street = "NA";
}

Now

var street = address?.StreetName ?? "NA";

Null-coalescing assignment operator ??= helps to assign the value of its right-hand operand to its left-hand operand only if the left-hand operand evaluates to null.

The left-hand operand of the ??= operator must be a variable, a property, or an indexer element

Earlier

int? i = null;

if(i == null)
   i = 0;

Console.WriteLine(i);  // output: 0

Now

int? i = null;

i ??= 0;
Console.WriteLine(i);  // output: 0

String Interpolation ($)

It enables to embed expressions in a string. With a special character $ to identify a string literal as an interpolated string. Interpolation expressions are replaced by the string representations of the expression results in the result string.

Earlier

string address = string.Format("{0},{1}", HouseNo, StreetName);

log.Write("Address: "+ HouseNo.ToLower() + "," + StreetName);

Now

string address = $"{HouseNo}, {StreetName}";

log.Write($"Address: {HouseNo.ToLower()}, {StreetName}");

Auto-Property Initializer

It helps declare the initial value for a property as part of the property declaration itself.

Earlier

Language _currentLanguage = Language.English;
public Language CurrentLanguage
{
   get { return _currentLanguage; }
   set { _currentLanguage = value; }
}   

// OR

// Improvement in C# 3.0
public Language CurrentLanguage { get; set; }

public MyClass()
{
    CurrentLanguage = Language.English;
}

Now

public Language CurrentLanguage { get; set; } = Language.English;

using static

It helps to import the enum or static methods of a single class.

Earlier

public class Enums
{
    public enum Language
    {
        English,
        Hindi,
        Spanish
    }
}

// Another file
public class MyClass
{
    public Enums.Language CurrentLanguage { get; set; };
}

Now

public class Enums
{
    public enum Language
    {
        English,
        Hindi,
        Spanish
    }
}

// Another file
using static mynamespace.Enums

public class MyClass
{
    public Language CurrentLanguage { get; set; };
}

Tuples

They are lightweight data structures that contain multiple fields to represent the data members.

# Initialize Way 1
(string First, string Second) ranks = ("1", "2");
Console.WriteLine($"{ranks.First}, {ranks.Second}");

# Initialize Way 2
var ranks = (First: "1", Second: "2");
Console.WriteLine($"{ranks.First}, {ranks.Second}");

It support == and !=

Expression bodied get/set accessors

With it, members can be implemented as expressions.

Earlier

public string Title
{
    get { return _title; }
    set 
    { 
       this._title = value ?? "Default - Hello";
    }
}

Now

public string Title
{
    get => _title;
    set => this._title = value ?? "Default - Hello";
}

Access modifier: private protected

A new compound access modifier: private protected to indicate a member accessible by containing class or by derived classes that are declared in the same assembly. One more level of abstraction compared to protected internal.

// Assembly1.cs
public class BaseClass
{
    private protected int myValue = 0;
}

public class DerivedClass1 : BaseClass
{
    void Access()
    {
        var baseObject = new BaseClass();

        // Error CS1540, because myValue can only be
        // accessed by classes derived from BaseClass
        // baseObject.myValue = 5;

        // OK, accessed through the current 
        // derived class instance
        myValue = 5;
    }
}

//
// Assembly2.cs
// Compile with: /reference:Assembly1.dll
class DerivedClass2 : BaseClass
{
    void Access()
    {
        // Error CS0122, because myValue can only
        // be accessed by types in Assembly1
        // myValue = 10;
    }
}

await

It helps suspend evaluation of the enclosing async method until the asynchronous operation represented by its operand completes. On completion, it returns result of the operation if any.

It does not blocks the thread that evaluates async method, instead suspends the enclosing async method and returns to the caller of the method.

using System;
using System.Net.Http;
using System.Threading.Tasks;

public class AwaitOperatorDemo
{
    // async Main method allowed since C# 7.1 
    public static async Task Main()
    {
        Task<int> downloading = DownloadProfileAsync();
        Console.WriteLine($"{nameof(Main)}: Started download.");

        int bytesLoaded = await downloading;
        Console.WriteLine($"{nameof(Main)}: Downloaded {bytesLoaded} bytes.");
    }

    private static async Task<int> DownloadProfileAsync()
    {
        Console.WriteLine($"{nameof(DownloadProfileAsync)}: Starting download.");

        var client = new HttpClient();
        // time taking call - await and move on
        byte[] content = await client.GetByteArrayAsync("https://learnbyinsight.com/about/");

        Console.WriteLine($"{nameof(DownloadProfileAsync)}: Finished download.");
        return content.Length;
    }
}

// Output:
// DownloadProfileAsync: Starting download.
// Main: Started download.
// DownloadProfileAsync: Finished download.
// Main: Downloaded 27700 bytes.

Default Interface methods

Now, we can add members to interfaces and provide a default implementation for those members. It helps in supporting backward compatibility. There would be no breaking change to existing interface consumers. Existing implementations inherit the default implementation.

public interface ICustomer
{
    DateTime DateJoined { get; }
    string Name { get; }

    // Later added to interface:
    public string Contact()
    {
       return "contact not provided";
    }
}

Wrap up

There are many more additions to C#. Believe, above are few that one should know and use in their day to day coding right away (if not already doing it). Most of it helps us with being more concise and avoid convoluted code.

Reference: https://docs.microsoft.com/en-us/dotnet/csharp/whats-new

Keep learning!

LearnByInsight
GitHub Profile Readme Samples

Harness your voice using Transcribe in Word

September 5, 2020September 25, 2020Sandeep Mewara Leave a comment

A new enhancement in Microsoft Office 365’s Word for the web – Transcribe in Word. It leverages the Azure Cognitive Services AI platform.

Transcribe converts speech (recorded directly in Word or from an uploaded audio file) to a text transcript with each speaker individually separated.

We can record our conversations directly in Word for the web and it transcribes them automatically with each speaker identified separately. Transcript will appear alongside the Word document, along with the recording.

For now, English (EN-US) is the only language supported for transcribe audio

Once the recording is finished, we can:

easily follow the flow of the transcript
revisit parts of the recording by playing back the time-stamped audio
edit the transcript for any corrections or if we see something amiss
save the full transcript as a Word document

How to use it?

Transcribe in Word is already available in Word for the web for all Microsoft 365 subscribers. Usage wise, it is completely unlimited to record and transcribe within Word for the Web.

There is a five hour limit per month for uploaded recordings and each uploaded recording is limited to 200mb.

Real life applications …

It has multiple values in different aspects of usage:

would be much easier to concentrate in meetings & discussions if doing multitask affects (taking notes during discussion)
provide important quotes with others in quick time
summarize the meeting based on key topics identified
- Minutes of meetings
- Key notes
opens up potential for NLP world (AI) in future
- access patterns particular speakers on how they speak, use specific words, provide feedback
- access questions and their response, act specifically
- improve auto corrections

Wrap Up

Seems like a nice move by Microsoft, to cover more than one aspect where it can help. Worth a feature to try out and see how it works and helps.

Reference: https://www.microsoft.com/en-us/microsoft-365/blog/2020/08/25/microsoft-365-transcription-voice-commands-word/

Keep exploring!

Samples GitHub profile Readme

pandas – get started with examples

August 30, 2020September 25, 2020Sandeep Mewara Leave a comment

This is to get started with pandas and try few concrete examples. pandas is a Python based library that helps in reading, transforming, cleaning and analyzing data. It is built on the NumPy package.

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
https://pandas.pydata.org

Key data structure in pandas is called DataFrame – it helps to work with tabular data translated as rows of observations and columns of features.

Download or fork entire Jupiter notebook from my GitHub to play around: https://github.com/sandeep-mewara/python-examples

pandas basics includes:

Series
Dataframes
- Create
  - from list of tuples
  - from a dictionary
  - from a CSV
  - from built-in dataset (eg: from sklearn.datasets)
- Data retrieval
- Modifying data
- Group by operation
- Custom Functions – apply method
- Pre-Processing
  - drop, mean, mode
  - ordinal feature
  - nominal feature
- Reshaping
  - CrossTab
  - Merge
  - Melt
  - Pivot

# .info(), .head(), .sample are handy method to use first off with dataframe to get a high level details
# index may be not unique – can return multiple values
# boolean indexing (masking) can help select certain set of rows
# .isin() is a useful when building a boolean index
# .where() is useful to retain shape of the original table
# Column names & Indexes can be set if needed
# to modify the table right away, use inplace=True
# aggregate operations can be applied on a groupby object
# dropna(), mean() or mode() are handy ways for pre-processing missing data
Key learning’s …

Examples notebook includes:

Uber taxi drivers
Apple stock price
Day or Night
Students marks
Balance Calculator

# .describe() is a handy method to get the statistical summary of numerical columns
# one-hot-encoding is really helpful for nominal features (that cannot be ordered)
# converting the columns into right datatype helps
# converting data into meaningful numbers help for analysis
# groupby is a powerful tool with dataframes for analysis
Key learning’s …

Cheat sheet

Download cheat sheet pdf from here
For more details about pandas, look at the documentation reference.

Keep learning!

“vshost32.exe has stopped working”

August 29, 2020September 25, 2020Sandeep Mewara Leave a comment

This is another one of the common errors developers get and ask about: vshost32.exe has stopped working.

Problem Statement

When I run my project (or a particular usecase), it displays an error: vshost32.exe has stopped working

Assessment

vshost was introduced in Visual Studio 2005 (only for use in VS). These are files that contains vshost in the file name and are placed under the output (default bin) folder of the application. It is the “hosting process” created when we build a project in Visual Studio.

It has following core responsibilities:

to provide support for improved F5 performance
To run a managed application in debug mode using F5 command, Visual Studio would need an AppDomain to provide a place for the runtime environment within which the application can run. It takes quite a bit of time to create an AppDomain and initialize the debugger along with it. The hosting process speeds up this process by doing all of this work in the background before we hit F5, and keeps the state around between multiple runs of the application.
for partial trust debugging
To simulate a partial trust environment within Visual Studio under the debugger would require special initialization of the AppDomain. This is handled by the hosting process
for design time expression evaluation
To test code in the application from the immediate window, without actually having to run the application. The hosting process is used to execute code under design time expression evaluation.

Possible Resolutions

Generally, it would be to figure out if the issue is specifically because of Visual Studio hosting process or there are other issues at play interacting with vshost.

Scenario 1:

It’s 64 bit OS, app is configured to build as AnyCPU, yet we get an error

Try:
32 bit/64 bit issues usually plays a role in relation to OS features and locations that are different. There is a setting in Build configuration that drives the debugger behavior when it is setup for AnyCPU. You need to turn off (un-tick checkbox) the Prefer 32 bit flag to run in 64 bit mode.

Now, even with above change, we can face issues that fall into 32/64 bit region. This is where vshost is still playing a role. Irrespective of above, flag vshost continues to work in 32 bit mode (platform config AnyCPU). Now, calls to certain APIs can be affected when the hosting process is enabled. In these cases, it is necessary to disable the hosting process to return the correct results. Details about how to turn it off in Debug tab: How to: Disable the Hosting Process

With above changes, AnyCPU configuration would be equivalent to the app as platform target x64 configuration.

Scenario 2:

Application is configured to build as x86 (or AnyCPU)

Try:
If the workflow is related to a third party, for 32 bit applications, use 32 bit runtime, irrespective of the OS being 32 bit or 64 bit.

Scenario 3:

Application is throwing an error for a specific code work flow that involves unmanaged assembly

Try:
If the workflow includes an interop call to an external assembly (unmanaged code that is executed outside the control of CLR), there might be incorrect usage of the function all. I have seen examples where a wrong return type can cause a vshost error. Return type of the external DLL cannot be string, it must be IntPtr.

[DllImport("Some.dll", CallingConvention = CallingConvention.Cdecl)]
private static extern IntPtr SomeMethod();

Scenario 4:

Application is throwing an error for a specific code work flow that is in realms of managed code (by CLR)

Try:
It could be that the process is taking time while executing that particular workflow. If the process is busy for a long time, it can throw an error. One of the solve would be to try the entire long operation on a BackgroundWorker thread and free up UI thread.

Conclusion

We can turn off the vshost as long as we are okay without it. It always helps to have same debugging environment (32/64 bit) as the app is expected to run in. We should be cognizant of the operations done with third party assemblies or unmanaged ones and have the right set of code/files interacting with application.

Happy troubleshooting!

Samples GitHub Profile Readme

Decode with Google

August 29, 2020August 10, 2021Sandeep Mewara Leave a comment

For details on Decode with Google 2021, please go here: Decode with Google 2021

Couple of days back, I received an email from Google about an upcoming virtual event on 11th & 12th September 2020. All the details of the event are here: Decode with Google

Looks like, theme of the event is: Innovations contributing to Technology in India

At Google, we’re always excited by the potential of technology to solve large-scale, real world problems. Decode with Google is an opportunity for techmakers, entrepreneurs, and academia to get a sneak peek at some of the toughest and most interesting challenges that Googlers in India are solving for.
Event highlight shared by Google

If topics resonate, feel free to register and join for the virtual edition of Decode with Google. Seems Google Tech leaders would share highlights about what they are working on, the state of AI and the opportunities it presents.

Happy connecting!

GitHub Profile Readme Samples

Sandeep Mewara Github
Sandeep Mewara Learn By Insight

Initiative Plan

Recent Improvements

Current State

My Thoughts

Thought converted to Code

Data Insights via various plots

Line Chart | ax.plot(x,y)

Real world example:

Histogram | ax.hist(data, n_bins)

Real world example

Bar Chart | ax.bar(x_pos, heights)

Real world example

Pie Chart | ax.pie(sizes, labels=[labels])

Real world example

Scatter plot | ax.scatter(x_points, y_points)

Real world example

Box Plot | ax.boxplot([data list])

Real world example

Violen Plot | ax.violinplot([data list])

Real world example

Heatmap

Real world example

Data Image

Real world example

SubPlots | fig, (ax1,ax2,ax3, ax4) = plt.subplots(2,2)

Data Representation

Plot Anatomy

Plot annotations

Plot style | plt.style.use('style')

Saving plots | ax.savefig()

Additional Usages of plots

Data Imputation

Animation

3-D Plotting

Cheat Sheet

How to Solve it?

Null Conditional Operator (?. & ?[])

Earlier

Now

Earlier

Now

Null Coalescing Operator (?? & ??=)

Earlier

Now

Earlier

Now

String Interpolation ($)

Earlier

Now

Auto-Property Initializer

Earlier

Now

using static

Earlier

Now

Tuples

Expression bodied get/set accessors

Earlier

Now

Access modifier: private protected

await

Default Interface methods

Wrap up

How to use it?

Real life applications …

Wrap Up

pandas basics includes:

Examples notebook includes:

Cheat sheet

Problem Statement

Assessment

Possible Resolutions

Scenario 1:

Scenario 2:

Scenario 3:

Scenario 4:

Conclusion

Line Chart | `ax.plot(x,y)`

Histogram | `ax.hist(data, n_bins)`

Bar Chart | `ax.bar(x_pos, heights)`

Pie Chart | `ax.pie(sizes, labels=[labels])`

Scatter plot | `ax.scatter(x_points, y_points)`

Box Plot | `ax.boxplot([data list])`

Violen Plot | `ax.violinplot([data list])`

SubPlots | `fig, (ax1,ax2,ax3, ax4) = plt.subplots(2,2)`

Plot style | `plt.style.use('style')`

Saving plots | `ax.savefig()`