Find missing number from 1 to N?

Last week, there was a discussion in my team on the problem of finding missing number(s). We had different thoughts and approaches and thus I thought to share it across.

find-missing-number

Problem statement was something like:

– An array of size (n) has numbers from 1 to (n+1). Find the missing one number.

– An array of size (n) has numbers from 1 to (n+2). Find the missing two numbers.

First thought …

Keep track of numbers found while traversing. At the end, use it to find the missing number. So kind of brute force approach.

We can maintain a hash or a boolean array of n size and keep on updating the hash or the array index location based on number found while traversing. Use it now to find the missing number. It would cover both one as well as two missing numbers case.

This would have two traversals of n (one for filling in the structure and another to find the missing one). Thus overall, time complexity of O(n). This would need an extra space to keep track of all numbers found and thus a space complexity of O(n).

Q: Now, can we avoid extra space or two times traversal?

Second thought …

We know how to calculate the sum of n natural numbers, i.e.: n*(n+1)/2. With it, we can traverse the given array and keep a sum of all numbers. Difference of the sum from formula to sum found would give us the missing number. Nice!

# Keep track of sum
def sumOfGivenNumbers(nos, n):
    sum = 0
    # calculate sum
    for i in range(0, n):
        sum += nos[i]
    return sum

# Input
numbers = [4, 2, 1, 6, 5, 7] 

# number range 
n = len(numbers) + 1
expectedSum = n*(n+1)/2
numbersSum = sumOfGivenNumbers(numbers,len(numbers))

print('Missing number:', expectedSum - numbersSum)

# Output
# Missing number: 3.0

This would help is solve one missing number in single traversal, thus time complexity of O(n). No extra space was used and thus space complexity of O(1).

Q: Can we extend this to two missing numbers now?

Yes, we can extend it. Along with sum, we can also use the product of n natural number as an expression. With it, we will have two equations and two numbers to find:

Missing1 = x1
Missing2 = x2
Sum of provided numbers = N1
Sum of n Natural numbers = N
Product of provided numbers = P1
Product of n Natural numbers = P

x1 + x2 + N1 = N
x1 * x2 * P1 = P

We can solve it to find the two missing numbers. It does have the quadratic flavor associated though. It maintains the time complexity as O(n) and space complexity as O(1). Nice!

Q: Does the solution help with large integers? Think of possible overflow?

Third thought …

Let’s look at possible way for 1 missing number first.

We will traverse through all the numbers of the array. While doing so, maintain a number that would be sum of all numbers traversed so far reduced by sum of all the indexes traversed (+1 if index starts from 0). It is still making use of n natural numbers (in form of indexes) to keep a check on sum to a defined limit.

# Keep track of sum
def getMissingNumber(nos, n):
    sum = 0
    # calculate sum
    for i in range(0, n):
        sum += (i+1)
        sum -= nos[i]

    # last number to add from n+1 natural nos.
    return sum+n+1

# Input
numbers = [4, 2, 1, 6, 5, 7] 

missingNumber =getMissingNumber(numbers,len(numbers))

print('Missing number:', missingNumber)

# Output
# Missing number: 3.0

This looks good and we maintain the same complexities along with solving for overflow.

We can probably try a similar thing for two missing numbers where we keep on multiple and divide the traversed number by index but it still could have overflow issues in worst case. Further, there could be round off issues.

Fourth thought …

Looking more, it seems we can make use of XOR operation to find the missing numbers. We can make use of XOR’s property to nullify the duplicate pair. We will take XOR of provided numbers and XOR of natural numbers. Combining both again with XOR will leave with missing numbers XOR output.

For one missing number, this would be easy and covers all the hurdles discussed earlier keeping same performance.

# Keep track of XOR data
def getMissingNumber(nos, n):
    x1 = nos[0]
    xn = 1

    # start from second
    for i in range(1, n):
        x1 = x1 ^ nos[i]
        xn = xn ^ (i+1)
    
    # last number to XOR
    xn = xn ^ (n+1)

    # find the missing number
    return x1 ^ xn

# Input
numbers = [4, 2, 1, 6, 5, 7] 

missingNumber =getMissingNumber(numbers,len(numbers))

print('Missing number:', missingNumber)

# Output
# Missing number: 3.0

For two missing numbers, using a similar logic of XOR above, we will have an output of XOR value of both missing numbers. Now, given the XOR value will not be zero, the XOR corresponding valid bit in missing1 and missing2 must be different to make it “1”.

# Keep track of XOR data
def getTwoMissingNumber(nos, n):
    x1 = nos[0]
    xn = 1

    # start from second
    for i in range(1, n-2):
        x1 = x1 ^ nos[i]
        xn = xn ^ (i+1)
    
    # last numbers to XOR
    xn = xn ^ (n-1) ^ (n)

    # XOR of two missing numbers
    # Any set bit in it must be 
    # set in one missing and 
    # unset in other missing number 
    XOR = x1 ^ xn

    # Get a rightmost set bit of XOR  
    set_bit_no = XOR & ~(XOR-1) 
  
    # Divide elements in two sets 
    # by comparing rightmost set bit of XOR 
    # with bit at same position in each element. 
    x = 0
    y = 0 
    for i in range(0,n-2): 
        if nos[i] & set_bit_no:    
            # XOR of first set in nos[]  
            x = x ^ nos[i]   
        else: 
            # XOR of second set in nos[]  
            y = y ^ nos[i]   

    for i in range(1,n+1): 
        if i & set_bit_no: 
            # XOR of first set in nos[]  
            x = x ^ i        
        else: 
            # XOR of second set in nos[]  
            y = y ^ i
    
    print ("Missing Numbers: %d %d"%(x,y)) 
    return

# Input
numbers = [4, 2, 1, 6, 7, 5] 

# total length will be provided count+2 missing ones
getTwoMissingNumber(numbers, len(numbers) + 2)

# Output
# Missing Numbers: 3 8

This overcomes the overflow issue and was easier to solve (compared to solving a quadratic equation). Though it took more than one traversal, overall it maintains the time complexity as O(n) and space complexity as O(1). Nice!

Closure …

There could be multiple ways to solve for one or more missing numbers. One can look at it based on ease and need.


Keep solving!

Sandeep Mewara Github
News Update
Tech Explore
Data Explore
samples GitHub Profile Readme
Learn Machine Learning with Examples
Machine Learning workflow
What is Data Science
Word Ladder solution
What is Dynamic Programming
Learn Microsoft Tech via Videos LiveTV Streams

How to solve Word Ladder Problem?

Sometime back, a colleague of mine asked me about the word ladder problem. She was looking for a change. So, I believe she stumbled across this while preparing for data structures and algorithms.

graph-header

Problem Statement

Typically, the puzzle shared is a flavor of below:

Find the smallest number of transformations needed to change an initial word to a target word of same length. In every transformation, change only one character and make sure word exists in the given dictionary.

Explanation

Assuming all these 4 letter words are there in the dictionary provided, it takes minimum 4 transitions to convert word from SAIL to RUIN, i.e.
SAIL -> MAIL -> MAIN -> RAIN -> RUIN

Intent here is to know about Graph algorithm. So, what are graphs in context of algorithms and how do we apply them to solve such problems?

Graph Data Structure

Graphs are flow structure that represents entities connection with each other. Visually, they are represented with help of a Node (Vertex) & an Edge (Connector).

graph-general

A tree is an undirected graph in which any two nodes are connected by only one path. In it, each node (except the root node) comprises exactly one parent node.

Most common way to represent a graph is using an Adjacency matrix. In it, Element A[i][j] is 1 if there is an edge from node i to node j or else it is 0. For example, adjacency matrix of above unidirected graph is:

  | 1 2 3 4
------------
1 | 0 1 0 1
2 | 1 0 1 0
3 | 0 1 0 1
4 | 1 0 1 0

Another common way is via Adjacency list. (List format of the data instead of a matrix.)

Related Algorithms

Graphs are applied in search algorithms. Traversing the nodes and edges in a defined order helps in optimizing search. There are two specific approaches to traverse graph:

Breadth First Search (BFS)

Given a graph G and a starting node s, search proceeds by exploring edges in the graph to find all the nodes in G for which there is a path from s. With this approach, it finds all the nodes that are at a distance k from s before it finds any nodes that are at a distance k+1.

For easy visualization, think of it as, in a tree, finding all the child nodes for a parent node as first step. Post it, find all the grandchildren and hence forth.

Depth First Search (DFS)

Given a graph G and a starting node s, search proceeds by exploring edges in the graph to find all the nodes in G traversed from s through it’s edges. With this approach, we go deep in graph connecting as many nodes in the graph as possible and branch where necessary.

For easy visualization, think of it as, in a tree, finding all the family nodes for a parent node. With this, for a given node, we connect its children, grand children, grand grand children and so on before moving to next node of same level.

Thus, with DFS approach, we can have multiple deduced trees.

Knight’s tour is a classic example that leverages Depth First Search algorithm.

Shortest Path First OR Dijkstra’s Algorithm (SPF)

Given a graph G and a starting node s, search the shortest path to reach node d. It uses a concept of weights. It’s an iterative algorithm similar to results of BFS.

Many real world example fits in here, e.g. what would be shortest path from home to office.

With BFS (a simple queue), we visit one node at a time whereas in SPF (a priority queue), we visit a node at any level with lowest cost. In a sense, BFS follows Dijkstra's algorithm, a step at a time with all edge weights equal to 1. The process for exploring the graph is structurally the same in both cases. at times, BFS is preferred with equal weight graphs. This is because, operations on a priority queue are O(log n) compared to operations on a regular queue which is O(1).

Code

I will be using a breadth first graph algorithm here based on the problem need:

import collections
from collections import deque 

class Solution(object):
    # method that will help find the path
    def ladderLength(self, beginWord, 
                        endWord, wordList):
        """
        :type beginWord: str
        :type endWord: str
        :type wordList: Set[str]
        :returntype: int
        """

        # Queue for BFS
        queue = deque()

        # start by adding begin word
        queue.append((beginWord, [beginWord]))

        while queue:
            # let's keep a watch at active queue
            print('Current queue:',queue)

            # get the current node and 
            # path how it came
            node, path = queue.popleft()

            # let's keep track of path length 
            # traversed so far
            print('Current transformation count:',
                                        len(path))

            # find possible next set of 
            # child nodes, 1 diff
            for next in self.next_nodes(node, 
                            wordList) - set(path):
                # traversing through all child nodes
                # if any of the child matches, 
                # we are good               
                if next == endWord:
                    print('found endword at path:',
                                            path)
                    return len(path)
                else:
                    # keep record of next 
                    # possible paths
                    queue.append((next, 
                                path + [next]))
        return 0

    def next_nodes(self, word, word_list):
        # start with empty collection
        possiblenodes = set()

        # all the words are of fixed length
        wl_word_length = len(word)

        # loop through all the words in 
        # the word list
        for wl_word in word_list:
            mismatch_count = 0

            # find all the words that are 
            # only a letter different from 
            # current word those are the 
            # possible next child nodes
            for i in range(wl_word_length):
                if wl_word[i] != word[i]:
                    mismatch_count += 1
            if mismatch_count == 1:
                # only one alphabet different-yes
                possiblenodes.add(wl_word)
        
        # lets see the set of next possible nodes 
        print('possible next nodes:',possiblenodes)
        return possiblenodes

# Setup
beginWord = "SAIL"
endWord = "RUIN"
wordList = ["SAIL","RAIN","REST","BAIL","MAIL",
                                    "MAIN","RUIN"]

# Call
print('Transformations needed: ',
    Solution().ladderLength(beginWord, 
                            endWord, wordList))

# Transformation expected == 4
# One possible shortes path with 4 transformation:
# SAIL -> MAIL -> MAIN -> RAIN -> RUIN

Used deque (doubly ended queue) of Python

deque helps with quicker append and pop operations from both the ends. It has O(1) time complexity for append and pop operations. In comparison, list provides it in O(n) time complexity.

A quick look at the code workflow to validate if all nodes at a particular distance was traversed first and then moved to next level:

Current queue: deque([('SAIL', ['SAIL'])])

Current transformation count: 1
possible next nodes: {'BAIL', 'MAIL'}
Current queue: deque([('BAIL', ['SAIL', 'BAIL']), 
                      ('MAIL', ['SAIL', 'MAIL'])])

Current transformation count: 2
possible next nodes: {'SAIL', 'MAIL'}
Current queue: deque([('MAIL', ['SAIL', 'MAIL']), 
                      ('MAIL', ['SAIL', 'BAIL', 
                       'MAIL'])])

Current transformation count: 2
possible next nodes: {'BAIL', 'MAIN', 'SAIL'}
Current queue: deque([('MAIL', ['SAIL', 'BAIL', 
                                'MAIL']), 
                      ('BAIL', ['SAIL', 'MAIL', 
                                'BAIL']), 
                      ('MAIN', ['SAIL', 'MAIL', 
                                'MAIN'])])

Current transformation count: 3
possible next nodes: {'BAIL', 'MAIN', 'SAIL'}
Current queue: deque([('BAIL', ['SAIL', 'MAIL', 
                                'BAIL']), 
                      ('MAIN', ['SAIL', 'MAIL', 
                                'MAIN']), 
                      ('MAIN', ['SAIL', 'BAIL', 
                                'MAIL', 'MAIN'])])

Current transformation count: 3
possible next nodes: {'SAIL', 'MAIL'}
Current queue: deque([('MAIN', ['SAIL', 'MAIL', 
                                'MAIN']), 
                      ('MAIN', ['SAIL', 'BAIL', 
                                'MAIL', 'MAIN'])])

Current transformation count: 3
possible next nodes: {'RAIN', 'MAIL'}
Current queue: deque([('MAIN', ['SAIL', 'BAIL', 
                                'MAIL', 'MAIN']), 
                      ('RAIN', ['SAIL', 'MAIL', 
                                'MAIN', 'RAIN'])])

Current transformation count: 4
possible next nodes: {'RAIN', 'MAIL'}
Current queue: deque([('RAIN', ['SAIL', 'MAIL', 
                                'MAIN', 'RAIN']), 
                      ('RAIN', ['SAIL', 'BAIL', 
                        'MAIL', 'MAIN', 'RAIN'])])

Current transformation count: 4
possible next nodes: {'MAIN', 'RUIN'}
found endword at path: ['SAIL', 'MAIL', 'MAIN', 
                                        'RAIN']

Transformations needed:  4
Overall path: ['SAIL', 'MAIL', 'MAIN', 
                               'RAIN', 'RUIN']

Complexity

For above code that I used to find the shortest path for transformation:

Time

In next_nodes, for each word in the word list, we iterated over its length to find all the intermediate words corresponding to it. Thus we did M×N iterations, where M is the length of each word and N is the total number of words in the input word list. Further, to form an intermediate word, it takes O(M) time. This adds up to O(M2×N).

In ladderLength, BFS can go to each of the N words and for each word, we need to examine M possible intermediate words. This adds up to O(M2×N).

Overall, it adds up to O2(M2×N) which would be called O(M2×N).

Space

In next_nodes, each word in the word list would have M intermediate combinations. For every word we need a space of M2 to save all the transformations corresponding to it. Thus, it would need a total space of O(M2×N).

In ladderLength, BFS queue would need a space of O(M×N)

Overall, it adds up to O(M2×N) + O(M×N) which would be called O(M2×N)

Wrap Up

It could be little tricky and thus would need some practice to visualize the graph as well to write code for it.

Great, so now we know how to solve problems like word ladder problem. It also touch based other related common graph algorithms that we can refer to.

I had a read of the following reference and it has much more details if needed.


Keep problem solving!

samples GitHub Profile Readme
Learn Python – Beginners step by step – Basics and Examples
Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples
Sandeep Mewara Github Repositories

Linear time partition – a three way split

Linear-time partition is a divide & conquer based selection algorithm. With it, data is split into three groups using a pivot.
.

linear-time-partioning

An integral part of Quick Sort algorithm which uses this partitioning logic recursively. All the elements smaller than the pivot are put on one side and all the larger ones on the other side of the pivot.

Similar to discussion of Dynamic Programming, this algorithm plays on solving sub-problems to solve complex problem.

Algorithm

Post selecting the pivot, Linear-time partition routine separates the data into three groups with values:

  • less than the pivot
  • equal to the pivot
  • greater than the pivot

Generally, this algorithm is done in place. This results in partially sorting the data. There are handful of problems that makes use of this fact, like:

  • Sort an array that contains only 0’s, 1’s & 2’s
  • Dutch national flag problem
  • Print all negative integers followed by positive for an array full of them
  • Print all 0’s first and then 1’s or vice-versa for an array with only 0’s & 1’s
  • Move all the 0’s to the end maintaining relative order of other elements for an array of integers

If done out of place, (i.e. not changing the original data), it would cost O(n) additional space

Example

Let’s take an example of: sort a array that contains only 0’s, 1’s & 2’s

First thought for such problem is to perform a count of 0’s, 1’s and 2’s. Once we have the counts, reset the array with them. Though it has time complexity O(n), it takes two traversal of the array or uses an extra array.

Below is an attempt to solve using Linear-time partition algorithm to avoid that extra traversal/space.

def threeWayPartition(A):
    start = mid = 0
    end = len(A)-1
    
    # define a Pivot
    pivot = 1
    
    while (mid <= end):
        # mid element is less than pivot
        # current element is 0
        
        # so lets move it to start
        # current start is good. 
        # move start to next element
        # move mid to next element to move forward
        if (A[mid] < pivot) :
            swap(A, start, mid)
            start = start + 1
            mid = mid + 1
            
        # mid element is more than pivot
        # current element is 2
        
        # so lets move it to end
        # current end is good. 
        # move end to previous element
        elif (A[mid] > pivot) :
            swap(A, mid, end)
            end = end - 1
        
        # mid element is same as pivot
        # current element is 1
        
        # just move forward: 
        # mid to next element
        else :
            mid = mid + 1
            
# Swap two elements A[i] and A[j] in the list
def swap(A, i, j):
    A[i], A[j] = A[j], A[i]


# Define an array
inputArray = [0, 1, 2, 2, 1, 0, 0, 2]

# Call the Linear-time partition routine
threeWayPartition(inputArray)

# print the final result
print(inputArray)

# Outputs
# [0, 0, 0, 1, 1, 2, 2, 2]

With a defined pivot, we segregated the data on the either side which resulted in desired output. Dutch nation flag problem or printing all negative first and then positive, or printing all 0s first follows the same code.

For moving all 0’s to end maintaining other elements order, we do a tweak in swap index to maintain order:

def threeWayPartition(A):
    current = 0
    nonzero = 0
    end = len(A)-1
    
    # define a Pivot
    pivot = 0
    
    while (current <= end):
        if (A[current] != pivot) :
            swap(A, current, nonzero)
            nonzero = nonzero + 1
        current = current + 1
            
# Swap two elements A[i] and A[j] in the list
def swap(A, i, j):
    A[i], A[j] = A[j], A[i]


# Define an array
inputArray = [7,0,5,1,2,0,2,0,6]

# Call the Linear-time partition routine
threeWayPartition(inputArray)

# print the final result
print(inputArray)

# Output
# [7, 5, 1, 2, 2, 6, 0, 0, 0]

Complexity

With above algorithm approach, we solved our problem with Time complexity O(n) & Space complexity O(1) (with single traversal of the array)


It was fun solving!

samples GitHub Profile Readme
Learn Python – Beginners step by step – Basics and Examples
Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples
Sandeep Mewara Github Repositories

Data Visualization – Insights with Matplotlib

While working on a machine learning problem, Matplotlib is the most popular python library used for visualization that helps in representing & analyzing the data and work through insights.

matplotlib-machine-learning

Generally, it’s difficult to interpret much about data, just by looking at it. But, a presentation of the data in any visual form, helps a great deal to peek into it. It becomes easy to deduce correlations, identify patterns & parameters of importance.

In data science world, data visualization plays an important role around data pre-processing stage. It helps in picking appropriate features and apply appropriate machine learning algorithm. Later, it helps in representing the data in a meaningful way.

Data Insights via various plots

If needed, we will use these dataset for plot examples and discussions. Based on the need, following are the common plots that are used:

Line Chart | ax.plot(x,y)

It helps in representing series of data points against a given range of defined parameter. Real benefit is to plot multiple line charts in a single plot to compare and track changes.

Points next to each other are related that helps to identify repeated or a defined pattern

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 1, 0.05)
y1 = x**2
y2 = x**3

plt.plot(x, y1,
    linewidth=0.5,
    linestyle='--',
    color='b',
    marker='o',
    markersize=10,
    markerfacecolor='red')

plt.plot(x, y2,
    linewidth=0.5,
    linestyle='dotted',
    color='g',
    marker='^',
    markersize=10,
    markerfacecolor='yellow')

plt.title('x Vs f(x)')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend(['f(x)=x^2', 'f(x)=x^3'])
plt.xticks(np.arange(0, 1.1,0.2),
    ['0','0.2','0.4','0.6','0.8','1.0'])

plt.grid(True)
plt.show()
line-chart
Real world example:

We will work with dataset created from collating historical data for few stocks downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

stocksdf1 = pd.read_csv('data-files/stock-INTU.csv') 
stocksdf2 = pd.read_csv('data-files/stock-AAPL.csv') 
stocksdf3 = pd.read_csv('data-files/stock-ADBE.csv') 

stocksdf = pd.DataFrame()
stocksdf['date'] = pd.to_datetime(stocksdf1['Date'])
stocksdf['INTU'] = stocksdf1['Open']
stocksdf['AAPL'] = stocksdf2['Open']
stocksdf['ADBE'] = stocksdf3['Open']

plt.plot(stocksdf['date'], stocksdf['INTU'])
plt.plot(stocksdf['date'], stocksdf['AAPL'])
plt.plot(stocksdf['date'], stocksdf['ADBE'])

plt.legend(labels=['INTU','AAPL','ADBE'])
plt.grid(True)

plt.show()
line-chart-stocks

With the above, we have couple of quick assessments:
Q: How a particular stock fared over last year?
A: Stocks were roughly rising till Feb 2020 and then took a dip in April and then back up since then.

Q: How the three stocks behaved during the same period?
A: Stock price of ADBE was more sensitive and AAPL being least sensitive to the change during the same period.

Histogram | ax.hist(data, n_bins)

It helps in showing distributions of variables where it plots quantitative data with range of the data grouped into intervals.

We can use Log scale if the data range is across several orders of magnitude.

import numpy as np
import matplotlib.pyplot as plt

mean = [0, 0]
cov = [[2,4], [5, 9]]
xn, yn = np.random.multivariate_normal(
                                mean, cov, 100).T

plt.hist(xn,bins=25,label="Distribution on x-axis"); 

plt.xlabel('x')
plt.ylabel('frequency')
plt.grid(True)
plt.legend()
Real world example

We will work with dataset of Indian Census data downloaded from here.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='STATE'
mask2 = populationdf['TRU']=='Total'
df = populationdf[mask1 & mask2]

plt.hist(df['TOT_P'], label='Distribution')

plt.xlabel('Total Population')
plt.ylabel('State Count')
plt.yticks(np.arange(0,20,2))

plt.grid(True)
plt.legend()
histogram-state-pop

With the above, couple of quick assessments about population in states of India:
Q: What’s the general population distribution of states in India?
A: More than 50% of states have population less than 2 crores (20 million)

Q: How many states are having population more than 10 crores (100 million)?
A: Only 3 states have that high a population.

Bar Chart | ax.bar(x_pos, heights)

It helps in comparing two or more variables by displaying values associated with categorical data.

Most commonly used plot in Media sharing data around surveys displaying every data sample.

import numpy as np
import matplotlib.pyplot as plt

data = [[60, 45, 65, 35],
        [35, 25, 55, 40]]

x_pos = np.arange(4)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.set_xticks(x_pos)

ax.bar(x_pos - 0.1, data[0], color='b', width=0.2)
ax.bar(x_pos + 0.1, data[1], color='g', width=0.2)

ax.yaxis.grid(True)
bar-chart
Real world example

We will work with dataset of Indian Census data downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='STATE'
mask2 = populationdf['TRU']=='Total'
statesdf = populationdf.loc[mask1].loc[mask2]
statesdf = statesdf.sort_values('TOT_P')

plt.figure(figsize=(10,8))
plt.barh(range(len(statesdf)), 
    statesdf['TOT_P'], tick_label=statesdf['Name'])
plt.grid(True)
plt.title('Total Population')
plt.show()
bar-chart-state-pop

With the above, couple of quick assessments about population in states of India:
– Uttar Pradesh has the highest total population and Lakshadeep has lowest
– Relative popluation across states with Uttar Pradesh almost double the second most populated state

Pie Chart | ax.pie(sizes, labels=[labels])

It helps in showing the percentage (or proportional) distribution of categories at a certain point of time. Usually, it works well if it’s limited to single digit categories.

A circular statistical graphic where the arc length of each slice is proportional to the quantity it represents.

import numpy as np
import matplotlib.pyplot as plt

# Slices will be ordered n plotted counter-clockwise
labels = ['Audi','BMW','LandRover','Tesla','Ferrari']
sizes = [90, 70, 35, 20, 25]

fig, ax = plt.subplots()
ax.pie(sizes,labels=labels, autopct='%1.1f%%')
ax.set_title('Car Sales')
plt.show()
pie-chart
Real world example

We will work with dataset of Alcohol Consumption downloaded from here.

import panda as pd
import matplotlib.pyplot as plt

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

labels = ['Beer', 'Spirit', 'Wine']
sizes = [drinksdf['beer'].sum(), 
         drinksdf['spirit'].sum(), 
         drinksdf['wine'].sum()]

fig, ax = plt.subplots()
explode = [0.05,0.05,0.2]
ax.pie(sizes,explode=explode,
    labels=labels, autopct='%1.1f%%')

ax.set_title('Alcohol Consumption')
plt.show()
pie-chart-drinks

With the above, we can have a quick assessment that alcohol consumption is distributed overall. This view helps if we have less number of slices (categories).

Scatter plot | ax.scatter(x_points, y_points)

It helps representing paired numerical data either to compare how one variable is affected by another or to see how multiple dependent variables value is spread for each value of independent variable.

Sometimes the data points in a scatter plot form distinct groups and are called as clusters.

import numpy as np
import matplotlib.pyplot as plt

# random but focused cluster data
x1 = np.random.randn(100) + 8
y1 = np.random.randn(100) + 8
x2 = np.random.randn(100) + 3
y2 = np.random.randn(100) + 3

x = np.append(x1,x2)
y = np.append(y1,y2)

plt.scatter(x,y, label="xy distribution")
plt.legend()
scatter-plot
Real world example
  1. We will work with dataset of Alcohol Consumption downloaded from here.
import pandas as pd
import matplotlib.pyplot as plt

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

drinksdf['total'] = drinksdf['beer'] 
+ drinksdf['spirit'] 
+ drinksdf['wine'] 
+ drinksdf['alcohol']

# drinksdf.corr() tells beer and alcochol 
# are highly corelated
fig = plt.figure()

# Compare beet and alcohol consumption
# Use color to show a third variable.
# Can also use size (s) to show a third variable.
scat = plt.scatter(drinksdf['beer'], 
                   drinksdf['alcohol'], 
                   c=drinksdf['total'], 
                   cmap=plt.cm.rainbow)

# colorbar to explain the color scheme
fig.colorbar(scat, label='Total drinks')

plt.xlabel('Beer')
plt.ylabel('Alcohol')
plt.title('Comparing beer and alcohol consumption')
plt.grid(True)
plt.show()
scatter-plot-drinks

With the above, we can have a quick assessment that beer and alcohol consumption have strong positive correlation which would suggest a large overlap of people who drink beer and alcohol.

2. We will work with dataset of Mall Customers downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

malldf = pd.read_csv('data-files/mall-customers.csv',
                skiprows=1, 
                names = ['customerid', 'genre', 
                         'age', 'annualincome', 
                         'spendingscore'])

plt.scatter(malldf['annualincome'], 
            malldf['spendingscore'], 
            marker='p', s=40, 
            facecolor='r', edgecolor='b', 
            linewidth=2, alpha=0.4)

plt.xlabel("Annual Income")
plt.ylabel("Spending Score (1-100)")
plt.grid(True)
scatter-plot-mall

With the above, we can have a quick assessment that there are five clusters there and thus five segments or types of customers one can make plan for.

Box Plot | ax.boxplot([data list])

A statistical plot that helps in comparing distributions of variables because the center, spread and range are immediately visible. It only shows the summary statistics like mean, median and interquartile range.

Easy to identify if data is symmetrical, how tightly it is grouped, and if and how data is skewed

import numpy as np
import matplotlib.pyplot as plt

# some random data
data1 = np.random.normal(0, 2, 100)
data2 = np.random.normal(0, 4, 100)
data3 = np.random.normal(0, 3, 100)
data4 = np.random.normal(0, 5, 100)
data = list([data1, data2, data3, data4])

fig, ax = plt.subplots()
bx = ax.boxplot(data, patch_artist=True)

ax.set_title('Box Plot Sample')
ax.set_ylabel('Spread')
xticklabels=['category A', 
             'category B', 
             'category B', 
             'category D']

colors = ['pink','lightblue','lightgreen','yellow']
for patch, color in zip(bx['boxes'], colors):
    patch.set_facecolor(color)

ax.set_xticklabels(xticklabels)
ax.yaxis.grid(True)
plt.show()
box-plot
Real world example

We will work with dataset of Tips downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tipsdf = pd.read_csv('data-files/tips.csv') 
sns.boxplot(x="time", y="tip", 
            hue='sex', data=tipsdf, 
            order=["Dinner", "Lunch"],
            palette='coolwarm')
box-plot-tips

With the above, we can have a quick couple of assessments:
– male gender gives more tip compared to females
– tips during dinner time can vary a lot (more) by males mean tip

Violen Plot | ax.violinplot([data list])

A statistical plot that helps in comparing distributions of variables because the center, spread and range are immediately visible. It shows the full distribution of data.

A quick way to compare distributions across multiple variables

import numpy as np
import matplotlib.pyplot as plt

data = [np.random.normal(0, std, size=100) 
        for std in range(2, 6)]

fig, ax = plt.subplots()
bx = ax.violinplot(data)

ax.set_title('Violin Plot Sample')
ax.set_ylabel('Spread')
xticklabels=['category A', 
             'category B', 
             'category B', 
             'category D']

ax.set_xticks([1,2,3,4])
ax.set_xticklabels(xticklabels)

ax.yaxis.grid(True)
plt.show()
violin-plot
Real world example
  1. We will work with dataset of Tips downloaded from here.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tipsdf = pd.read_csv('data-files/tips.csv') 
sns.violinplot(x="day", y="tip", 
               split="True", data=tipsdf)
violin-plot-tips

With the above, we can have a quick assessment that the tips on Saturday has more relaxed distribution whereas Friday has much narrow distribution in comparison.

2. We will work with dataset of Indian Census data downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='DISTRICT'
mask2 = populationdf['TRU']!='Total'
statesdf = populationdf[mask1 & mask2]

maskUP = statesdf['State']==9
maskM = statesdf['State']==27
data = statesdf.loc[maskUP | maskM]

sns.violinplot( x='State', y='P_06', 
inner='quartile', hue='TRU',  
palette={'Rural':'green','Urban':'blue'}, 
scale='count', split=True, 
data=data, size=6)

plt.title('In districts of UP and Maharashtra')
plt.show()
violin-plot-child

With the above, we can have couple of quick assessments:
– Uttar Pradesh has high volume and distribution of rural child population.
– Maharashtra has almost equal spread of rural and urban child population

Heatmap

It helps in representing a 2-D matrix form of data using variation of color for different values. Variation of color maybe hue or intensity.

Generally used to visualize correlation matrix which in turn helps in features (variables) selection.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# create 2D array
array_2d = np.random.rand(4, 6)
sns.heatmap(array_2d, annot=True)
heatmap
Real world example
  1. We will work with dataset of Alcohol Consumption downloaded from here.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

sns.heatmap(drinksdf.corr(),annot=True,cmap='YlGnBu')
heatmap-drinks

With the above, we can have a quick couple of assessments:
– there is a strong correlation between beer and alcohol and thus a strong overlap there.
– wine and spirit are almost not correlated and thus it would be rare to have a place where wine and spirit consumption equally high. One would be preferred over other.

If we notice, upper and lower halves along the diagonal are same. Correlation of A is to B is same as B is to A. Further, A correlation with A will always be 1. Such case, we can make a small tweak to make it more presentable and avoid any correlation confusion.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

drinksdf = pd.read_csv(
    'data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

# correlation and masks
drinks_cr = drinksdf.corr()
drinks_mask = np.triu(drinks_cr)

# remove the last ones on both axes
drinks_cr = drinks_cr.iloc[1:,:-1]
drinks_mask = drinks_mask[1:, :-1]

sns.heatmap(drinks_cr, 
        mask=drinks_mask,
        annot=True,
        cmap='coolwarm')
heatmap-masked

It is the same correlation data but just the needed one is represented.

Data Image

It helps in displaying data as an image, i.e. on a 2D regular raster.

Images are internally just arrays. Any 2D numpy array can be displayed as an image.

import pandas as pd
import matplotlib.pyplot as plt

M,N = 25,30
data = np.random.random((M,N)) 
plt.imshow(data)
data-image
Real world example
  1. Let’s read an image and then try to display it back to see how it looks
import cv2
import matplotlib.pyplot as plt

img = cv2.imread('data-files/babygroot.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# print(img.shape)
# output => (500, 359, 3)

plt.imshow(img)
baby-groot

It read the image as an array of matrix and then drew it as plot that turned to be same as the image. Since, images are like any other plots, we can plot other objects (like annotations) on top of it.

SubPlots | fig, (ax1,ax2,ax3, ax4) = plt.subplots(2,2)

Generally, it is used in comparing multiple variables (in pairs) against each other. With multiple plots stacked against each other in the same figure, it helps in quick assessment for correlation and distribution for a pair.

Parameters are: number of rows, number of columns, the index of the subplot 

(Index are counted row wise starting with 1) 

The widths of the different subplots may be different with use of GridSpec.

import numpy as np
import matplotlib.pyplot as plt
import math

# data setup
x = np.arange(1, 100, 5)
y1 = x**2
y2 = 2*x+4
y3 = [ math.sqrt(i) for i in x]  
y4 = [ math.log(j) for j in x] 

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)

ax1.plot(x, y1)
ax1.set_title('f(x) = quadratic')
ax1.grid()

ax2.plot(x, y2)
ax2.set_title('f(x) = linear')
ax2.grid()

ax3.plot(x, y3)
ax3.set_title('f(x) = sqareroot')
ax3.grid()

ax4.plot(x, y4)
ax4.set_title('f(x) = log')
ax4.grid()

fig.tight_layout()
plt.show()
sub-plot

We can stack up m x n view of the variables and have a quick look on how they are correlated. With the above, we can quickly assess that second graph parameters are linearly correlated.

Data Representation

Plot Anatomy

Below picture will help with plots terminology and representation:

matplotlib-plot-anatomy
Credit: matplotlib.org

Figure above is the base space where the entire plot happens. Most of the parameters can be customized for better representation. For specific details, look here.

Plot annotations

It helps in highlighting few key findings or indicators on a plot. For advanced annotations, look here.

import numpy as np
import matplotlib.pyplot as plt

# A simple parabolic data
x = np.arange(-4, 4, 0.02)
y = x**2

# Setup plot with data
fig, ax = plt.subplots()
ax.plot(x, y)

# Setup axes
ax.set_xlim(-4,4)
ax.set_ylim(-1,8)

# Visual titles
ax.set_title('Annotation Sample')
ax.set_xlabel('X-values')
ax.set_ylabel('Parabolic values')

# Annotation
# 1. Highlighting specific data on the x,y data
ax.annotate('local minima of \n the parabola',
            xy=(0, 0),
            xycoords='data',
            xytext=(2, 3),
            arrowprops=
                dict(facecolor='red', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='top')

# 2. Highlighting specific data on the x/y axis
bbox_yproperties = dict(
    boxstyle="round,pad=0.4", fc="w", ec="k", lw=2)
ax.annotate('Covers 70% of y-plot range',
            xy=(0, 0.7),
            xycoords='axes fraction',
            xytext=(0.2, 0.7),
            bbox=bbox_yproperties,
            arrowprops=
                dict(facecolor='green', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='center')

bbox_xproperties = dict(
    boxstyle="round,pad=0.4", fc="w", ec="k", lw=2)
ax.annotate('Covers 40% of x-plot range',
            xy=(0.3, 0),
            xycoords='axes fraction',
            xytext=(0.1, 0.4),
            bbox=bbox_xproperties,
            arrowprops=
                dict(facecolor='blue', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='center')

plt.show()
matplotlib-annotation

Plot style | plt.style.use('style')

It helps in customizing representation of a plot, like color, fonts, line thickness, etc. Default styles get applied if the customization is not defined. Apart from adhoc customization, we can also choose one of the already defined template styles and apply them.

# To know all existing styles with package
for style in plt.style.available:
    print(style)

Solarize_Light2, _classic_test_patch, bmh, classic, dark_background, fast, fivethirtyeight, ggplot, grayscale, seaborn, seaborn-bright, seaborn-colorblind, seaborn-dark, seaborn-dark-palette, seaborn-darkgrid, seaborn-deep, seaborn-muted, seaborn-notebook, seaborn-paper, seaborn-pastel, seaborn-poster, seaborn-talk, seaborn-ticks, seaborn-white, seaborn-whitegrid, tableau-colorblind10

pre-defined styles available for use

More details around customization are here.

# To use a defined style for plot
plt.style.use('seaborn')

# OR
with plt.style.context('Solarize_Light2'):
    plt.plot(np.sin(np.linspace(0, 2 * np.pi)), 'r-o')
plt.show()
matplotlib-style-ex

Saving plots | ax.savefig()

It helps in saving figure with plot as an image file of defined parameters. Parameters details are here. It will save the image file to the current directory by default.

plt.savefig('plot.png', dpi=300, bbox_inches='tight')

Additional Usages of plots

Data Imputation

It helps in filling missing data with some reasonable data as many statistical or machine learning packages do not work with data containing null values.

Data interpolation can be defined to use pre-defined functions such as linear, quadratic or cubic

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(20,1))
df = df.where(df<0.5)

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.plot(df)
ax1.set_title('f(x) = data missing')
ax1.grid()

ax2.plot(df.interpolate())
ax2.set_title('f(x) = data interpolated')
ax2.grid()

fig.tight_layout()
plt.show()
data-interpolate

With the above, we see all the missing data replaced with some probably interpolation supported by dataframe based on valid previous and next data.

Animation

At times, it helps in presenting the data as an animation. On a high level, it would need data to be plugged in a loop with delta changes translating into a moving view.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import animation

fig = plt.figure()

def f(x, y):
    return np.sin(x) + np.cos(y)

x = np.linspace(0, 2 * np.pi, 80)
y = np.linspace(0, 2 * np.pi, 70).reshape(-1, 1)

im = plt.imshow(f(x, y), animated=True)


def updatefig(*args):
    global x, y
    x += np.pi / 5.
    y += np.pi / 10.
    im.set_array(f(x, y))
    return im,

ani = animation.FuncAnimation(
    fig, updatefig, interval=100, blit=True)
plt.show()
animation

3-D Plotting

If needed, we can also have an interactive 3-D plot though it might be slow with large datasets.

import numpy as np
import matplotlib.pyplot as plt

def randrange(n, vmin, vmax):
     return (vmax-vmin)*np.random.rand(n) + vmin

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
n = 200
for c, m, zl in [('g', 'o', +1), ('r', '^', -1)]:
    xs = randrange(n, 0, 50)
    ys = randrange(n, 0, 100)
    zs = xs+zl*ys  
    ax.scatter(xs, ys, zs, c=c, marker=m)

ax.set_xlabel('X data')
ax.set_ylabel('Y data')
ax.set_zlabel('Z data')
plt.show()
3d-plot

Cheat Sheet

A page representation of the key features for quick lookup or revision:

matplotlib-cheatsheet
Credit: DataCamp

Download the PDF version of cheatsheet from here.
Overall reference & for more details, look: https://matplotlib.org/

Entire Jupyter notebook with more samples can be downloaded or forked from my GitHub to look or play around: https://github.com/sandeep-mewara/data-visualization


Keep learning!

LearnByInsight C#
GitHub Profile Readme Samples
LearnByInsight Machine Learning

pandas – get started with examples

This is to get started with pandas and try few concrete examples. pandas is a Python based library that helps in reading, transforming, cleaning and analyzing data. It is built on the NumPy package.

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

https://pandas.pydata.org


Key data structure in pandas is called DataFrame – it helps to work with tabular data translated as rows of observations and columns of features.

Download or fork entire Jupiter notebook from my GitHub to play around: https://github.com/sandeep-mewara/python-examples

pandas basics includes:

  • Series
  • Dataframes
    • Create
      • from list of tuples
      • from a dictionary
      • from a CSV
      • from built-in dataset (eg: from sklearn.datasets)
    • Data retrieval
    • Modifying data
    • Group by operation
    • Custom Functions – apply method
    • Pre-Processing
      • drop, mean, mode
      • ordinal feature
      • nominal feature
    • Reshaping
      • CrossTab
      • Merge
      • Melt
      • Pivot

# .info(), .head(), .sample are handy method to use first off with dataframe to get a high level details

# index may be not unique – can return multiple values

# boolean indexing (masking) can help select certain set of rows

# .isin() is a useful when building a boolean index

# .where() is useful to retain shape of the original table

# Column names & Indexes can be set if needed

# to modify the table right away, use inplace=True

# aggregate operations can be applied on a groupby object

# dropna(), mean() or mode() are handy ways for pre-processing missing data

Key learning’s …

Examples notebook includes:

  • Uber taxi drivers
  • Apple stock price
  • Day or Night
  • Students marks
  • Balance Calculator

# .describe() is a handy method to get the statistical summary of numerical columns

# one-hot-encoding is really helpful for nominal features (that cannot be ordered)

# converting the columns into right datatype helps

# converting data into meaningful numbers help for analysis

# groupby is a powerful tool with dataframes for analysis

Key learning’s …

Cheat sheet

Credit: Pandas website

Download cheat sheet pdf from here
For more details about pandas, look at the documentation reference.

Keep learning!

Python as statistics workbench

While reading for AI/ML (Artificial Intelligence/Machine Learning), I came across a discussion – if Python can be used as a “statistics workbench” to replace R, SPSS, etc? It was nice shareout by multiple knowledge folks related to languages used for problems of statistics, specifically R (read about R here).

Discussion here: https://stats.stackexchange.com/questions/1595/python-as-a-statistics-workbench

For quick reference, I will quote few of the latest thoughts from there that are in favor of Python and how it has evolved. I too conquer with most of them:

1. Python is easily the most intuitive syntax of any programming language. This makes for extremely fast development time.

2. Python is performant. It opens large datasets reliably.

3. The packages in Python are fast catching up to R’s packages. Python usage has increased tremendously last few years.

4. Readability is one of the most important qualities good code can possess, and Python is one of the most readable language.

5. Python has an extremely well-thought-out IDE now: PyCharm & Visual Studio Code.

https://stats.stackexchange.com/a/457753

Overall, Python is a general purpose language with an easy to understand syntax which would be relatively easier for usual programmers to learn/adopt. R is developed keeping statisticians in mind. Thus it has many features around data visualization and is a tad ahead currently.

A little research …

Recently DataCamp too published an article comparing R and Python for data analysis. There is a nice comparison in it on various parameters, picking just couple of them here:

Final analysis in the paper shares R being ahead in comparison for data analysis but Python having potential to catch up quickly and easily.

My thoughts …

My intent was to understand which of the programming language serves as an essential tool to demonstrate AI/ML capabilities. Looking at them, Python seems good enough for me to serve as AI/ML tool to start and probably conquer it.

Ammunition needed …

There are many python based libraries and packages that are generally used for statistical work. Below are few of them that would help in our data analysis exploration going ahead:

  • scipy – python-based ecosystem of open-source software for mathematics, science, and engineering.
    • cookbook – many statistical facilities, a collection of various user-contributed recipes already available
    • numpy – base N-dimensional array package. Handful of example lists here
    • pandas – a fast, powerful, flexible and easy to use data analysis and manipulation tool
    • matplotlib – a comprehensive library for creating static, animated, and interactive visualizations
  • scikit-learn – simple and efficient machine learning tools for predictive data analysis
  • keras – API for deep learning
  • tensorflow – API to develop and train ML models

Since I am a programmer, I maybe be biased here. But, it seems Python can and does all the needful to start with AI/ML journey.

Happy learning!

NumPy – Basics & Examples

This is to get started with NumPy and try few concrete examples. NumPy (Numerical Python) are packages for numerical computation designed for efficient work on large data sets.

Entire Jupiter notebook can be downloaded or forked from my GitHub to play around: https://github.com/sandeep-mewara/python-examples

numpy-icon

Reference: https://numpy.org/learn/

NumPy basics includes:

  • Initialize Matrix via
    • List
    • NULL Matrix
    • IDENTITY Matrix
    • ONES Matrix
  • Matrix Transpose
  • Matrix Indexing
  • Simulation
  • Basic CSV file operations
  • Matrix Broadcasting
  • Basic Image Processing

# matrix in python is list of a list

# arrays are compatible for broadcasting when the trailing dimensions match or either of them is of length 1

# image when read as numbers, the values are between 0 & 1

Key learning’s …

Examples notebook includes:

  • Random walk simulation
  • Triangle simulation
  • Random Number
  • Correlation co-efficient
  • Mean/Variance of crude oil

# masking helps get all the values back that satisfy the mask

# cumsum() is a handy function for cumulative sum

# there are handy methods for random number generation

Key learning’s …

For learning more about NumPy, look here: https://numpy.org/doc/stable/

Keep learning!

Python – Basics & Examples

This is to get started with Python and try few concrete examples. It should help beginners to learn or others to do a quick revision without getting too deep.

Entire Jupyter notebook can be downloaded or forked from my GitHub to look or play around: https://github.com/sandeep-mewara/python-examples

I started Python programming using Jupiter notebook web application. Later, I moved to Visual Studio Code that looked much user friendly.

A guide on how to setup VS Code for Python is here.

Python basics includes:

  • Variables
  • Conditional statements
  • String manipulations
  • Type conversion
  • Formatting strings
  • Data Structure – List, Tuple
  • Functions
  • List comprehension
  • Zip & Pack

# items are indexed by integers, starting from 0.

# % is a format operator and %d, %s, %f are special format sequences

# negative index is used to access list elements from the end

# [start:end:step] Returns a new list from start to end-1 with default step 1

# zip can merge two lists into a list of tuples

Key learning’s …

Examples notebook includes:

  • Palindrome
  • Sum of Squares
  • Sort students marks list
  • Format students marks list
  • Word Frequency

# sometimes anonymous functions are enough

# storing data in dictionary as key-value pair helps

Key learning’s …

Keep learning!