Mastering the SKILL.md File in Agentic AI: A Complete Guide

In modern Agentic AI architectures, the primary engineering challenge is no longer generating language, but bridging the gap between conversational intent and reliable, repeatable and unambiguous execution. To achieve this, we must treat agent capabilities not as conversational shortcuts, but as well-defined engineering assets.

skill-md-agentic-ai.png


This requires a standardized contract for capability execution. That’s where SKILL.md comes in. A formal, machine-parsable definition file that acts as a Standard Interoperability Definition (SID) contract for systematic task execution within an agentic framework.

In this blog, I’ll dive deep into SKILL.md and share how it serves as a single source of truth for both conceptual planning (roles) and procedural execution (workflows) that power an automated, engineering-grade SDLC.

The Architectural Blueprint: The SKILL.md

SKILL.md is structured as an engineering specification, designed for zero-ambiguity parsing by an LLM like Claude. It defines the contract for interoperability, forcing teams to move from conversational requests to precise capability definitions.

Anatomy of an Engineering Contract

The specification consists of five required metadata fields that are immutable and machine-parsable:

  • Name: An immutable, unique, system-wide identifier for the capability (e.g., internal-token-manager-v1exec-raise-github-pr-v1, or sdlc-pm-v1). This is the system’s handle for the skill.


  • Description: Critically, this is not a summary. It is the definitive Trigger Event Definition. It must be written from the perspective of an event, user query or internal signal that activates this capability, allowing the framework to perform accurate skill matching. Example: “Triggers automatically after a successful code analysis scan…”


  • Commands: A list of executable operations or prompts defined by the contract. For procedural skills, these map to API endpoints or internal function calls. For conceptual skills, these map to defined prompt sequences. Example: get-linter-report(timestamp) or refresh-token(service_id).


  • Constraints: A critical safety and resource management section. It defines the limits, rules and error conditions of the contract. Example: “Internal authentication tokens must expire after 1 hour.”


  • Examples: These are not suggestions but are the gold standard of Expected Behavior. They define the intended output for specific input scenarios, providing the LLM with a definitive blueprint for successful execution and reducing non-deterministic output.
# Code Snippet 1: Sample Procedural SKILL.md (Raise GitHub PR)
---
# REQUIRED METADATA FIELDS (SID CONTRACT)

name: exec-raise-github-pr-v1
description: Triggers automatically after a successful 'exec-linter-code-analyzer-v1' scan or upon user request to systematically raise a new pull request on GitHub for reviewed code.
commands:
  - create-pr(repository_url, head_branch, base_branch, title, body)
constraints:
  - Must use a valid GitHub API token with 'repo' scope.
  - Head branch must differ from the base branch.
---

### Expected Behavior (Examples)

When this skill is matched against a standard JavaScript repository:
  - Input: create-pr("https://github.com/org/repo.git", "feat/new-api", "main", "Feat: Add API v2", "This PR introduces...")
  - Execution: Loads 'scripts/create_pr.py'.
  - Output: New PR URL.

Directory Structure & Progressive Disclosure

The SKILL.md is packaged within a defined directory structure, ensuring all supporting assets are decoupled and version-controlled alongside the specification.

skill-folder-structure.jpg

.Sandeep Mewara Github

  • 📄 SKILL.md (The only required asset, containing the definitions and contract).
  • 📁 scripts/ (Optional: Decoupled logic – Python, Bash, Node.js, etc. The implementation details of the contract).
  • 📁 references/ (Optional: Docs, checklists, design patterns or standards the skill must adhere to).
  • 📁 assets/ (Optional: Templates or sample data).

This decoupled architecture enables the Progressive Disclosure Pattern, which is critical for system efficiency and managing token constraints. A high-performance agentic system should not load every asset for every skill simultaneously. Progressive disclosure ensures assets are loaded only when necessary.

skill-md-activation-flow.jpg


Agents don’t load everything at once. They discover and expand context only when needed.

Architecting the Automated SDLC

The standardization offered by SKILL.md allows us to architect and separate the dynamic pillars of an automated SDLC, managing all capabilities via this single specification. In a professional lifecycle, conceptual setup (Defining Roles) always precedes procedural execution (Executing Workflows).

Conceptual Role-Based Skills: Defining the Contract for a Persona (Planning & Setup)

To initiate any SDLC phase (e.g., Requirements), we must first define the conceptual frameworks, knowledge bases and systematic planning workflows of specific roles that help organise content by domain (behaviour-driven). We apply the identical SKILL.md standard to define a persona’s “mindset”.

  • WHAT: SKILL.md definitions for Product Manager Persona or Lead Developer Persona.


  • APPLICATION: During the “Requirements” and “Design” phases of the SDLC.


  • ARCHITECTURAL FLOW: During planning, you activate the Product Manager Persona (Code Snippet 2). Claude adopts this mindset and leverages knowledge references (e.g., Agile standards) and the command contract (draft-prd(user_stories)) to provide focused, high-quality requirements.
Code Snippet 2: Sample Conceptual SKILL.md (Product Manager)
---
# REQUIRED METADATA FIELDS (SID CONTRACT)

name: sdlc-pm-v1
description: Triggers during project initiation to define the persona, responsibilities, knowledge base and systematic planning workflows of a senior Product Manager.
commands:
  - draft-prd(user_stories, acceptance_criteria)
  - run-feature-prioritization(prd_document)
constraints:
  - Must reference files in the optional 'references/' directory (e.g., 'references/agile-standards.md') for all Agile terminology.
---

### Expected Behavior (Examples)

When this skill is matched to a new project request:
  - Input: draft-prd(user_stories, acceptance_criteria)
  - Execution: Loads 'references/agile-standards.md' to define terminology.
  - Output: A structured PRD document based on the internal persona.

External Workflow Execution Skills: Defining the Contract for the Workflow to ‘Do’

Once the groundwork is established and the build begins, the agent’s focus shifts to user-triggered workflows (e.g., after a commit). These skills are guides that help perform specific, measurable steps in the automated pipeline, providing the user with domain-specific results (task-driven).

  • WHAT:SKILL.md definitions for exec-linter-code-analyzerexec-raise-github-pr, or jira-ticket-update.


  • APPLICATION: During the “Build,” “Test” and “Deploy” phases of the SDLC, typically automated by CI/CD events.


  • ARCHITECTURAL FLOW: After a successful code implementation event, the framework activates the exec-linter-code-analyzer-v1 (Code Snippet 3). Claude reads the inputs and expected behavior. The framework executes the decoupled logic (scripts/) to systematically create the pull request, ensuring a reliable result (the PR URL) is provided back to the user’s workflow or CI/CD pipeline.
Code Snippet 3: Sample Procedural SKILL.md (Code Analyzer Workflow)
---
# REQUIRED METADATA FIELDS (SID CONTRACT)
name: exec-linter-code-analyzer-v1
description: Triggers automatically after a code commit event to execute a static analysis and linter scan on the modified files in a specific repository, providing a systematic JSON report.
commands:
  - run-analysis(repository_url, branch)
constraints:
  - Must use a valid GitHub API token with 'repo' scope.
---

### Expected Behavior (Examples)
When this skill is matched following a code commit:
  - Input: run-analysis("https://github.com/org/repo.git", "main")
  - Execution: Loads 'scripts/run_analysis.py'.
  - Output: Linter report JSON.

Internal Agent Operational Skills: Defining the Contract for the Software to ‘Be’

To ensure system stability, the agent software itself requires precise, standardized contracts for core operational tasks (like authentication, state, error handling, api-call, etc). These skills are operational and invisible to the SDLC workflow itself. They focus on the agent’s internal robustness and platform integrity.

  • WHAT: SKILL.md definitions for internal-token-manager or agent-state-historian.


  • APPLICATION: Triggered automatically by the agent’s orchestration layer during defined lifecycle events (e.g., establishing a session state, refreshing an expired 401 token).


  • ARCHITECTURAL FLOW: When any skill requires access to a restricted API, it activates the internal-token-manager (Code Snippet 4). Claude reads the command contract (refresh-token(service_id)). The framework executes the decoupled logic (scripts/) to refresh the secure token, ensuring the agent software can authenticate without creating brittle, direct credential dependencies in the domain-level skills. This internal complexity is hidden from the user but critical for security and robustness.
Code Snippet 4: Sample Procedural SKILL.md (Token Manager)
---
# REQUIRED METADATA FIELDS (SID CONTRACT)
name: internal-token-manager-v1
description: An internal operational skill that triggers throughout a workflow when the agent detects it requires a secure token to authenticate against an external service (e.g., GitHub, Slack, Splunk).
commands:
  - refresh-token(service_id)
constraints:
  - Must use a valid agent credential secret (e.g., 'agent_platform_secret').
  - Tokens must expire after 1 hour.
---

### Expected Behavior (Examples)

When this skill is matched when a GitHub operation requires auth:
  - Input: refresh-token("github_api")
  - Execution: Loads 'scripts/refresh_token.py'.
  - Output: New OAuth token JSON.

The Boundary of Autonomy and the Expertise Gap

While standardizing capabilities via SKILL.md is essential, I believe it is critical for architects to also define where SKILL.md is not the right tool. My own perspective, based on recent project implementation, is that a common architectural failure is expecting SKILL.md to easily encode true Domain Expertise and Heuristic Judgment.

Offloading Heuristics vs. Offloading Wisdom

A well-defined SKILL.md is designed to be precise, measurable and standardized. It excels at offloading common known items, standard checklists and systematic patterns into reliable workflows (as seen in our Code Snippets 3 & 4). In my recent project, this precision made the skills function as excellent fixed checklists, significantly reducing operational ambiguity.

This same precision, however, means it can appear only as a checklist. A procedural skill like exec-linter-code-analyzer can identify a syntax error based on a rule, but I found it often lacked the domain wisdom to understand the conceptual design decision that led to that error.

Assisting Expertise, Not Replacing It

Based on the experience so far, I believe that you cannot easily encode a senior engineer’s years of nuanced design thinking into a SKILL.md description. The true architectural value of a standardized specification is that it offloads the reliable execution complexity, allowing the Human Expert (or a high-level Agentic Persona) to focus entirely on core domain and design reasoning.

For now, I believe following a model where three distinct pillars of knowledge are defined will work out:

  1. Systematic Workflows (Procedural Skills): Handled perfectly by SKILL.md. (The “What to Do”)
  2. Conceptual Frameworks (Persona Mindsets): Setup by SKILL.md. (How Claude “Thinks”)
  3. Domain Wisdom & Design Reasoning: Passed as the problem context in the main prompt. (Why Claude “Decides”)

Engineering Best Practices for SKILL.md Mastery

Achieving systematic capability definition requires adhering to these foundational best practices:

  1. Strict Decoupling: Never place the execution logic (e.g., Python code) directly within the SKILL.md file. The SKILL.md is the specification & the scripts/ directory is the implementation.


  2. Immutability: Once a skill is deployed, treat its metadata (Name, Description, Commands) as immutable. Any significant change requires a new version (e.g., exec-raise-github-pr-v2). Brittleness often stems from changing definitions in place.


  3. Description as a Trigger: Never write a summary description (e.g., “This skill runs a linter”). It must be written as a trigger definition (e.g., “Triggers automatically after a context save event…”). Skill matching depends entirely on this accuracy.


  4. Token Economy: Adhere to strict size constraints: < 500 lines and < 5k tokens for the SKILL.md. The Progressive Disclosure pattern will handle heavier assets, keeping the SID itself focused and parseable.


  5. Git-Managed Context: Treat SKILL.md files as code. They must be version-controlled in Git, promoting discoverability, reuse and providing a traceable history of how capabilities have evolved throughout the lifecycle.

Final Thought: A Standard for Scaling Autonomy

By adopting the SKILL.md specification, we move from fuzzy conversational AI to a structured engineering discipline, where all agent capabilities, whether they are internal operational requirements, external user workflows or conceptual roles framework – all are defined by precise, version-controlled contracts.

This foundation standardizes reliable execution complexity, not only making your automated SDLC predictable and robust but also ensuring that precious domain expertise remains focused on main design decisions, not common patterns. Mastering the SKILL.md standard is the definitive, interoperable foundation for building scalable, maintainable and engineering-grade AgenticAI architectures.

. Sandeep Mewara Github
News Update
Tech Explore
Trend
samples GitHub Profile Readme
Learn Machine Learning with Examples
Machine Learning workflow
Agentic AI for Beginners: My Journey into Building with Claude
The Great Inversion: Why AI is Moving from Cloud to Desktop

[DOWNLOADskill.md Quick Reference Guide]

.

Find missing number from 1 to N?

Last week, there was a discussion in my team on the problem of finding missing number(s). We had different thoughts and approaches and thus I thought to share it across.

find-missing-number

Problem statement was something like:

– An array of size (n) has numbers from 1 to (n+1). Find the missing one number.

– An array of size (n) has numbers from 1 to (n+2). Find the missing two numbers.

First thought …

Keep track of numbers found while traversing. At the end, use it to find the missing number. So kind of brute force approach.

We can maintain a hash or a boolean array of n size and keep on updating the hash or the array index location based on number found while traversing. Use it now to find the missing number. It would cover both one as well as two missing numbers case.

This would have two traversals of n (one for filling in the structure and another to find the missing one). Thus overall, time complexity of O(n). This would need an extra space to keep track of all numbers found and thus a space complexity of O(n).

Q: Now, can we avoid extra space or two times traversal?

Second thought …

We know how to calculate the sum of n natural numbers, i.e.: n*(n+1)/2. With it, we can traverse the given array and keep a sum of all numbers. Difference of the sum from formula to sum found would give us the missing number. Nice!

# Keep track of sum
def sumOfGivenNumbers(nos, n):
    sum = 0
    # calculate sum
    for i in range(0, n):
        sum += nos[i]
    return sum

# Input
numbers = [4, 2, 1, 6, 5, 7] 

# number range 
n = len(numbers) + 1
expectedSum = n*(n+1)/2
numbersSum = sumOfGivenNumbers(numbers,len(numbers))

print('Missing number:', expectedSum - numbersSum)

# Output
# Missing number: 3.0

This would help is solve one missing number in single traversal, thus time complexity of O(n). No extra space was used and thus space complexity of O(1).

Q: Can we extend this to two missing numbers now?

Yes, we can extend it. Along with sum, we can also use the product of n natural number as an expression. With it, we will have two equations and two numbers to find:

Missing1 = x1
Missing2 = x2
Sum of provided numbers = N1
Sum of n Natural numbers = N
Product of provided numbers = P1
Product of n Natural numbers = P

x1 + x2 + N1 = N
x1 * x2 * P1 = P

We can solve it to find the two missing numbers. It does have the quadratic flavor associated though. It maintains the time complexity as O(n) and space complexity as O(1). Nice!

Q: Does the solution help with large integers? Think of possible overflow?

Third thought …

Let’s look at possible way for 1 missing number first.

We will traverse through all the numbers of the array. While doing so, maintain a number that would be sum of all numbers traversed so far reduced by sum of all the indexes traversed (+1 if index starts from 0). It is still making use of n natural numbers (in form of indexes) to keep a check on sum to a defined limit.

# Keep track of sum
def getMissingNumber(nos, n):
    sum = 0
    # calculate sum
    for i in range(0, n):
        sum += (i+1)
        sum -= nos[i]

    # last number to add from n+1 natural nos.
    return sum+n+1

# Input
numbers = [4, 2, 1, 6, 5, 7] 

missingNumber =getMissingNumber(numbers,len(numbers))

print('Missing number:', missingNumber)

# Output
# Missing number: 3.0

This looks good and we maintain the same complexities along with solving for overflow.

We can probably try a similar thing for two missing numbers where we keep on multiple and divide the traversed number by index but it still could have overflow issues in worst case. Further, there could be round off issues.

Fourth thought …

Looking more, it seems we can make use of XOR operation to find the missing numbers. We can make use of XOR’s property to nullify the duplicate pair. We will take XOR of provided numbers and XOR of natural numbers. Combining both again with XOR will leave with missing numbers XOR output.

For one missing number, this would be easy and covers all the hurdles discussed earlier keeping same performance.

# Keep track of XOR data
def getMissingNumber(nos, n):
    x1 = nos[0]
    xn = 1

    # start from second
    for i in range(1, n):
        x1 = x1 ^ nos[i]
        xn = xn ^ (i+1)
    
    # last number to XOR
    xn = xn ^ (n+1)

    # find the missing number
    return x1 ^ xn

# Input
numbers = [4, 2, 1, 6, 5, 7] 

missingNumber =getMissingNumber(numbers,len(numbers))

print('Missing number:', missingNumber)

# Output
# Missing number: 3.0

For two missing numbers, using a similar logic of XOR above, we will have an output of XOR value of both missing numbers. Now, given the XOR value will not be zero, the XOR corresponding valid bit in missing1 and missing2 must be different to make it “1”.

# Keep track of XOR data
def getTwoMissingNumber(nos, n):
    x1 = nos[0]
    xn = 1

    # start from second
    for i in range(1, n-2):
        x1 = x1 ^ nos[i]
        xn = xn ^ (i+1)
    
    # last numbers to XOR
    xn = xn ^ (n-1) ^ (n)

    # XOR of two missing numbers
    # Any set bit in it must be 
    # set in one missing and 
    # unset in other missing number 
    XOR = x1 ^ xn

    # Get a rightmost set bit of XOR  
    set_bit_no = XOR & ~(XOR-1) 
  
    # Divide elements in two sets 
    # by comparing rightmost set bit of XOR 
    # with bit at same position in each element. 
    x = 0
    y = 0 
    for i in range(0,n-2): 
        if nos[i] & set_bit_no:    
            # XOR of first set in nos[]  
            x = x ^ nos[i]   
        else: 
            # XOR of second set in nos[]  
            y = y ^ nos[i]   

    for i in range(1,n+1): 
        if i & set_bit_no: 
            # XOR of first set in nos[]  
            x = x ^ i        
        else: 
            # XOR of second set in nos[]  
            y = y ^ i
    
    print ("Missing Numbers: %d %d"%(x,y)) 
    return

# Input
numbers = [4, 2, 1, 6, 7, 5] 

# total length will be provided count+2 missing ones
getTwoMissingNumber(numbers, len(numbers) + 2)

# Output
# Missing Numbers: 3 8

This overcomes the overflow issue and was easier to solve (compared to solving a quadratic equation). Though it took more than one traversal, overall it maintains the time complexity as O(n) and space complexity as O(1). Nice!

Closure …

There could be multiple ways to solve for one or more missing numbers. One can look at it based on ease and need.


Keep solving!

Sandeep Mewara Github
News Update
Tech Explore
Data Explore
samples GitHub Profile Readme
Learn Machine Learning with Examples
Machine Learning workflow
What is Data Science
Word Ladder solution
What is Dynamic Programming
Learn Microsoft Tech via Videos LiveTV Streams

How to solve Word Ladder Problem?

Sometime back, a colleague of mine asked me about the word ladder problem. She was looking for a change. So, I believe she stumbled across this while preparing for data structures and algorithms.

graph-header

Problem Statement

Typically, the puzzle shared is a flavor of below:

Find the smallest number of transformations needed to change an initial word to a target word of same length. In every transformation, change only one character and make sure word exists in the given dictionary.

Explanation

Assuming all these 4 letter words are there in the dictionary provided, it takes minimum 4 transitions to convert word from SAIL to RUIN, i.e.
SAIL -> MAIL -> MAIN -> RAIN -> RUIN

Intent here is to know about Graph algorithm. So, what are graphs in context of algorithms and how do we apply them to solve such problems?

Graph Data Structure

Graphs are flow structure that represents entities connection with each other. Visually, they are represented with help of a Node (Vertex) & an Edge (Connector).

graph-general

A tree is an undirected graph in which any two nodes are connected by only one path. In it, each node (except the root node) comprises exactly one parent node.

Most common way to represent a graph is using an Adjacency matrix. In it, Element A[i][j] is 1 if there is an edge from node i to node j or else it is 0. For example, adjacency matrix of above unidirected graph is:

  | 1 2 3 4
------------
1 | 0 1 0 1
2 | 1 0 1 0
3 | 0 1 0 1
4 | 1 0 1 0

Another common way is via Adjacency list. (List format of the data instead of a matrix.)

Related Algorithms

Graphs are applied in search algorithms. Traversing the nodes and edges in a defined order helps in optimizing search. There are two specific approaches to traverse graph:

Breadth First Search (BFS)

Given a graph G and a starting node s, search proceeds by exploring edges in the graph to find all the nodes in G for which there is a path from s. With this approach, it finds all the nodes that are at a distance k from s before it finds any nodes that are at a distance k+1.

For easy visualization, think of it as, in a tree, finding all the child nodes for a parent node as first step. Post it, find all the grandchildren and hence forth.

Depth First Search (DFS)

Given a graph G and a starting node s, search proceeds by exploring edges in the graph to find all the nodes in G traversed from s through it’s edges. With this approach, we go deep in graph connecting as many nodes in the graph as possible and branch where necessary.

For easy visualization, think of it as, in a tree, finding all the family nodes for a parent node. With this, for a given node, we connect its children, grand children, grand grand children and so on before moving to next node of same level.

Thus, with DFS approach, we can have multiple deduced trees.

Knight’s tour is a classic example that leverages Depth First Search algorithm.

Shortest Path First OR Dijkstra’s Algorithm (SPF)

Given a graph G and a starting node s, search the shortest path to reach node d. It uses a concept of weights. It’s an iterative algorithm similar to results of BFS.

Many real world example fits in here, e.g. what would be shortest path from home to office.

With BFS (a simple queue), we visit one node at a time whereas in SPF (a priority queue), we visit a node at any level with lowest cost. In a sense, BFS follows Dijkstra's algorithm, a step at a time with all edge weights equal to 1. The process for exploring the graph is structurally the same in both cases. at times, BFS is preferred with equal weight graphs. This is because, operations on a priority queue are O(log n) compared to operations on a regular queue which is O(1).

Code

I will be using a breadth first graph algorithm here based on the problem need:

import collections
from collections import deque 

class Solution(object):
    # method that will help find the path
    def ladderLength(self, beginWord, 
                        endWord, wordList):
        """
        :type beginWord: str
        :type endWord: str
        :type wordList: Set[str]
        :returntype: int
        """

        # Queue for BFS
        queue = deque()

        # start by adding begin word
        queue.append((beginWord, [beginWord]))

        while queue:
            # let's keep a watch at active queue
            print('Current queue:',queue)

            # get the current node and 
            # path how it came
            node, path = queue.popleft()

            # let's keep track of path length 
            # traversed so far
            print('Current transformation count:',
                                        len(path))

            # find possible next set of 
            # child nodes, 1 diff
            for next in self.next_nodes(node, 
                            wordList) - set(path):
                # traversing through all child nodes
                # if any of the child matches, 
                # we are good               
                if next == endWord:
                    print('found endword at path:',
                                            path)
                    return len(path)
                else:
                    # keep record of next 
                    # possible paths
                    queue.append((next, 
                                path + [next]))
        return 0

    def next_nodes(self, word, word_list):
        # start with empty collection
        possiblenodes = set()

        # all the words are of fixed length
        wl_word_length = len(word)

        # loop through all the words in 
        # the word list
        for wl_word in word_list:
            mismatch_count = 0

            # find all the words that are 
            # only a letter different from 
            # current word those are the 
            # possible next child nodes
            for i in range(wl_word_length):
                if wl_word[i] != word[i]:
                    mismatch_count += 1
            if mismatch_count == 1:
                # only one alphabet different-yes
                possiblenodes.add(wl_word)
        
        # lets see the set of next possible nodes 
        print('possible next nodes:',possiblenodes)
        return possiblenodes

# Setup
beginWord = "SAIL"
endWord = "RUIN"
wordList = ["SAIL","RAIN","REST","BAIL","MAIL",
                                    "MAIN","RUIN"]

# Call
print('Transformations needed: ',
    Solution().ladderLength(beginWord, 
                            endWord, wordList))

# Transformation expected == 4
# One possible shortes path with 4 transformation:
# SAIL -> MAIL -> MAIN -> RAIN -> RUIN

Used deque (doubly ended queue) of Python

deque helps with quicker append and pop operations from both the ends. It has O(1) time complexity for append and pop operations. In comparison, list provides it in O(n) time complexity.

A quick look at the code workflow to validate if all nodes at a particular distance was traversed first and then moved to next level:

Current queue: deque([('SAIL', ['SAIL'])])

Current transformation count: 1
possible next nodes: {'BAIL', 'MAIL'}
Current queue: deque([('BAIL', ['SAIL', 'BAIL']), 
                      ('MAIL', ['SAIL', 'MAIL'])])

Current transformation count: 2
possible next nodes: {'SAIL', 'MAIL'}
Current queue: deque([('MAIL', ['SAIL', 'MAIL']), 
                      ('MAIL', ['SAIL', 'BAIL', 
                       'MAIL'])])

Current transformation count: 2
possible next nodes: {'BAIL', 'MAIN', 'SAIL'}
Current queue: deque([('MAIL', ['SAIL', 'BAIL', 
                                'MAIL']), 
                      ('BAIL', ['SAIL', 'MAIL', 
                                'BAIL']), 
                      ('MAIN', ['SAIL', 'MAIL', 
                                'MAIN'])])

Current transformation count: 3
possible next nodes: {'BAIL', 'MAIN', 'SAIL'}
Current queue: deque([('BAIL', ['SAIL', 'MAIL', 
                                'BAIL']), 
                      ('MAIN', ['SAIL', 'MAIL', 
                                'MAIN']), 
                      ('MAIN', ['SAIL', 'BAIL', 
                                'MAIL', 'MAIN'])])

Current transformation count: 3
possible next nodes: {'SAIL', 'MAIL'}
Current queue: deque([('MAIN', ['SAIL', 'MAIL', 
                                'MAIN']), 
                      ('MAIN', ['SAIL', 'BAIL', 
                                'MAIL', 'MAIN'])])

Current transformation count: 3
possible next nodes: {'RAIN', 'MAIL'}
Current queue: deque([('MAIN', ['SAIL', 'BAIL', 
                                'MAIL', 'MAIN']), 
                      ('RAIN', ['SAIL', 'MAIL', 
                                'MAIN', 'RAIN'])])

Current transformation count: 4
possible next nodes: {'RAIN', 'MAIL'}
Current queue: deque([('RAIN', ['SAIL', 'MAIL', 
                                'MAIN', 'RAIN']), 
                      ('RAIN', ['SAIL', 'BAIL', 
                        'MAIL', 'MAIN', 'RAIN'])])

Current transformation count: 4
possible next nodes: {'MAIN', 'RUIN'}
found endword at path: ['SAIL', 'MAIL', 'MAIN', 
                                        'RAIN']

Transformations needed:  4
Overall path: ['SAIL', 'MAIL', 'MAIN', 
                               'RAIN', 'RUIN']

Complexity

For above code that I used to find the shortest path for transformation:

Time

In next_nodes, for each word in the word list, we iterated over its length to find all the intermediate words corresponding to it. Thus we did M×N iterations, where M is the length of each word and N is the total number of words in the input word list. Further, to form an intermediate word, it takes O(M) time. This adds up to O(M2×N).

In ladderLength, BFS can go to each of the N words and for each word, we need to examine M possible intermediate words. This adds up to O(M2×N).

Overall, it adds up to O2(M2×N) which would be called O(M2×N).

Space

In next_nodes, each word in the word list would have M intermediate combinations. For every word we need a space of M2 to save all the transformations corresponding to it. Thus, it would need a total space of O(M2×N).

In ladderLength, BFS queue would need a space of O(M×N)

Overall, it adds up to O(M2×N) + O(M×N) which would be called O(M2×N)

Wrap Up

It could be little tricky and thus would need some practice to visualize the graph as well to write code for it.

Great, so now we know how to solve problems like word ladder problem. It also touch based other related common graph algorithms that we can refer to.

I had a read of the following reference and it has much more details if needed.


Keep problem solving!

samples GitHub Profile Readme
Learn Python – Beginners step by step – Basics and Examples
Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples
Sandeep Mewara Github Repositories

Linear time partition – a three way split

Linear-time partition is a divide & conquer based selection algorithm. With it, data is split into three groups using a pivot.
.

linear-time-partioning

An integral part of Quick Sort algorithm which uses this partitioning logic recursively. All the elements smaller than the pivot are put on one side and all the larger ones on the other side of the pivot.

Similar to discussion of Dynamic Programming, this algorithm plays on solving sub-problems to solve complex problem.

Algorithm

Post selecting the pivot, Linear-time partition routine separates the data into three groups with values:

  • less than the pivot
  • equal to the pivot
  • greater than the pivot

Generally, this algorithm is done in place. This results in partially sorting the data. There are handful of problems that makes use of this fact, like:

  • Sort an array that contains only 0’s, 1’s & 2’s
  • Dutch national flag problem
  • Print all negative integers followed by positive for an array full of them
  • Print all 0’s first and then 1’s or vice-versa for an array with only 0’s & 1’s
  • Move all the 0’s to the end maintaining relative order of other elements for an array of integers

If done out of place, (i.e. not changing the original data), it would cost O(n) additional space

Example

Let’s take an example of: sort a array that contains only 0’s, 1’s & 2’s

First thought for such problem is to perform a count of 0’s, 1’s and 2’s. Once we have the counts, reset the array with them. Though it has time complexity O(n), it takes two traversal of the array or uses an extra array.

Below is an attempt to solve using Linear-time partition algorithm to avoid that extra traversal/space.

def threeWayPartition(A):
    start = mid = 0
    end = len(A)-1
    
    # define a Pivot
    pivot = 1
    
    while (mid <= end):
        # mid element is less than pivot
        # current element is 0
        
        # so lets move it to start
        # current start is good. 
        # move start to next element
        # move mid to next element to move forward
        if (A[mid] < pivot) :
            swap(A, start, mid)
            start = start + 1
            mid = mid + 1
            
        # mid element is more than pivot
        # current element is 2
        
        # so lets move it to end
        # current end is good. 
        # move end to previous element
        elif (A[mid] > pivot) :
            swap(A, mid, end)
            end = end - 1
        
        # mid element is same as pivot
        # current element is 1
        
        # just move forward: 
        # mid to next element
        else :
            mid = mid + 1
            
# Swap two elements A[i] and A[j] in the list
def swap(A, i, j):
    A[i], A[j] = A[j], A[i]


# Define an array
inputArray = [0, 1, 2, 2, 1, 0, 0, 2]

# Call the Linear-time partition routine
threeWayPartition(inputArray)

# print the final result
print(inputArray)

# Outputs
# [0, 0, 0, 1, 1, 2, 2, 2]

With a defined pivot, we segregated the data on the either side which resulted in desired output. Dutch nation flag problem or printing all negative first and then positive, or printing all 0s first follows the same code.

For moving all 0’s to end maintaining other elements order, we do a tweak in swap index to maintain order:

def threeWayPartition(A):
    current = 0
    nonzero = 0
    end = len(A)-1
    
    # define a Pivot
    pivot = 0
    
    while (current <= end):
        if (A[current] != pivot) :
            swap(A, current, nonzero)
            nonzero = nonzero + 1
        current = current + 1
            
# Swap two elements A[i] and A[j] in the list
def swap(A, i, j):
    A[i], A[j] = A[j], A[i]


# Define an array
inputArray = [7,0,5,1,2,0,2,0,6]

# Call the Linear-time partition routine
threeWayPartition(inputArray)

# print the final result
print(inputArray)

# Output
# [7, 5, 1, 2, 2, 6, 0, 0, 0]

Complexity

With above algorithm approach, we solved our problem with Time complexity O(n) & Space complexity O(1) (with single traversal of the array)


It was fun solving!

samples GitHub Profile Readme
Learn Python – Beginners step by step – Basics and Examples
Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples
Sandeep Mewara Github Repositories

Data Visualization – Insights with Matplotlib

While working on a machine learning problem, Matplotlib is the most popular python library used for visualization that helps in representing & analyzing the data and work through insights.

matplotlib-machine-learning

Generally, it’s difficult to interpret much about data, just by looking at it. But, a presentation of the data in any visual form, helps a great deal to peek into it. It becomes easy to deduce correlations, identify patterns & parameters of importance.

In data science world, data visualization plays an important role around data pre-processing stage. It helps in picking appropriate features and apply appropriate machine learning algorithm. Later, it helps in representing the data in a meaningful way.

Data Insights via various plots

If needed, we will use these dataset for plot examples and discussions. Based on the need, following are the common plots that are used:

Line Chart | ax.plot(x,y)

It helps in representing series of data points against a given range of defined parameter. Real benefit is to plot multiple line charts in a single plot to compare and track changes.

Points next to each other are related that helps to identify repeated or a defined pattern

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 1, 0.05)
y1 = x**2
y2 = x**3

plt.plot(x, y1,
    linewidth=0.5,
    linestyle='--',
    color='b',
    marker='o',
    markersize=10,
    markerfacecolor='red')

plt.plot(x, y2,
    linewidth=0.5,
    linestyle='dotted',
    color='g',
    marker='^',
    markersize=10,
    markerfacecolor='yellow')

plt.title('x Vs f(x)')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend(['f(x)=x^2', 'f(x)=x^3'])
plt.xticks(np.arange(0, 1.1,0.2),
    ['0','0.2','0.4','0.6','0.8','1.0'])

plt.grid(True)
plt.show()
line-chart
Real world example:

We will work with dataset created from collating historical data for few stocks downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

stocksdf1 = pd.read_csv('data-files/stock-INTU.csv') 
stocksdf2 = pd.read_csv('data-files/stock-AAPL.csv') 
stocksdf3 = pd.read_csv('data-files/stock-ADBE.csv') 

stocksdf = pd.DataFrame()
stocksdf['date'] = pd.to_datetime(stocksdf1['Date'])
stocksdf['INTU'] = stocksdf1['Open']
stocksdf['AAPL'] = stocksdf2['Open']
stocksdf['ADBE'] = stocksdf3['Open']

plt.plot(stocksdf['date'], stocksdf['INTU'])
plt.plot(stocksdf['date'], stocksdf['AAPL'])
plt.plot(stocksdf['date'], stocksdf['ADBE'])

plt.legend(labels=['INTU','AAPL','ADBE'])
plt.grid(True)

plt.show()
line-chart-stocks

With the above, we have couple of quick assessments:
Q: How a particular stock fared over last year?
A: Stocks were roughly rising till Feb 2020 and then took a dip in April and then back up since then.

Q: How the three stocks behaved during the same period?
A: Stock price of ADBE was more sensitive and AAPL being least sensitive to the change during the same period.

Histogram | ax.hist(data, n_bins)

It helps in showing distributions of variables where it plots quantitative data with range of the data grouped into intervals.

We can use Log scale if the data range is across several orders of magnitude.

import numpy as np
import matplotlib.pyplot as plt

mean = [0, 0]
cov = [[2,4], [5, 9]]
xn, yn = np.random.multivariate_normal(
                                mean, cov, 100).T

plt.hist(xn,bins=25,label="Distribution on x-axis"); 

plt.xlabel('x')
plt.ylabel('frequency')
plt.grid(True)
plt.legend()
Real world example

We will work with dataset of Indian Census data downloaded from here.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='STATE'
mask2 = populationdf['TRU']=='Total'
df = populationdf[mask1 & mask2]

plt.hist(df['TOT_P'], label='Distribution')

plt.xlabel('Total Population')
plt.ylabel('State Count')
plt.yticks(np.arange(0,20,2))

plt.grid(True)
plt.legend()
histogram-state-pop

With the above, couple of quick assessments about population in states of India:
Q: What’s the general population distribution of states in India?
A: More than 50% of states have population less than 2 crores (20 million)

Q: How many states are having population more than 10 crores (100 million)?
A: Only 3 states have that high a population.

Bar Chart | ax.bar(x_pos, heights)

It helps in comparing two or more variables by displaying values associated with categorical data.

Most commonly used plot in Media sharing data around surveys displaying every data sample.

import numpy as np
import matplotlib.pyplot as plt

data = [[60, 45, 65, 35],
        [35, 25, 55, 40]]

x_pos = np.arange(4)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.set_xticks(x_pos)

ax.bar(x_pos - 0.1, data[0], color='b', width=0.2)
ax.bar(x_pos + 0.1, data[1], color='g', width=0.2)

ax.yaxis.grid(True)
bar-chart
Real world example

We will work with dataset of Indian Census data downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='STATE'
mask2 = populationdf['TRU']=='Total'
statesdf = populationdf.loc[mask1].loc[mask2]
statesdf = statesdf.sort_values('TOT_P')

plt.figure(figsize=(10,8))
plt.barh(range(len(statesdf)), 
    statesdf['TOT_P'], tick_label=statesdf['Name'])
plt.grid(True)
plt.title('Total Population')
plt.show()
bar-chart-state-pop

With the above, couple of quick assessments about population in states of India:
– Uttar Pradesh has the highest total population and Lakshadeep has lowest
– Relative popluation across states with Uttar Pradesh almost double the second most populated state

Pie Chart | ax.pie(sizes, labels=[labels])

It helps in showing the percentage (or proportional) distribution of categories at a certain point of time. Usually, it works well if it’s limited to single digit categories.

A circular statistical graphic where the arc length of each slice is proportional to the quantity it represents.

import numpy as np
import matplotlib.pyplot as plt

# Slices will be ordered n plotted counter-clockwise
labels = ['Audi','BMW','LandRover','Tesla','Ferrari']
sizes = [90, 70, 35, 20, 25]

fig, ax = plt.subplots()
ax.pie(sizes,labels=labels, autopct='%1.1f%%')
ax.set_title('Car Sales')
plt.show()
pie-chart
Real world example

We will work with dataset of Alcohol Consumption downloaded from here.

import panda as pd
import matplotlib.pyplot as plt

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

labels = ['Beer', 'Spirit', 'Wine']
sizes = [drinksdf['beer'].sum(), 
         drinksdf['spirit'].sum(), 
         drinksdf['wine'].sum()]

fig, ax = plt.subplots()
explode = [0.05,0.05,0.2]
ax.pie(sizes,explode=explode,
    labels=labels, autopct='%1.1f%%')

ax.set_title('Alcohol Consumption')
plt.show()
pie-chart-drinks

With the above, we can have a quick assessment that alcohol consumption is distributed overall. This view helps if we have less number of slices (categories).

Scatter plot | ax.scatter(x_points, y_points)

It helps representing paired numerical data either to compare how one variable is affected by another or to see how multiple dependent variables value is spread for each value of independent variable.

Sometimes the data points in a scatter plot form distinct groups and are called as clusters.

import numpy as np
import matplotlib.pyplot as plt

# random but focused cluster data
x1 = np.random.randn(100) + 8
y1 = np.random.randn(100) + 8
x2 = np.random.randn(100) + 3
y2 = np.random.randn(100) + 3

x = np.append(x1,x2)
y = np.append(y1,y2)

plt.scatter(x,y, label="xy distribution")
plt.legend()
scatter-plot
Real world example
  1. We will work with dataset of Alcohol Consumption downloaded from here.
import pandas as pd
import matplotlib.pyplot as plt

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

drinksdf['total'] = drinksdf['beer'] 
+ drinksdf['spirit'] 
+ drinksdf['wine'] 
+ drinksdf['alcohol']

# drinksdf.corr() tells beer and alcochol 
# are highly corelated
fig = plt.figure()

# Compare beet and alcohol consumption
# Use color to show a third variable.
# Can also use size (s) to show a third variable.
scat = plt.scatter(drinksdf['beer'], 
                   drinksdf['alcohol'], 
                   c=drinksdf['total'], 
                   cmap=plt.cm.rainbow)

# colorbar to explain the color scheme
fig.colorbar(scat, label='Total drinks')

plt.xlabel('Beer')
plt.ylabel('Alcohol')
plt.title('Comparing beer and alcohol consumption')
plt.grid(True)
plt.show()
scatter-plot-drinks

With the above, we can have a quick assessment that beer and alcohol consumption have strong positive correlation which would suggest a large overlap of people who drink beer and alcohol.

2. We will work with dataset of Mall Customers downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

malldf = pd.read_csv('data-files/mall-customers.csv',
                skiprows=1, 
                names = ['customerid', 'genre', 
                         'age', 'annualincome', 
                         'spendingscore'])

plt.scatter(malldf['annualincome'], 
            malldf['spendingscore'], 
            marker='p', s=40, 
            facecolor='r', edgecolor='b', 
            linewidth=2, alpha=0.4)

plt.xlabel("Annual Income")
plt.ylabel("Spending Score (1-100)")
plt.grid(True)
scatter-plot-mall

With the above, we can have a quick assessment that there are five clusters there and thus five segments or types of customers one can make plan for.

Box Plot | ax.boxplot([data list])

A statistical plot that helps in comparing distributions of variables because the center, spread and range are immediately visible. It only shows the summary statistics like mean, median and interquartile range.

Easy to identify if data is symmetrical, how tightly it is grouped, and if and how data is skewed

import numpy as np
import matplotlib.pyplot as plt

# some random data
data1 = np.random.normal(0, 2, 100)
data2 = np.random.normal(0, 4, 100)
data3 = np.random.normal(0, 3, 100)
data4 = np.random.normal(0, 5, 100)
data = list([data1, data2, data3, data4])

fig, ax = plt.subplots()
bx = ax.boxplot(data, patch_artist=True)

ax.set_title('Box Plot Sample')
ax.set_ylabel('Spread')
xticklabels=['category A', 
             'category B', 
             'category B', 
             'category D']

colors = ['pink','lightblue','lightgreen','yellow']
for patch, color in zip(bx['boxes'], colors):
    patch.set_facecolor(color)

ax.set_xticklabels(xticklabels)
ax.yaxis.grid(True)
plt.show()
box-plot
Real world example

We will work with dataset of Tips downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tipsdf = pd.read_csv('data-files/tips.csv') 
sns.boxplot(x="time", y="tip", 
            hue='sex', data=tipsdf, 
            order=["Dinner", "Lunch"],
            palette='coolwarm')
box-plot-tips

With the above, we can have a quick couple of assessments:
– male gender gives more tip compared to females
– tips during dinner time can vary a lot (more) by males mean tip

Violen Plot | ax.violinplot([data list])

A statistical plot that helps in comparing distributions of variables because the center, spread and range are immediately visible. It shows the full distribution of data.

A quick way to compare distributions across multiple variables

import numpy as np
import matplotlib.pyplot as plt

data = [np.random.normal(0, std, size=100) 
        for std in range(2, 6)]

fig, ax = plt.subplots()
bx = ax.violinplot(data)

ax.set_title('Violin Plot Sample')
ax.set_ylabel('Spread')
xticklabels=['category A', 
             'category B', 
             'category B', 
             'category D']

ax.set_xticks([1,2,3,4])
ax.set_xticklabels(xticklabels)

ax.yaxis.grid(True)
plt.show()
violin-plot
Real world example
  1. We will work with dataset of Tips downloaded from here.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tipsdf = pd.read_csv('data-files/tips.csv') 
sns.violinplot(x="day", y="tip", 
               split="True", data=tipsdf)
violin-plot-tips

With the above, we can have a quick assessment that the tips on Saturday has more relaxed distribution whereas Friday has much narrow distribution in comparison.

2. We will work with dataset of Indian Census data downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='DISTRICT'
mask2 = populationdf['TRU']!='Total'
statesdf = populationdf[mask1 & mask2]

maskUP = statesdf['State']==9
maskM = statesdf['State']==27
data = statesdf.loc[maskUP | maskM]

sns.violinplot( x='State', y='P_06', 
inner='quartile', hue='TRU',  
palette={'Rural':'green','Urban':'blue'}, 
scale='count', split=True, 
data=data, size=6)

plt.title('In districts of UP and Maharashtra')
plt.show()
violin-plot-child

With the above, we can have couple of quick assessments:
– Uttar Pradesh has high volume and distribution of rural child population.
– Maharashtra has almost equal spread of rural and urban child population

Heatmap

It helps in representing a 2-D matrix form of data using variation of color for different values. Variation of color maybe hue or intensity.

Generally used to visualize correlation matrix which in turn helps in features (variables) selection.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# create 2D array
array_2d = np.random.rand(4, 6)
sns.heatmap(array_2d, annot=True)
heatmap
Real world example
  1. We will work with dataset of Alcohol Consumption downloaded from here.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

sns.heatmap(drinksdf.corr(),annot=True,cmap='YlGnBu')
heatmap-drinks

With the above, we can have a quick couple of assessments:
– there is a strong correlation between beer and alcohol and thus a strong overlap there.
– wine and spirit are almost not correlated and thus it would be rare to have a place where wine and spirit consumption equally high. One would be preferred over other.

If we notice, upper and lower halves along the diagonal are same. Correlation of A is to B is same as B is to A. Further, A correlation with A will always be 1. Such case, we can make a small tweak to make it more presentable and avoid any correlation confusion.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

drinksdf = pd.read_csv(
    'data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

# correlation and masks
drinks_cr = drinksdf.corr()
drinks_mask = np.triu(drinks_cr)

# remove the last ones on both axes
drinks_cr = drinks_cr.iloc[1:,:-1]
drinks_mask = drinks_mask[1:, :-1]

sns.heatmap(drinks_cr, 
        mask=drinks_mask,
        annot=True,
        cmap='coolwarm')
heatmap-masked

It is the same correlation data but just the needed one is represented.

Data Image

It helps in displaying data as an image, i.e. on a 2D regular raster.

Images are internally just arrays. Any 2D numpy array can be displayed as an image.

import pandas as pd
import matplotlib.pyplot as plt

M,N = 25,30
data = np.random.random((M,N)) 
plt.imshow(data)
data-image
Real world example
  1. Let’s read an image and then try to display it back to see how it looks
import cv2
import matplotlib.pyplot as plt

img = cv2.imread('data-files/babygroot.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# print(img.shape)
# output => (500, 359, 3)

plt.imshow(img)
baby-groot

It read the image as an array of matrix and then drew it as plot that turned to be same as the image. Since, images are like any other plots, we can plot other objects (like annotations) on top of it.

SubPlots | fig, (ax1,ax2,ax3, ax4) = plt.subplots(2,2)

Generally, it is used in comparing multiple variables (in pairs) against each other. With multiple plots stacked against each other in the same figure, it helps in quick assessment for correlation and distribution for a pair.

Parameters are: number of rows, number of columns, the index of the subplot 

(Index are counted row wise starting with 1) 

The widths of the different subplots may be different with use of GridSpec.

import numpy as np
import matplotlib.pyplot as plt
import math

# data setup
x = np.arange(1, 100, 5)
y1 = x**2
y2 = 2*x+4
y3 = [ math.sqrt(i) for i in x]  
y4 = [ math.log(j) for j in x] 

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)

ax1.plot(x, y1)
ax1.set_title('f(x) = quadratic')
ax1.grid()

ax2.plot(x, y2)
ax2.set_title('f(x) = linear')
ax2.grid()

ax3.plot(x, y3)
ax3.set_title('f(x) = sqareroot')
ax3.grid()

ax4.plot(x, y4)
ax4.set_title('f(x) = log')
ax4.grid()

fig.tight_layout()
plt.show()
sub-plot

We can stack up m x n view of the variables and have a quick look on how they are correlated. With the above, we can quickly assess that second graph parameters are linearly correlated.

Data Representation

Plot Anatomy

Below picture will help with plots terminology and representation:

matplotlib-plot-anatomy
Credit: matplotlib.org

Figure above is the base space where the entire plot happens. Most of the parameters can be customized for better representation. For specific details, look here.

Plot annotations

It helps in highlighting few key findings or indicators on a plot. For advanced annotations, look here.

import numpy as np
import matplotlib.pyplot as plt

# A simple parabolic data
x = np.arange(-4, 4, 0.02)
y = x**2

# Setup plot with data
fig, ax = plt.subplots()
ax.plot(x, y)

# Setup axes
ax.set_xlim(-4,4)
ax.set_ylim(-1,8)

# Visual titles
ax.set_title('Annotation Sample')
ax.set_xlabel('X-values')
ax.set_ylabel('Parabolic values')

# Annotation
# 1. Highlighting specific data on the x,y data
ax.annotate('local minima of \n the parabola',
            xy=(0, 0),
            xycoords='data',
            xytext=(2, 3),
            arrowprops=
                dict(facecolor='red', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='top')

# 2. Highlighting specific data on the x/y axis
bbox_yproperties = dict(
    boxstyle="round,pad=0.4", fc="w", ec="k", lw=2)
ax.annotate('Covers 70% of y-plot range',
            xy=(0, 0.7),
            xycoords='axes fraction',
            xytext=(0.2, 0.7),
            bbox=bbox_yproperties,
            arrowprops=
                dict(facecolor='green', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='center')

bbox_xproperties = dict(
    boxstyle="round,pad=0.4", fc="w", ec="k", lw=2)
ax.annotate('Covers 40% of x-plot range',
            xy=(0.3, 0),
            xycoords='axes fraction',
            xytext=(0.1, 0.4),
            bbox=bbox_xproperties,
            arrowprops=
                dict(facecolor='blue', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='center')

plt.show()
matplotlib-annotation

Plot style | plt.style.use('style')

It helps in customizing representation of a plot, like color, fonts, line thickness, etc. Default styles get applied if the customization is not defined. Apart from adhoc customization, we can also choose one of the already defined template styles and apply them.

# To know all existing styles with package
for style in plt.style.available:
    print(style)

Solarize_Light2, _classic_test_patch, bmh, classic, dark_background, fast, fivethirtyeight, ggplot, grayscale, seaborn, seaborn-bright, seaborn-colorblind, seaborn-dark, seaborn-dark-palette, seaborn-darkgrid, seaborn-deep, seaborn-muted, seaborn-notebook, seaborn-paper, seaborn-pastel, seaborn-poster, seaborn-talk, seaborn-ticks, seaborn-white, seaborn-whitegrid, tableau-colorblind10

pre-defined styles available for use

More details around customization are here.

# To use a defined style for plot
plt.style.use('seaborn')

# OR
with plt.style.context('Solarize_Light2'):
    plt.plot(np.sin(np.linspace(0, 2 * np.pi)), 'r-o')
plt.show()
matplotlib-style-ex

Saving plots | ax.savefig()

It helps in saving figure with plot as an image file of defined parameters. Parameters details are here. It will save the image file to the current directory by default.

plt.savefig('plot.png', dpi=300, bbox_inches='tight')

Additional Usages of plots

Data Imputation

It helps in filling missing data with some reasonable data as many statistical or machine learning packages do not work with data containing null values.

Data interpolation can be defined to use pre-defined functions such as linear, quadratic or cubic

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(20,1))
df = df.where(df&lt;0.5)

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.plot(df)
ax1.set_title('f(x) = data missing')
ax1.grid()

ax2.plot(df.interpolate())
ax2.set_title('f(x) = data interpolated')
ax2.grid()

fig.tight_layout()
plt.show()
data-interpolate

With the above, we see all the missing data replaced with some probably interpolation supported by dataframe based on valid previous and next data.

Animation

At times, it helps in presenting the data as an animation. On a high level, it would need data to be plugged in a loop with delta changes translating into a moving view.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import animation

fig = plt.figure()

def f(x, y):
    return np.sin(x) + np.cos(y)

x = np.linspace(0, 2 * np.pi, 80)
y = np.linspace(0, 2 * np.pi, 70).reshape(-1, 1)

im = plt.imshow(f(x, y), animated=True)


def updatefig(*args):
    global x, y
    x += np.pi / 5.
    y += np.pi / 10.
    im.set_array(f(x, y))
    return im,

ani = animation.FuncAnimation(
    fig, updatefig, interval=100, blit=True)
plt.show()
animation

3-D Plotting

If needed, we can also have an interactive 3-D plot though it might be slow with large datasets.

import numpy as np
import matplotlib.pyplot as plt

def randrange(n, vmin, vmax):
     return (vmax-vmin)*np.random.rand(n) + vmin

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
n = 200
for c, m, zl in [('g', 'o', +1), ('r', '^', -1)]:
    xs = randrange(n, 0, 50)
    ys = randrange(n, 0, 100)
    zs = xs+zl*ys  
    ax.scatter(xs, ys, zs, c=c, marker=m)

ax.set_xlabel('X data')
ax.set_ylabel('Y data')
ax.set_zlabel('Z data')
plt.show()
3d-plot

Cheat Sheet

A page representation of the key features for quick lookup or revision:

matplotlib-cheatsheet
Credit: DataCamp

Download the PDF version of cheatsheet from here.
Overall reference & for more details, look: https://matplotlib.org/

Entire Jupyter notebook with more samples can be downloaded or forked from my GitHub to look or play around: https://github.com/sandeep-mewara/data-visualization


Keep learning!

LearnByInsight C#
GitHub Profile Readme Samples
LearnByInsight Machine Learning

pandas – get started with examples

This is to get started with pandas and try few concrete examples. pandas is a Python based library that helps in reading, transforming, cleaning and analyzing data. It is built on the NumPy package.

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

https://pandas.pydata.org


Key data structure in pandas is called DataFrame – it helps to work with tabular data translated as rows of observations and columns of features.

Download or fork entire Jupiter notebook from my GitHub to play around: https://github.com/sandeep-mewara/python-examples

pandas basics includes:

  • Series
  • Dataframes
    • Create
      • from list of tuples
      • from a dictionary
      • from a CSV
      • from built-in dataset (eg: from sklearn.datasets)
    • Data retrieval
    • Modifying data
    • Group by operation
    • Custom Functions – apply method
    • Pre-Processing
      • drop, mean, mode
      • ordinal feature
      • nominal feature
    • Reshaping
      • CrossTab
      • Merge
      • Melt
      • Pivot

# .info(), .head(), .sample are handy method to use first off with dataframe to get a high level details

# index may be not unique – can return multiple values

# boolean indexing (masking) can help select certain set of rows

# .isin() is a useful when building a boolean index

# .where() is useful to retain shape of the original table

# Column names & Indexes can be set if needed

# to modify the table right away, use inplace=True

# aggregate operations can be applied on a groupby object

# dropna(), mean() or mode() are handy ways for pre-processing missing data

Key learning’s …

Examples notebook includes:

  • Uber taxi drivers
  • Apple stock price
  • Day or Night
  • Students marks
  • Balance Calculator

# .describe() is a handy method to get the statistical summary of numerical columns

# one-hot-encoding is really helpful for nominal features (that cannot be ordered)

# converting the columns into right datatype helps

# converting data into meaningful numbers help for analysis

# groupby is a powerful tool with dataframes for analysis

Key learning’s …

Cheat sheet

Credit: Pandas website

Download cheat sheet pdf from here
For more details about pandas, look at the documentation reference.

Keep learning!

Python as statistics workbench

While reading for AI/ML (Artificial Intelligence/Machine Learning), I came across a discussion – if Python can be used as a “statistics workbench” to replace R, SPSS, etc? It was nice shareout by multiple knowledge folks related to languages used for problems of statistics, specifically R (read about R here).

Discussion here: https://stats.stackexchange.com/questions/1595/python-as-a-statistics-workbench

For quick reference, I will quote few of the latest thoughts from there that are in favor of Python and how it has evolved. I too conquer with most of them:

1. Python is easily the most intuitive syntax of any programming language. This makes for extremely fast development time.

2. Python is performant. It opens large datasets reliably.

3. The packages in Python are fast catching up to R’s packages. Python usage has increased tremendously last few years.

4. Readability is one of the most important qualities good code can possess, and Python is one of the most readable language.

5. Python has an extremely well-thought-out IDE now: PyCharm & Visual Studio Code.

https://stats.stackexchange.com/a/457753

Overall, Python is a general purpose language with an easy to understand syntax which would be relatively easier for usual programmers to learn/adopt. R is developed keeping statisticians in mind. Thus it has many features around data visualization and is a tad ahead currently.

A little research …

Recently DataCamp too published an article comparing R and Python for data analysis. There is a nice comparison in it on various parameters, picking just couple of them here:

Final analysis in the paper shares R being ahead in comparison for data analysis but Python having potential to catch up quickly and easily.

My thoughts …

My intent was to understand which of the programming language serves as an essential tool to demonstrate AI/ML capabilities. Looking at them, Python seems good enough for me to serve as AI/ML tool to start and probably conquer it.

Ammunition needed …

There are many python based libraries and packages that are generally used for statistical work. Below are few of them that would help in our data analysis exploration going ahead:

  • scipy – python-based ecosystem of open-source software for mathematics, science, and engineering.
    • cookbook – many statistical facilities, a collection of various user-contributed recipes already available
    • numpy – base N-dimensional array package. Handful of example lists here
    • pandas – a fast, powerful, flexible and easy to use data analysis and manipulation tool
    • matplotlib – a comprehensive library for creating static, animated, and interactive visualizations
  • scikit-learn – simple and efficient machine learning tools for predictive data analysis
  • keras – API for deep learning
  • tensorflow – API to develop and train ML models

Since I am a programmer, I maybe be biased here. But, it seems Python can and does all the needful to start with AI/ML journey.

Happy learning!

NumPy – Basics & Examples

This is to get started with NumPy and try few concrete examples. NumPy (Numerical Python) are packages for numerical computation designed for efficient work on large data sets.

Entire Jupiter notebook can be downloaded or forked from my GitHub to play around: https://github.com/sandeep-mewara/python-examples

numpy-icon

Reference: https://numpy.org/learn/

NumPy basics includes:

  • Initialize Matrix via
    • List
    • NULL Matrix
    • IDENTITY Matrix
    • ONES Matrix
  • Matrix Transpose
  • Matrix Indexing
  • Simulation
  • Basic CSV file operations
  • Matrix Broadcasting
  • Basic Image Processing

# matrix in python is list of a list

# arrays are compatible for broadcasting when the trailing dimensions match or either of them is of length 1

# image when read as numbers, the values are between 0 & 1

Key learning’s …

Examples notebook includes:

  • Random walk simulation
  • Triangle simulation
  • Random Number
  • Correlation co-efficient
  • Mean/Variance of crude oil

# masking helps get all the values back that satisfy the mask

# cumsum() is a handy function for cumulative sum

# there are handy methods for random number generation

Key learning’s …

For learning more about NumPy, look here: https://numpy.org/doc/stable/

Keep learning!

Python – Basics & Examples

This is to get started with Python and try few concrete examples. It should help beginners to learn or others to do a quick revision without getting too deep.

Entire Jupyter notebook can be downloaded or forked from my GitHub to look or play around: https://github.com/sandeep-mewara/python-examples

I started Python programming using Jupiter notebook web application. Later, I moved to Visual Studio Code that looked much user friendly.

A guide on how to setup VS Code for Python is here.

Python basics includes:

  • Variables
  • Conditional statements
  • String manipulations
  • Type conversion
  • Formatting strings
  • Data Structure – List, Tuple
  • Functions
  • List comprehension
  • Zip & Pack

# items are indexed by integers, starting from 0.

# % is a format operator and %d, %s, %f are special format sequences

# negative index is used to access list elements from the end

# [start:end:step] Returns a new list from start to end-1 with default step 1

# zip can merge two lists into a list of tuples

Key learning’s …

Examples notebook includes:

  • Palindrome
  • Sum of Squares
  • Sort students marks list
  • Format students marks list
  • Word Frequency

# sometimes anonymous functions are enough

# storing data in dictionary as key-value pair helps

Key learning’s …

Keep learning!