Mastering the SKILL.md File in Agentic AI: A Complete Guide

In modern Agentic AI architectures, the primary engineering challenge is no longer generating language, but bridging the gap between conversational intent and reliable, repeatable and unambiguous execution. To achieve this, we must treat agent capabilities not as conversational shortcuts, but as well-defined engineering assets.

skill-md-agentic-ai.png


This requires a standardized contract for capability execution. That’s where SKILL.md comes in. A formal, machine-parsable definition file that acts as a Standard Interoperability Definition (SID) contract for systematic task execution within an agentic framework.

In this blog, I’ll dive deep into SKILL.md and share how it serves as a single source of truth for both conceptual planning (roles) and procedural execution (workflows) that power an automated, engineering-grade SDLC.

The Architectural Blueprint: The SKILL.md

SKILL.md is structured as an engineering specification, designed for zero-ambiguity parsing by an LLM like Claude. It defines the contract for interoperability, forcing teams to move from conversational requests to precise capability definitions.

Anatomy of an Engineering Contract

The specification consists of five required metadata fields that are immutable and machine-parsable:

  • Name: An immutable, unique, system-wide identifier for the capability (e.g., internal-token-manager-v1exec-raise-github-pr-v1, or sdlc-pm-v1). This is the system’s handle for the skill.


  • Description: Critically, this is not a summary. It is the definitive Trigger Event Definition. It must be written from the perspective of an event, user query or internal signal that activates this capability, allowing the framework to perform accurate skill matching. Example: “Triggers automatically after a successful code analysis scan…”


  • Commands: A list of executable operations or prompts defined by the contract. For procedural skills, these map to API endpoints or internal function calls. For conceptual skills, these map to defined prompt sequences. Example: get-linter-report(timestamp) or refresh-token(service_id).


  • Constraints: A critical safety and resource management section. It defines the limits, rules and error conditions of the contract. Example: “Internal authentication tokens must expire after 1 hour.”


  • Examples: These are not suggestions but are the gold standard of Expected Behavior. They define the intended output for specific input scenarios, providing the LLM with a definitive blueprint for successful execution and reducing non-deterministic output.
# Code Snippet 1: Sample Procedural SKILL.md (Raise GitHub PR)
---
# REQUIRED METADATA FIELDS (SID CONTRACT)

name: exec-raise-github-pr-v1
description: Triggers automatically after a successful 'exec-linter-code-analyzer-v1' scan or upon user request to systematically raise a new pull request on GitHub for reviewed code.
commands:
  - create-pr(repository_url, head_branch, base_branch, title, body)
constraints:
  - Must use a valid GitHub API token with 'repo' scope.
  - Head branch must differ from the base branch.
---

### Expected Behavior (Examples)

When this skill is matched against a standard JavaScript repository:
  - Input: create-pr("https://github.com/org/repo.git", "feat/new-api", "main", "Feat: Add API v2", "This PR introduces...")
  - Execution: Loads 'scripts/create_pr.py'.
  - Output: New PR URL.

Directory Structure & Progressive Disclosure

The SKILL.md is packaged within a defined directory structure, ensuring all supporting assets are decoupled and version-controlled alongside the specification.

skill-folder-structure.jpg

.Sandeep Mewara Github

  • 📄 SKILL.md (The only required asset, containing the definitions and contract).
  • 📁 scripts/ (Optional: Decoupled logic – Python, Bash, Node.js, etc. The implementation details of the contract).
  • 📁 references/ (Optional: Docs, checklists, design patterns or standards the skill must adhere to).
  • 📁 assets/ (Optional: Templates or sample data).

This decoupled architecture enables the Progressive Disclosure Pattern, which is critical for system efficiency and managing token constraints. A high-performance agentic system should not load every asset for every skill simultaneously. Progressive disclosure ensures assets are loaded only when necessary.

skill-md-activation-flow.jpg


Agents don’t load everything at once. They discover and expand context only when needed.

Architecting the Automated SDLC

The standardization offered by SKILL.md allows us to architect and separate the dynamic pillars of an automated SDLC, managing all capabilities via this single specification. In a professional lifecycle, conceptual setup (Defining Roles) always precedes procedural execution (Executing Workflows).

Conceptual Role-Based Skills: Defining the Contract for a Persona (Planning & Setup)

To initiate any SDLC phase (e.g., Requirements), we must first define the conceptual frameworks, knowledge bases and systematic planning workflows of specific roles that help organise content by domain (behaviour-driven). We apply the identical SKILL.md standard to define a persona’s “mindset”.

  • WHAT: SKILL.md definitions for Product Manager Persona or Lead Developer Persona.


  • APPLICATION: During the “Requirements” and “Design” phases of the SDLC.


  • ARCHITECTURAL FLOW: During planning, you activate the Product Manager Persona (Code Snippet 2). Claude adopts this mindset and leverages knowledge references (e.g., Agile standards) and the command contract (draft-prd(user_stories)) to provide focused, high-quality requirements.
Code Snippet 2: Sample Conceptual SKILL.md (Product Manager)
---
# REQUIRED METADATA FIELDS (SID CONTRACT)

name: sdlc-pm-v1
description: Triggers during project initiation to define the persona, responsibilities, knowledge base and systematic planning workflows of a senior Product Manager.
commands:
  - draft-prd(user_stories, acceptance_criteria)
  - run-feature-prioritization(prd_document)
constraints:
  - Must reference files in the optional 'references/' directory (e.g., 'references/agile-standards.md') for all Agile terminology.
---

### Expected Behavior (Examples)

When this skill is matched to a new project request:
  - Input: draft-prd(user_stories, acceptance_criteria)
  - Execution: Loads 'references/agile-standards.md' to define terminology.
  - Output: A structured PRD document based on the internal persona.

External Workflow Execution Skills: Defining the Contract for the Workflow to ‘Do’

Once the groundwork is established and the build begins, the agent’s focus shifts to user-triggered workflows (e.g., after a commit). These skills are guides that help perform specific, measurable steps in the automated pipeline, providing the user with domain-specific results (task-driven).

  • WHAT:SKILL.md definitions for exec-linter-code-analyzerexec-raise-github-pr, or jira-ticket-update.


  • APPLICATION: During the “Build,” “Test” and “Deploy” phases of the SDLC, typically automated by CI/CD events.


  • ARCHITECTURAL FLOW: After a successful code implementation event, the framework activates the exec-linter-code-analyzer-v1 (Code Snippet 3). Claude reads the inputs and expected behavior. The framework executes the decoupled logic (scripts/) to systematically create the pull request, ensuring a reliable result (the PR URL) is provided back to the user’s workflow or CI/CD pipeline.
Code Snippet 3: Sample Procedural SKILL.md (Code Analyzer Workflow)
---
# REQUIRED METADATA FIELDS (SID CONTRACT)
name: exec-linter-code-analyzer-v1
description: Triggers automatically after a code commit event to execute a static analysis and linter scan on the modified files in a specific repository, providing a systematic JSON report.
commands:
  - run-analysis(repository_url, branch)
constraints:
  - Must use a valid GitHub API token with 'repo' scope.
---

### Expected Behavior (Examples)
When this skill is matched following a code commit:
  - Input: run-analysis("https://github.com/org/repo.git", "main")
  - Execution: Loads 'scripts/run_analysis.py'.
  - Output: Linter report JSON.

Internal Agent Operational Skills: Defining the Contract for the Software to ‘Be’

To ensure system stability, the agent software itself requires precise, standardized contracts for core operational tasks (like authentication, state, error handling, api-call, etc). These skills are operational and invisible to the SDLC workflow itself. They focus on the agent’s internal robustness and platform integrity.

  • WHAT: SKILL.md definitions for internal-token-manager or agent-state-historian.


  • APPLICATION: Triggered automatically by the agent’s orchestration layer during defined lifecycle events (e.g., establishing a session state, refreshing an expired 401 token).


  • ARCHITECTURAL FLOW: When any skill requires access to a restricted API, it activates the internal-token-manager (Code Snippet 4). Claude reads the command contract (refresh-token(service_id)). The framework executes the decoupled logic (scripts/) to refresh the secure token, ensuring the agent software can authenticate without creating brittle, direct credential dependencies in the domain-level skills. This internal complexity is hidden from the user but critical for security and robustness.
Code Snippet 4: Sample Procedural SKILL.md (Token Manager)
---
# REQUIRED METADATA FIELDS (SID CONTRACT)
name: internal-token-manager-v1
description: An internal operational skill that triggers throughout a workflow when the agent detects it requires a secure token to authenticate against an external service (e.g., GitHub, Slack, Splunk).
commands:
  - refresh-token(service_id)
constraints:
  - Must use a valid agent credential secret (e.g., 'agent_platform_secret').
  - Tokens must expire after 1 hour.
---

### Expected Behavior (Examples)

When this skill is matched when a GitHub operation requires auth:
  - Input: refresh-token("github_api")
  - Execution: Loads 'scripts/refresh_token.py'.
  - Output: New OAuth token JSON.

The Boundary of Autonomy and the Expertise Gap

While standardizing capabilities via SKILL.md is essential, I believe it is critical for architects to also define where SKILL.md is not the right tool. My own perspective, based on recent project implementation, is that a common architectural failure is expecting SKILL.md to easily encode true Domain Expertise and Heuristic Judgment.

Offloading Heuristics vs. Offloading Wisdom

A well-defined SKILL.md is designed to be precise, measurable and standardized. It excels at offloading common known items, standard checklists and systematic patterns into reliable workflows (as seen in our Code Snippets 3 & 4). In my recent project, this precision made the skills function as excellent fixed checklists, significantly reducing operational ambiguity.

This same precision, however, means it can appear only as a checklist. A procedural skill like exec-linter-code-analyzer can identify a syntax error based on a rule, but I found it often lacked the domain wisdom to understand the conceptual design decision that led to that error.

Assisting Expertise, Not Replacing It

Based on the experience so far, I believe that you cannot easily encode a senior engineer’s years of nuanced design thinking into a SKILL.md description. The true architectural value of a standardized specification is that it offloads the reliable execution complexity, allowing the Human Expert (or a high-level Agentic Persona) to focus entirely on core domain and design reasoning.

For now, I believe following a model where three distinct pillars of knowledge are defined will work out:

  1. Systematic Workflows (Procedural Skills): Handled perfectly by SKILL.md. (The “What to Do”)
  2. Conceptual Frameworks (Persona Mindsets): Setup by SKILL.md. (How Claude “Thinks”)
  3. Domain Wisdom & Design Reasoning: Passed as the problem context in the main prompt. (Why Claude “Decides”)

Engineering Best Practices for SKILL.md Mastery

Achieving systematic capability definition requires adhering to these foundational best practices:

  1. Strict Decoupling: Never place the execution logic (e.g., Python code) directly within the SKILL.md file. The SKILL.md is the specification & the scripts/ directory is the implementation.


  2. Immutability: Once a skill is deployed, treat its metadata (Name, Description, Commands) as immutable. Any significant change requires a new version (e.g., exec-raise-github-pr-v2). Brittleness often stems from changing definitions in place.


  3. Description as a Trigger: Never write a summary description (e.g., “This skill runs a linter”). It must be written as a trigger definition (e.g., “Triggers automatically after a context save event…”). Skill matching depends entirely on this accuracy.


  4. Token Economy: Adhere to strict size constraints: < 500 lines and < 5k tokens for the SKILL.md. The Progressive Disclosure pattern will handle heavier assets, keeping the SID itself focused and parseable.


  5. Git-Managed Context: Treat SKILL.md files as code. They must be version-controlled in Git, promoting discoverability, reuse and providing a traceable history of how capabilities have evolved throughout the lifecycle.

Final Thought: A Standard for Scaling Autonomy

By adopting the SKILL.md specification, we move from fuzzy conversational AI to a structured engineering discipline, where all agent capabilities, whether they are internal operational requirements, external user workflows or conceptual roles framework – all are defined by precise, version-controlled contracts.

This foundation standardizes reliable execution complexity, not only making your automated SDLC predictable and robust but also ensuring that precious domain expertise remains focused on main design decisions, not common patterns. Mastering the SKILL.md standard is the definitive, interoperable foundation for building scalable, maintainable and engineering-grade AgenticAI architectures.

. Sandeep Mewara Github
News Update
Tech Explore
Trend
samples GitHub Profile Readme
Learn Machine Learning with Examples
Machine Learning workflow
Agentic AI for Beginners: My Journey into Building with Claude
The Great Inversion: Why AI is Moving from Cloud to Desktop

[DOWNLOADskill.md Quick Reference Guide]

.

Probability Distribution – An aid to know the data

A probability distribution helps understand the likelihood of possible values that a random variable can take. It is one of the must needed statistical knowledge for any data science aspirant.

probability-header

Few consider, Probability distributions are fundamental to statistics, like data structures are to computer science

In Layman terms

Let’s say, you pick any 100 employees of an organization. Measure their heights (or weights). As you measure them, create a distribution of it on a graph. Keep height on X-Axis & frequency of a particular height on Y-Axis. With this, we will get a distribution for a range of heights.

This distribution will help know which outcomes are most likely, the spread of potential values, and the likelihood of different results.

Basic terminology

Random Sample

The set of 100 people selected above in our example will be termed as random sample.

Sample Space

The range of possible heights of the 100 people is our sample space. It’s the set of all possible values in the setup.

Random Variable

The height of the 100 people measured are termed as random variable. It’s a variable that takes different values of the sample space randomly.

Mean (Expected Value)

Let’s say most of the people in those 100 are of height 5 feet, 3 inches (making it an average height of those 100). This would be termed expected value. It’s an average value of a random variable.

Standard deviation & Variance

Let’s say most of the people in those 100 are of height 5 feet, 1 inches to 5 feet, 5 inches. This is variance for us. It’s an average spread of values around the expected value. Standard Deviation is the square root of the variance.

Types of data
  • Ordinal – They have a meaningful order. All numerical data fall in this bucket. They can be ordered in relative numerical strength.
  • Nominal – They cannot be ordered. All categorical data fall in this bucket. Like, colors – Red, Blue & Green – there cannot be an order or a sequence of high or low in them by itself.
  • Discrete – an ordinal data that can take only certain values (like soccer match score)
  • Continuous – an ordinal data that can take any real or fractional value (like height & weight)

In Continuous distribution, random variables can have an infinite range of possible outcomes

Probability Distribution Flowchart

Following diagram shares few of the common distributions used:

Based on above diagram, will cover three distributions to have a broad understanding:

Uniform Distribution

It is the simplest form of distribution. Every outcome of the sample space has equal probability to happen. An example would be to roll a fair dice that would have an equal probability outcome of 1-6.

uniform-distribution
Normal (Gaussian) Distribution

The most common distribution. Few would recognize this by a ‘bell curve’. Most values are around the mean value making the distribution arrangement symmetric.

Central limit theorem suggests that sum of several independent random variables is normally distributed

normal-distribution

The area under the distribution curve is equal to 1 (all the probabilities must sum up to 1)

A parameter Mew drives the distribution center (mean). It corresponds to the maximum height of the graph. A parameter Sigma corresponds to the range of variation (variance or standard deviation).

standard-normal
Credit: Wikipedia

68–95–99.7 rule (empirical rule) – approximate percentage of the data covered by ranges defined by 1, 2, and 3 standard deviations from the mean

Exponential Distribution

It is where a few outcomes are most likely with a rapid decrease in probability to all other outcomes. An example of it would be a car battery life in months.

exponential-graph

A parameter Beta deals with scale that defines the mean and standard deviation of the distribution. A parameter Lambda deals with rate of change in the distribution

Probability Distribution Choices

I came across an awesome representation of the probability distribution choices. It works as a cheat sheet to understand the provided data.

Wrap Up

Though above is just an introduction, believe it should be good enough to start, correlate and understand some basics of machine learning algorithms. There would be more to it while working on algorithms and problems while analyzing data to predict trends, etc.


Keep learning!

samples GitHub Profile Readme
Learn Python – Beginners step by step – Basics and Examples
Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples

Data Visualization – Insights with Matplotlib

While working on a machine learning problem, Matplotlib is the most popular python library used for visualization that helps in representing & analyzing the data and work through insights.

matplotlib-machine-learning

Generally, it’s difficult to interpret much about data, just by looking at it. But, a presentation of the data in any visual form, helps a great deal to peek into it. It becomes easy to deduce correlations, identify patterns & parameters of importance.

In data science world, data visualization plays an important role around data pre-processing stage. It helps in picking appropriate features and apply appropriate machine learning algorithm. Later, it helps in representing the data in a meaningful way.

Data Insights via various plots

If needed, we will use these dataset for plot examples and discussions. Based on the need, following are the common plots that are used:

Line Chart | ax.plot(x,y)

It helps in representing series of data points against a given range of defined parameter. Real benefit is to plot multiple line charts in a single plot to compare and track changes.

Points next to each other are related that helps to identify repeated or a defined pattern

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 1, 0.05)
y1 = x**2
y2 = x**3

plt.plot(x, y1,
    linewidth=0.5,
    linestyle='--',
    color='b',
    marker='o',
    markersize=10,
    markerfacecolor='red')

plt.plot(x, y2,
    linewidth=0.5,
    linestyle='dotted',
    color='g',
    marker='^',
    markersize=10,
    markerfacecolor='yellow')

plt.title('x Vs f(x)')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend(['f(x)=x^2', 'f(x)=x^3'])
plt.xticks(np.arange(0, 1.1,0.2),
    ['0','0.2','0.4','0.6','0.8','1.0'])

plt.grid(True)
plt.show()
line-chart
Real world example:

We will work with dataset created from collating historical data for few stocks downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

stocksdf1 = pd.read_csv('data-files/stock-INTU.csv') 
stocksdf2 = pd.read_csv('data-files/stock-AAPL.csv') 
stocksdf3 = pd.read_csv('data-files/stock-ADBE.csv') 

stocksdf = pd.DataFrame()
stocksdf['date'] = pd.to_datetime(stocksdf1['Date'])
stocksdf['INTU'] = stocksdf1['Open']
stocksdf['AAPL'] = stocksdf2['Open']
stocksdf['ADBE'] = stocksdf3['Open']

plt.plot(stocksdf['date'], stocksdf['INTU'])
plt.plot(stocksdf['date'], stocksdf['AAPL'])
plt.plot(stocksdf['date'], stocksdf['ADBE'])

plt.legend(labels=['INTU','AAPL','ADBE'])
plt.grid(True)

plt.show()
line-chart-stocks

With the above, we have couple of quick assessments:
Q: How a particular stock fared over last year?
A: Stocks were roughly rising till Feb 2020 and then took a dip in April and then back up since then.

Q: How the three stocks behaved during the same period?
A: Stock price of ADBE was more sensitive and AAPL being least sensitive to the change during the same period.

Histogram | ax.hist(data, n_bins)

It helps in showing distributions of variables where it plots quantitative data with range of the data grouped into intervals.

We can use Log scale if the data range is across several orders of magnitude.

import numpy as np
import matplotlib.pyplot as plt

mean = [0, 0]
cov = [[2,4], [5, 9]]
xn, yn = np.random.multivariate_normal(
                                mean, cov, 100).T

plt.hist(xn,bins=25,label="Distribution on x-axis"); 

plt.xlabel('x')
plt.ylabel('frequency')
plt.grid(True)
plt.legend()
Real world example

We will work with dataset of Indian Census data downloaded from here.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='STATE'
mask2 = populationdf['TRU']=='Total'
df = populationdf[mask1 & mask2]

plt.hist(df['TOT_P'], label='Distribution')

plt.xlabel('Total Population')
plt.ylabel('State Count')
plt.yticks(np.arange(0,20,2))

plt.grid(True)
plt.legend()
histogram-state-pop

With the above, couple of quick assessments about population in states of India:
Q: What’s the general population distribution of states in India?
A: More than 50% of states have population less than 2 crores (20 million)

Q: How many states are having population more than 10 crores (100 million)?
A: Only 3 states have that high a population.

Bar Chart | ax.bar(x_pos, heights)

It helps in comparing two or more variables by displaying values associated with categorical data.

Most commonly used plot in Media sharing data around surveys displaying every data sample.

import numpy as np
import matplotlib.pyplot as plt

data = [[60, 45, 65, 35],
        [35, 25, 55, 40]]

x_pos = np.arange(4)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.set_xticks(x_pos)

ax.bar(x_pos - 0.1, data[0], color='b', width=0.2)
ax.bar(x_pos + 0.1, data[1], color='g', width=0.2)

ax.yaxis.grid(True)
bar-chart
Real world example

We will work with dataset of Indian Census data downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='STATE'
mask2 = populationdf['TRU']=='Total'
statesdf = populationdf.loc[mask1].loc[mask2]
statesdf = statesdf.sort_values('TOT_P')

plt.figure(figsize=(10,8))
plt.barh(range(len(statesdf)), 
    statesdf['TOT_P'], tick_label=statesdf['Name'])
plt.grid(True)
plt.title('Total Population')
plt.show()
bar-chart-state-pop

With the above, couple of quick assessments about population in states of India:
– Uttar Pradesh has the highest total population and Lakshadeep has lowest
– Relative popluation across states with Uttar Pradesh almost double the second most populated state

Pie Chart | ax.pie(sizes, labels=[labels])

It helps in showing the percentage (or proportional) distribution of categories at a certain point of time. Usually, it works well if it’s limited to single digit categories.

A circular statistical graphic where the arc length of each slice is proportional to the quantity it represents.

import numpy as np
import matplotlib.pyplot as plt

# Slices will be ordered n plotted counter-clockwise
labels = ['Audi','BMW','LandRover','Tesla','Ferrari']
sizes = [90, 70, 35, 20, 25]

fig, ax = plt.subplots()
ax.pie(sizes,labels=labels, autopct='%1.1f%%')
ax.set_title('Car Sales')
plt.show()
pie-chart
Real world example

We will work with dataset of Alcohol Consumption downloaded from here.

import panda as pd
import matplotlib.pyplot as plt

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

labels = ['Beer', 'Spirit', 'Wine']
sizes = [drinksdf['beer'].sum(), 
         drinksdf['spirit'].sum(), 
         drinksdf['wine'].sum()]

fig, ax = plt.subplots()
explode = [0.05,0.05,0.2]
ax.pie(sizes,explode=explode,
    labels=labels, autopct='%1.1f%%')

ax.set_title('Alcohol Consumption')
plt.show()
pie-chart-drinks

With the above, we can have a quick assessment that alcohol consumption is distributed overall. This view helps if we have less number of slices (categories).

Scatter plot | ax.scatter(x_points, y_points)

It helps representing paired numerical data either to compare how one variable is affected by another or to see how multiple dependent variables value is spread for each value of independent variable.

Sometimes the data points in a scatter plot form distinct groups and are called as clusters.

import numpy as np
import matplotlib.pyplot as plt

# random but focused cluster data
x1 = np.random.randn(100) + 8
y1 = np.random.randn(100) + 8
x2 = np.random.randn(100) + 3
y2 = np.random.randn(100) + 3

x = np.append(x1,x2)
y = np.append(y1,y2)

plt.scatter(x,y, label="xy distribution")
plt.legend()
scatter-plot
Real world example
  1. We will work with dataset of Alcohol Consumption downloaded from here.
import pandas as pd
import matplotlib.pyplot as plt

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

drinksdf['total'] = drinksdf['beer'] 
+ drinksdf['spirit'] 
+ drinksdf['wine'] 
+ drinksdf['alcohol']

# drinksdf.corr() tells beer and alcochol 
# are highly corelated
fig = plt.figure()

# Compare beet and alcohol consumption
# Use color to show a third variable.
# Can also use size (s) to show a third variable.
scat = plt.scatter(drinksdf['beer'], 
                   drinksdf['alcohol'], 
                   c=drinksdf['total'], 
                   cmap=plt.cm.rainbow)

# colorbar to explain the color scheme
fig.colorbar(scat, label='Total drinks')

plt.xlabel('Beer')
plt.ylabel('Alcohol')
plt.title('Comparing beer and alcohol consumption')
plt.grid(True)
plt.show()
scatter-plot-drinks

With the above, we can have a quick assessment that beer and alcohol consumption have strong positive correlation which would suggest a large overlap of people who drink beer and alcohol.

2. We will work with dataset of Mall Customers downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

malldf = pd.read_csv('data-files/mall-customers.csv',
                skiprows=1, 
                names = ['customerid', 'genre', 
                         'age', 'annualincome', 
                         'spendingscore'])

plt.scatter(malldf['annualincome'], 
            malldf['spendingscore'], 
            marker='p', s=40, 
            facecolor='r', edgecolor='b', 
            linewidth=2, alpha=0.4)

plt.xlabel("Annual Income")
plt.ylabel("Spending Score (1-100)")
plt.grid(True)
scatter-plot-mall

With the above, we can have a quick assessment that there are five clusters there and thus five segments or types of customers one can make plan for.

Box Plot | ax.boxplot([data list])

A statistical plot that helps in comparing distributions of variables because the center, spread and range are immediately visible. It only shows the summary statistics like mean, median and interquartile range.

Easy to identify if data is symmetrical, how tightly it is grouped, and if and how data is skewed

import numpy as np
import matplotlib.pyplot as plt

# some random data
data1 = np.random.normal(0, 2, 100)
data2 = np.random.normal(0, 4, 100)
data3 = np.random.normal(0, 3, 100)
data4 = np.random.normal(0, 5, 100)
data = list([data1, data2, data3, data4])

fig, ax = plt.subplots()
bx = ax.boxplot(data, patch_artist=True)

ax.set_title('Box Plot Sample')
ax.set_ylabel('Spread')
xticklabels=['category A', 
             'category B', 
             'category B', 
             'category D']

colors = ['pink','lightblue','lightgreen','yellow']
for patch, color in zip(bx['boxes'], colors):
    patch.set_facecolor(color)

ax.set_xticklabels(xticklabels)
ax.yaxis.grid(True)
plt.show()
box-plot
Real world example

We will work with dataset of Tips downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tipsdf = pd.read_csv('data-files/tips.csv') 
sns.boxplot(x="time", y="tip", 
            hue='sex', data=tipsdf, 
            order=["Dinner", "Lunch"],
            palette='coolwarm')
box-plot-tips

With the above, we can have a quick couple of assessments:
– male gender gives more tip compared to females
– tips during dinner time can vary a lot (more) by males mean tip

Violen Plot | ax.violinplot([data list])

A statistical plot that helps in comparing distributions of variables because the center, spread and range are immediately visible. It shows the full distribution of data.

A quick way to compare distributions across multiple variables

import numpy as np
import matplotlib.pyplot as plt

data = [np.random.normal(0, std, size=100) 
        for std in range(2, 6)]

fig, ax = plt.subplots()
bx = ax.violinplot(data)

ax.set_title('Violin Plot Sample')
ax.set_ylabel('Spread')
xticklabels=['category A', 
             'category B', 
             'category B', 
             'category D']

ax.set_xticks([1,2,3,4])
ax.set_xticklabels(xticklabels)

ax.yaxis.grid(True)
plt.show()
violin-plot
Real world example
  1. We will work with dataset of Tips downloaded from here.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tipsdf = pd.read_csv('data-files/tips.csv') 
sns.violinplot(x="day", y="tip", 
               split="True", data=tipsdf)
violin-plot-tips

With the above, we can have a quick assessment that the tips on Saturday has more relaxed distribution whereas Friday has much narrow distribution in comparison.

2. We will work with dataset of Indian Census data downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='DISTRICT'
mask2 = populationdf['TRU']!='Total'
statesdf = populationdf[mask1 & mask2]

maskUP = statesdf['State']==9
maskM = statesdf['State']==27
data = statesdf.loc[maskUP | maskM]

sns.violinplot( x='State', y='P_06', 
inner='quartile', hue='TRU',  
palette={'Rural':'green','Urban':'blue'}, 
scale='count', split=True, 
data=data, size=6)

plt.title('In districts of UP and Maharashtra')
plt.show()
violin-plot-child

With the above, we can have couple of quick assessments:
– Uttar Pradesh has high volume and distribution of rural child population.
– Maharashtra has almost equal spread of rural and urban child population

Heatmap

It helps in representing a 2-D matrix form of data using variation of color for different values. Variation of color maybe hue or intensity.

Generally used to visualize correlation matrix which in turn helps in features (variables) selection.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# create 2D array
array_2d = np.random.rand(4, 6)
sns.heatmap(array_2d, annot=True)
heatmap
Real world example
  1. We will work with dataset of Alcohol Consumption downloaded from here.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

sns.heatmap(drinksdf.corr(),annot=True,cmap='YlGnBu')
heatmap-drinks

With the above, we can have a quick couple of assessments:
– there is a strong correlation between beer and alcohol and thus a strong overlap there.
– wine and spirit are almost not correlated and thus it would be rare to have a place where wine and spirit consumption equally high. One would be preferred over other.

If we notice, upper and lower halves along the diagonal are same. Correlation of A is to B is same as B is to A. Further, A correlation with A will always be 1. Such case, we can make a small tweak to make it more presentable and avoid any correlation confusion.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

drinksdf = pd.read_csv(
    'data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

# correlation and masks
drinks_cr = drinksdf.corr()
drinks_mask = np.triu(drinks_cr)

# remove the last ones on both axes
drinks_cr = drinks_cr.iloc[1:,:-1]
drinks_mask = drinks_mask[1:, :-1]

sns.heatmap(drinks_cr, 
        mask=drinks_mask,
        annot=True,
        cmap='coolwarm')
heatmap-masked

It is the same correlation data but just the needed one is represented.

Data Image

It helps in displaying data as an image, i.e. on a 2D regular raster.

Images are internally just arrays. Any 2D numpy array can be displayed as an image.

import pandas as pd
import matplotlib.pyplot as plt

M,N = 25,30
data = np.random.random((M,N)) 
plt.imshow(data)
data-image
Real world example
  1. Let’s read an image and then try to display it back to see how it looks
import cv2
import matplotlib.pyplot as plt

img = cv2.imread('data-files/babygroot.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# print(img.shape)
# output => (500, 359, 3)

plt.imshow(img)
baby-groot

It read the image as an array of matrix and then drew it as plot that turned to be same as the image. Since, images are like any other plots, we can plot other objects (like annotations) on top of it.

SubPlots | fig, (ax1,ax2,ax3, ax4) = plt.subplots(2,2)

Generally, it is used in comparing multiple variables (in pairs) against each other. With multiple plots stacked against each other in the same figure, it helps in quick assessment for correlation and distribution for a pair.

Parameters are: number of rows, number of columns, the index of the subplot 

(Index are counted row wise starting with 1) 

The widths of the different subplots may be different with use of GridSpec.

import numpy as np
import matplotlib.pyplot as plt
import math

# data setup
x = np.arange(1, 100, 5)
y1 = x**2
y2 = 2*x+4
y3 = [ math.sqrt(i) for i in x]  
y4 = [ math.log(j) for j in x] 

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)

ax1.plot(x, y1)
ax1.set_title('f(x) = quadratic')
ax1.grid()

ax2.plot(x, y2)
ax2.set_title('f(x) = linear')
ax2.grid()

ax3.plot(x, y3)
ax3.set_title('f(x) = sqareroot')
ax3.grid()

ax4.plot(x, y4)
ax4.set_title('f(x) = log')
ax4.grid()

fig.tight_layout()
plt.show()
sub-plot

We can stack up m x n view of the variables and have a quick look on how they are correlated. With the above, we can quickly assess that second graph parameters are linearly correlated.

Data Representation

Plot Anatomy

Below picture will help with plots terminology and representation:

matplotlib-plot-anatomy
Credit: matplotlib.org

Figure above is the base space where the entire plot happens. Most of the parameters can be customized for better representation. For specific details, look here.

Plot annotations

It helps in highlighting few key findings or indicators on a plot. For advanced annotations, look here.

import numpy as np
import matplotlib.pyplot as plt

# A simple parabolic data
x = np.arange(-4, 4, 0.02)
y = x**2

# Setup plot with data
fig, ax = plt.subplots()
ax.plot(x, y)

# Setup axes
ax.set_xlim(-4,4)
ax.set_ylim(-1,8)

# Visual titles
ax.set_title('Annotation Sample')
ax.set_xlabel('X-values')
ax.set_ylabel('Parabolic values')

# Annotation
# 1. Highlighting specific data on the x,y data
ax.annotate('local minima of \n the parabola',
            xy=(0, 0),
            xycoords='data',
            xytext=(2, 3),
            arrowprops=
                dict(facecolor='red', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='top')

# 2. Highlighting specific data on the x/y axis
bbox_yproperties = dict(
    boxstyle="round,pad=0.4", fc="w", ec="k", lw=2)
ax.annotate('Covers 70% of y-plot range',
            xy=(0, 0.7),
            xycoords='axes fraction',
            xytext=(0.2, 0.7),
            bbox=bbox_yproperties,
            arrowprops=
                dict(facecolor='green', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='center')

bbox_xproperties = dict(
    boxstyle="round,pad=0.4", fc="w", ec="k", lw=2)
ax.annotate('Covers 40% of x-plot range',
            xy=(0.3, 0),
            xycoords='axes fraction',
            xytext=(0.1, 0.4),
            bbox=bbox_xproperties,
            arrowprops=
                dict(facecolor='blue', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='center')

plt.show()
matplotlib-annotation

Plot style | plt.style.use('style')

It helps in customizing representation of a plot, like color, fonts, line thickness, etc. Default styles get applied if the customization is not defined. Apart from adhoc customization, we can also choose one of the already defined template styles and apply them.

# To know all existing styles with package
for style in plt.style.available:
    print(style)

Solarize_Light2, _classic_test_patch, bmh, classic, dark_background, fast, fivethirtyeight, ggplot, grayscale, seaborn, seaborn-bright, seaborn-colorblind, seaborn-dark, seaborn-dark-palette, seaborn-darkgrid, seaborn-deep, seaborn-muted, seaborn-notebook, seaborn-paper, seaborn-pastel, seaborn-poster, seaborn-talk, seaborn-ticks, seaborn-white, seaborn-whitegrid, tableau-colorblind10

pre-defined styles available for use

More details around customization are here.

# To use a defined style for plot
plt.style.use('seaborn')

# OR
with plt.style.context('Solarize_Light2'):
    plt.plot(np.sin(np.linspace(0, 2 * np.pi)), 'r-o')
plt.show()
matplotlib-style-ex

Saving plots | ax.savefig()

It helps in saving figure with plot as an image file of defined parameters. Parameters details are here. It will save the image file to the current directory by default.

plt.savefig('plot.png', dpi=300, bbox_inches='tight')

Additional Usages of plots

Data Imputation

It helps in filling missing data with some reasonable data as many statistical or machine learning packages do not work with data containing null values.

Data interpolation can be defined to use pre-defined functions such as linear, quadratic or cubic

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(20,1))
df = df.where(df&lt;0.5)

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.plot(df)
ax1.set_title('f(x) = data missing')
ax1.grid()

ax2.plot(df.interpolate())
ax2.set_title('f(x) = data interpolated')
ax2.grid()

fig.tight_layout()
plt.show()
data-interpolate

With the above, we see all the missing data replaced with some probably interpolation supported by dataframe based on valid previous and next data.

Animation

At times, it helps in presenting the data as an animation. On a high level, it would need data to be plugged in a loop with delta changes translating into a moving view.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import animation

fig = plt.figure()

def f(x, y):
    return np.sin(x) + np.cos(y)

x = np.linspace(0, 2 * np.pi, 80)
y = np.linspace(0, 2 * np.pi, 70).reshape(-1, 1)

im = plt.imshow(f(x, y), animated=True)


def updatefig(*args):
    global x, y
    x += np.pi / 5.
    y += np.pi / 10.
    im.set_array(f(x, y))
    return im,

ani = animation.FuncAnimation(
    fig, updatefig, interval=100, blit=True)
plt.show()
animation

3-D Plotting

If needed, we can also have an interactive 3-D plot though it might be slow with large datasets.

import numpy as np
import matplotlib.pyplot as plt

def randrange(n, vmin, vmax):
     return (vmax-vmin)*np.random.rand(n) + vmin

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
n = 200
for c, m, zl in [('g', 'o', +1), ('r', '^', -1)]:
    xs = randrange(n, 0, 50)
    ys = randrange(n, 0, 100)
    zs = xs+zl*ys  
    ax.scatter(xs, ys, zs, c=c, marker=m)

ax.set_xlabel('X data')
ax.set_ylabel('Y data')
ax.set_zlabel('Z data')
plt.show()
3d-plot

Cheat Sheet

A page representation of the key features for quick lookup or revision:

matplotlib-cheatsheet
Credit: DataCamp

Download the PDF version of cheatsheet from here.
Overall reference & for more details, look: https://matplotlib.org/

Entire Jupyter notebook with more samples can be downloaded or forked from my GitHub to look or play around: https://github.com/sandeep-mewara/data-visualization


Keep learning!

LearnByInsight C#
GitHub Profile Readme Samples
LearnByInsight Machine Learning

pandas – get started with examples

This is to get started with pandas and try few concrete examples. pandas is a Python based library that helps in reading, transforming, cleaning and analyzing data. It is built on the NumPy package.

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

https://pandas.pydata.org


Key data structure in pandas is called DataFrame – it helps to work with tabular data translated as rows of observations and columns of features.

Download or fork entire Jupiter notebook from my GitHub to play around: https://github.com/sandeep-mewara/python-examples

pandas basics includes:

  • Series
  • Dataframes
    • Create
      • from list of tuples
      • from a dictionary
      • from a CSV
      • from built-in dataset (eg: from sklearn.datasets)
    • Data retrieval
    • Modifying data
    • Group by operation
    • Custom Functions – apply method
    • Pre-Processing
      • drop, mean, mode
      • ordinal feature
      • nominal feature
    • Reshaping
      • CrossTab
      • Merge
      • Melt
      • Pivot

# .info(), .head(), .sample are handy method to use first off with dataframe to get a high level details

# index may be not unique – can return multiple values

# boolean indexing (masking) can help select certain set of rows

# .isin() is a useful when building a boolean index

# .where() is useful to retain shape of the original table

# Column names & Indexes can be set if needed

# to modify the table right away, use inplace=True

# aggregate operations can be applied on a groupby object

# dropna(), mean() or mode() are handy ways for pre-processing missing data

Key learning’s …

Examples notebook includes:

  • Uber taxi drivers
  • Apple stock price
  • Day or Night
  • Students marks
  • Balance Calculator

# .describe() is a handy method to get the statistical summary of numerical columns

# one-hot-encoding is really helpful for nominal features (that cannot be ordered)

# converting the columns into right datatype helps

# converting data into meaningful numbers help for analysis

# groupby is a powerful tool with dataframes for analysis

Key learning’s …

Cheat sheet

Credit: Pandas website

Download cheat sheet pdf from here
For more details about pandas, look at the documentation reference.

Keep learning!