Quick look into Machine Learning workflow

December 6, 2020December 13, 2020Sandeep Mewara Leave a comment

Before we jump into various ML concepts and algorithms, let’s have a quick look into basic workflow when we apply Machine Learning to a problem.

A short brief about Machine Learning, it’s association with AI or Data Science world is here.

How does Machine Learning help?

Machine Learning is about having a training algorithm that helps predict an output based on the past data. This input data can keep on changing and accordingly the algorithm can fine tune to provide better output.

It has vast applications across. For example, Google is using it to predict natural disasters like floods. A very common use we hear these days in news are usage in Politics and how to attack the demography of voters.

How does Machine Learning work?

Data is the key here. More the data is, better the algorithm can learn and fine tune. For any problem output, there would be multiple factors at play. Some of them would have more affect then others. Analyzing and applying all such findings are part of a machine learning problem. Mathematically, ML converts a problem output as a function of multiple input factors.

Y = f(x)
Y = predicted output
x = multiple factors as an input

How does a typical ML workflow look?

There is a structured way to apply ML on a problem. I tried to put the workflow in a pictorial view to easily visualize and understand it:

It’s goes in a cycle and once we have some insights from the published model, it goes back into the funnel as learning to make output better.

Roughly, Data scientists spend 60% of their time on cleaning and organizing data

Walk through an example?

Let’s use dataset of Titanic survivors found here and run through basic workflow to see how certain features like traveling class, sex, age and fare are helping us assess survival probability.

Load data from file

Data could be in various format. Easiest is to have it in a csv and then load it using pandas. More details around how to play with data is discussed here.
.

# lets load and see the info of the dataset

titanicdf = pd.read_csv("data-files/titanic.csv")
print(titanicdf.info())

&lt;class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB
None

# A quick sample view
titanicdf.head(3)

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.55	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2	1	2	113781	151.55	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON

Data cleanup – Drop the irrelevant columns

It’s not always that the data captured is the only data you need. You will have a superset of data that has additional information which are not relevant for your problem statement. This data will work as a noise and thus it’s better to clean the dataset before starting to work on them for any ML algorithm.
.

# there seems to be handful of columns which 
# we might not need for our analysis 
# lets drop all those column that we know 
# are irrelevant
titanicdf.drop(['embarked','body','boat','name',
'cabin','home.dest','ticket', 'sibsp', 'parch'],
axis='columns', inplace=True)

titanicdf.head(2)

	pclass	survived	sex	age	fare
0	1	1	female	29	211.3375
1	1	1	male	0.9167	151.55

Data analysis

There could be various ways to analyze data. Based on the problem statement, we would need to know the general trend of the data in discussion. Statistics Probability Distribution knowledge help here. For gaining insights to understand more around correlations and patterns, data visualization based insights help.
.

# let's see if there are any highly corelated data
# if we observe something, we will remove that 
# one of the feature to avoid bias

import seaborn as sns
sns.pairplot(titanicdf)

# looking at graphs, seems we don't have 
# anything right away to remove.

Data transform – Ordinal/Nominal/Datatype, etc

In order to work through data, it’s easy to interpret once they are converted into numbers (from strings) if possible. This helps them input them into various statistics formulas to get more insights. More details on how to apply numerical modifications to data is discussed here.
.

# There seems to be 3 class of people, 
# lets represent class as numbers
# We don't have info of relative relation of 
# class so we will one-hot-encode it
titanicdf = pd.get_dummies(titanicdf,
                              columns=['pclass'])


# Lets Convert sex to a number 
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
titanicdf["sex"] = le.fit_transform(titanicdf.sex) 

titanicdf.head(2)

	survived	sex	age	fare	pclass_1	pclass_2	pclass_3
0	1	0	29	211.3375	1	0	0
1	1	1	0.9167	151.55	1	0	0

Data imputation: Fill the missing values

There are always some missing data or an outlier. Running algorithms with missing data could lead to inconsistent results or algorithm failure. Based on the context, we can choose to remove them or fill/replace them with an appropriate value.
.

# When we saw the info, lots of age were missing
# Missing ages values filled with mean age. 
 
titanicdf.loc[ titanicdf["age"].isnull(), "age" ] = 
titanicdf["age"].mean()

# titanicdf.loc[ titanicdf["fare"].isnull(), "fare"] 
# = titanicdf["fare"].mean() 
# => can do but we will use another way 

titanicdf.info()

&lt;class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  1309 non-null   int64  
 1   sex       1309 non-null   int64  
 2   age       1309 non-null   float64
 3   fare      1308 non-null   float64
 4   pclass_1  1309 non-null   uint8  
 5   pclass_2  1309 non-null   uint8  
 6   pclass_3  1309 non-null   uint8  
dtypes: float64(2), int64(2), uint8(3)
memory usage: 44.9 KB

# When we saw the info,
# 1 fare was missing
# Lets drop that one record
titanicdf.dropna(inplace=True)
titanicdf.info()

#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 1308 entries, 0 to 1308

Normalize training data

At times, various data in context are of different scales. In such cases, if the data is not normalized, algorithm can induce bias towards the data that has higher magnitude. Eg, feature A value range is 0-10 and feature B range is 0-10000. In such case, even though a small change in magnitude of A can make a difference but if data is not normalized, feature B will influence results more (which could be not the actual case).
.

X = titanicdf
y = X['survived']
X = X.drop(['survived'], axis=1)

# Scales each column to have 0 mean and 1 std.dev
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)

Split data – Train/Test dataset

It’s always best to split the dataset into two unequal parts. Bigger one to train the algorithm and then then smaller one to test the trained algorithm. This way, algorithm is not biased to just the input data and results for test data can provide better picture.
.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = 
                        train_test_split(X_scaled,y)
X_train.info()

&lt;class 'pandas.core.frame.DataFrame'>
Int64Index: 981 entries, 545 to 864
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   sex       981 non-null    int64  
 1   age       981 non-null    float64
 2   fare      981 non-null    float64
 3   pclass_1  981 non-null    uint8  
 4   pclass_2  981 non-null    uint8  
 5   pclass_3  981 non-null    uint8  
dtypes: float64(2), int64(1), uint8(3)
memory usage: 33.5 KB

Run ML algorithm data

Once we have our training dataset ready as per our need, we can apply machine learning algorithms and find which model fits in best.
.

# for now, picking any one of the classifier - KN
# Ignore details or syntax for now
from sklearn.neighbors import KNeighborsClassifier

dtc = KNeighborsClassifier(n_neighbors=5)
dtc.fit(X_train,y_train)

Check the accuracy of model

In order to validate the model, we use test dataset where comparing the predicted value by model to actual data helps us know about ML model accuracy.
.

import sklearn.metrics as met

pred_knc = dtc.predict(X_test)
print( "Nearest neighbors: %.3f" 
      % (met.accuracy_score(y_test, pred_knc)))

Nearest neighbors: 0.817

Voila! With basic workflow, we have a model that can predict the survivor with more than 80% probability.

Download

Entire Jupyter notebook with more samples can be downloaded or forked from my GitHub to look or play around: https://github.com/sandeep-mewara/machine-learning

Currently, it covers examples on following datasets:

Titanic Survivors
Sci-kit Iris
Sci-kit Digits
Bread Basket Bakery

Over time, I would continue building on the same repository with more samples with different algorithms.

Closure

Believe, now it’s pretty clear on how we can attack any problem for an insight using Machine learning. Try out and see for yourself.

We can apply Machine Learning to multiple problems in multiple fields. I have shared a pictorial view of sectors in the AI section that are already leveraging it’s benefit.

Keep learning!

samples GitHub Profile Readme

Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples
Sandeep Mewara Github Repositories

Machine Learning workflow Microsoft .NET5

Learn AI ML with Netflix’s Fei Fei!

November 1, 2020November 1, 2020Sandeep Mewara Leave a comment

Off late, Microsoft has been working on providing more and more learning materials. This is as per their Global Skills Initiative aimed at helping 25 million people worldwide acquire new digital skills.

Our goal is to ignite the passion to solve important problems relevant to their lives, families and communities.
Microsoft

Last week, they partnered with Netflix to release a new learning experience featuring a young female hero Fei Fei, who has a passion for science and explores space.

Newly launched

Microsoft has launched three new modules under Explore Space with “Over the Moon” learning path. These modules will help learn basic concepts of data science, artificial intelligence and machine learning:

Plan a Moon Mission using the Python Pandas Library
Predict Meteor Showers using Python and VC Code
Use AI to Recognize Objects in Images using Azure Custom Vision

The movie’s story takes place in a beautifully animated universe and tackles problems real-life space engineers face.

How is it?

They cover the basic workflow of a machine learning problem similar to what I shared in my previous article here.

Exercises also provide a professional experience as they are built on Visual Studio Code and use Azure Cognitive Services.

Looking at it, seems a fun way to learn and explore the data world. Microsoft is really trying to make things simple and available to all. An effort worth to try out. Would recommend and ask to give it a shot.

Keep learning!

samples GitHub Profile Readme

Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples
Sandeep Mewara Github Repositories

Machine Learning workflow

Dwell in Data Science world

October 10, 2020October 12, 2020Sandeep Mewara Leave a comment

Let’s walk into the Data Science world and see what it means and how does it connect with other terms that we often hear in the context – Artificial Intelligence & Machine Learning.

What is Data Science?

In a context of a particular domain, data science is gaining insights into data through statistics, computation and visualization. Below diagram represents this multi-focal field well:

For a real world problem, based on statistics, we make certain assumptions. Based on assumptions, we make a learning model using mathematics. With this model, we make software that can help validate and solve the problem. This leads to solving complex problems fast and more accurately.

How to use Data Science?

Clearly, it can be defined as a process. When followed, it would help move towards destination with proper next steps:

There are multiple iterative stages that solves specific queries. Based on answers to the queries, we might have to circle back and reassess the steps of the entire process.

Various stages involved?

Once we have a defined process, it’s easier to break it down into different functional groups. It would help us interpret how to visualize, connect them and know who can help us at each of those stage:

what-is-data-science — Credit: Rubens Zimbres article

There is a strong correlation of business intelligence with data science here. Current advancements in algorithms and tools has helped us improve accuracy for each of the stages above.

Where does AI or ML fits in?

Data Science, Artificial Intelligence & Machine Learning are different but often used interchangeably. There are overlaps and a part covers all of them together:

On a high level, part of the data science provides processed data. AI or ML or DL helps to process the data to get the needed output.

Artificial Intelligence (AI)

It is a program that enables machine to mimic human behavior. As such, goal here is to understand human intelligence, learn to imitate and act accordingly. I came across a good AI exhibit by BCG:

ai-bcg-analysis — Credit: BCG Group tweet

Self driving cars and route change suggestion are few common AI solutions

Machine Learning (ML)

These are AI techniques that enables machines to learn from examples, without being explicitly programmed to do so. It incorporates mathematics and statistics to learn from itself and improve with experience.

Recommendation engines & Spam email classification are few common ML solutions

I will cover Machine Learning in much more detail in later posts.

Deep Learning (DL)

These are subset of ML that makes the computation of multi-layer neural network feasible. It does not require feature selection/engineering. They classify information similar to human brains. They often continue to improve as the size of your data increases.

Face detection and number recognition are few common DL solutions

Moving On …

Overall, Data Science is more about extracting insights to build a robust strategy for business solution and is different from AI or ML.

To read more around differences between the Data Science and other terms, Vincent Granville has shared in detail here.

With above, we have entered into the Data Science world. Going forward, I will concentrate more on Machine Learning aspect for now.

Keep exploring!

samples GitHub Profile Readme

Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples
Sandeep Mewara Github Repositories

Data Science Virtual Conference 2020

October 2, 2020October 2, 2020Sandeep Mewara Leave a comment

Open Data Science Conference (ODSC) is one of the biggest specialized data science upcoming event, scheduled on December 8-9, 2020. Read details about the event here. Theme for the year is RETHINK AI.

Virtual Conference is spread across two days:
– December 8th – ODSC India
– December 9th – ODSC Asia & Pacific

The Largest Applied Data Science & AI Conference Returns to India!

Call for speakers are open for submission. Registration is open to book seat.

Learn the latest AI & data science topics, tools, and languages from some of the best and brightest minds in the field.
Event highlight shared by ODSC team

The conference promises to accelerate your knowledge on data science and related disciplines with insightful sessions, workshops and speakers from various fields. There would be speakers that include core contributors to many open source libraries and languages.

Happy connecting!

samples GitHub Profile Readme

Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples
Sandeep Mewara Github Repositories

Probability Distribution – An aid to know the data

September 27, 2020October 19, 2020Sandeep Mewara Leave a comment

A probability distribution helps understand the likelihood of possible values that a random variable can take. It is one of the must needed statistical knowledge for any data science aspirant.

Few consider, Probability distributions are fundamental to statistics, like data structures are to computer science

In Layman terms

Let’s say, you pick any 100 employees of an organization. Measure their heights (or weights). As you measure them, create a distribution of it on a graph. Keep height on X-Axis & frequency of a particular height on Y-Axis. With this, we will get a distribution for a range of heights.

This distribution will help know which outcomes are most likely, the spread of potential values, and the likelihood of different results.

Basic terminology

Random Sample

The set of 100 people selected above in our example will be termed as random sample.

Sample Space

The range of possible heights of the 100 people is our sample space. It’s the set of all possible values in the setup.

Random Variable

The height of the 100 people measured are termed as random variable. It’s a variable that takes different values of the sample space randomly.

Mean (Expected Value)

Let’s say most of the people in those 100 are of height 5 feet, 3 inches (making it an average height of those 100). This would be termed expected value. It’s an average value of a random variable.

Standard deviation & Variance

Let’s say most of the people in those 100 are of height 5 feet, 1 inches to 5 feet, 5 inches. This is variance for us. It’s an average spread of values around the expected value. Standard Deviation is the square root of the variance.

Types of data

Ordinal – They have a meaningful order. All numerical data fall in this bucket. They can be ordered in relative numerical strength.
Nominal – They cannot be ordered. All categorical data fall in this bucket. Like, colors – Red, Blue & Green – there cannot be an order or a sequence of high or low in them by itself.
Discrete – an ordinal data that can take only certain values (like soccer match score)
Continuous – an ordinal data that can take any real or fractional value (like height & weight)

In Continuous distribution, random variables can have an infinite range of possible outcomes

Probability Distribution Flowchart

Following diagram shares few of the common distributions used:

distribution-common — Credit: cloudera blog

Based on above diagram, will cover three distributions to have a broad understanding:

Uniform Distribution

It is the simplest form of distribution. Every outcome of the sample space has equal probability to happen. An example would be to roll a fair dice that would have an equal probability outcome of 1-6.

Normal (Gaussian) Distribution

The most common distribution. Few would recognize this by a ‘bell curve’. Most values are around the mean value making the distribution arrangement symmetric.

Central limit theorem suggests that sum of several independent random variables is normally distributed

The area under the distribution curve is equal to 1 (all the probabilities must sum up to 1)

A parameter Mew drives the distribution center (mean). It corresponds to the maximum height of the graph. A parameter Sigma corresponds to the range of variation (variance or standard deviation).

68–95–99.7 rule (empirical rule) – approximate percentage of the data covered by ranges defined by 1, 2, and 3 standard deviations from the mean

Exponential Distribution

It is where a few outcomes are most likely with a rapid decrease in probability to all other outcomes. An example of it would be a car battery life in months.

A parameter Beta deals with scale that defines the mean and standard deviation of the distribution. A parameter Lambda deals with rate of change in the distribution

Probability Distribution Choices

I came across an awesome representation of the probability distribution choices. It works as a cheat sheet to understand the provided data.

distibutional-choices — Credit: nyu.edu/adamodar

Wrap Up

Though above is just an introduction, believe it should be good enough to start, correlate and understand some basics of machine learning algorithms. There would be more to it while working on algorithms and problems while analyzing data to predict trends, etc.

Keep learning!

samples GitHub Profile Readme

Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples

Flood forecasting – new tech way!

September 20, 2020September 25, 2020Sandeep Mewara Leave a comment

Recently, Google opened up its Flood Forecasting Initiative that uses Artificial Intelligence to predict when and where flood will occur for India and Bangladesh. They worked with governments to develop systems that predict flood and thus keep people safe and informed.

Google now covers 200 million people living in more than 250,000 square kilometers in India.

This topic was also touched upon in the Decode with Google event last week.

Initiative Plan

Google started this initiative back in 2018.

Floods are devastating natural disasters worldwide—it’s estimated that every year, 250 million people around the world are affected by floods, causing around $10 billion in damages.

The plan was to use AI and create forecasting models based on:

historical events
river level readings
terrain and elevation of an area

An inside look at the flood forecasting was published here that covers:
1. The Inundation Model
2. Real time water level measurements
3. Elevation Map creation
4. Hydraulic modeling

Recent Improvements

The new approach devised for inundation modeling is called a morphological inundation model. It combines physics-based modeling with machine learning to create more accurate and scalable inundation models in real-world settings.

This new forecasting system covers:
1. Forecasting Water Levels
2. Morphological Inundation Modeling
3. Alert targeting
4. Improved Water Levels Forecasting

Have a read of the following blog for full details.

Current State

As shared here, they partnered with Indian Central Water Commission to expand forecasting models and services. For research, they have collaborated with Yale to visit flood affected areas. This helps them to understand how to provide information and what information would people need to protect themselves.

We’re providing people with information about flood depth: when and how much flood waters are likely to rise. And in areas where we can produce depth maps throughout the floodplain, we’re sharing information about depth in the user’s village or area.

To increase it’s reach about alerts, Google.org has started a collaboration with the International Federation of Red Cross and Red Crescent Societies.

My Thoughts

It’s a great use of technology to help mankind. Floods are life changing events and an early prediction and shareout would help big to everyone.

Awesome initiative, breakthroughs and progress!

samples GitHub Profile Readme

Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples

Data Visualization – Insights with Matplotlib

September 13, 2020October 5, 2020Sandeep Mewara Leave a comment

While working on a machine learning problem, Matplotlib is the most popular python library used for visualization that helps in representing & analyzing the data and work through insights.

Generally, it’s difficult to interpret much about data, just by looking at it. But, a presentation of the data in any visual form, helps a great deal to peek into it. It becomes easy to deduce correlations, identify patterns & parameters of importance.

In data science world, data visualization plays an important role around data pre-processing stage. It helps in picking appropriate features and apply appropriate machine learning algorithm. Later, it helps in representing the data in a meaningful way.

Data Insights via various plots

If needed, we will use these dataset for plot examples and discussions. Based on the need, following are the common plots that are used:

Line Chart | `ax.plot(x,y)`

It helps in representing series of data points against a given range of defined parameter. Real benefit is to plot multiple line charts in a single plot to compare and track changes.

Points next to each other are related that helps to identify repeated or a defined pattern

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0, 1, 0.05)
y1 = x**2
y2 = x**3

plt.plot(x, y1,
    linewidth=0.5,
    linestyle='--',
    color='b',
    marker='o',
    markersize=10,
    markerfacecolor='red')

plt.plot(x, y2,
    linewidth=0.5,
    linestyle='dotted',
    color='g',
    marker='^',
    markersize=10,
    markerfacecolor='yellow')

plt.title('x Vs f(x)')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend(['f(x)=x^2', 'f(x)=x^3'])
plt.xticks(np.arange(0, 1.1,0.2),
    ['0','0.2','0.4','0.6','0.8','1.0'])

plt.grid(True)
plt.show()

Real world example:

We will work with dataset created from collating historical data for few stocks downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

stocksdf1 = pd.read_csv('data-files/stock-INTU.csv') 
stocksdf2 = pd.read_csv('data-files/stock-AAPL.csv') 
stocksdf3 = pd.read_csv('data-files/stock-ADBE.csv') 

stocksdf = pd.DataFrame()
stocksdf['date'] = pd.to_datetime(stocksdf1['Date'])
stocksdf['INTU'] = stocksdf1['Open']
stocksdf['AAPL'] = stocksdf2['Open']
stocksdf['ADBE'] = stocksdf3['Open']

plt.plot(stocksdf['date'], stocksdf['INTU'])
plt.plot(stocksdf['date'], stocksdf['AAPL'])
plt.plot(stocksdf['date'], stocksdf['ADBE'])

plt.legend(labels=['INTU','AAPL','ADBE'])
plt.grid(True)

plt.show()

With the above, we have couple of quick assessments:
Q: How a particular stock fared over last year?
A: Stocks were roughly rising till Feb 2020 and then took a dip in April and then back up since then.

Q: How the three stocks behaved during the same period?
A: Stock price of ADBE was more sensitive and AAPL being least sensitive to the change during the same period.

Histogram | `ax.hist(data, n_bins)`

It helps in showing distributions of variables where it plots quantitative data with range of the data grouped into intervals.

We can use Log scale if the data range is across several orders of magnitude.

import numpy as np
import matplotlib.pyplot as plt

mean = [0, 0]
cov = [[2,4], [5, 9]]
xn, yn = np.random.multivariate_normal(
                                mean, cov, 100).T

plt.hist(xn,bins=25,label="Distribution on x-axis"); 

plt.xlabel('x')
plt.ylabel('frequency')
plt.grid(True)
plt.legend()

Real world example

We will work with dataset of Indian Census data downloaded from here.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='STATE'
mask2 = populationdf['TRU']=='Total'
df = populationdf[mask1 & mask2]

plt.hist(df['TOT_P'], label='Distribution')

plt.xlabel('Total Population')
plt.ylabel('State Count')
plt.yticks(np.arange(0,20,2))

plt.grid(True)
plt.legend()

With the above, couple of quick assessments about population in states of India:
Q: What’s the general population distribution of states in India?
A: More than 50% of states have population less than 2 crores (20 million)

Q: How many states are having population more than 10 crores (100 million)?
A: Only 3 states have that high a population.

Bar Chart | `ax.bar(x_pos, heights)`

It helps in comparing two or more variables by displaying values associated with categorical data.

Most commonly used plot in Media sharing data around surveys displaying every data sample.

import numpy as np
import matplotlib.pyplot as plt

data = [[60, 45, 65, 35],
        [35, 25, 55, 40]]

x_pos = np.arange(4)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.set_xticks(x_pos)

ax.bar(x_pos - 0.1, data[0], color='b', width=0.2)
ax.bar(x_pos + 0.1, data[1], color='g', width=0.2)

ax.yaxis.grid(True)

Real world example

We will work with dataset of Indian Census data downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='STATE'
mask2 = populationdf['TRU']=='Total'
statesdf = populationdf.loc[mask1].loc[mask2]
statesdf = statesdf.sort_values('TOT_P')

plt.figure(figsize=(10,8))
plt.barh(range(len(statesdf)), 
    statesdf['TOT_P'], tick_label=statesdf['Name'])
plt.grid(True)
plt.title('Total Population')
plt.show()

With the above, couple of quick assessments about population in states of India:
– Uttar Pradesh has the highest total population and Lakshadeep has lowest
– Relative popluation across states with Uttar Pradesh almost double the second most populated state

Pie Chart | `ax.pie(sizes, labels=[labels])`

It helps in showing the percentage (or proportional) distribution of categories at a certain point of time. Usually, it works well if it’s limited to single digit categories.

A circular statistical graphic where the arc length of each slice is proportional to the quantity it represents.

import numpy as np
import matplotlib.pyplot as plt

# Slices will be ordered n plotted counter-clockwise
labels = ['Audi','BMW','LandRover','Tesla','Ferrari']
sizes = [90, 70, 35, 20, 25]

fig, ax = plt.subplots()
ax.pie(sizes,labels=labels, autopct='%1.1f%%')
ax.set_title('Car Sales')
plt.show()

Real world example

We will work with dataset of Alcohol Consumption downloaded from here.

import panda as pd
import matplotlib.pyplot as plt

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

labels = ['Beer', 'Spirit', 'Wine']
sizes = [drinksdf['beer'].sum(), 
         drinksdf['spirit'].sum(), 
         drinksdf['wine'].sum()]

fig, ax = plt.subplots()
explode = [0.05,0.05,0.2]
ax.pie(sizes,explode=explode,
    labels=labels, autopct='%1.1f%%')

ax.set_title('Alcohol Consumption')
plt.show()

With the above, we can have a quick assessment that alcohol consumption is distributed overall. This view helps if we have less number of slices (categories).

Scatter plot | `ax.scatter(x_points, y_points)`

It helps representing paired numerical data either to compare how one variable is affected by another or to see how multiple dependent variables value is spread for each value of independent variable.

Sometimes the data points in a scatter plot form distinct groups and are called as clusters.

import numpy as np
import matplotlib.pyplot as plt

# random but focused cluster data
x1 = np.random.randn(100) + 8
y1 = np.random.randn(100) + 8
x2 = np.random.randn(100) + 3
y2 = np.random.randn(100) + 3

x = np.append(x1,x2)
y = np.append(y1,y2)

plt.scatter(x,y, label="xy distribution")
plt.legend()

Real world example

We will work with dataset of Alcohol Consumption downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

drinksdf['total'] = drinksdf['beer'] 
+ drinksdf['spirit'] 
+ drinksdf['wine'] 
+ drinksdf['alcohol']

# drinksdf.corr() tells beer and alcochol 
# are highly corelated
fig = plt.figure()

# Compare beet and alcohol consumption
# Use color to show a third variable.
# Can also use size (s) to show a third variable.
scat = plt.scatter(drinksdf['beer'], 
                   drinksdf['alcohol'], 
                   c=drinksdf['total'], 
                   cmap=plt.cm.rainbow)

# colorbar to explain the color scheme
fig.colorbar(scat, label='Total drinks')

plt.xlabel('Beer')
plt.ylabel('Alcohol')
plt.title('Comparing beer and alcohol consumption')
plt.grid(True)
plt.show()

With the above, we can have a quick assessment that beer and alcohol consumption have strong positive correlation which would suggest a large overlap of people who drink beer and alcohol.

2. We will work with dataset of Mall Customers downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt

malldf = pd.read_csv('data-files/mall-customers.csv',
                skiprows=1, 
                names = ['customerid', 'genre', 
                         'age', 'annualincome', 
                         'spendingscore'])

plt.scatter(malldf['annualincome'], 
            malldf['spendingscore'], 
            marker='p', s=40, 
            facecolor='r', edgecolor='b', 
            linewidth=2, alpha=0.4)

plt.xlabel("Annual Income")
plt.ylabel("Spending Score (1-100)")
plt.grid(True)

With the above, we can have a quick assessment that there are five clusters there and thus five segments or types of customers one can make plan for.

Box Plot | `ax.boxplot([data list])`

A statistical plot that helps in comparing distributions of variables because the center, spread and range are immediately visible. It only shows the summary statistics like mean, median and interquartile range.

Easy to identify if data is symmetrical, how tightly it is grouped, and if and how data is skewed

import numpy as np
import matplotlib.pyplot as plt

# some random data
data1 = np.random.normal(0, 2, 100)
data2 = np.random.normal(0, 4, 100)
data3 = np.random.normal(0, 3, 100)
data4 = np.random.normal(0, 5, 100)
data = list([data1, data2, data3, data4])

fig, ax = plt.subplots()
bx = ax.boxplot(data, patch_artist=True)

ax.set_title('Box Plot Sample')
ax.set_ylabel('Spread')
xticklabels=['category A', 
             'category B', 
             'category B', 
             'category D']

colors = ['pink','lightblue','lightgreen','yellow']
for patch, color in zip(bx['boxes'], colors):
    patch.set_facecolor(color)

ax.set_xticklabels(xticklabels)
ax.yaxis.grid(True)
plt.show()

Real world example

We will work with dataset of Tips downloaded from he r e.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tipsdf = pd.read_csv('data-files/tips.csv') 
sns.boxplot(x="time", y="tip", 
            hue='sex', data=tipsdf, 
            order=["Dinner", "Lunch"],
            palette='coolwarm')

With the above, we can have a quick couple of assessments:
– male gender gives more tip compared to females
– tips during dinner time can vary a lot (more) by males mean tip

Violen Plot | `ax.violinplot([data list])`

A statistical plot that helps in comparing distributions of variables because the center, spread and range are immediately visible. It shows the full distribution of data.

A quick way to compare distributions across multiple variables

import numpy as np
import matplotlib.pyplot as plt

data = [np.random.normal(0, std, size=100) 
        for std in range(2, 6)]

fig, ax = plt.subplots()
bx = ax.violinplot(data)

ax.set_title('Violin Plot Sample')
ax.set_ylabel('Spread')
xticklabels=['category A', 
             'category B', 
             'category B', 
             'category D']

ax.set_xticks([1,2,3,4])
ax.set_xticklabels(xticklabels)

ax.yaxis.grid(True)
plt.show()

Real world example

We will work with dataset of Tips downloaded from he r e.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

tipsdf = pd.read_csv('data-files/tips.csv') 
sns.violinplot(x="day", y="tip", 
               split="True", data=tipsdf)

With the above, we can have a quick assessment that the tips on Saturday has more relaxed distribution whereas Friday has much narrow distribution in comparison.

2. We will work with dataset of Indian Census data downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

populationdf = pd.read_csv(
    "./data-files/census-population.csv")

mask1 = populationdf['Level']=='DISTRICT'
mask2 = populationdf['TRU']!='Total'
statesdf = populationdf[mask1 & mask2]

maskUP = statesdf['State']==9
maskM = statesdf['State']==27
data = statesdf.loc[maskUP | maskM]

sns.violinplot( x='State', y='P_06', 
inner='quartile', hue='TRU',  
palette={'Rural':'green','Urban':'blue'}, 
scale='count', split=True, 
data=data, size=6)

plt.title('In districts of UP and Maharashtra')
plt.show()

With the above, we can have couple of quick assessments:
– Uttar Pradesh has high volume and distribution of rural child population.
– Maharashtra has almost equal spread of rural and urban child population

Heatmap

It helps in representing a 2-D matrix form of data using variation of color for different values. Variation of color maybe hue or intensity.

Generally used to visualize correlation matrix which in turn helps in features (variables) selection.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# create 2D array
array_2d = np.random.rand(4, 6)
sns.heatmap(array_2d, annot=True)

Real world example

We will work with dataset of Alcohol Consumption downloaded from here.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

drinksdf = pd.read_csv('data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

sns.heatmap(drinksdf.corr(),annot=True,cmap='YlGnBu')

With the above, we can have a quick couple of assessments:
– there is a strong correlation between beer and alcohol and thus a strong overlap there.
– wine and spirit are almost not correlated and thus it would be rare to have a place where wine and spirit consumption equally high. One would be preferred over other.

If we notice, upper and lower halves along the diagonal are same. Correlation of A is to B is same as B is to A. Further, A correlation with A will always be 1. Such case, we can make a small tweak to make it more presentable and avoid any correlation confusion.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

drinksdf = pd.read_csv(
    'data-files/drinks.csv', 
    skiprows=1, 
    names = ['country', 'beer', 'spirit', 
             'wine', 'alcohol', 'continent']) 

# correlation and masks
drinks_cr = drinksdf.corr()
drinks_mask = np.triu(drinks_cr)

# remove the last ones on both axes
drinks_cr = drinks_cr.iloc[1:,:-1]
drinks_mask = drinks_mask[1:, :-1]

sns.heatmap(drinks_cr, 
        mask=drinks_mask,
        annot=True,
        cmap='coolwarm')

It is the same correlation data but just the needed one is represented.

Data Image

It helps in displaying data as an image, i.e. on a 2D regular raster.

Images are internally just arrays. Any 2D numpy array can be displayed as an image.

import pandas as pd
import matplotlib.pyplot as plt

M,N = 25,30
data = np.random.random((M,N)) 
plt.imshow(data)

Real world example

Let’s read an image and then try to display it back to see how it looks

import cv2
import matplotlib.pyplot as plt

img = cv2.imread('data-files/babygroot.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# print(img.shape)
# output => (500, 359, 3)

plt.imshow(img)

It read the image as an array of matrix and then drew it as plot that turned to be same as the image. Since, images are like any other plots, we can plot other objects (like annotations) on top of it.

SubPlots | `fig, (ax1,ax2,ax3, ax4) = plt.subplots(2,2)`

Generally, it is used in comparing multiple variables (in pairs) against each other. With multiple plots stacked against each other in the same figure, it helps in quick assessment for correlation and distribution for a pair.

Parameters are: number of rows, number of columns, the index of the subplot
(Index are counted row wise starting with 1)

The widths of the different subplots may be different with use of GridSpec.

import numpy as np
import matplotlib.pyplot as plt
import math

# data setup
x = np.arange(1, 100, 5)
y1 = x**2
y2 = 2*x+4
y3 = [ math.sqrt(i) for i in x]  
y4 = [ math.log(j) for j in x] 

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)

ax1.plot(x, y1)
ax1.set_title('f(x) = quadratic')
ax1.grid()

ax2.plot(x, y2)
ax2.set_title('f(x) = linear')
ax2.grid()

ax3.plot(x, y3)
ax3.set_title('f(x) = sqareroot')
ax3.grid()

ax4.plot(x, y4)
ax4.set_title('f(x) = log')
ax4.grid()

fig.tight_layout()
plt.show()

We can stack up m x n view of the variables and have a quick look on how they are correlated. With the above, we can quickly assess that second graph parameters are linearly correlated.

Data Representation

Plot Anatomy

Below picture will help with plots terminology and representation:

Figure above is the base space where the entire plot happens. Most of the parameters can be customized for better representation. For specific details, look here.

Plot annotations

It helps in highlighting few key findings or indicators on a plot. For advanced annotations, look here.

import numpy as np
import matplotlib.pyplot as plt

# A simple parabolic data
x = np.arange(-4, 4, 0.02)
y = x**2

# Setup plot with data
fig, ax = plt.subplots()
ax.plot(x, y)

# Setup axes
ax.set_xlim(-4,4)
ax.set_ylim(-1,8)

# Visual titles
ax.set_title('Annotation Sample')
ax.set_xlabel('X-values')
ax.set_ylabel('Parabolic values')

# Annotation
# 1. Highlighting specific data on the x,y data
ax.annotate('local minima of \n the parabola',
            xy=(0, 0),
            xycoords='data',
            xytext=(2, 3),
            arrowprops=
                dict(facecolor='red', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='top')

# 2. Highlighting specific data on the x/y axis
bbox_yproperties = dict(
    boxstyle="round,pad=0.4", fc="w", ec="k", lw=2)
ax.annotate('Covers 70% of y-plot range',
            xy=(0, 0.7),
            xycoords='axes fraction',
            xytext=(0.2, 0.7),
            bbox=bbox_yproperties,
            arrowprops=
                dict(facecolor='green', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='center')

bbox_xproperties = dict(
    boxstyle="round,pad=0.4", fc="w", ec="k", lw=2)
ax.annotate('Covers 40% of x-plot range',
            xy=(0.3, 0),
            xycoords='axes fraction',
            xytext=(0.1, 0.4),
            bbox=bbox_xproperties,
            arrowprops=
                dict(facecolor='blue', shrink=0.04),
                horizontalalignment='left',
                verticalalignment='center')

plt.show()

Plot style | `plt.style.use('style')`

It helps in customizing representation of a plot, like color, fonts, line thickness, etc. Default styles get applied if the customization is not defined. Apart from adhoc customization, we can also choose one of the already defined template styles and apply them.

# To know all existing styles with package
for style in plt.style.available:
    print(style)

Solarize_Light2, _classic_test_patch, bmh, classic, dark_background, fast, fivethirtyeight, ggplot, grayscale, seaborn, seaborn-bright, seaborn-colorblind, seaborn-dark, seaborn-dark-palette, seaborn-darkgrid, seaborn-deep, seaborn-muted, seaborn-notebook, seaborn-paper, seaborn-pastel, seaborn-poster, seaborn-talk, seaborn-ticks, seaborn-white, seaborn-whitegrid, tableau-colorblind10
pre-defined styles available for use

More details around customization are here.

# To use a defined style for plot
plt.style.use('seaborn')

# OR
with plt.style.context('Solarize_Light2'):
    plt.plot(np.sin(np.linspace(0, 2 * np.pi)), 'r-o')
plt.show()

Saving plots | `ax.savefig()`

It helps in saving figure with plot as an image file of defined parameters. Parameters details are here. It will save the image file to the current directory by default.

plt.savefig('plot.png', dpi=300, bbox_inches='tight')

Additional Usages of plots

Data Imputation

It helps in filling missing data with some reasonable data as many statistical or machine learning packages do not work with data containing null values.

Data interpolation can be defined to use pre-defined functions such as linear, quadratic or cubic

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(20,1))
df = df.where(df&lt;0.5)

fig, (ax1, ax2) = plt.subplots(1, 2)

ax1.plot(df)
ax1.set_title('f(x) = data missing')
ax1.grid()

ax2.plot(df.interpolate())
ax2.set_title('f(x) = data interpolated')
ax2.grid()

fig.tight_layout()
plt.show()

With the above, we see all the missing data replaced with some probably interpolation supported by dataframe based on valid previous and next data.

Animation

At times, it helps in presenting the data as an animation. On a high level, it would need data to be plugged in a loop with delta changes translating into a moving view.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import animation

fig = plt.figure()

def f(x, y):
    return np.sin(x) + np.cos(y)

x = np.linspace(0, 2 * np.pi, 80)
y = np.linspace(0, 2 * np.pi, 70).reshape(-1, 1)

im = plt.imshow(f(x, y), animated=True)


def updatefig(*args):
    global x, y
    x += np.pi / 5.
    y += np.pi / 10.
    im.set_array(f(x, y))
    return im,

ani = animation.FuncAnimation(
    fig, updatefig, interval=100, blit=True)
plt.show()

3-D Plotting

If needed, we can also have an interactive 3-D plot though it might be slow with large datasets.

import numpy as np
import matplotlib.pyplot as plt

def randrange(n, vmin, vmax):
     return (vmax-vmin)*np.random.rand(n) + vmin

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
n = 200
for c, m, zl in [('g', 'o', +1), ('r', '^', -1)]:
    xs = randrange(n, 0, 50)
    ys = randrange(n, 0, 100)
    zs = xs+zl*ys  
    ax.scatter(xs, ys, zs, c=c, marker=m)

ax.set_xlabel('X data')
ax.set_ylabel('Y data')
ax.set_zlabel('Z data')
plt.show()

Cheat Sheet

A page representation of the key features for quick lookup or revision:

Download the PDF version of cheatsheet from here.
Overall reference & for more details, look: https://matplotlib.org/

Entire Jupyter notebook with more samples can be downloaded or forked from my GitHub to look or play around: https://github.com/sandeep-mewara/data-visualization

Keep learning!

LearnByInsight C#
GitHub Profile Readme Samples
LearnByInsight Machine Learning

pandas – get started with examples

August 30, 2020September 25, 2020Sandeep Mewara Leave a comment

This is to get started with pandas and try few concrete examples. pandas is a Python based library that helps in reading, transforming, cleaning and analyzing data. It is built on the NumPy package.

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
https://pandas.pydata.org

Key data structure in pandas is called DataFrame – it helps to work with tabular data translated as rows of observations and columns of features.

Download or fork entire Jupiter notebook from my GitHub to play around: https://github.com/sandeep-mewara/python-examples

pandas basics includes:

Series
Dataframes
- Create
  - from list of tuples
  - from a dictionary
  - from a CSV
  - from built-in dataset (eg: from sklearn.datasets)
- Data retrieval
- Modifying data
- Group by operation
- Custom Functions – apply method
- Pre-Processing
  - drop, mean, mode
  - ordinal feature
  - nominal feature
- Reshaping
  - CrossTab
  - Merge
  - Melt
  - Pivot

# .info(), .head(), .sample are handy method to use first off with dataframe to get a high level details
# index may be not unique – can return multiple values
# boolean indexing (masking) can help select certain set of rows
# .isin() is a useful when building a boolean index
# .where() is useful to retain shape of the original table
# Column names & Indexes can be set if needed
# to modify the table right away, use inplace=True
# aggregate operations can be applied on a groupby object
# dropna(), mean() or mode() are handy ways for pre-processing missing data
Key learning’s …

Examples notebook includes:

Uber taxi drivers
Apple stock price
Day or Night
Students marks
Balance Calculator

# .describe() is a handy method to get the statistical summary of numerical columns
# one-hot-encoding is really helpful for nominal features (that cannot be ordered)
# converting the columns into right datatype helps
# converting data into meaningful numbers help for analysis
# groupby is a powerful tool with dataframes for analysis
Key learning’s …

Cheat sheet

Download cheat sheet pdf from here
For more details about pandas, look at the documentation reference.

Keep learning!

Python as statistics workbench

August 22, 2020August 30, 2020Sandeep Mewara Leave a comment

While reading for AI/ML (Artificial Intelligence/Machine Learning), I came across a discussion – if Python can be used as a “statistics workbench” to replace R, SPSS, etc? It was nice shareout by multiple knowledge folks related to languages used for problems of statistics, specifically R (read about R here).

Discussion here: https://stats.stackexchange.com/questions/1595/python-as-a-statistics-workbench

For quick reference, I will quote few of the latest thoughts from there that are in favor of Python and how it has evolved. I too conquer with most of them:

1. Python is easily the most intuitive syntax of any programming language. This makes for extremely fast development time.
2. Python is performant. It opens large datasets reliably.
3. The packages in Python are fast catching up to R’s packages. Python usage has increased tremendously last few years.
4. Readability is one of the most important qualities good code can possess, and Python is one of the most readable language.
5. Python has an extremely well-thought-out IDE now: PyCharm & Visual Studio Code.
https://stats.stackexchange.com/a/457753

Overall, Python is a general purpose language with an easy to understand syntax which would be relatively easier for usual programmers to learn/adopt. R is developed keeping statisticians in mind. Thus it has many features around data visualization and is a tad ahead currently.

A little research …

Recently DataCamp too published an article comparing R and Python for data analysis. There is a nice comparison in it on various parameters, picking just couple of them here:

Final analysis in the paper shares R being ahead in comparison for data analysis but Python having potential to catch up quickly and easily.

My thoughts …

My intent was to understand which of the programming language serves as an essential tool to demonstrate AI/ML capabilities. Looking at them, Python seems good enough for me to serve as AI/ML tool to start and probably conquer it.

Ammunition needed …

There are many python based libraries and packages that are generally used for statistical work. Below are few of them that would help in our data analysis exploration going ahead:

scipy – python-based ecosystem of open-source software for mathematics, science, and engineering.
- cookbook – many statistical facilities, a collection of various user-contributed recipes already available
- numpy – base N-dimensional array package. Handful of example lists here
- pandas – a fast, powerful, flexible and easy to use data analysis and manipulation tool
- matplotlib – a comprehensive library for creating static, animated, and interactive visualizations
scikit-learn – simple and efficient machine learning tools for predictive data analysis
keras – API for deep learning
tensorflow – API to develop and train ML models

Since I am a programmer, I maybe be biased here. But, it seems Python can and does all the needful to start with AI/ML journey.

Happy learning!

NumPy – Basics & Examples

August 15, 2020September 27, 2020Sandeep Mewara Leave a comment

This is to get started with NumPy and try few concrete examples. NumPy (Numerical Python) are packages for numerical computation designed for efficient work on large data sets.

Entire Jupiter notebook can be downloaded or forked from my GitHub to play around: https://github.com/sandeep-mewara/python-examples

Reference: https://numpy.org/learn/

NumPy basics includes:

Initialize Matrix via
- List
- NULL Matrix
- IDENTITY Matrix
- ONES Matrix
Matrix Transpose
Matrix Indexing
Simulation
Basic CSV file operations
Matrix Broadcasting
Basic Image Processing

# matrix in python is list of a list
# arrays are compatible for broadcasting when the trailing dimensions match or either of them is of length 1
# image when read as numbers, the values are between 0 & 1
Key learning’s …

Examples notebook includes:

Random walk simulation
Triangle simulation
Random Number
Correlation co-efficient
Mean/Variance of crude oil

# masking helps get all the values back that satisfy the mask
# cumsum() is a handy function for cumulative sum
# there are handy methods for random number generation
Key learning’s …

For learning more about NumPy, look here: https://numpy.org/doc/stable/

Keep learning!

How does Machine Learning help?

How does Machine Learning work?

How does a typical ML workflow look?

Walk through an example?

Load data from file

Data cleanup – Drop the irrelevant columns

Data analysis

Data transform – Ordinal/Nominal/Datatype, etc

Data imputation: Fill the missing values

Normalize training data

Split data – Train/Test dataset

Run ML algorithm data

Check the accuracy of model

Download

Closure

Newly launched

How is it?

What is Data Science?

How to use Data Science?

Various stages involved?

Where does AI or ML fits in?

Artificial Intelligence (AI)

Machine Learning (ML)

Deep Learning (DL)

Moving On …

In Layman terms

Basic terminology

Random Sample

Sample Space

Random Variable

Mean (Expected Value)

Standard deviation & Variance

Types of data

Probability Distribution Flowchart

Uniform Distribution

Normal (Gaussian) Distribution

Exponential Distribution

Probability Distribution Choices

Wrap Up

Initiative Plan

Recent Improvements

Current State

My Thoughts

Data Insights via various plots

Line Chart | ax.plot(x,y)

Real world example:

Histogram | ax.hist(data, n_bins)

Real world example

Bar Chart | ax.bar(x_pos, heights)

Real world example

Pie Chart | ax.pie(sizes, labels=[labels])

Real world example

Scatter plot | ax.scatter(x_points, y_points)

Real world example

Box Plot | ax.boxplot([data list])

Real world example

Violen Plot | ax.violinplot([data list])

Real world example

Heatmap

Real world example

Data Image

Real world example

SubPlots | fig, (ax1,ax2,ax3, ax4) = plt.subplots(2,2)

Data Representation

Plot Anatomy

Plot annotations

Plot style | plt.style.use('style')

Saving plots | ax.savefig()

Additional Usages of plots

Data Imputation

Animation

3-D Plotting

Cheat Sheet

pandas basics includes:

Examples notebook includes:

Cheat sheet

A little research …

My thoughts …

Ammunition needed …

NumPy basics includes:

Line Chart | `ax.plot(x,y)`

Histogram | `ax.hist(data, n_bins)`

Bar Chart | `ax.bar(x_pos, heights)`

Pie Chart | `ax.pie(sizes, labels=[labels])`

Scatter plot | `ax.scatter(x_points, y_points)`

Box Plot | `ax.boxplot([data list])`

Violen Plot | `ax.violinplot([data list])`

SubPlots | `fig, (ax1,ax2,ax3, ax4) = plt.subplots(2,2)`

Plot style | `plt.style.use('style')`

Saving plots | `ax.savefig()`