Quick look into Machine Learning workflow

Before we jump into various ML concepts and algorithms, let’s have a quick look into basic workflow when we apply Machine Learning to a problem.

ml-header

A short brief about Machine Learning, it’s association with AI or Data Science world is here.

How does Machine Learning help?

Machine Learning is about having a training algorithm that helps predict an output based on the past data. This input data can keep on changing and accordingly the algorithm can fine tune to provide better output.

It has vast applications across. For example, Google is using it to predict natural disasters like floods. A very common use we hear these days in news are usage in Politics and how to attack the demography of voters.

How does Machine Learning work?

Data is the key here. More the data is, better the algorithm can learn and fine tune. For any problem output, there would be multiple factors at play. Some of them would have more affect then others. Analyzing and applying all such findings are part of a machine learning problem. Mathematically, ML converts a problem output as a function of multiple input factors.

Y = f(x)

Y = predicted output
x = multiple factors as an input

How does a typical ML workflow look?

There is a structured way to apply ML on a problem. I tried to put the workflow in a pictorial view to easily visualize and understand it:

ml-workflow

It’s goes in a cycle and once we have some insights from the published model, it goes back into the funnel as learning to make output better.

datascience-process

Roughly, Data scientists spend 60% of their time on cleaning and organizing data

Walk through an example?

Let’s use dataset of Titanic survivors found here and run through basic workflow to see how certain features like traveling class, sex, age and fare are helping us assess survival probability.

Load data from file

Data could be in various format. Easiest is to have it in a csv and then load it using pandas. More details around how to play with data is discussed here.
.

# lets load and see the info of the dataset

titanicdf = pd.read_csv("data-files/titanic.csv")
print(titanicdf.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB
None
# A quick sample view
titanicdf.head(3)
 pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.dest
011Allen, Miss. Elisabeth Waltonfemale290024160211.3375B5S2NaNSt Louis, MO
111Allison, Master. Hudson Trevormale0.916712113781151.55C22 C26S11NaNMontreal, PQ / Chesterville, ON
210Allison, Miss. Helen Lorainefemale212113781151.55C22 C26SNaNNaNMontreal, PQ / Chesterville, ON
Data cleanup – Drop the irrelevant columns

It’s not always that the data captured is the only data you need. You will have a superset of data that has additional information which are not relevant for your problem statement. This data will work as a noise and thus it’s better to clean the dataset before starting to work on them for any ML algorithm.
.

# there seems to be handful of columns which 
# we might not need for our analysis 
# lets drop all those column that we know 
# are irrelevant
titanicdf.drop(['embarked','body','boat','name',
'cabin','home.dest','ticket', 'sibsp', 'parch'],
axis='columns', inplace=True)

titanicdf.head(2)
 pclasssurvivedsexagefare
011female29211.3375
111male0.9167151.55
Data analysis

There could be various ways to analyze data. Based on the problem statement, we would need to know the general trend of the data in discussion. Statistics Probability Distribution knowledge help here. For gaining insights to understand more around correlations and patterns, data visualization based insights help.
.

# let's see if there are any highly corelated data
# if we observe something, we will remove that 
# one of the feature to avoid bias

import seaborn as sns
sns.pairplot(titanicdf)

# looking at graphs, seems we don't have 
# anything right away to remove.
titanic-graph
Data transform – Ordinal/Nominal/Datatype, etc

In order to work through data, it’s easy to interpret once they are converted into numbers (from strings) if possible. This helps them input them into various statistics formulas to get more insights. More details on how to apply numerical modifications to data is discussed here.
.

# There seems to be 3 class of people, 
# lets represent class as numbers
# We don't have info of relative relation of 
# class so we will one-hot-encode it
titanicdf = pd.get_dummies(titanicdf,
                              columns=['pclass'])


# Lets Convert sex to a number 
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
titanicdf["sex"] = le.fit_transform(titanicdf.sex) 

titanicdf.head(2)
 survivedsexagefarepclass_1pclass_2pclass_3
01029211.3375100
1110.9167151.55100
Data imputation: Fill the missing values

There are always some missing data or an outlier. Running algorithms with missing data could lead to inconsistent results or algorithm failure. Based on the context, we can choose to remove them or fill/replace them with an appropriate value.
.

# When we saw the info, lots of age were missing
# Missing ages values filled with mean age. 
 
titanicdf.loc[ titanicdf["age"].isnull(), "age" ] = 
titanicdf["age"].mean()

# titanicdf.loc[ titanicdf["fare"].isnull(), "fare"] 
# = titanicdf["fare"].mean() 
# => can do but we will use another way 

titanicdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  1309 non-null   int64  
 1   sex       1309 non-null   int64  
 2   age       1309 non-null   float64
 3   fare      1308 non-null   float64
 4   pclass_1  1309 non-null   uint8  
 5   pclass_2  1309 non-null   uint8  
 6   pclass_3  1309 non-null   uint8  
dtypes: float64(2), int64(2), uint8(3)
memory usage: 44.9 KB
# When we saw the info,
# 1 fare was missing
# Lets drop that one record
titanicdf.dropna(inplace=True)
titanicdf.info()

#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 1308 entries, 0 to 1308
Normalize training data

At times, various data in context are of different scales. In such cases, if the data is not normalized, algorithm can induce bias towards the data that has higher magnitude. Eg, feature A value range is 0-10 and feature B range is 0-10000. In such case, even though a small change in magnitude of A can make a difference but if data is not normalized, feature B will influence results more (which could be not the actual case).
.

X = titanicdf
y = X['survived']
X = X.drop(['survived'], axis=1)

# Scales each column to have 0 mean and 1 std.dev
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
Split data – Train/Test dataset

It’s always best to split the dataset into two unequal parts. Bigger one to train the algorithm and then then smaller one to test the trained algorithm. This way, algorithm is not biased to just the input data and results for test data can provide better picture.
.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = 
                        train_test_split(X_scaled,y)
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 981 entries, 545 to 864
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   sex       981 non-null    int64  
 1   age       981 non-null    float64
 2   fare      981 non-null    float64
 3   pclass_1  981 non-null    uint8  
 4   pclass_2  981 non-null    uint8  
 5   pclass_3  981 non-null    uint8  
dtypes: float64(2), int64(1), uint8(3)
memory usage: 33.5 KB
Run ML algorithm data

Once we have our training dataset ready as per our need, we can apply machine learning algorithms and find which model fits in best.
.

# for now, picking any one of the classifier - KN
# Ignore details or syntax for now
from sklearn.neighbors import KNeighborsClassifier

dtc = KNeighborsClassifier(n_neighbors=5)
dtc.fit(X_train,y_train)
Check the accuracy of model

In order to validate the model, we use test dataset where comparing the predicted value by model to actual data helps us know about ML model accuracy.
.

import sklearn.metrics as met

pred_knc = dtc.predict(X_test)
print( "Nearest neighbors: %.3f" 
      % (met.accuracy_score(y_test, pred_knc)))
Nearest neighbors: 0.817


Voila! With basic workflow, we have a model that can predict the survivor with more than 80% probability.

Download

Entire Jupyter notebook with more samples can be downloaded or forked from my GitHub to look or play around: https://github.com/sandeep-mewara/machine-learning

Currently, it covers examples on following datasets:

  • Titanic Survivors
  • Sci-kit Iris
  • Sci-kit Digits
  • Bread Basket Bakery

Over time, I would continue building on the same repository with more samples with different algorithms.

Closure

Believe, now it’s pretty clear on how we can attack any problem for an insight using Machine learning. Try out and see for yourself.

We can apply Machine Learning to multiple problems in multiple fields. I have shared a pictorial view of sectors in the AI section that are already leveraging it’s benefit.


Keep learning!

samples GitHub Profile Readme
Learn Python – Beginners step by step – Basics and Examples
Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples
Sandeep Mewara Github Repositories
Learn Microsoft Tech via Videos LiveTV Streams
Machine Learning workflowMicrosoft .NET5