Quick look into Machine Learning workflow

Sandeep Mewara

4 years ago

Before we jump into various ML concepts and algorithms, let’s have a quick look into basic workflow when we apply Machine Learning to a problem.

A short brief about Machine Learning, it’s association with AI or Data Science world is here.

How does Machine Learning help?

Machine Learning is about having a training algorithm that helps predict an output based on the past data. This input data can keep on changing and accordingly the algorithm can fine tune to provide better output.

It has vast applications across. For example, Google is using it to predict natural disasters like floods. A very common use we hear these days in news are usage in Politics and how to attack the demography of voters.

How does Machine Learning work?

Data is the key here. More the data is, better the algorithm can learn and fine tune. For any problem output, there would be multiple factors at play. Some of them would have more affect then others. Analyzing and applying all such findings are part of a machine learning problem. Mathematically, ML converts a problem output as a function of multiple input factors.

Y = f(x)
Y = predicted output
x = multiple factors as an input

How does a typical ML workflow look?

There is a structured way to apply ML on a problem. I tried to put the workflow in a pictorial view to easily visualize and understand it:

It’s goes in a cycle and once we have some insights from the published model, it goes back into the funnel as learning to make output better.

Roughly, Data scientists spend 60% of their time on cleaning and organizing data

Walk through an example?

Let’s use dataset of Titanic survivors found here and run through basic workflow to see how certain features like traveling class, sex, age and fare are helping us assess survival probability.

Load data from file

Data could be in various format. Easiest is to have it in a csv and then load it using pandas. More details around how to play with data is discussed here.
.

# lets load and see the info of the dataset

titanicdf = pd.read_csv("data-files/titanic.csv")
print(titanicdf.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB
None

# A quick sample view
titanicdf.head(3)

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.55	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2	1	2	113781	151.55	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON

Data cleanup – Drop the irrelevant columns

It’s not always that the data captured is the only data you need. You will have a superset of data that has additional information which are not relevant for your problem statement. This data will work as a noise and thus it’s better to clean the dataset before starting to work on them for any ML algorithm.
.

# there seems to be handful of columns which 
# we might not need for our analysis 
# lets drop all those column that we know 
# are irrelevant
titanicdf.drop(['embarked','body','boat','name',
'cabin','home.dest','ticket', 'sibsp', 'parch'],
axis='columns', inplace=True)

titanicdf.head(2)

	pclass	survived	sex	age	fare
0	1	1	female	29	211.3375
1	1	1	male	0.9167	151.55

Data analysis

There could be various ways to analyze data. Based on the problem statement, we would need to know the general trend of the data in discussion. Statistics Probability Distribution knowledge help here. For gaining insights to understand more around correlations and patterns, data visualization based insights help.
.

# let's see if there are any highly corelated data
# if we observe something, we will remove that 
# one of the feature to avoid bias

import seaborn as sns
sns.pairplot(titanicdf)

# looking at graphs, seems we don't have 
# anything right away to remove.

Data transform – Ordinal/Nominal/Datatype, etc

In order to work through data, it’s easy to interpret once they are converted into numbers (from strings) if possible. This helps them input them into various statistics formulas to get more insights. More details on how to apply numerical modifications to data is discussed here.
.

# There seems to be 3 class of people, 
# lets represent class as numbers
# We don't have info of relative relation of 
# class so we will one-hot-encode it
titanicdf = pd.get_dummies(titanicdf,
                              columns=['pclass'])


# Lets Convert sex to a number 
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
titanicdf["sex"] = le.fit_transform(titanicdf.sex) 

titanicdf.head(2)

	survived	sex	age	fare	pclass_1	pclass_2	pclass_3
0	1	0	29	211.3375	1	0	0
1	1	1	0.9167	151.55	1	0	0

Data imputation: Fill the missing values

There are always some missing data or an outlier. Running algorithms with missing data could lead to inconsistent results or algorithm failure. Based on the context, we can choose to remove them or fill/replace them with an appropriate value.
.

# When we saw the info, lots of age were missing
# Missing ages values filled with mean age. 
 
titanicdf.loc[ titanicdf["age"].isnull(), "age" ] = 
titanicdf["age"].mean()

# titanicdf.loc[ titanicdf["fare"].isnull(), "fare"] 
# = titanicdf["fare"].mean() 
# => can do but we will use another way 

titanicdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  1309 non-null   int64  
 1   sex       1309 non-null   int64  
 2   age       1309 non-null   float64
 3   fare      1308 non-null   float64
 4   pclass_1  1309 non-null   uint8  
 5   pclass_2  1309 non-null   uint8  
 6   pclass_3  1309 non-null   uint8  
dtypes: float64(2), int64(2), uint8(3)
memory usage: 44.9 KB

# When we saw the info,
# 1 fare was missing
# Lets drop that one record
titanicdf.dropna(inplace=True)
titanicdf.info()

#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 1308 entries, 0 to 1308

Normalize training data

At times, various data in context are of different scales. In such cases, if the data is not normalized, algorithm can induce bias towards the data that has higher magnitude. Eg, feature A value range is 0-10 and feature B range is 0-10000. In such case, even though a small change in magnitude of A can make a difference but if data is not normalized, feature B will influence results more (which could be not the actual case).
.

X = titanicdf
y = X['survived']
X = X.drop(['survived'], axis=1)

# Scales each column to have 0 mean and 1 std.dev
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)

Split data – Train/Test dataset

It’s always best to split the dataset into two unequal parts. Bigger one to train the algorithm and then then smaller one to test the trained algorithm. This way, algorithm is not biased to just the input data and results for test data can provide better picture.
.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = 
                        train_test_split(X_scaled,y)
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 981 entries, 545 to 864
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   sex       981 non-null    int64  
 1   age       981 non-null    float64
 2   fare      981 non-null    float64
 3   pclass_1  981 non-null    uint8  
 4   pclass_2  981 non-null    uint8  
 5   pclass_3  981 non-null    uint8  
dtypes: float64(2), int64(1), uint8(3)
memory usage: 33.5 KB

Run ML algorithm data

Once we have our training dataset ready as per our need, we can apply machine learning algorithms and find which model fits in best.
.

# for now, picking any one of the classifier - KN
# Ignore details or syntax for now
from sklearn.neighbors import KNeighborsClassifier

dtc = KNeighborsClassifier(n_neighbors=5)
dtc.fit(X_train,y_train)

Check the accuracy of model

In order to validate the model, we use test dataset where comparing the predicted value by model to actual data helps us know about ML model accuracy.
.

import sklearn.metrics as met

pred_knc = dtc.predict(X_test)
print( "Nearest neighbors: %.3f" 
      % (met.accuracy_score(y_test, pred_knc)))

Nearest neighbors: 0.817

Voila! With basic workflow, we have a model that can predict the survivor with more than 80% probability.

Download

Entire Jupyter notebook with more samples can be downloaded or forked from my GitHub to look or play around: https://github.com/sandeep-mewara/machine-learning

Currently, it covers examples on following datasets:

Titanic Survivors
Sci-kit Iris
Sci-kit Digits
Bread Basket Bakery

Over time, I would continue building on the same repository with more samples with different algorithms.

Closure

Believe, now it’s pretty clear on how we can attack any problem for an insight using Machine learning. Try out and see for yourself.

We can apply Machine Learning to multiple problems in multiple fields. I have shared a pictorial view of sectors in the AI section that are already leveraging it’s benefit.

Keep learning!

samples GitHub Profile Readme
Learn Python – Beginners step by step – Basics and Examples
Sandeep Mewara Github
Sandeep Mewara Learn By Insight
Matplotlib plot samples
Sandeep Mewara Github Repositories
Learn Microsoft Tech via Videos LiveTV Streams
Machine Learning workflow Microsoft .NET5