A probability distribution helps understand the likelihood of possible values that a random variable can take. It is one of the must needed statistical knowledge for any data science aspirant.
Few consider, Probability distributions are fundamental to statistics, like data structures are to computer science
In Layman terms
Let’s say, you pick any 100 employees of an organization. Measure their heights (or weights). As you measure them, create a distribution of it on a graph. Keep height on X-Axis & frequency of a particular height on Y-Axis. With this, we will get a distribution for a range of heights.
This distribution will help know which outcomes are most likely, the spread of potential values, and the likelihood of different results.
Basic terminology
Random Sample
The set of 100 people selected above in our example will be termed as random sample.
Sample Space
The range of possible heights of the 100 people is our sample space. It’s the set of all possible values in the setup.
Random Variable
The height of the 100 people measured are termed as random variable. It’s a variable that takes different values of the sample space randomly.
Mean (Expected Value)
Let’s say most of the people in those 100 are of height 5 feet, 3 inches (making it an average height of those 100). This would be termed expected value. It’s an average value of a random variable.
Standard deviation & Variance
Let’s say most of the people in those 100 are of height 5 feet, 1 inches to 5 feet, 5 inches. This is variance for us. It’s an average spread of values around the expected value. Standard Deviation is the square root of the variance.
Types of data
- Ordinal – They have a meaningful order. All numerical data fall in this bucket. They can be ordered in relative numerical strength.
- Nominal – They cannot be ordered. All categorical data fall in this bucket. Like, colors – Red, Blue & Green – there cannot be an order or a sequence of high or low in them by itself.
- Discrete – an ordinal data that can take only certain values (like soccer match score)
- Continuous – an ordinal data that can take any real or fractional value (like height & weight)
In Continuous distribution, random variables can have an infinite range of possible outcomes
Probability Distribution Flowchart
Following diagram shares few of the common distributions used:
Based on above diagram, will cover three distributions to have a broad understanding:
Uniform Distribution
It is the simplest form of distribution. Every outcome of the sample space has equal probability to happen. An example would be to roll a fair dice that would have an equal probability outcome of 1-6.
Normal (Gaussian) Distribution
The most common distribution. Few would recognize this by a ‘bell curve’. Most values are around the mean value making the distribution arrangement symmetric.
Central limit theorem suggests that sum of several independent random variables is normally distributed
The area under the distribution curve is equal to 1 (all the probabilities must sum up to 1)
A parameter Mew drives the distribution center (mean). It corresponds to the maximum height of the graph. A parameter Sigma corresponds to the range of variation (variance or standard deviation).
68–95–99.7 rule (empirical rule) – approximate percentage of the data covered by ranges defined by 1, 2, and 3 standard deviations from the mean
Exponential Distribution
It is where a few outcomes are most likely with a rapid decrease in probability to all other outcomes. An example of it would be a car battery life in months.
A parameter Beta deals with scale that defines the mean and standard deviation of the distribution. A parameter Lambda deals with rate of change in the distribution
Probability Distribution Choices
I came across an awesome representation of the probability distribution choices. It works as a cheat sheet to understand the provided data.
Wrap Up
Though above is just an introduction, believe it should be good enough to start, correlate and understand some basics of machine learning algorithms. There would be more to it while working on algorithms and problems while analyzing data to predict trends, etc.
Keep learning!