# All Roads Lead to Probability

Probability and statistics are the basis of how we deal with the uncertainties of our world.

## Table of Contents

A little over a year ago marks the beginning of my data science and analytics studies at GWU. Although it wasn't my first time studying in the nation's capitol, it was my first graduate-level embarkment, and looking back today, that MSBA program certainly equipped me with many of the skills I'm using on a daily basis now at SMT.

Like The Karate Kid, GWU's curriculum started by covering the foundational topics early before throwing in the more complex material. Except, instead of learning how to "wax on and wax off," we were learning Bayes' Theorem and how to navigate working directories in R.

But for one reason or another, my class schedule skipped over the probability and modeling courses and dove me straight into machine learning. Honestly, I didn't think much of it at the time and just scheduled those courses near the end of my program a year later, but in hindsight, I realized that I mistakenly learned data science backward (... and turned out fine).

And much like the featured image of this blog post, this post is going to be about what it was like to introduce yourself to machine learning before studying probability. I'll also touch on the foundational connection between the disciplines.

Enjoy.

## The Enlightenment

Picture this. You're working on your first regression model. You gathered up 5,000 rows of data and 50 features (... you think you're doing great). You run the model and the error-rate is low on the training data, but high on the validation data. You're confused, so you visit your professor during office hours because more features means better predictions, right? (Wrong). Then, your professor starts throwing words like "overfitting," "bias-variance trade-off," "probability density," and "model complexity" at you. Then and there, you realize you have a lot to learn.

The above scenario was me quite some time ago before I knew better. Nowadays, I shoot for simplicity over complexity whenever possible for a few reasons:

**Explainability**. Simpler models are not only easier to explain, but also easier to debug and understand their fairness and causality.**Efficiency**. Complex models if not handled elegantly can suffer from long, intolerable compute times, while simple models run many times quicker.**Execution/Sustainability**. Simple models are easier to deploy and maintain overtime. Especially in DevSecOps, the ability execute and sustain software is paramount.

I would not have understood why the above points are true without having a grasp of statistics and probability plus some industry experience. Why? You may ask. Well, machine learning is like driving a car. You can start the engine, steer the wheel, read the dashboard, but only until you pop the hood will you understand the engine. In other words, until you learn statistics and probability, you likely will not grasp a full understanding of machine learning models.

## From Statistics and Probability to Machine Learning

The idea that simpler models have a better use-case than complex models hits home when you consider that probability and statistics are the building blocks for most models and methodologies in machine learning.

"Using fancy tools like neural nets, boosting and support vector machines without understanding basic probability and statistics is like doing brain surgery before knowing how to use a band-aid." — Wasserman (2004)

Regardless of what you are trying to simulate or predict, 99 percent of the time, your model will follow the logic loop depicted above. The one percent exception is when your model does not have any prior observable data.

Let's take a look at the four phases of the loop:

- Phase 1: The
**data generating process**is our perception of the ground truth, like a random experiment, for instance. - Phase 2:
**Probability**is how we quantify uncertainty, which is the likelihood that an event will occur. Whether you should lean toward the frequentist or bayesian approach depends on the circumstances of your problem. Sometimes it is more logical to assign probabilities to data and not hypotheses and vice versa. When we don't have data, we should rely on subjective expert opinion. - Phase 3:
**Observed data**is our collection of all outcomes (sample space), with events being a subset of the sample space. - Phase 4:
**Inference and data mining**is our statistical inference. Think of it as how we estimate uncertainty or sample-to-sample variance.

Misconstruing statistics and machine learning as different disciplines is a common error. I think the confusion stems from conflicting terminology used in each field.

Look at these examples for instance:

- Data (Statistics) = Training Sample (ML)
- Estimation = Learning
- Classification = Supervised Learning
- Clustering = Unsupervised Learning

On the left side of the equals sign is the language favored by statisticians. The right is preferred by non-traditional statisticians (data scientists and machine learning engineers). The benefit of understanding the cohesion between the two fields reflects in your understanding of uncertainty and how we deal with it.

## Dealing With Uncertainty

What types of uncertainties do we deal with in our world? Let's name a few examples:

- Who will win the coin toss for starting in the next World Cup game?
- Will the NASDAQ go up tomorrow?
- How many people will catch the flu this Winter?
- What will be the number of sales at a grocery store this week?
- How many times will Elon tweet tomorrow?

In all the above examples, we will eventually know the outcome, so data is being generated as we collect observations. That means the whole concept behind the data generating process is just random experiments.

Why is that? An experiment is any data generating process whose outcome is not known in advance with certainty. Given that we can determine the possible outcomes prior to the experimentation and its actual observed outcome, the data generating process is therefore (random) experiments.

So all that is left is to quantify (or express) our uncertainty surrounding the actual outcome of an event prior to the experiment, which can be referred to as uncertainty probability modeling. If continue this direction of thought, we can claim that modeling is based on data from previous experiments. Therefore, machine learning is largely rooted in statistics and probability.

Although, society has used probability for measuring uncertainty for decades, and there is complete agreement on the mathematical theory behind it, many still do not agree on its meaning and interpretation.

The battle for defining the interpretation is largely fought between the (Relative) Frequentists and the (Subjective) Bayesians. The interpretation used should be based on the problem at hand in my opinion. If you have data with lots of examples, go Frequentist, if not, go Bayesian (... but then again, I do not care too much about the definitions).

Anyhow, let's wrap things up with a code example.

## R Code Example

In this example, we'll analyze daily returns from Ford and the SP 500 and answers some probability questions one might have when picking stocks.

If you would like to follow along feel free to download the dataset embedded below.

First, let's load the data.

```
data = read.table('prices.txt', header=TRUE)
sp500 = data[,3]
ford = data[,5]
head(data)
```

Next, let's calculate daily price movement and return.

```
n = length(sp500)
d_sp500 = rep(0,n)
r_sp500 = rep(0,n)
d_ford = rep(0,n)
r_ford = rep(0,n)
for (t in 2:n) {
d_sp500[t] = sp500[t] - sp500[t-1]
r_sp500[t] = d_sp500[t] / sp500[t-1]
d_ford[t] = ford[t] - ford[t-1]
r_ford[t] = d_ford[t] / ford[t-1]
}
head(cbind(r_sp500, r_ford))
```

Now, let's plot the data.

```
par(mfrow=c(1,2))
plot(r_sp500, type='l', main='Daily SP500 Returns', xlab='Day', ylab='Returns', col='blue')
abline(h=0, col='red', lwd=2)
hist(r_sp500, col='blue', xlab='SP500 Returns', prob=T, nclass=20)
```

```
plot(r_ford, type='l', main='Daily Ford Returns', xlab='Day', ylab='Returns', col='blue')
abline(h=0, col='red', lwd=2)
hist(r_ford, col='blue', xlab='Ford Returns', prob=T, nclass=20)
```

Looks like both Ford and the SP 500 resemble a normal distribution, however Ford has slightly higher volatility.

Moving on, one might be interested in the positive return percentage, so, for example, let's find out the probability Ford stock will price moves higher than 2.5% and the SP 500 higher than 1%.

```
id_sp500=(d_sp500>.01)
id_ford=(d_ford>.025)
cbind(mean(id_sp500),mean(id_ford))
```

Now, given that `F`

stands for upward Ford stock movement and `S`

stands for upward SP 500 stock movement. Let's figure out the following probabilities:

- P(S ∩ F) = Probability of Ford and SP 500 stock going up in a given day.
- P(S ∩ F
^{c}) = Probability of Ford stock going up and SP 500 stock not in a given day. - P(S
^{c}∩ F) = Probability of SP 500 stock going up and Ford stock not in a given day. - P(S
^{c}∩ F^{c}) = Probability of Ford and SP 500 stock both not going up in a given day.

```
library(descr)
CrossTable(id_sp500,id_ford,prop.chisq = FALSE,prop.r = FALSE,prop.c = FALSE,prop.t=FALSE)
```

Then, we do some calculations, respectively.

- 757/2363 = 0.32
- 476/2363 = 0.20
- 318/2363 = 0.13
- 812/2363 = 0.35

Okay, I think that just about does it for this post. You got some personal stories, some theory, and some R code, so if you made it this far I hope you got some value out of it. Also, I need to pay credit to my GWU DNSC 6311 lectures for inspiring a lot of this content.

## Kyle's Blog

Join my blog to receive the latest updates in your inbox.