Machine Learning (ML) is turning heads in the human resources (HR) field and I couldn’t be more excited. As an IO psychologist, I expected to stretch far outside my comfort zone into the data science world to understand ML. However, I’ve learned that, with a decent grasp on statistics, ML is a manageable stretch.
This begs the question, what is the difference between ML and statistics, anyway? I’ll start with an example of ML doing its part to improve recruitment processes.
Faliagka, Ramantas, and Tsakalidis (2012) from Patras, Greece created a ML algorithm to evaluate and rank job applicants for recruitment. First, they collected resumes and blog posts from applicants’ LinkedIn profiles and social media presence. Recruiters assessed the information and gave each applicant a score that reflected how well the person fit job requirements for three positions: sales engineer, junior programmer, and senior programmer. The score was based on four characteristics: 1) years of formal academic training, 2) years of relevant work experience, 3) average number of years spent at previous jobs, and 4) extraversion. Using the resume data, blog data, and recruiter score, the researchers trained a model to accurately infer level of extraversion and rank applicants by job fit. The researchers then used this learning-to-rank model to quickly and effectively identify the best fitting candidates without having to slog through hundreds of resumes.
Could HR departments accomplish the same task with traditional statistics? Probably. They could create a mathematical formula to score applicants. But it would likely take more time and effort than training a model. And what if the job criteria changed (as it typically does every few years) and the old formula became obsolete? What if the recruiters missed a key variable that was a strong predictor of job fit? ML models can provide a quick and accurate resolution to these types of roadblocks.
Before I discuss the differences between ML and statistics (more precisely, statistical modeling), I want to point out a few key similarities:
Both ML and statistical modeling reveal patterns in data. They help us make inferences and predictions from a set of information that we turn into meaningful insights.
ML is built on statistics. It is not the same as statistics; it is not glorified statistics, but it does share a lot of the same concepts. So, if you understand statistics and statistical modeling, you will probably catch on to ML quicker than someone who does not.
ML and statistical modeling share data analysis techniques, like linear regression. If you run the same data using the same technique, you should come to a similar conclusion. For example, we can train a regression model in ML to predict a similar outcome as one would compute from a traditional regression model.
Moving on to the differences:
Statistical modeling is a method for describing relationships between variables in the form of a mathematical equation. ML is a type of artificial intelligence that learns from data then produces an outcome without relying on rules-based programming or predetermined equations.
With a statistical model, I as the researcher have to indicate what variables to test and build the relational model. ML is a little more flexible and forgiving. I still need to select "features" to train on, but certain algorithms help find relationships I may not know exist (particularly in deep learning).
Both statistics and ML methods can be used for prediction and inference. Historically, however, statistics focuses on inference while ML concentrates on prediction.
Statistical models are great for quantifying and explaining how variables are associated with one another. For instance, if I wanted to understand if and how burnout, pay satisfaction, and engagement were related to turnover, I might use regression to infer the relationships. Then, I’d use the resulting equation to predict employee turnover.
ML models are excellent at making repeatable predictions by learning from experience (i.e., iterating over the data). If I wanted to predict whether an employee is likely to leave the company or not, I might collect loads of data over time on burnout, pay satisfaction, engagement, and other factors; train a regressor on the data; test the regressor; and then use it to predict employee turnover.
In other words, statistical models characterize relationships which can be used to make predictions. ML models make predictions by extracting relationships from data.
Statistical and ML models differ on their level of interpretability. Good statistical models tend to be simple and easy to understand. They are often theory driven and follow a logical pattern of thought. ML models can also be simple, easy to understand, and logical. However, they can be incredibly complex and convoluted, too, with excellent accuracy (i.e., deep learning layers). ML models might find patterns within data that don’t make much sense, but that’s okay if the goal is predictive power and not necessarily interpretability.
Statistical models can make decent inferences and predictions with a few hundred participants or observations. They work well when the data is "long" (number of observations exceeds the number of input variables) and standard assumptions are met (e.g., normality, homogeneity). ML is great for "wide" data (number of input variables exceeds the number of observations). It makes minimal assumptions about the data and can handle huge datasets that are wild and unruly.
Terminology is a "same-but-different" issue. Many of the same concepts go by different names depending on the approach. For example, covariates are characteristics, like number of absences or performance ratings. A covariate is also called an independent variable when used to predict some outcome, like promotability. Features are basically the same characteristics. They are the values that ML models learn, or "train" on, to predict an outcome.
|Statistical Models||Machine Learning|
|Sum of Squares Residuals||Cost Function|
We traditionally evaluate statistical models by testing their significance, the robustness of the parameters, goodness-of-fit, and general strength of the relationships. In ML, we evaluate how well the model performs on the test set and use cross validation techniques (e.g., K-folds) to assess the effectiveness of the model. One common evaluation metric in ML is model accuracy – the ratio of number of correct predictions to total number of predictions made. Other metrics include the area under curve (AUC), F1 score, mean absolute error, and loss.
Given all the similarities and differences between statistical modeling and ML, you might be wondering, how do I know which approach to use? Here are a few questions I start with:
- What do I need to understand (e.g., relationships, outcomes)?
- What are my resource constraints (e.g., data, expertise)?
- What am I going to do with the information (e.g., make inferences, predictions)?
If you have answers to all of the above questions, then it might be ideal to take a hybrid approach. Try both. Try mixing methods. Try something new. You won’t hurt the data.
For more information about ML and statistics, check out the articles below and keep an eye out for future blog posts from PSI Services.
- Faliagka, E., Ramantas, K., Tsakalidis, A., & Tzimas, G. (2012, May). Application of machine learning algorithms to an online recruitment system. In Proc. International Conference on Internet and Web Applications and Services.