Data, data, everywhere data. In today's highly digitized and connected world, data are being collected and recorded every second of the day. In the past, social scientists - like myself - spent unfathomable hours focused on finding subjects and begging for and collecting data for scientific study. The mountains of data readily available now can make you downright giddy (well, it makes me giddy anyway). New software programs and statistical techniques have emerged to handle the extra large data sets available to statisticians.
Big Data, a term to describe the mining of large data sets for patterns, trends, and relationships, is poised to help solve problems and offer insight into complicated situations for organizations across industries. The information gleaned from the data could inform healthcare decisions, organizational outcomes, and process improvements. I personally look forward to seeing where this focus on data will lead us in the future.
While I definitely embrace the future of data science, I tend to worry about the science part of data science. Big Data is generally an approach that sifts through data and looks for empirical relationships among the data available, without any preconceived notions as to what the data might show. Science, or at least good science, implies that there is a theory or reasoning behind the relationships that are found. In a scientifically sound research study, there is typically a theoretical foundation for a particular relationship, hypotheses are set forth, and analyses are conducted to test those hypotheses. In many cases, extraneous variables are considered and built into the study (to control for them) to help focus in on the relationship of interest. When the scientific process is used, the results and conclusions are well-supported and credible. In Big Data, the process is empirical and the relationships identified are often taken as truth, when in actuality the relationship could be spurious or nonsensical.
I came across an article that shows examples of spurious correlations that exist in the world that show very strong - almost perfect - correlations that could lead some people to draw nonsensical conclusions. For example, there is a .998 correlation (almost perfect - 1.0 is perfect) between U.S. spending on science, space, and technology and suicides by hanging, strangling and suffocation. Someone who was examining the data without a scientific basis might conclude that spending more on space, science, and technology somehow unleashes a chain of events that leads to an increase in suicides.
Correlations ≠ Causation
It's important to note that correlations do not mean causation. This is a prime example of how two things might vary together but have no link to one another. Suicides don't cause an increase in space spending and vice versa. See the article for more examples of spurious relationships (e.g., drownings from fishing boats and the marriage rate in Kentucky). These examples are so ridiculous that it's easy to see how the two variables are not theoretically linked to one another.
With Big Data, while the variables available in the data set may be more sensible in their linkage, the relationship could be just as spurious. This is where the scientific process fits into the mix. Before mining a data set, there should be some thought behind what is there and why certain relationships may or may not exist. Good scientific principles like cross validating (using a holdout sample and checking for similar relationships across samples) should continue to be used to establish credibility for the results.
Sometimes Big Data studies are exploratory in nature. When they are, this should be well-communicated and used as a stepping stone to building scientific theory and to inform future scientific research. Alternatively, when there is reason to test for particular relationships, the scientific foundation should be established and models should be tested. The results should be discussed within the context of theory and other scientific literature.
Big Data has the potential to answer questions and explore problems previously unavailable to scientists. In the excitement of data accessibility, it's important to be careful in the interpretation of the results, continue to apply sound scientific principles, and use the data to build theory and scientific models that can continue to be tested now and in the future.
This report illustrates how applying the science of employee assessment netted real results for our clients.