What is data science, really?
In all the research I’ve done, I still don’t really have what I would consider a satisfactory answer. It obviously has something to do with the science of data, but drawing a line between a “data scientist” and “statistician” is more difficult than it seems. This is made even more convoluted by the fact that a statistician may very well be able to argue that they’re also a data scientist, while a data scientist may argue that they’re also a statistician.
To me, the difference is more about the setting and the purpose of each role. Being a data scientist is, in some sense, the same as being a statistician in industry. That makes it sound like there’s hardly any difference at all, but that’s not exactly the case either. The implication of “in industry” is much larger than it first appears.
So, what are these implications?
Well, for one, data scientists need to have a lot more software knowledge than a typical statistician. They need to be able to manage and utilize databases, deal with absolutely massive datasets, and handle much “messier” data overall. The data that data scientists use is often unstructured, which makes the wrangling and cleaning process much more involved.
By nature of working in industry, data scientists also need to deal with the typically non-technical people in their industry (shocking, I know). This means that the questions given to a data scientist aren’t technical either. Business decision makers don’t want to know if something is statistically significant, the effect size of a treatment, or the statistical power of some experiment. They have data, and they want their data scientists to use that data to give them immediate, actionable answers to the problems their companies face every day.
This means that data scientists have to take a practical business question, translate it into a statistical question, find a statistical answer to that question, and translate that back into a practical business answer. Thus, the presentation and framing of results is a huge part of a data scientists job. It doesn’t matter how great a model is if you can’t make it easy to digest and utilize for other people in the business.
Similarly, the methods that statisticians and data scientists employ will also differ quite a bit. Statisticians typically want to draw inference about a population, fully understanding and interpreting the many components of an analysis. Data scientists, however, don’t generally care about inference. It just doesn’t really matter why something is the way it is, it only matters how well we can make predictions and, by extension, make practical decisions. That’s not to say that a data scientist couldn’t perform the statistical tests that a statistician would find themselves doing or that a statistician couldn’t build machine learning models that a data scientist would often be building, it just speaks to the general perspective both roles are approaching problems from.
What… am I?
Personally, I’d definitely say I’m more on the statistician side of the spectrum at the moment. My software skills are limited to SAS, R, and a little bit of Python/SQL, and I know prescious little about any machine learning methods that data scientists often use. However, I think (and hope!) that changes after this summer of Python and R courses, and that I’ll be more prepared to take on data scientist roles in my career.