This paper is the best overview I’ve seen of the field of data science and its relationship to the field of statistics, written by David Donoho, a Professor of Statistics at Stanford. My apologies if you’ve already come across this.

In this article, Dr. Donoho discusses:

- A history of data science and the prime movers in the field, and the pros and cons of the field’s current direction.
- Key difference between statistics and data science including traditional statistics’ focus on inference and the application-oriented field of data science focusing more on prediction. He goes into some detail about the many similarities between what is being taught in statistics departments and emerging data science curricula.
- The need for more formalized training on data exploration and cleaning in curricular programs, and his extensive thoughts on what an appropriate curriculum should be, citing others in the field.

He presents 6 “key divisions” of data science for use in a proposed curriculum:

- Data exploration and preparation
- Data representation and transformation
- Computing with data
- Data modeling
- Data visualization and presentation
- Science about data science (adopting a scientific approach to developing tools and workflows in the field with rigorous performance metrics, and ensuring there are standards for documenting both analyses and data for reproducibility and long term preservation)

My favorite quote in this article comes from or is paraphrased from John Tukey, who states regarding new tools or methods of analyzing data,

“The true effectiveness of a tool is related to the probability of deployment times the probability of effective results once deployed.”

So very true.