Data Science Pathways

As a field that is actually a synthesis of a number of well-established fields, the skills that make for a successful data scientist come from a variety of disciplines including statistics, mathematics, and computer science. Navigating a pathway through developing skills in all of these fields can be challenging. To help provide pathways through data science skill development, I've compiled a list of resources building and expanding data science knowledge:


Computational Tools

back to top

There are a vast array of tools that can be used for solving problems in data science. Some are programming languages or environments, others are useful packages for solving specific problems or communicating and visualizing your results.

Programming Languages

Almost any programming language can be used to solve computational problems, although there are a few that outshine in terms of built in packages and user support communities. Most notably, Python and R have excelled in these respects and are also freely available. MATLAB may have the most detailed documentation of any of the options available, but it is commercial software.

Python. Python is a powerful, general purpose, dynamic programming language that is has extensive packages for scientific computation (NumPy, SciPy, Pandas), advanced plotting (matplotlib), and machine learning (scikit-learn). For this sort of scientific computing, using an IDE such as Rodeo or Spyder may speed up the development of analyses.

R. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and OSX. With the RStudio integrated development environment (IDE), the language can be powerfully wielded for rapid analyses. Additionally, R Shiny can turn R analyses into interactive web applications.

MATLAB. A numerical computing environment and programming language with a wide set of standard toolboxes including those for statistics and machine learning.

Julia. A newer programming language designed to meet the needs of mathematical computing.

Version Control

Almost any data science project worth doing requires significant numbers of revisions and collaboration. These tools allow for comprehensive Git-based version control with a web-based repository. Github is the most popular, but all offer similar web-based repository services.

Git. Open source distributed version control system. Git is often used with a web-based Git repository hosting service such as Github.

Code Sharing and Dissemination

Jupyter Notebook. This web application allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Github Pages / Github.io. Github Pages allows you to create a web page from a Github repository and convert plain text into a formatted web document.

Visualization

D3.js. D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but do require significant skill to use effectively.

Tableau. Proprietary desktop and web-based visualization tools that include many data visualization techniques for the rapid development of professional visualizations.

Database Management (for big and little data)

MySQL. An open source relational database management system using SQL.

Apache Hadoop. An open source framework for distributed file storage and processing (often associated with “big data”) that uses the Hadoop Distributed File System (HDFS) for storage and the MapReduce algorithm for data processing.

MongoDB. A document-oriented NoSQL database (non-relational database, which does not rely on tables for storing data) capable of handling a wider variety of data types than traditional SQL relational databases.


Resources and References

Recommendations are indicated with a star ().


back to top
Author Name Topic Year
Donoho, David 50 Years of Data Science Data Science as a Field 2015
Kriesel, David A Brief Introduction to Neural Networks Machine Learning - Deep Learning and Neural Networks 2007
Daumé III, Hal A Course in Machine Learning Machine Learning 2015
Beezer, Robert Arnold A First Course in Linear Algebra Mathematics - Linear Algebra 2008
Ross, Sheldon A First Course in Probability Probability and Statistics 2014
Yee, Stephanie, and Tony Chu A Visual Introduction to Machine Learning Machine Learning Unknown
Shalizi, Cosma Advanced Data Analysis from an Elementary Point of View Machine Learning Unknown
Wickham, Hadley Advanced R Programming - R Unknown
Smilkov, Daniel and Shan Carter An Interactive Neural Network Playground Machine Learning - Deep Learning and Neural Networks Unknown
Venables, W., and D. Smith An Introduction to R Programming - R 2017
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani An Introduction to Statistical Learning Machine Learning 2013
Sweigart, Al Automate the Boring Stuff with Python: Practical Programming for Total Beginners Programming - Python 2016
Ash, Robert B. Basic Probability Theory Probability and Statistics 1970
Davidson-Pilon, Cameron Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference Machine Learning - Bayesian Methods 2015
Barber, David Bayesian Reasoning and Machine Learning Machine Learning - Bayesian Methods 2012
Marsden, Jerrold and Alan Weinsten Calculus 1, 2, and 3. 2nd Edition Mathematics - Calculus 1985
Hartman, Gregory Calculus 1, 2, and 3. 3rd Edition Mathematics - Calculus 2015
Stewart, James Calculus: Early Transcendentals. 8th Edition Mathematics - Calculus 2015
Strang, Gilbert Calculus: MIT Open Courseware Mathematics - Calculus 1991
Guichard, David Community Calculus Mathematics - Calculus 2016
Maclean, Malcom D3 Tips and Tricks Visualization - D3 2013
Swanstrom, Ryan Data Science University Programs Data Science as a Field 2015
Lin, Jimmy, and Chris Dyer Data-Intesive Text Processing with MapReduce Programming - MapReduce 2010
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville Deep Learning Machine Learning - Deep Learning and Neural Networks 2016
Ng, Andrew Deep Learning Tutorial Machine Learning - Deep Learning and Neural Networks Unknown
Trench, William Elementary Differential Equations Mathematics - Differential Equations 2013
Rougier, Nicholas From Python to Numpy Programming - Python 2017
Rasmussen, Carl Edward, and Christopher Williams Gaussian Processes for Machine Learning Probability and Statistics - Gaussian Processes 2006
Chacon, Scott Git Internals Programming - Version Control 2008
Atlassian Git Tutorial Programming - Version Control Unknown
Grinstead, Charles and James Snell Grinstead and Snell’s Introduction to Probability Probability and Statistics 2006
Géron, Aurélien Hands-On Machine Learning with Scikit-Learn and TensorFlow Machine Learning 2017
Ström, J., K. Åström, and T. Akenine-Möller Immersive Linear Algebra Mathematics - Linear Algebra 2016
MacKay, David Information Theory, Inference and Learning Algorithms Probability and Statistics - Information Theory 2003
Murray, Scott Interactive Data Visualization for the Web Visualization - D3 2013
Vandenberghe, L. Introduction to Applied Linear Algebra Mathematics - Linear Algebra 2017
Shaw, Zed Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code Programming - Python 2013
Navarro, Daniel Learning Statistics with R: A Tutorial for Psychology Students and Other Beginners (version 0.5) Programming - R 2015
Hefferon, Jim Linear Algebra Mathematics - Linear Algebra 2006
Lay, David Linear Algebra and Its Applications Mathematics - Linear Algebra 2006
Treil, Sergei Linear Algebra Done Wrong Mathematics - Linear Algebra 2004
Leskovec, Jure, Anand Rajaraman, and Jeffrey Ullman Mining of Massive Datasets Programming - MapReduce 2014
MySQL MySQL Tutorial Programming - SQL 1997
Nielsen, Michael Neural Networks and Deep Learning Machine Learning - Deep Learning and Neural Networks 2016
Lebl, Jiří Notes on Diffy Qs: Differential Equations for Engineering Mathematics - Differential Equations 2014
Few, Stephen Now You See It: Simple Visualization Techniques for Quantitative Analysis Visualization - Design 2009
Carethell, Clare Open Data Science Masters Curriculum Data Science as a Field 2015
Diez, David, Christopher Barr, and Mine Cetinkaya-Rundel OpenIntro Statistics Probability and Statistics 2015
Gertz, M. Oracle/SQL Tutorial Programming - SQL 2000
Duda, Richard O., Peter E. Hart, and David G. Stork Pattern Classification Machine Learning 2012
Bishop, Christopher Pattern Recognition Machine Learning 2006
VanderPlas, Jake Python Data Science Handbook Programming - Python 2016
Severance, Charls Python for Informatics Programming - Python Unknown
Raschka, Sebastian Python Machine Learning, Second Ediction Machine Learning 2017
Grolemund, Garrett, and Hadley Wickham R for Data Science Programming - R 2017
Skinner, Grant RegExr Programming - Regular Expressions Unknown
Sutton, Richard, and Andrew Barto Reinforcement Learning: An Introduction Machine Learning - Reinforcement Learning 2010
Varoquaux et al. Scipy Lecture Notes Programming - Python 2017
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani The Elements of Statistical Learning (2nd Edition) Machine Learning 2009
Cairo, Alberto The Functional Art: An Introduction to Information Graphics and Visualization Visualization - Design 2012
Reitz, Kenneth and‎ Tanya Schlusser The Hitchhiker’s Guide to Python Programming - Python 2016
Oetiker, Tobias The Not So Short Introduction to LATEX 2ε Programming - Typesetting 2016
Python Software Foundation The Python Tutorial Programming - Python 2017
Nilsson, Nils The Quest for Artificial Intelligence: A History of Ideas and Achivements Data Science as a Field 2010
Stitch The State of Data Science Data Science as a Field 2016
Tufte, Edward The Visual Display of Quantitative Information Visualization - Design 2001
Downey, Allen Think Bayes: Bayesian Statistics Made Simple Machine Learning - Bayesian Methods 2013
Downey, Allen Think Complexity: Complexity Science and Computational Modeling Programming - Python 2012
Downey, Allen Think Python: How to Think Like a Computer Scientist Programming - Python 2015
Downey, Allen Think Stats: Probability and Statistics for Programmers Probability and Statistics 2014
Shalev-Shwartz, Shai and Shai Ben-David Understanding Machine Learning: From Theory to Algorithms Machine Learning 2014

Online Course Materials

Recommendations are indicated with a star ().

back to top
Instructor Title Designation University Year
Siegel A Mathematics Course for Political and Social Researchers None Duke University 2014
Adams Advanced Machine Learning CS 281 Harvard University 2013
Gogate Advanced Machine Learning CS 7301 University of Texas at Dallas 2017
Krause Advanced Topics in Machine Learning CS 253 California Institute of Technology 2010
Mueller Applied Machine Learning COMS W4995 Columbia 2017
Berenson Artificial Intelligence CS 534 Worchester Polytechnic Institute 2015
Winston Artificial Intelligence OpenCourseware Massachusetts Institute of Technology 2010
Chan Computational Statistics in Python STA 663 Duke University 2015
Chan Computational Statistics in Python STA 663 Duke University 2017
Li Convolutional Neural Networks for Visual Recognition CS 231n Stanford University 2017
Walther Data Mining and Analysis Stats 202 Stanford University 2017
Arnold Data Mining and Machine Learning STAT 365/665 Yale University 2016
Irizarry Data Science CS 109 Harvard University 2014
Klein Introduction to Artificial Intelligence CS 188 University of California, Berkeley 2014
Konidaris Introduction to Artificial Intelligence CPS 270 Duke University 2016
Lex Introduction to Data Science CS 5963 University of Utah 2016
Chen Introduction to Data Science for Public Policy PPOL 670 Georgetown University 2018
Shewchuk Introduction to Machine Learning CS 189 University of California, Berkeley 2017
Srihari Introduction to Machine Learning CSE 574 University at Buffalo 2017
Vishwanathan Introduction to Machine Learning CS 590 Purdue University 2010
Harrington Introduction to Machine Learning and Data Mining COMP 135 Tufts University 2016
Abu-Mostafa Learning From Data MOOC California Institute of Technology 2010
Dietterich Machine Learning CS 534 Oregon State University 2005
Domingos Machine Learning CSE 446 University of Washington 2014
Domingos Machine Learning CSE 546 University of Washington 2014
Fern Machine Learning CS 534 Oregon State University 2015
Guestrin Machine Learning 10-601 Carnegie Mellon University 2007
Jamieson Machine Learning CSE 546 University of Washington 2017
Kakade Machine Learning CSE 546 University of Washington 2016
Mitchell Machine Learning 10-601 Carnegie Mellon University 2015
Ng Machine Learning CS 229 Stanford University Unknown
Shavlik Machine Learning CS 760 University of Wisconsin 2010
Wang Machine Learning CS 6140 Northeastern University 2017
Weinberger Machine Learning CS 4780 Cornell University 2017
Zisserman Machine Learning C19 Oxford University 2015
Ihler Machine Learning and Data Mining CS 178 University of California, Irvine 2011
Paisley Machine Learning for Data Science COMS W4721 Columbia 2017
Salleb-Aouissi Machine Learning for Data Science COMS 4721 Columbia 2014
Ullman Mining of Massive Data Sets CS 246 Stanford University 2017
Huyen Tensorflow for Deep Learning Research CS 20SI Stanford University 2017
Donoho Theories of Deep Learning STATS 385 Stanford University 2017

Tools

back to top
Name Topic Description
Anaconda Python Distribution Python Distribution for Python with package manager
Authorea Collaborative Writing Online scientific document collaboration
Bokeh Python Interactive plotting tools
cmder Command Line Console emulator for Windows
Colorgorical Color Palette Generator Online color palette generator
CommonMark Markdown Language Markdown Language
D3.js Interactive visualization D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but does require significant skill to use effectively
Draw.io Graphics Online graphics platform
Explain Shell Command Line Seach command-lines to see the help text that matches each argument
Fabric Python Command line automation tool
Git Version Control Open source distributed version control system - the de facto standard
Github Version Control Web hosting for git repositories
Github Pages Web Publishing Host web pages from Github repositories
Google Style Guide Programming Style guide for Python, R, Shell, HTML, CSS, Javascript, Java, and C++
Jupyter Notebook Programming This application allows you to create and share documents that contain live code, equations, visualizations and explanatory text
Open source license guide License A guide to choosing an open source license
OpenAI Gym Reinforcement Learning A toolkit for developing and comparing reinforcement learning algorithms
OpenAI Universe Reinforcement Learning A toolkit for developing and comparing reinforcement learning algorithms, particularly video games
Overleaf Collaborative Writing Online LaTeX collaboration
Plot.ly for Python Python Interactive plotting tools
PyFormat Python Explanation of formatting in Python
Regexer Regular Expressions Interactive regular expression playground
Rodeo Python A Python integrated development environment
Scrapy Web Scraping Scrape data from the web
Scrollama Interactive visualization Scrollers for interactive web visualizations
ShareLaTeX Collaborative Writing Online LaTeX collaboration
So You Want to Build A Scroller Interactive visualization Scrollers for interactive web visualizations
Style Guide for Python Code Python Programming style guide
Tableau Data visualization Graphical user interface-based data visualization tool
Tabula Data Scraping Extract data from tables
Tensorflow Playground Neural Networks Interactive neural network playground
The Neural Network Zoo Neural Networks A graphical cheat sheet for neural network architectures and acronyms
Tmux Programming Terminal multiplexer
Zotero Reference Management Reference and citation management system for research

Videos

back to top
Author Organization Name Description
Jurafsky, Dan Stanford Natural Language Processing Video series on natural language processing (text analysis)
Sanderson, Grant 3Brown1Blue Essence of Linear Algebra Video series on a geometric interpretation of linear algebra concepts
Sanderson, Grant 3Brown1Blue Neural Networks Introductory video series on neural networks
Ng, Andrew Deep Learning School Nuts and Bolts of Applying Deep Learning Andrew Ng speaks on advice for those looking to enter the field of machine learning
Klein, Dan and Pieter Abbeel Berkeley Machine Learning Artificial Intelligence and Reinforcement Learning lectures
Welch, Stephen Welch Labs Neural Networks Demystified Visual introduction to neural networks
Welch, Stephen Welch Labs Learning to See Intuitive, visual explanation of machine learning
Winston, Patrick MIT Open Courseware Support Vector Machines An exceedingly lucid explantion of support vector machines - intuitively and mathematically
Abu-Mostafa, Yaser Caltech Kernel Functions Description of kernel functions and how they are used
Sanderson, Grant 3Brown1Blue Taylor Series Clear description of Taylor Series
Hastie, Treveor H2O.ai Gradient Boosting and Machine Learning Discussion of ensemble learning including random forests and gradient boosting