Data Science Pathways

As a field that is actually a synthesis of a number of well-established fields, the skills that make for a successful data scientist come from a variety of disciplines including statistics, mathematics, and computer science. Navigating a pathway through developing skills in all of these fields can be challenging. To help provide pathways through data science skill development, I've compiled a list of computational tools, (mostly free) resources and educational references, online machine learning courses and a checklist of data science concepts for those interested in structure on their pathway through their personal data science curriculum or to brush up on topics and expand your knowledge.


Tools

There are a vast array of tools that can be used for solving problems in data science. Some are programming languages or environments, others are useful packages for solving specific problems or communicating and visualizing your results.

Programming Languages

Almost any programming language can be used to solve computational problems, although there are a few that outshine in terms of built in packages and user support communities. Most notably, Python and R have excelled in these respects and are also freely available. MATLAB may have the most detailed documentation of any of the options available, but it is commercial software.

Python. Python is a powerful, general purpose, dynamic programming language that is has extensive packages for scientific computation (NumPy, SciPy, Pandas), advanced plotting (matplotlib), and machine learning (scikit-learn). For this sort of scientific computing, using an IDE such as Rodeo or Spyder may speed up the development of analyses.

R. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and OSX. With the RStudio integrated development environment (IDE), the language can be powerfully wielded for rapid analyses. Additionally, R Shiny can turn R analyses into interactive web applications.

MATLAB. A numerical computing environment and programming language with a wide set of standard toolboxes including those for statistics and machine learning.

Julia. A newer programming language designed to meet the needs of mathematical computing.

Version Control

Almost any data science project worth doing requires significant numbers of revisions and collaboration. These tools allow for comprehensive Git-based version control with a web-based repository. Github is the most popular, but all offer similar web-based repository services.

Git. Open source distributed version control system. Git is often used with a web-based Git repository hosting service such as Github.

Communicating and Visualizing Analyses on the Web

Jupyter Notebook. This web application allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Github Pages / Github.io. Github Pages allows you to create a web page from a Github repository and use convert plain text into a formatted web document.

D3.js. D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but do require significant skill use effectively.

Tableau. Proprietary desktop and web-based visualization tools that include many data visualization techniques for the rapid development of professional visualizations.

Database Management (for big and little data)

MySQL. An open source relational database management system using SQL.

Apache Hadoop. An open source framework for distributed file storage and processing (often associated with “big data”) that uses the Hadoop Distributed File System (HDFS) for storage and the MapReduce algorithm for data processing.

MongoDB. A document-oriented NoSQL database (non-relational database, which does not rely on tables for storing data) capable of handling a wider variety of data types than traditional SQL relational databases.


Resources

Here are some (primarily free) resources data science. Some of these are personal favorites () or recommendations, and many come from the github awesome-machine-learning repository on data science books.

Author Name Topic Year
Donoho, David 50 Years of Data Science Data Science as a Field 2015
Nilsson, Nils The Quest for Artificial Intelligence: A History of Ideas and Achivements Data Science as a Field 2010
Stitch The State of Data Science Data Science as a Field 2016
Swanstrom, Ryan Data Science University Programs Data Science as a Field 2015
Carethell, Clare Open Data Science Masters Curriculum Data Science as a Field 2015
Guichard, David Community Calculus Mathematics - Calculus 2016
Hartman, Gregory Calculus 1, 2, and 3. 3rd Edition Mathematics - Calculus 2015
Marsden, Jerrold and Alan Weinsten Calculus 1, 2, and 3. 2nd Edition Mathematics - Calculus 1985
Strang, Gilbert Calculus: MIT Open Courseware Mathematics - Calculus 1991
Stewart, James Calculus: Early Transcendentals. 8th Edition Mathematics - Calculus 2015
Beezer, Robert Arnold A First Course in Linear Algebra Mathematics - Linear Algebra 2008
Hefferon, Jim Linear Algebra Mathematics - Linear Algebra 2006
Treil, Sergei Linear Algebra Done Wrong Mathematics - Linear Algebra 2004
Vandenberghe, L. Applied Numerical Computing. Lecture Notes Mathematics - Linear Algebra 2007
Lay, David Linear Algebra and Its Applications Mathematics - Linear Algebra 2006
Lebl, Jiří Notes on Diffy Qs: Differential Equations for Engineering Mathematics - Differential Equations 2014
Trench, William Elementary Differential Equations Mathematics - Differential Equations 2013
Ash, Robert B. Basic Probability Theory Probability and Statistics 1970
Diez, David, Christopher Barr, and Mine Cetinkaya-Rundel OpenIntro Statistics Probability and Statistics 2015
Downey, Allen Think Stats: Probability and Statistics for Programmers Probability and Statistics 2014
Grinstead, Charles and James Snell Grinstead and Snell’s Introduction to Probability Probability and Statistics 2006
Ross, Sheldon A First Course in Probability Probability and Statistics 2014
Barber, David Bayesian Reasoning and Machine Learning Machine Learning - Bayesian Methods 2012
Daumé III, Hal A Course in Machine Learning Machine Learning 2015
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani An Introduction to Statistical Learning Machine Learning 2013
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani The Elements of Statistical Learning (2nd Edition) Machine Learning 2009
Duda, Richard O., Peter E. Hart, and David G. Stork Pattern Classification Machine Learning 2012
Yee, Stephanie, and Tony Chu A Visual Introduction to Machine Learning Machine Learning Unknown
Shalizi, Cosma Advanced Data Analysis from an Elementary Point of View Machine Learning Unknown
Bishop, Christopher Pattern Recognition Machine Learning 2006
Davidson-Pilon, Cameron Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference Machine Learning - Bayesian Methods 2015
Downey, Allen Think Bayes: Bayesian Statistics Made Simple Machine Learning - Bayesian Methods 2013
Kriesel, David A Brief Introduction to Neural Networks Machine Learning - Deep Learning and Neural Networks 2007
Smilkov, Daniel and Shan Carter An Interactive Neural Network Playground Machine Learning - Deep Learning and Neural Networks Unknown
Nielsen, Michael Neural Networks and Deep Learning Machine Learning - Deep Learning and Neural Networks 2016
Ng, Andrew Deep Learning Tutorial Machine Learning - Deep Learning and Neural Networks Unknown
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville Deep Learning Machine Learning - Deep Learning and Neural Networks 2016
MacKay, David Information Theory, Inference and Learning Algorithms Probability and Statistics - Information Theory 2003
Rasmussen, Carl Edward, and Christopher Williams Gaussian Processes for Machine Learning Probability and Statistics - Gaussian Processes 2006
Sutton, Richard, and Andrew Barto Reinforcement Learning: An Introduction Machine Learning - Reinforcement Learning 2010
Shaw, Zed Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code Programming - Python 2013
Sweigart, Al Automate the Boring Stuff with Python: Practical Programming for Total Beginners Programming - Python 2016
Downey, Allen Think Python: How to Think Like a Computer Scientist Programming - Python 2015
Downey, Allen Think Complexity: Complexity Science and Computational Modeling Programming - Python 2012
Severance, Charls Python for Informatics Programming - Python Unknown
Wickham, Hadley Advanced R Programming - R Unknown
Navarro, Daniel Learning Statistics with R: A Tutorial for Psychology Students and Other Beginners (version 0.5) Programming - R 2015
Tufte, Edward The Visual Display of Quantitative Information Visualization - Design 2001
Few, Stephen Now You See It: Simple Visualization Techniques for Quantitative Analysis Visualization - Design 2009
Cairo, Alberto The Functional Art: An Introduction to Information Graphics and Visualization Visualization - Design 2012
Murray, Scott Interactive Data Visualization for the Web Visualization - D3 2013
Maclean, Malcom D3 Tips and Tricks Visualization - D3 2013
Atlassian Git Tutorial Programming - Version Control Unknown
Skinner, Grant RegExr Programming - Regular Expressions Unknown
Gertz, M. Oracle/SQL Tutorial Programming - SQL 2000
MySQL MySQL Tutorial Programming - SQL 1997
Leskovec, Jure, Anand Rajaraman, and Jeffrey Ullman Mining of Massive Datasets Programming - MapReduce 2014
Lin, Jimmy, and Chris Dyer Data-Intesive Text Processing with MapReduce Programming - MapReduce 2010

Online Course Materials

Here are a number of machine learning courses whos materials can be found online. Personal recommendations are indicated wth a star ().

Instructor Title Designation University Year
Abu-Mostafa Learning From Data MOOC California Institute of Technology 2010
Winston Artificial Intelligence OpenCourseware Massachusetts Institute of Technology 2010
Ng Machine Learning CS 229 Stanford University Unknown
Walther Data Mining and Analysis Stats 202 Stanford University 2017
Adams Advanced Machine Learning CS 281 Harvard University 2013
Mitchell Machine Learning 10-601 Carnegie Mellon University 2015
Domingos Machine Learning CSE 446 University of Washington 2014
Gogate Advanced Machine Learning CS 7301 University of Texas at Dallas 2017
Krause Advanced Topics in Machine Learning CS 253 California Institute of Technology 2010
Zisserman Machine Learning C19 Oxford University 2015
Berenson Artificial Intelligence CS 534 Worchester Polytechnic Institute 2015
Konidaris Introduction to Artificial Intelligence CPS 270 Duke University 2016
Irizarry Data Science CS 109 Harvard University 2014
Lex Introduction to Data Science CS 5963 University of Utah 2016
Paisley Machine Learning for Data Science COMS W4721 Columbia 2017
Mueller Applied Machine Learning COMS W4995 Columbia 2017
Srihari Introduction to Machine Learning CSE 574 University at Buffalo 2017
Wang Machine Learning CS 6140 Northeastern University 2017
Guestrin Machine Learning 10-601 Carnegie Mellon University 2007
Kakade Machine Learning CSE 546 University of Washington 2016
Jamieson Machine Learning CSE 546 University of Washington 2017
Domingos Machine Learning CSE 546 University of Washington 2014
Ihler Machine Learning and Data Mining CS 178 University of California, Irvine 2011
Shavlik Machine Learning CS 760 University of Wisconsin 2010
Dietterich Machine Learning CS 534 Oregon State University 2005
Fern Machine Learning CS 534 Oregon State University 2015
Vishwanathan Introduction to Machine Learning CS 590 Purdue University 2010
Shewchuk Introduction to Machine Learning CS 189 University of California, Berkeley 2017
Weinberger Machine Learning CS 4780 Cornell University 2017
Harrington Introduction to Machine Learning and Data Mining COMP 135 Tufts University 2016
Arnold Data Mining and Machine Learning STAT 365/665 Yale University 2016