As a field that is actually a synthesis of a number of well-established fields, the skills that make for a successful data scientist come from a variety of disciplines including statistics, mathematics, and computer science. Navigating a pathway through developing skills in all of these fields can be challenging. To help provide pathways through data science skill development, I've compiled a list of resources building and expanding data science knowledge:

**Computational Tools**. Programming tools and useful packages.**Resources and References**. A curated collection of educational resources on a wide variety of core data science concepts.**Online Courses**. Course content on machine learning made available on the web from dozens of universities.**Tools**. Data science tools on a variety of topics including visualization, markdown, and technical writing.**Videos**. Data science videos.**Data Science Concept Checklist**. Checklist of core and advanced concepts in data science across the three primary disciplines (mathematics, statistics & machine learning, and computer science) organized by topical areas. This can act as a roadmap through which concepts to explore or as a tool for evaluating opportunities for expanding your existing skillset.

There are a vast array of tools that can be used for solving problems in data science. Some are programming languages or environments, others are useful packages for solving specific problems or communicating and visualizing your results.

Almost any programming language can be used to solve computational problems, although there are a few that outshine in terms of built in packages and user support communities. Most notably, Python and R have excelled in these respects and are also freely available. MATLAB may have the most detailed documentation of any of the options available, but it is commercial software.

**Python**. Python is a powerful, general purpose, dynamic programming language that is has extensive packages for scientific computation (NumPy, SciPy, Pandas), advanced plotting (matplotlib), and machine learning (scikit-learn). For this sort of scientific computing, using an IDE such as Rodeo or Spyder may speed up the development of analyses.

**R**. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and OSX. With the RStudio integrated development environment (IDE), the language can be powerfully wielded for rapid analyses. Additionally, R Shiny can turn R analyses into interactive web applications.

**MATLAB**. A numerical computing environment and programming language with a wide set of standard toolboxes including those for statistics and machine learning.

**Julia**. A newer programming language designed to meet the needs of mathematical computing.

Almost any data science project worth doing requires significant numbers of revisions and collaboration. These tools allow for comprehensive Git-based version control with a web-based repository. Github is the most popular, but all offer similar web-based repository services.

**Git**. Open source distributed version control system. Git is often used with a web-based Git repository hosting service such as Github.

**Jupyter Notebook**. This web application allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

**Github Pages / Github.io**. Github Pages allows you to create a web page from a Github repository and convert plain text into a formatted web document.

**D3.js**. D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but do require significant skill to use effectively.

**Tableau**. Proprietary desktop and web-based visualization tools that include many data visualization techniques for the rapid development of professional visualizations.

**MySQL**. An open source relational database management system using SQL.

**Apache Hadoop**. An open source framework for distributed file storage and processing (often associated with “big data”) that uses the Hadoop Distributed File System (HDFS) for storage and the MapReduce algorithm for data processing.

**MongoDB**. A document-oriented NoSQL database (non-relational database, which does not rely on tables for storing data) capable of handling a wider variety of data types than traditional SQL relational databases.

Recommendations are indicated with a star ().

back to top

Author | Name | Topic | Year | |
---|---|---|---|---|

Donoho, David | 50 Years of Data Science | Data Science as a Field | 2015 | |

Kriesel, David | A Brief Introduction to Neural Networks | Machine Learning - Deep Learning and Neural Networks | 2007 | |

Daumé III, Hal | A Course in Machine Learning | Machine Learning | 2015 | |

Beezer, Robert Arnold | A First Course in Linear Algebra | Mathematics - Linear Algebra | 2008 | |

Ross, Sheldon | A First Course in Probability | Probability and Statistics | 2014 | |

Yee, Stephanie, and Tony Chu | A Visual Introduction to Machine Learning | Machine Learning | Unknown | |

Shalizi, Cosma | Advanced Data Analysis from an Elementary Point of View | Machine Learning | Unknown | |

Wickham, Hadley | Advanced R | Programming - R | Unknown | |

Smilkov, Daniel and Shan Carter | An Interactive Neural Network Playground | Machine Learning - Deep Learning and Neural Networks | Unknown | |

Venables, W., and D. Smith | An Introduction to R | Programming - R | 2017 | |

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani | An Introduction to Statistical Learning | Machine Learning | 2013 | |

Sweigart, Al | Automate the Boring Stuff with Python: Practical Programming for Total Beginners | Programming - Python | 2016 | |

Ash, Robert B. | Basic Probability Theory | Probability and Statistics | 1970 | |

Davidson-Pilon, Cameron | Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference | Machine Learning - Bayesian Methods | 2015 | |

Barber, David | Bayesian Reasoning and Machine Learning | Machine Learning - Bayesian Methods | 2012 | |

Marsden, Jerrold and Alan Weinsten | Calculus 1, 2, and 3. 2nd Edition | Mathematics - Calculus | 1985 | |

Hartman, Gregory | Calculus 1, 2, and 3. 3rd Edition | Mathematics - Calculus | 2015 | |

Stewart, James | Calculus: Early Transcendentals. 8th Edition | Mathematics - Calculus | 2015 | |

Strang, Gilbert | Calculus: MIT Open Courseware | Mathematics - Calculus | 1991 | |

Guichard, David | Community Calculus | Mathematics - Calculus | 2016 | |

Maclean, Malcom | D3 Tips and Tricks | Visualization - D3 | 2013 | |

Swanstrom, Ryan | Data Science University Programs | Data Science as a Field | 2015 | |

Lin, Jimmy, and Chris Dyer | Data-Intesive Text Processing with MapReduce | Programming - MapReduce | 2010 | |

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville | Deep Learning | Machine Learning - Deep Learning and Neural Networks | 2016 | |

Ng, Andrew | Deep Learning Tutorial | Machine Learning - Deep Learning and Neural Networks | Unknown | |

Trench, William | Elementary Differential Equations | Mathematics - Differential Equations | 2013 | |

Rougier, Nicholas | From Python to Numpy | Programming - Python | 2017 | |

Rasmussen, Carl Edward, and Christopher Williams | Gaussian Processes for Machine Learning | Probability and Statistics - Gaussian Processes | 2006 | |

Chacon, Scott | Git Internals | Programming - Version Control | 2008 | |

Atlassian | Git Tutorial | Programming - Version Control | Unknown | |

Grinstead, Charles and James Snell | Grinstead and Snell’s Introduction to Probability | Probability and Statistics | 2006 | |

Géron, Aurélien | Hands-On Machine Learning with Scikit-Learn and TensorFlow | Machine Learning | 2017 | |

Ström, J., K. Åström, and T. Akenine-Möller | Immersive Linear Algebra | Mathematics - Linear Algebra | 2016 | |

MacKay, David | Information Theory, Inference and Learning Algorithms | Probability and Statistics - Information Theory | 2003 | |

Murray, Scott | Interactive Data Visualization for the Web | Visualization - D3 | 2013 | |

Vandenberghe, L. | Introduction to Applied Linear Algebra | Mathematics - Linear Algebra | 2017 | |

Shaw, Zed | Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code | Programming - Python | 2013 | |

Navarro, Daniel | Learning Statistics with R: A Tutorial for Psychology Students and Other Beginners (version 0.5) | Programming - R | 2015 | |

Hefferon, Jim | Linear Algebra | Mathematics - Linear Algebra | 2006 | |

Lay, David | Linear Algebra and Its Applications | Mathematics - Linear Algebra | 2006 | |

Treil, Sergei | Linear Algebra Done Wrong | Mathematics - Linear Algebra | 2004 | |

Leskovec, Jure, Anand Rajaraman, and Jeffrey Ullman | Mining of Massive Datasets | Programming - MapReduce | 2014 | |

MySQL | MySQL Tutorial | Programming - SQL | 1997 | |

Nielsen, Michael | Neural Networks and Deep Learning | Machine Learning - Deep Learning and Neural Networks | 2016 | |

Lebl, Jiří | Notes on Diffy Qs: Differential Equations for Engineering | Mathematics - Differential Equations | 2014 | |

Few, Stephen | Now You See It: Simple Visualization Techniques for Quantitative Analysis | Visualization - Design | 2009 | |

Carethell, Clare | Open Data Science Masters Curriculum | Data Science as a Field | 2015 | |

Diez, David, Christopher Barr, and Mine Cetinkaya-Rundel | OpenIntro Statistics | Probability and Statistics | 2015 | |

Gertz, M. | Oracle/SQL Tutorial | Programming - SQL | 2000 | |

Duda, Richard O., Peter E. Hart, and David G. Stork | Pattern Classification | Machine Learning | 2012 | |

Bishop, Christopher | Pattern Recognition | Machine Learning | 2006 | |

VanderPlas, Jake | Python Data Science Handbook | Programming - Python | 2016 | |

Severance, Charls | Python for Informatics | Programming - Python | Unknown | |

Raschka, Sebastian | Python Machine Learning, Second Ediction | Machine Learning | 2017 | |

Grolemund, Garrett, and Hadley Wickham | R for Data Science | Programming - R | 2017 | |

Skinner, Grant | RegExr | Programming - Regular Expressions | Unknown | |

Sutton, Richard, and Andrew Barto | Reinforcement Learning: An Introduction | Machine Learning - Reinforcement Learning | 2010 | |

Varoquaux et al. | Scipy Lecture Notes | Programming - Python | 2017 | |

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani | The Elements of Statistical Learning (2nd Edition) | Machine Learning | 2009 | |

Cairo, Alberto | The Functional Art: An Introduction to Information Graphics and Visualization | Visualization - Design | 2012 | |

Reitz, Kenneth and Tanya Schlusser | The Hitchhiker’s Guide to Python | Programming - Python | 2016 | |

Oetiker, Tobias | The Not So Short Introduction to LATEX 2ε | Programming - Typesetting | 2016 | |

Python Software Foundation | The Python Tutorial | Programming - Python | 2017 | |

Nilsson, Nils | The Quest for Artificial Intelligence: A History of Ideas and Achivements | Data Science as a Field | 2010 | |

Stitch | The State of Data Science | Data Science as a Field | 2016 | |

Tufte, Edward | The Visual Display of Quantitative Information | Visualization - Design | 2001 | |

Downey, Allen | Think Bayes: Bayesian Statistics Made Simple | Machine Learning - Bayesian Methods | 2013 | |

Downey, Allen | Think Complexity: Complexity Science and Computational Modeling | Programming - Python | 2012 | |

Downey, Allen | Think Python: How to Think Like a Computer Scientist | Programming - Python | 2015 | |

Downey, Allen | Think Stats: Probability and Statistics for Programmers | Probability and Statistics | 2014 | |

Shalev-Shwartz, Shai and Shai Ben-David | Understanding Machine Learning: From Theory to Algorithms | Machine Learning | 2014 |

Recommendations are indicated with a star ().

back to top

Instructor | Title | Designation | University | Year | |
---|---|---|---|---|---|

Siegel | A Mathematics Course for Political and Social Researchers | None | Duke University | 2014 | |

Adams | Advanced Machine Learning | CS 281 | Harvard University | 2013 | |

Gogate | Advanced Machine Learning | CS 7301 | University of Texas at Dallas | 2017 | |

Krause | Advanced Topics in Machine Learning | CS 253 | California Institute of Technology | 2010 | |

Mueller | Applied Machine Learning | COMS W4995 | Columbia | 2017 | |

Berenson | Artificial Intelligence | CS 534 | Worchester Polytechnic Institute | 2015 | |

Winston | Artificial Intelligence | OpenCourseware | Massachusetts Institute of Technology | 2010 | |

Chan | Computational Statistics in Python | STA 663 | Duke University | 2015 | |

Chan | Computational Statistics in Python | STA 663 | Duke University | 2017 | |

Li | Convolutional Neural Networks for Visual Recognition | CS 231n | Stanford University | 2017 | |

Walther | Data Mining and Analysis | Stats 202 | Stanford University | 2017 | |

Arnold | Data Mining and Machine Learning | STAT 365/665 | Yale University | 2016 | |

Irizarry | Data Science | CS 109 | Harvard University | 2014 | |

Klein | Introduction to Artificial Intelligence | CS 188 | University of California, Berkeley | 2014 | |

Konidaris | Introduction to Artificial Intelligence | CPS 270 | Duke University | 2016 | |

Lex | Introduction to Data Science | CS 5963 | University of Utah | 2016 | |

Chen | Introduction to Data Science for Public Policy | PPOL 670 | Georgetown University | 2018 | |

Shewchuk | Introduction to Machine Learning | CS 189 | University of California, Berkeley | 2017 | |

Srihari | Introduction to Machine Learning | CSE 574 | University at Buffalo | 2017 | |

Vishwanathan | Introduction to Machine Learning | CS 590 | Purdue University | 2010 | |

Harrington | Introduction to Machine Learning and Data Mining | COMP 135 | Tufts University | 2016 | |

Abu-Mostafa | Learning From Data | MOOC | California Institute of Technology | 2010 | |

Dietterich | Machine Learning | CS 534 | Oregon State University | 2005 | |

Domingos | Machine Learning | CSE 446 | University of Washington | 2014 | |

Domingos | Machine Learning | CSE 546 | University of Washington | 2014 | |

Fern | Machine Learning | CS 534 | Oregon State University | 2015 | |

Guestrin | Machine Learning | 10-601 | Carnegie Mellon University | 2007 | |

Jamieson | Machine Learning | CSE 546 | University of Washington | 2017 | |

Kakade | Machine Learning | CSE 546 | University of Washington | 2016 | |

Mitchell | Machine Learning | 10-601 | Carnegie Mellon University | 2015 | |

Ng | Machine Learning | CS 229 | Stanford University | Unknown | |

Shavlik | Machine Learning | CS 760 | University of Wisconsin | 2010 | |

Wang | Machine Learning | CS 6140 | Northeastern University | 2017 | |

Weinberger | Machine Learning | CS 4780 | Cornell University | 2017 | |

Zisserman | Machine Learning | C19 | Oxford University | 2015 | |

Ihler | Machine Learning and Data Mining | CS 178 | University of California, Irvine | 2011 | |

Paisley | Machine Learning for Data Science | COMS W4721 | Columbia | 2017 | |

Salleb-Aouissi | Machine Learning for Data Science | COMS 4721 | Columbia | 2014 | |

Ullman | Mining of Massive Data Sets | CS 246 | Stanford University | 2017 | |

Huyen | Tensorflow for Deep Learning Research | CS 20SI | Stanford University | 2017 | |

Donoho | Theories of Deep Learning | STATS 385 | Stanford University | 2017 |

back to top

Name | Topic | Description |
---|---|---|

Anaconda Python Distribution | Python | Distribution for Python with package manager |

Authorea | Collaborative Writing | Online scientific document collaboration |

Bokeh | Python | Interactive plotting tools |

cmder | Command Line | Console emulator for Windows |

Colorgorical | Color Palette Generator | Online color palette generator |

CommonMark | Markdown Language | Markdown Language |

D3.js | Interactive visualization | D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but does require significant skill to use effectively |

Draw.io | Graphics | Online graphics platform |

Explain Shell | Command Line | Seach command-lines to see the help text that matches each argument |

Fabric | Python | Command line automation tool |

Git | Version Control | Open source distributed version control system - the de facto standard |

Github | Version Control | Web hosting for git repositories |

Github Pages | Web Publishing | Host web pages from Github repositories |

Google Style Guide | Programming | Style guide for Python, R, Shell, HTML, CSS, Javascript, Java, and C++ |

Jupyter Notebook | Programming | This application allows you to create and share documents that contain live code, equations, visualizations and explanatory text |

Open source license guide | License | A guide to choosing an open source license |

OpenAI Gym | Reinforcement Learning | A toolkit for developing and comparing reinforcement learning algorithms |

OpenAI Universe | Reinforcement Learning | A toolkit for developing and comparing reinforcement learning algorithms, particularly video games |

Overleaf | Collaborative Writing | Online LaTeX collaboration |

Plot.ly for Python | Python | Interactive plotting tools |

PyFormat | Python | Explanation of formatting in Python |

Regexer | Regular Expressions | Interactive regular expression playground |

Rodeo | Python | A Python integrated development environment |

Scrapy | Web Scraping | Scrape data from the web |

Scrollama | Interactive visualization | Scrollers for interactive web visualizations |

ShareLaTeX | Collaborative Writing | Online LaTeX collaboration |

So You Want to Build A Scroller | Interactive visualization | Scrollers for interactive web visualizations |

Style Guide for Python Code | Python | Programming style guide |

Tableau | Data visualization | Graphical user interface-based data visualization tool |

Tabula | Data Scraping | Extract data from tables |

Tensorflow Playground | Neural Networks | Interactive neural network playground |

The Neural Network Zoo | Neural Networks | A graphical cheat sheet for neural network architectures and acronyms |

Tmux | Programming | Terminal multiplexer |

Zotero | Reference Management | Reference and citation management system for research |

back to top

Author | Organization | Name | Description |
---|---|---|---|

Jurafsky, Dan | Stanford | Natural Language Processing | Video series on natural language processing (text analysis) |

Sanderson, Grant | 3Brown1Blue | Essence of Linear Algebra | Video series on a geometric interpretation of linear algebra concepts |

Sanderson, Grant | 3Brown1Blue | Neural Networks | Introductory video series on neural networks |

Ng, Andrew | Deep Learning School | Nuts and Bolts of Applying Deep Learning | Andrew Ng speaks on advice for those looking to enter the field of machine learning |

Klein, Dan and Pieter Abbeel | Berkeley | Machine Learning | Artificial Intelligence and Reinforcement Learning lectures |

Welch, Stephen | Welch Labs | Neural Networks Demystified | Visual introduction to neural networks |

Welch, Stephen | Welch Labs | Learning to See | Intuitive, visual explanation of machine learning |

Winston, Patrick | MIT Open Courseware | Support Vector Machines | An exceedingly lucid explantion of support vector machines - intuitively and mathematically |

Abu-Mostafa, Yaser | Caltech | Kernel Functions | Description of kernel functions and how they are used |

Sanderson, Grant | 3Brown1Blue | Taylor Series | Clear description of Taylor Series |

Hastie, Treveor | H2O.ai | Gradient Boosting and Machine Learning | Discussion of ensemble learning including random forests and gradient boosting |