As a field that is actually a synthesis of a number of well-established fields, the skills that make for a successful data scientist come from a variety of disciplines including statistics, mathematics, and computer science. Navigating a pathway through developing skills in all of these fields can be challenging. To help provide pathways through data science skill development, I've compiled a list of computational tools, (mostly free) resources and educational references, online machine learning courses and a checklist of data science concepts for those interested in structure on their pathway through their personal data science curriculum or to brush up on topics and expand your knowledge.
Computational Tools. Descriptions and links to powerful computational tools and useful packages.
Resources and References. A curated collection of educational resources on a wide variety of core data science concepts and some special topics.
Online Courses. Course content on machine learning made available on the web from dozens of universities.
Data Science Concept Checklist. Checklist of core and advanced concepts in data science across the three primary disciplines (mathematics, statistics & machine learning, and computer science) organized by topical areas. This can act as a roadmap through which concepts to explore or as a tool for evaluating opportunities for expanding your existing skillset.
There are a vast array of tools that can be used for solving problems in data science. Some are programming languages or environments, others are useful packages for solving specific problems or communicating and visualizing your results.
Almost any programming language can be used to solve computational problems, although there are a few that outshine in terms of built in packages and user support communities. Most notably, Python and R have excelled in these respects and are also freely available. MATLAB may have the most detailed documentation of any of the options available, but it is commercial software.
Python. Python is a powerful, general purpose, dynamic programming language that is has extensive packages for scientific computation (NumPy, SciPy, Pandas), advanced plotting (matplotlib), and machine learning (scikit-learn). For this sort of scientific computing, using an IDE such as Rodeo or Spyder may speed up the development of analyses.
R. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and OSX. With the RStudio integrated development environment (IDE), the language can be powerfully wielded for rapid analyses. Additionally, R Shiny can turn R analyses into interactive web applications.
MATLAB. A numerical computing environment and programming language with a wide set of standard toolboxes including those for statistics and machine learning.
Julia. A newer programming language designed to meet the needs of mathematical computing.
Almost any data science project worth doing requires significant numbers of revisions and collaboration. These tools allow for comprehensive Git-based version control with a web-based repository. Github is the most popular, but all offer similar web-based repository services.
Jupyter Notebook. This web application allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
Github Pages / Github.io. Github Pages allows you to create a web page from a Github repository and use convert plain text into a formatted web document.
Tableau. Proprietary desktop and web-based visualization tools that include many data visualization techniques for the rapid development of professional visualizations.
MySQL. An open source relational database management system using SQL.
Apache Hadoop. An open source framework for distributed file storage and processing (often associated with “big data”) that uses the Hadoop Distributed File System (HDFS) for storage and the MapReduce algorithm for data processing.
MongoDB. A document-oriented NoSQL database (non-relational database, which does not rely on tables for storing data) capable of handling a wider variety of data types than traditional SQL relational databases.
Here are some (primarily free) resources data science. Some of these are personal favorites () or recommendations, and many come from the github awesome-machine-learning repository on data science books.
|Donoho, David||50 Years of Data Science||Data Science as a Field||2015|
|Nilsson, Nils||The Quest for Artificial Intelligence: A History of Ideas and Achivements||Data Science as a Field||2010|
|Stitch||The State of Data Science||Data Science as a Field||2016|
|Swanstrom, Ryan||Data Science University Programs||Data Science as a Field||2015|
|Carethell, Clare||Open Data Science Masters Curriculum||Data Science as a Field||2015|
|Guichard, David||Community Calculus||Mathematics - Calculus||2016|
|Hartman, Gregory||Calculus 1, 2, and 3. 3rd Edition||Mathematics - Calculus||2015|
|Marsden, Jerrold and Alan Weinsten||Calculus 1, 2, and 3. 2nd Edition||Mathematics - Calculus||1985|
|Strang, Gilbert||Calculus: MIT Open Courseware||Mathematics - Calculus||1991|
|Stewart, James||Calculus: Early Transcendentals. 8th Edition||Mathematics - Calculus||2015|
|Beezer, Robert Arnold||A First Course in Linear Algebra||Mathematics - Linear Algebra||2008|
|Hefferon, Jim||Linear Algebra||Mathematics - Linear Algebra||2006|
|Treil, Sergei||Linear Algebra Done Wrong||Mathematics - Linear Algebra||2004|
|Vandenberghe, L.||Applied Numerical Computing. Lecture Notes||Mathematics - Linear Algebra||2007|
|Lay, David||Linear Algebra and Its Applications||Mathematics - Linear Algebra||2006|
|Lebl, Jiří||Notes on Diffy Qs: Differential Equations for Engineering||Mathematics - Differential Equations||2014|
|Trench, William||Elementary Differential Equations||Mathematics - Differential Equations||2013|
|Ash, Robert B.||Basic Probability Theory||Probability and Statistics||1970|
|Diez, David, Christopher Barr, and Mine Cetinkaya-Rundel||OpenIntro Statistics||Probability and Statistics||2015|
|Downey, Allen||Think Stats: Probability and Statistics for Programmers||Probability and Statistics||2014|
|Grinstead, Charles and James Snell||Grinstead and Snell’s Introduction to Probability||Probability and Statistics||2006|
|Ross, Sheldon||A First Course in Probability||Probability and Statistics||2014|
|Barber, David||Bayesian Reasoning and Machine Learning||Machine Learning - Bayesian Methods||2012|
|Daumé III, Hal||A Course in Machine Learning||Machine Learning||2015|
|James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani||An Introduction to Statistical Learning||Machine Learning||2013|
|Friedman, Jerome, Trevor Hastie, and Robert Tibshirani||The Elements of Statistical Learning (2nd Edition)||Machine Learning||2009|
|Duda, Richard O., Peter E. Hart, and David G. Stork||Pattern Classification||Machine Learning||2012|
|Yee, Stephanie, and Tony Chu||A Visual Introduction to Machine Learning||Machine Learning||Unknown|
|Shalizi, Cosma||Advanced Data Analysis from an Elementary Point of View||Machine Learning||Unknown|
|Bishop, Christopher||Pattern Recognition||Machine Learning||2006|
|Davidson-Pilon, Cameron||Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference||Machine Learning - Bayesian Methods||2015|
|Downey, Allen||Think Bayes: Bayesian Statistics Made Simple||Machine Learning - Bayesian Methods||2013|
|Kriesel, David||A Brief Introduction to Neural Networks||Machine Learning - Deep Learning and Neural Networks||2007|
|Smilkov, Daniel and Shan Carter||An Interactive Neural Network Playground||Machine Learning - Deep Learning and Neural Networks||Unknown|
|Nielsen, Michael||Neural Networks and Deep Learning||Machine Learning - Deep Learning and Neural Networks||2016|
|Ng, Andrew||Deep Learning Tutorial||Machine Learning - Deep Learning and Neural Networks||Unknown|
|Goodfellow, Ian, Yoshua Bengio, and Aaron Courville||Deep Learning||Machine Learning - Deep Learning and Neural Networks||2016|
|MacKay, David||Information Theory, Inference and Learning Algorithms||Probability and Statistics - Information Theory||2003|
|Rasmussen, Carl Edward, and Christopher Williams||Gaussian Processes for Machine Learning||Probability and Statistics - Gaussian Processes||2006|
|Sutton, Richard, and Andrew Barto||Reinforcement Learning: An Introduction||Machine Learning - Reinforcement Learning||2010|
|Shaw, Zed||Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code||Programming - Python||2013|
|Sweigart, Al||Automate the Boring Stuff with Python: Practical Programming for Total Beginners||Programming - Python||2016|
|Downey, Allen||Think Python: How to Think Like a Computer Scientist||Programming - Python||2015|
|Downey, Allen||Think Complexity: Complexity Science and Computational Modeling||Programming - Python||2012|
|Severance, Charls||Python for Informatics||Programming - Python||Unknown|
|Wickham, Hadley||Advanced R||Programming - R||Unknown|
|Navarro, Daniel||Learning Statistics with R: A Tutorial for Psychology Students and Other Beginners (version 0.5)||Programming - R||2015|
|Tufte, Edward||The Visual Display of Quantitative Information||Visualization - Design||2001|
|Few, Stephen||Now You See It: Simple Visualization Techniques for Quantitative Analysis||Visualization - Design||2009|
|Cairo, Alberto||The Functional Art: An Introduction to Information Graphics and Visualization||Visualization - Design||2012|
|Murray, Scott||Interactive Data Visualization for the Web||Visualization - D3||2013|
|Maclean, Malcom||D3 Tips and Tricks||Visualization - D3||2013|
|Atlassian||Git Tutorial||Programming - Version Control||Unknown|
|Skinner, Grant||RegExr||Programming - Regular Expressions||Unknown|
|Gertz, M.||Oracle/SQL Tutorial||Programming - SQL||2000|
|MySQL||MySQL Tutorial||Programming - SQL||1997|
|Leskovec, Jure, Anand Rajaraman, and Jeffrey Ullman||Mining of Massive Datasets||Programming - MapReduce||2014|
|Lin, Jimmy, and Chris Dyer||Data-Intesive Text Processing with MapReduce||Programming - MapReduce||2010|
Here are a number of machine learning courses whos materials can be found online. Personal recommendations are indicated wth a star ().
|Abu-Mostafa||Learning From Data||MOOC||California Institute of Technology||2010|
|Winston||Artificial Intelligence||OpenCourseware||Massachusetts Institute of Technology||2010|
|Ng||Machine Learning||CS 229||Stanford University||Unknown|
|Walther||Data Mining and Analysis||Stats 202||Stanford University||2017|
|Adams||Advanced Machine Learning||CS 281||Harvard University||2013|
|Mitchell||Machine Learning||10-601||Carnegie Mellon University||2015|
|Domingos||Machine Learning||CSE 446||University of Washington||2014|
|Gogate||Advanced Machine Learning||CS 7301||University of Texas at Dallas||2017|
|Krause||Advanced Topics in Machine Learning||CS 253||California Institute of Technology||2010|
|Zisserman||Machine Learning||C19||Oxford University||2015|
|Berenson||Artificial Intelligence||CS 534||Worchester Polytechnic Institute||2015|
|Konidaris||Introduction to Artificial Intelligence||CPS 270||Duke University||2016|
|Irizarry||Data Science||CS 109||Harvard University||2014|
|Lex||Introduction to Data Science||CS 5963||University of Utah||2016|
|Paisley||Machine Learning for Data Science||COMS W4721||Columbia||2017|
|Mueller||Applied Machine Learning||COMS W4995||Columbia||2017|
|Srihari||Introduction to Machine Learning||CSE 574||University at Buffalo||2017|
|Wang||Machine Learning||CS 6140||Northeastern University||2017|
|Guestrin||Machine Learning||10-601||Carnegie Mellon University||2007|
|Kakade||Machine Learning||CSE 546||University of Washington||2016|
|Jamieson||Machine Learning||CSE 546||University of Washington||2017|
|Domingos||Machine Learning||CSE 546||University of Washington||2014|
|Ihler||Machine Learning and Data Mining||CS 178||University of California, Irvine||2011|
|Shavlik||Machine Learning||CS 760||University of Wisconsin||2010|
|Dietterich||Machine Learning||CS 534||Oregon State University||2005|
|Fern||Machine Learning||CS 534||Oregon State University||2015|
|Vishwanathan||Introduction to Machine Learning||CS 590||Purdue University||2010|
|Shewchuk||Introduction to Machine Learning||CS 189||University of California, Berkeley||2017|
|Weinberger||Machine Learning||CS 4780||Cornell University||2017|
|Harrington||Introduction to Machine Learning and Data Mining||COMP 135||Tufts University||2016|
|Arnold||Data Mining and Machine Learning||STAT 365/665||Yale University||2016|