Data Science Pathways

As an emerging field that is a collection of a number of well-established fields, the skills that make for a successful data scientist come from a variety of disciplines including mathematics, statistics & machine learning, and computer science. Navigating a pathway through developing skills in these fields can be challenging since no single resource (by necessity) is able to provide guidance on all of the tools of a modern data scientist. Many of these resources have been personally helpful in developing skills in this area, and I've compiled a list of computational tools, (mostly free) resources and educational references, and a checklist of data science concepts for those interested in structure on their pathway through their personal data science curriculum or to brush up on topics and expand your knowledge.

Data Science Concept Checklist. Checklist of core and advanced concepts in data science across the three primary disciplines (mathematics, statistics & machine learning, and computer science) organized by topical areas. This can act as a roadmap through which concepts to explore or as a tool for evaluating opportunities for expanding your existing skillset.

Tools. Descriptions and links to powerful computational tools and useful packages.

Resources and References. A curated collection of educational resources on a wide variety of core data science concepts and some special topics.


Tools

There are a vast array of tools that can be used for solving problems in data science. Some are programming languages or environments, others are useful packages for solving specific problems or communicating and visualizing your results.

Programming Languages

Almost any programming language can be used to solve computational problems, although there are a few that outshine in terms of built in packages and user support communities. Most notably, R and Python have excelled in these respects and are also freely available. MATLAB may have the most detailed documentation of any of the options available, but it is commercial software.

R. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and OSX. With the RStudio integrated development environment (IDE), the language can be powerfully wielded for rapid analyses. Additionally, R Shiny can turn R analyses into interactive web applications.

Python. Python is a powerful, general purpose, dynamic programming language that is has extensive packages for scientific computation (NumPy, SciPy, Pandas), advanced plotting (matplotlib), and machine learning (scikit-learn). For this sort of scientific computing, using an IDE such as Rodeo or Spyder may speed up the development of analyses.

MATLAB. A numerical computing environment and programming language with a wide set of standard toolboxes including those for statistics and machine learning.

Julia. A newer programming language designed to meet the needs of mathematical computing.

Version Control

Almost any data science project worth doing requires significant numbers of revisions and collaboration. These tools allow for comprehensive Git-based version control with a web-based repository. Github is the most popular, but all offer similar web-based repository services.

Git. Open source distributed version control system. Git is often used with a web-based Git repository hosting service such as Github.

Apache Subversion (SVN). A free software versioning and revision control system, based on a centralized concurrent versioning model.

Communicating and Visualizing Analyses on the Web

Jupyter Notebook. This web application allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Github Pages / Github.io. Github Pages allows you to create a web page from a Github repository and use convert plain text into a formatted web document.

D3.js. D3 (or Data Driven Documents) is an open-source JavaScript library for producing dynamic, interactive data visualizations in web browsers. Since this is based in JavaScript, visualizations are entirely customizable, but do require significant skill use effectively.

Tableau. Proprietary desktop and web-based visualization tools that include many data visualization techniques for the rapid development of professional visualizations.

Database Management (for big and little data)

MySQL. An open source relational database management system using SQL.

Apache Hadoop. An open source framework for distributed file storage and processing (often associated with “big data”) that uses the Hadoop Distributed File System (HDFS) for storage and the MapReduce algorithm for data processing.

MongoDB. A document-oriented NoSQL database (non-relational database, which does not rely on tables for storing data) capable of handling a wider variety of data types than traditional SQL relational databases.


Resources

Here are some (primarily free) resources data science. Some of these are personal favorites () or recommendations, and many come from the github awesome-machine-learning repository on data science books.

Data Science as a Field

Donoho, David. 2015. 50 Years of Data Science.

Nilsson, Nils. The Quest for Artificial Intelligence: A History of Ideas and Achievements. 2010. A history of machine learning and data science

Stitch. 2016. The State of Data Science.

Swanstrom, Ryan. 2015. Data Science University Programs.

Corethell, Clare. 2015. Open Data Science Masters Curriculum.

Mathematics

Calculus

Guichard, David. 2016. Community Calculus.

Hartman, Gregory. 2015. Calculus 1, 2, and 3. 3rd edition.

Marsden, Jerrold, and Alan Weinstein. 1985. Calculus I, 2, and 3. 2nd edition. New York: Springer.

Strang, Gilbert. 1991. Calculus. MIT Open Courseware.

Stewart, James. 2015. Calculus: Early Transcendentals. 8th edition. Boston, MA, USA: Brooks Cole.

Linear Algebra

Beezer, Robert Arnold. 2008. A First Course in Linear Algebra.

Hefferon, Jim. 2006. Linear Algebra.

Treil, Sergei. 2004. Linear Algebra Done Wrong.

Vandenberghe, L. 2007. Applied Numerical Computing. Lecture Notes.

Lay, David C. 2006. Linear Algebra and Its Applications. Pearson/Addison-Wesley.

Special Topics

Differential Equations

Lebl, Jiří. 2014. Notes on Diffy Qs: Differential Equations for Engineering.

Trench, William. 2013. Elementary Differential Equations.

Machine Learning, Probabilistic Modeling, & Statistics

Probability & Statistics

Ash, Robert B. 1970. Basic Probability Theory. John Wiley and Sons.

Diez, David, Christopher Barr, and Mine Cetinkaya-Rundel. 2015. OpenIntro Statistics. Third.

Downey, Allen B. 2014. Think Stats: Probability and Statistics for Programmers. O’Reilly Media, Inc.

Grinstead, Charles Miller, and James Laurie Snell. 2006. Grinstead and Snell’s Introduction to Probability. Chance Project.

Ross, Sheldon. 2014. A First Course in Probability.

Machine Learning / Statistical Learning

Barber, David. 2012. Bayesian Reasoning and Machine Learning. Cambridge University Press.

Daumé III, Hal. 2015. A Course in Machine Learning.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer series in statistics Springer, Berlin.

Duda, Richard O., Peter E. Hart, and David G. Stork. 2012. Pattern Classification. John Wiley & Sons.

Yee, Stephanie, and Tony Chu. A Visual Introduction to Machine learning. Data visualizations that guide the reader through core machine learning concepts.

Shalizi, Cosma. Advanced Data Analysis from an Elementary Point of View. A pre-publication pdf draft textbook made available by the author.

Bishop, Christopher M. 2006. Pattern Recognition.

Special Topics

Bayesian Methods

Davidson-Pilon, Cameron. 2015. Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference. 1st edition. New York: Addison-Wesley Professional.

Downey, Allen. 2013. Think Bayes: Bayesian Statistics Made Simple. O’Reilly Media, Inc.

Neural Networks & Deep Learning

Kriesel, David. 2007. A Brief Introduction to Neural Networks.

Nielsen, Michael. Neural Networks and Deep Learning. 2016. Free online book.

Smilkov, Daniel and Shan Carter. An Interactive Neural Network Playground. Interactive neural network simulator.

Deep Learning Tutorial. (Stanford)

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. 2016. An MIT Press book on deep learning (and basic machine learning).

Information Theory

MacKay, David J. C. 2003. Information Theory, Inference and Learning Algorithms. Cambridge University Press.

Gaussian Processes

Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. University Press Group Limited.

Reinforcement Learning

Lin, Jimmy, and Chris Dyer. 2010. Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers.

Sutton, Richard S., and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press.

Programming

Basic and Statistical Programming

Python

Shaw, Zed A. 2013. Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code. Addison-Wesley.

Sweigart, Al. 2016. Automate the Boring Stuff with Python: Practical Programming for Total Beginners.

Downey, Allen. 2015. Think Python: How to Think Like a Computer Scientist. 2nd Edition. O’Reilly Media, Inc.

Downey, Allen B. 2012. Think Complexity: Complexity Science and Computational Modeling. O’Reilly Media, Inc.

Severance, Charles. Python for Informatics. A free pdf book on an data-analysis-centered approach to Python coding.

R

Wickham, Hadley. Advanced R. An online textbook based on a popular print book on R.

Navarro, Daniel. 2015. Learning Statistics with R: A Tutorial for Psychology Students and Other Beginners (version 0.5).

MATLAB

Hamilton, Antonia. Matlab for Psychologists. 2004. A MATLAB beginner's pdf tutorial.

Mathworks MATLAB Statistics and Machine Learning Toolbox Tutorial.

Visualization

Best Practices

Tufte, Edward R. 2001. The Visual Display of Quantitative Information. 2nd edition. Cheshire, Conn: Graphics Pr.

Few, Stephen. 2009. Now You See It: Simple Visualization Techniques for Quantitative Analysis. Analytics Press.

Cairo, Alberto. 2012. The Functional Art: An Introduction to Information Graphics and Visualization. New Riders.

D3.js

Murray, Scott. 2013. Interactive Data Visualization for the Web. O’Reilly Media, Inc.

Maclean, Malcolm. 2013. D3 Tips and Tricks. Leanpub.

Special Topics

Git

Atlassian. Git Tutorial.

Regular Expressions

Skinner, Grant. RegExr. An online tool to learn, build, & test Regular Expressions.

SQL

Gertz, M. 2000. Oracle/SQL Tutorial.

MySQL Tutorial. 1997. MySQL 5.1 Reference Manual.

MapReduce and Big Data

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets. Cambridge University Press.

Lin, Jimmy, and Chris Dyer. 2010. Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers.