As an emerging field that is a collection of a number of well-established fields, the skills that make for a successful data scientist come from a variety of disciplines including mathematics, statistics & machine learning, and computer science. Navigating a pathway through developing skills in these fields can be challenging since no single resource (by necessity) is able to provide guidance on all of the tools of a modern data scientist. Many of these resources have been personally helpful in developing skills in this area, and I've compiled a list of computational tools, (mostly free) resources and educational references, and a checklist of data science concepts for those interested in structure on their pathway through their personal data science curriculum or to brush up on topics and expand your knowledge.
Data Science Concept Checklist. Checklist of core and advanced concepts in data science across the three primary disciplines (mathematics, statistics & machine learning, and computer science) organized by topical areas. This can act as a roadmap through which concepts to explore or as a tool for evaluating opportunities for expanding your existing skillset.
Tools. Descriptions and links to powerful computational tools and useful packages.
Resources and References. A curated collection of educational resources on a wide variety of core data science concepts and some special topics.
There are a vast array of tools that can be used for solving problems in data science. Some are programming languages or environments, others are useful packages for solving specific problems or communicating and visualizing your results.
Almost any programming language can be used to solve computational problems, although there are a few that outshine in terms of built in packages and user support communities. Most notably, R and Python have excelled in these respects and are also freely available. MATLAB may have the most detailed documentation of any of the options available, but it is commercial software.
R. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows, and OSX. With the RStudio integrated development environment (IDE), the language can be powerfully wielded for rapid analyses. Additionally, R Shiny can turn R analyses into interactive web applications.
Python. Python is a powerful, general purpose, dynamic programming language that is has extensive packages for scientific computation (NumPy, SciPy, Pandas), advanced plotting (matplotlib), and machine learning (scikit-learn). For this sort of scientific computing, using an IDE such as Rodeo or Spyder may speed up the development of analyses.
MATLAB. A numerical computing environment and programming language with a wide set of standard toolboxes including those for statistics and machine learning.
Julia. A newer programming language designed to meet the needs of mathematical computing.
Almost any data science project worth doing requires significant numbers of revisions and collaboration. These tools allow for comprehensive Git-based version control with a web-based repository. Github is the most popular, but all offer similar web-based repository services.
Apache Subversion (SVN). A free software versioning and revision control system, based on a centralized concurrent versioning model.
Jupyter Notebook. This web application allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
Github Pages / Github.io. Github Pages allows you to create a web page from a Github repository and use convert plain text into a formatted web document.
Tableau. Proprietary desktop and web-based visualization tools that include many data visualization techniques for the rapid development of professional visualizations.
MySQL. An open source relational database management system using SQL.
Apache Hadoop. An open source framework for distributed file storage and processing (often associated with “big data”) that uses the Hadoop Distributed File System (HDFS) for storage and the MapReduce algorithm for data processing.
MongoDB. A document-oriented NoSQL database (non-relational database, which does not rely on tables for storing data) capable of handling a wider variety of data types than traditional SQL relational databases.
Here are some (primarily free) resources data science. Some of these are personal favorites () or recommendations, and many come from the github awesome-machine-learning repository on data science books.
Nilsson, Nils. The Quest for Artificial Intelligence: A History of Ideas and Achievements. 2010. A history of machine learning and data science
Stewart, James. 2015. Calculus: Early Transcendentals. 8th edition. Boston, MA, USA: Brooks Cole.
Lay, David C. 2006. Linear Algebra and Its Applications. Pearson/Addison-Wesley.
Ross, Sheldon. 2014. A First Course in Probability.
Duda, Richard O., Peter E. Hart, and David G. Stork. 2012. Pattern Classification. John Wiley & Sons.
Yee, Stephanie, and Tony Chu. A Visual Introduction to Machine learning. Data visualizations that guide the reader through core machine learning concepts.
Shalizi, Cosma. Advanced Data Analysis from an Elementary Point of View. A pre-publication pdf draft textbook made available by the author.
Bishop, Christopher M. 2006. Pattern Recognition.
Nielsen, Michael. Neural Networks and Deep Learning. 2016. Free online book.
Smilkov, Daniel and Shan Carter. An Interactive Neural Network Playground. Interactive neural network simulator.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. 2016. An MIT Press book on deep learning (and basic machine learning).
Severance, Charles. Python for Informatics. A free pdf book on an data-analysis-centered approach to Python coding.
Wickham, Hadley. Advanced R. An online textbook based on a popular print book on R.
Hamilton, Antonia. Matlab for Psychologists. 2004. A MATLAB beginner's pdf tutorial.
Tufte, Edward R. 2001. The Visual Display of Quantitative Information. 2nd edition. Cheshire, Conn: Graphics Pr.
Few, Stephen. 2009. Now You See It: Simple Visualization Techniques for Quantitative Analysis. Analytics Press.
Cairo, Alberto. 2012. The Functional Art: An Introduction to Information Graphics and Visualization. New Riders.
Skinner, Grant. RegExr. An online tool to learn, build, & test Regular Expressions.
MySQL Tutorial. 1997. MySQL 5.1 Reference Manual.