The Rise of the Citizen Data Scientist
We have written a few times about the data scientist profession here in this space (and by the way, we’re hiring for that role as we speak!). The...
Data scientists are inquisitive and often seek out new tools that help them find answers. They also need to be proficient in using the tools of the trade, even though there are dozens upon dozens of them. Overall, data scientists should have a working knowledge of statistical programming languages for constructing data processing systems, databases,... Read more »
Data scientists are inquisitive and often seek out new tools that help them find answers. They also need to be proficient in using the tools of the trade, even though there are dozens upon dozens of them. Overall, data scientists should have a working knowledge of statistical programming languages for constructing data processing systems, databases, and visualization tools. Many in the field also deem a knowledge of programming an integral part of data science; however, not all data scientist students study programming, so it is helpful to be aware of tools that circumvent programming and include a user-friendly graphical interface so that data scientists’ knowledge of algorithms is enough to help them build predictive models.
With everything on a data scientist’s plate, you don’t have time to search for the tools of the trade that can help you do your work. That’s why we have rounded up tools that aid in data visualization, algorithms, statistical programming languages, and databases. We have chosen tools based on their ease of use, popularity, reputation, and features. And, we have listed our top tools for data scientists in alphabetical order to simplify your search; thus, they are not listed by any ranking or rating.
1. Algorithms.io
@algorithms_io
Algorithms.io is a LumenData Company providing machine learning as a service for streaming data from connected devices. This tool turns raw data into real-time insights and actionable events so that companies are in a better position to deploy machine learning for streaming data.
Key Features:
Cost: Contact for a quote
An iterative graph processing system designed for high scalability, Apache Giraph began as an open source counterpart to Pregel but adds multiple features beyond the basic Pregel model. Giraph is used by data scientists to “unleash the potential of structured datasets at a massive scale.”
Key Features:
Cost: FREE
Apache Hadoop is an open source software for reliable, distributed, scalable computing. A framework allowing for the distributed processing of large datasets across clusters of computers, the software library uses simple programming models. Hadoop is appropriate for research and production.
Key Features:
Cost: FREE
The Hadoop database, Apache HBase is a distributed, scalable, big data store. Data scientists use this open source tool when they need random, real-time read/write access to Big Data. Apache HBase also provides capabilities similar to Bigtable on top of Hadoop and HDFS.
Key Features:
Cost: FREE
An Apache Software foundation Project, Apache Hive began as a subproject of Apache Hadoop and now is a top-level project itself. This tool is a data warehouse software that assists in reading, writing, and managing large datasets that reside in distributed storage using SQL.
Key Features:
Cost: FREE
A distributed streaming platform, Apache Kafka efficiently processes streams of data in real time. Data scientists use this tool to build real-time data pipelines and streaming apps because it empowers you to publish and subscribe to streams of records, store streams of records in a fault-tolerant way, and process streams of records as they occur.
Key Features:
Cost: FREE
7. Apache Mahout
@ApacheMahout
An open source Apache Foundation project for machine learning, Apache Mahout aims to enable scalable machine learning and data mining. Specifically, the project’s goal is to “build an environment for quickly creating scalable performant machine learning applications.”
Key Features:
Cost: FREE
A cluster manager, Apache Mesos provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos abstracts CPU, memory, storage, and other resources away from physical or virtual machines to enable fault-tolerant, elastic distributed systems to be built easily and run effectively.
Key Features:
Cost: FREE
9. Apache Pig
A platform designed for analyzing large datasets, Apache Pig consists of a high-level language for expressing data analysis programs that is coupled with infrastructure for evaluating such programs. Because Pig programs’ structures can handle significant parallelization, they can tackle large datasets.
Key Features:
Cost: FREE
Apache Spark delivers “lightning-fast cluster computing.” A wide range of organizations use Spark to process large datasets, and this data scientist tool can access diverse data sources such as HDFS, Cassandra, HBase, and S3.
Key Features:
Cost: FREE
11. Apache Storm
@ApacheStorm
@stormprocessor
Apache Storm is a tool for data scientists that handles distributed and fault-tolerant real-time computation. It also tackles stream processing, continuous computation, distributed RPC, and more.
Key Features:
Cost: FREE
BigML makes machine learning simple. This company-wide platform runs in the cloud or on premises for operationalizing machine learning in organizations. BigML makes it simple to solve and automate classification, regression, cluster analysis, anomaly detection, association discovery, and topic modeling tasks.
Key Features:
Cost: Contact for a quote
13. Bokeh
@BokehPlots
A Python interactive visualization library, Bokeh targets modern web browsers for presentation and helps users create interactive plots, dashboards, and data apps easily.
Key Features:
Cost: FREE
14. Cascading
@cascading
Cascading is an application development platform for data scientists building Big Data applications on Apache Hadoop. Users can solve simple and complex data problems with Cascading because it boasts computation engine, systems integration framework, data processing, and scheduling capabilities.
Key Features:
Cost: FREE
15. Clojure
A robust and fast programming language, Clojure is a practical tool that marries the interactive development of a scripting language with an efficient infrastructure for multithreaded programming. Clojure is unique in that it is a compile language but remains dynamic with every feature supported at runtime.
Key Features:
Cost: FREE
Committed to “code and data for humans,” Mike Bostock created D3.js. Data scientists use this tool, a JavaScript library for manipulating documents based on data, to add life to their data with SVG, Canvas, and HTML.
Key Features:
Cost: FREE
17. DataRobot
@DataRobot
An advanced machine learning automation platform, DataRobot helps data scientists build better predictive models faster. You can keep up with the ever-expanding ecosystem of machine learning algorithms easily when you use DataRobot.
Key Features:
Cost: Contact for a quote
DataRPM is the “industry’s first and only cognitive predictive maintenance platform for industrial IoT. DataRPM also is the recipient of the 2017 Technology Leadership Award for Cognitive Predictive Maintenance in Automotive Manufacturing from Frost & Sullivan.
Key Features:
Cost: Contact for a quote
Many data scientists view Excel as a secret weapon. It is a familiar tool that scientists can rely on to quickly sort, filter, and work with their data. It’s also on nearly every computer you come across, so data scientists can work from just about anywhere with Excel.
Key Features:
Cost: FREE trial available
20. Feature Labs
An end-to-end data science solution, Feature Labs develops and deploys intelligent products and services for your data. They also work with data scientists to help you develop and deploy intelligent products, features, and services.
Key Features:
Cost: Contact for a quote
21. ForecastThis
@forecastthis
ForecastThis is a tool for data scientists that automates predictive model selection. The company strives to make deep learning relevant for finance and economics by enabling investment managers, quantitative analysts, and data scientists to use their own data to generate robust forecasts and optimize complex future objectives.
Key Features:
Cost: Contact for a quote
Google Fusion Tables is a cloud-based data management service that focuses on collaboration, ease-of-use, and visualizations. An experimental app, Fusion Tables is a data visualization web application tool for data scientists that empowers you to gather, visualize, and share data tables.
Key Features:
Cost: FREE
23. Gawk
GNU is an operating system that enables you to use a computer without software “that would trample your freedom.” They have created Gawk, an awk utility that interprets a special-purpose programming language. Gawk empowers users to handle simple data-reformatting jobs using only a few lines of code.
Key Features:
Cost: FREE
24. ggplot2
@hadleywickham
@winston_chang
Hadley Wickham and Winston Chang developed ggplot2, a plotting system for R that is based on the grammar of graphics. With ggplot2, data scientists can avoid many of the hassles of plotting while maintaining the attractive parts of base and lattice graphics and producing complex multi-layered graphics easily.
Key Features:
Cost: FREE
25. GraphLab Create
Data scientists and developers use GraphLab Create to build state-of-the-art data products via machine learning. This machine learning modeling tool helps users build intelligent applications end-to-end in Python.
Key Features:
Cost:
26. IPython
@IPythonDev
Interactive Python tools, or IPython, is a growing project with expanding language-agnostic components and provides a rich architecture for interactive computing. An open source tool for data scientists, IPython supports Python 2.7 and 3.3 or newer.
Key Features:
Cost: FREE
Java is a language with a broad user base that serves as a tool for data scientists creating products and frameworks involving distributed systems, data analysis, and machine learning. Java now is recognized as being just as important to data science as R and Python because it is robust, convenient, and scalable for data science applications.
Key Features:
Cost: FREE trial available; Contact for commercial license cost
Jupyter provides multi-language interactive computing environments. Its Notebook, an open source web application, allows data scientists to create and share documents containing live code, equations, visualizations, and explanatory text.
Key Features:
Cost: FREE
29. KNIME Analytics Platform
@knime
Thanks to its open platform, KNIME is a tool for navigating complex data freely. The KNIME Analytics Platform is a leading open solution for data-driven innovation to help data scientists uncover data’s hidden potential, mine for insights, and predict futures.
Key Features:
Cost: FREE
An award-winning white-box machine learning and artificial intelligence platform, Logical Glue increases productivity and profit for organizations. Data scientists choose this tool because it brings your insights to life for your audience.
Key Features:
Cost: Contact for a quote
A high-level language and interactive environment for numerical computation, visualization, and programming, MATLAB is a powerful tool for data scientists. MATLAB serves as the language of technical computing and is useful for math, graphics, and programming.
Key Features:
Cost:
Matplotlib is a Python 2D plotting library that produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms. Data scientists use this tool in Python scripts, the Python and IPython shell, the Jupyter Notebook, web application servers, and four graphical user interface toolkits.
Key Features:
Cost: FREE
UC Berkeley’s AMPLab integrates algorithms, machines, and people to make sense of Big Data. They also developed MLBase, an open source project that makes distributed machine learning easier for data scientists.
Key Features:
Cost: FREE
MySQL is one of today’s most popular open source databases. It’s also a popular tool for data scientists to use to access data from the database. Even though MySQL typically is software in web applications, it can be used in a variety of settings.
Key Features:
Cost: FREE
35. Narrative Science
@narrativesci
Narrative Science helps enterprises maximize the impact of their data with automated, intelligent narratives generated by advanced narrative language generation (NLG). Data scientists humanize data with Narrative Science’s technology that interprets and then transforms data at unparalleled speed and scale.
Key Features:
Cost: Contact for a quote
36. Natural Language Toolkit (NLTK)
@NLTK_org
A leading platform for building Python programs, Natural Language Toolkit (NLTK) is a tool for working with human language data. NLTK is a helpful tool for inexperienced data scientists and data science students working in computational linguistics using Python.
Key Features:
Cost: FREE
37. NetworkX
NetworkX is a Python package tool for data scientists. Create, manipulate, and study the structure, dynamics, and functions of complex networks with NetworkX.
Key Features:
Cost: FREE
38. NumPy
A fundamental package for scientific computing with Python, NumPy is well-suited to scientific uses. NumPy also serves as a multi-dimensional container of generic data.
Key Features:
Cost: FREE
39. Octave
@GnuOctave
GNU Octave is a scientific programming language that is a useful tool for data scientists looking to solve systems of equations or visualize data with high-level plot commands. This tool’s syntax is compatible with MATLAB, and its interpreter can be run in GUI mode, as a console, or invoked as part of a shell script.
Key Features:
Cost: FREE
OpenRefine is a powerful tool for data scientists who want to clean up, transform, and extend data with web services and then link it to databases. Formerly Google Refine, OpenRefine now is an open source project fully supported by volunteers.
Key Features:
Cost: FREE
41. pandas
pandas is an open source library that delivers high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Data scientists use this tool when they need a Python data analysis library.
Key Features:
Cost: FREE
Data scientists are more productive when they use RapidMiner, a unified platform for data prep, machine learning, and model deployment. A tool for making data science fast and simple, RapidMiner is a leader in the 2017 Gartner Magic Quadrant for Data Science Platforms, a leader in 2017 Forrester Wave for predictive analytics and machine learning, and a high performer in the G2 Crowd predictive analytics grid.
Key Features:
Cost:
43. Redis
@redisfeed
Redis is a data structure server that data scientists use as a database, cache, and message broker. This open source, in-memory data structure store supports strings, hashes, lists, and more.
Key Features:
Cost: FREE
RStudio is a tool for data scientists that is open source and enterprise-ready. This professional software for the R community makes R easier to use.
Key Features:
Cost:
45. Scala
@scala_lang
The Scala programming language is a tool for data scientists looking to construct elegant class hierarchies to maximize code reuse and extensibility. The tool also empowers users to implement class hierarchies’ behavior using higher-order functions.
Key Features:
Cost: FREE
46. scikit-learn
@scikit_learn
scikit-learn is an easy-to-use, general-purpose machine learning for Python. Data scientists prefer scikit-learn because it features simple, efficient tools for data mining and data analysis
Key Features:
Cost: FREE
47. SciPy
SciPy, a Python-based ecosystem of open source software, is intended for for math, science, and engineering applications. The SciPy Stack includes Python, NumPy, Matplotlib, Python, the SciPy Library, and more.
Key Features:
Cost: FREE
48. Shiny
A web application framework for R by RStudio, Shiny is a tool data scientists use to turn analyses into interactive web applications. Shiny is an ideal tool for data scientists who are inexperienced in web development.
Key Features:
Cost: Contact for a quote
TensorFlow is a fast, flexible, scalable open source machine learning library for research and production. Data scientists use TensorFlow for numerical computation using data flow graphs.
Key Features:
Cost: FREE
50. TIBCO Spotfire
@TIBCO
TIBCO drives digital business by enabling better decisions and faster, smarter actions. Their Spotfire solution is a tool for data scientists that addresses data discovery, data wrangling, predictive analytics, and more.
Key Features:
Cost: FREE trial available
This blog features a comprehensive list of tools for working with Python and Excel. It covers writing Excel Add-Ins in Python, reading and writing Excel files, and interacting with Excel. It’s a great resource for understanding the differences between all the different Python/Excel tools out there, and all in one place.
We have written a few times about the data scientist profession here in this space (and by the way, we’re hiring for that role as we speak!). The...
Data scientists look for the proverbial “needle in the haystack,” with the haystack, in this case, being the mass of data collected by a company....
Luigi Vacca, Data Scientist at NGDATA, October 28, 2014 Recently, I came across a piece in The New York Times that examined the prevalence of...
Be the first to know about new B2B SaaS Marketing insights to build or refine your marketing function with the tools and knowledge of today’s industry.