July 7th, 2023

Tools for Data Science

Tools for Data Science

Open-Source Data Management Tools

- Relational databases:

  - MySQL

  - PostgreSQL

- NoSQL Databases:

  - MongoDB

  - Apache CouchDB

  - Apache Cassandra

- File-based tools:

  - Hadoop File System

  - Ceph

- Elastic search tool


Open-Source Data Integration and Transformation Tools

- Extract, Transform, and Load (ETL) process

- Extract, Load, Transform (ELT) process

- Data Refinery and Cleansing

- Open-source tools:

  - Apache AirFlow

  - KubeFlow

  - Apache Kafka

  - Apache Nifi

  - Apache SparkSQL

  - NodeRED


Data Visualization Tools

- Programming libraries:

  - Pixie Dust

- Tools with user interface:

  - Hue

  - Kibana

  - Apache Superset


Model Tools for Building, Deployment, Monitoring, and Assessment

- Model deployment tools:

  - Apache PredictionIO

  - Seldon

  - Kubernetes

  - Redhat OpenShift

  - MLeap

  - TensorFlow service

  - TensorFlow lite

  - TensorFlow dot JS

- Model monitoring tools:

  - ModelDB

  - Prometheus

  - IBM AI Fairness 360

  - IBM Adversarial Robustness 360 Toolbox

  - IBM AI Explainability 360


Tools for Code and Data Asset Management

- Code asset management tools:

  - Git

  - GitHub

  - GitLab

  - Bitbucket

- Data asset management tools:

  - Apache Atlas

  - ODPi Egeria

  - Kylo


Comparison of Open-Source Tools

- Jupyter:

  - Interactive Python programming

  - Support for multiple programming languages through kernels

  - Jupyter Notebooks and Jupyter Lab

- Apache Zeppelin:

  - Similar to Jupyter Notebooks

  - Integrated plotting capability

- RStudio:

  - Development environment for statistics and data science using R

  - Integration with Jupyter

- Spyder:

  - Python alternative to RStudio


Cluster Execution Environments

- Apache Spark:

  - Batch data processing engine

  - Linear scalability

- Apache Flink:

  - Stream-processing engine for real-time data

- Ray:

  - Large-scale deep learning model training


Fully Integrated and Visual Tools for Data Scientists

- KNIME:

  - Visual user interface for data integration, transformation, visualization, and model building