July 7th, 2023
Tools for Data Science
Tools for Data Science
Open-Source Data Management Tools
- Relational databases:
- MySQL
- PostgreSQL
- NoSQL Databases:
- MongoDB
- Apache CouchDB
- Apache Cassandra
- File-based tools:
- Hadoop File System
- Ceph
- Elastic search tool
Open-Source Data Integration and Transformation Tools
- Extract, Transform, and Load (ETL) process
- Extract, Load, Transform (ELT) process
- Data Refinery and Cleansing
- Open-source tools:
- Apache AirFlow
- KubeFlow
- Apache Kafka
- Apache Nifi
- Apache SparkSQL
- NodeRED
Data Visualization Tools
- Programming libraries:
- Pixie Dust
- Tools with user interface:
- Hue
- Kibana
- Apache Superset
Model Tools for Building, Deployment, Monitoring, and Assessment
- Model deployment tools:
- Apache PredictionIO
- Seldon
- Kubernetes
- Redhat OpenShift
- MLeap
- TensorFlow service
- TensorFlow lite
- TensorFlow dot JS
- Model monitoring tools:
- ModelDB
- Prometheus
- IBM AI Fairness 360
- IBM Adversarial Robustness 360 Toolbox
- IBM AI Explainability 360
Tools for Code and Data Asset Management
- Code asset management tools:
- Git
- GitHub
- GitLab
- Bitbucket
- Data asset management tools:
- Apache Atlas
- ODPi Egeria
- Kylo
Comparison of Open-Source Tools
- Jupyter:
- Interactive Python programming
- Support for multiple programming languages through kernels
- Jupyter Notebooks and Jupyter Lab
- Apache Zeppelin:
- Similar to Jupyter Notebooks
- Integrated plotting capability
- RStudio:
- Development environment for statistics and data science using R
- Integration with Jupyter
- Spyder:
- Python alternative to RStudio
Cluster Execution Environments
- Apache Spark:
- Batch data processing engine
- Linear scalability
- Apache Flink:
- Stream-processing engine for real-time data
- Ray:
- Large-scale deep learning model training
Fully Integrated and Visual Tools for Data Scientists
- KNIME:
- Visual user interface for data integration, transformation, visualization, and model building