Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
pyds


Awesome Python Data Science


Probably the best curated list of data science software in Python

Contents {docsify-ignore}

Machine Learning

General Purpouse Machine Learning

  • scikit-learn - Machine learning in Python. sklearn
  • Shogun - Machine learning toolbox.
  • xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package.
  • cuML - RAPIDS Machine Learning Library. sklearn GPU accelerated
  • modAL - Modular active learning framework for Python3. sklearn
  • Sparkit-learn - PySpark + scikit-learn = Sparkit-learn. sklearn Apache Spark based
  • mlpack - A scalable C++ machine learning library (Python bindings).
  • dlib - Toolkit for making real world machine learning and data analysis applications in C++ (Python bindings).
  • MLxtend - Extension and helper modules for Python's data analysis and machine learning libraries. sklearn
  • Reproducible Experiment Platform (REP) - Machine Learning toolbox for Humans. sklearn
  • scikit-multilearn - Multi-label classification for python. sklearn
  • seqlearn - Sequence classification toolkit for Python. sklearn
  • pystruct - Simple structured learning framework for Python. sklearn
  • sklearn-expertsys - Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models. sklearn
  • RuleFit - Implementation of the rulefit. sklearn
  • metric-learn - Metric learning algorithms in Python. sklearn
  • pyGAM - Generalized Additive Models in Python.

Time Series

  • tslearn - Machine learning toolkit dedicated to time-series data. sklearn
  • tick - Module for statistical learning, with a particular emphasis on time-dependent modelling. sklearn
  • Prophet - Automatic Forecasting Procedure.
  • PyFlux - Open source time series library for Python.
  • bayesloop - Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.
  • luminol - Anomaly Detection and Correlation library.

Automated Machine Learning

  • TPOT - Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. sklearn
  • auto-sklearn - An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. sklearn
  • MLBox - A powerful Automated Machine Learning python library.

Ensemble Methods

  • ML-Ensemble - High performance ensemble learning. sklearn
  • Stacking - Simple and useful stacking library, written in Python. sklearn
  • stacked_generalization - Library for machine learning stacking generalization. sklearn
  • vecstack - Python package for stacking (machine learning technique). sklearn

Imbalanced Datasets

  • imbalanced-learn - Module to perform under sampling and over sampling with various techniques. sklearn
  • imbalanced-algorithms - Python-based implementations of algorithms for learning on imbalanced data. sklearn sklearn

Random Forests

Extreme Learning Machine

  • Python-ELM - Extreme Learning Machine implementation in Python. sklearn
  • Python Extreme Learning Machine (ELM) - A machine learning technique used for classification/regression tasks.
  • hpelm - High performance implementation of Extreme Learning Machines (fast randomized neural networks). GPU accelerated

Kernel Methods

  • pyFM - Factorization machines in python. sklearn
  • fastFM - A library for Factorization Machines. sklearn
  • tffm - TensorFlow implementation of an arbitrary order Factorization Machine. sklearn sklearn
  • liquidSVM - An implementation of SVMs.
  • scikit-rvm - Relevance Vector Machine implementation using the scikit-learn API. sklearn
  • ThunderSVM - A fast SVM Library on GPUs and CPUs. sklearn GPU accelerated

Gradient Boosting

  • XGBoost - Scalable, Portable and Distributed Gradient Boosting. sklearn GPU accelerated
  • LightGBM - A fast, distributed, high performance gradient boosting. sklearn GPU accelerated
  • CatBoost - An open-source gradient boosting on decision trees library. sklearn GPU accelerated
  • ThunderGBM - Fast GBDTs and Random Forests on GPUs. sklearn GPU accelerated

Deep Learning

PyTorch

  • PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch based/compatible
  • torchvision - Datasets, Transforms and Models specific to Computer Vision. PyTorch based/compatible
  • torchtext - Data loaders and abstractions for text and NLP. PyTorch based/compatible
  • torchaudio - An audio library for PyTorch. PyTorch based/compatible
  • ignite - High-level library to help with training neural networks in PyTorch. PyTorch based/compatible
  • PyToune - A Keras-like framework and utilities for PyTorch.
  • skorch - A scikit-learn compatible neural network library that wraps pytorch. sklearn PyTorch based/compatible
  • PyTorchNet - An abstraction to train neural networks PyTorch based/compatible
  • Aorun - Intend to implement an API similar to Keras with PyTorch as backend. PyTorch based/compatible
  • pytorch_geometric - Geometric Deep Learning Extension Library for PyTorch. PyTorch based/compatible
  • Catalyst - High-level utils for PyTorch DL & RL research. PyTorch based/compatible

TensorFlow

  • TensorFlow - Computation using data flow graphs for scalable machine learning by Google. sklearn
  • TensorLayer - Deep Learning and Reinforcement Learning Library for Researcher and Engineer. sklearn
  • TFLearn - Deep learning library featuring a higher-level API for TensorFlow. sklearn
  • Sonnet - TensorFlow-based neural network library. sklearn
  • tensorpack - A Neural Net Training Interface on TensorFlow sklearn
  • Polyaxon - A platform that helps you build, manage and monitor deep learning models. sklearn
  • NeuPy - NeuPy is a Python library for Artificial Neural Networks and Deep Learning (previously: Theano compatible). sklearn
  • tfdeploy - Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy. sklearn
  • tensorflow-upstream - TensorFlow ROCm port. sklearn Possible to run on AMD GPU
  • TensorFlow Fold - Deep learning with dynamic computation graphs in TensorFlow. sklearn
  • tensorlm - Wrapper library for text generation / language models at char and word level with RNN. sklearn
  • TensorLight - A high-level framework for TensorFlow. sklearn
  • Mesh TensorFlow - Model Parallelism Made Easier. sklearn
  • Ludwig - A toolbox, that allows to train and test deep learning models without the need to write code. sklearn

Keras

  • Keras - A high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. Keras compatible
  • keras-contrib - Keras community contributions. Keras compatible
  • Hyperas - Keras + Hyperopt: A very simple wrapper for convenient hyperparameter. Keras compatible
  • Elephas - Distributed Deep learning with Keras & Spark. Keras compatible
  • Hera - Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser. Keras compatible
  • Spektral - Deep learning on graphs. Keras compatible
  • qkeras - A quantization deep learning library. Keras compatible

MXNet

  • MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler. MXNet based
  • Gluon - A clear, concise, simple yet powerful and efficient API for deep learning (now included in MXNet). MXNet based
  • MXbox - Simple, efficient and flexible vision toolbox for mxnet framework. MXNet based
  • gluon-cv - Provides implementations of the state-of-the-art deep learning models in computer vision. MXNet based
  • gluon-nlp - NLP made easy. MXNet based
  • Xfer - Transfer Learning library for Deep Neural Networks. MXNet based
  • MXNet - HIP Port of MXNet. MXNet based Possible to run on AMD GPU

Chainer

  • Chainer - A flexible framework for neural networks.
  • ChainerCV - A Library for Deep Learning in Computer Vision.
  • ChainerMN - Scalable distributed deep learning with Chainer.

Theano

WARNING: Theano development has been stopped

  • Theano - A Python library that allows you to define, optimize, and evaluate mathematical expressions.Theano compatible
  • Lasagne - Lightweight library to build and train neural networks in Theano. Theano compatible
  • nolearn - A scikit-learn compatible neural network library (mainly for Lasagne). Theano compatible sklearn
  • Blocks - A Theano framework for building and training neural networks. Theano compatible
  • scikit-neuralnetwork - Deep neural networks without the learning cliff. sklearn Theano compatible
  • platoon - Multi-GPU mini-framework for Theano. Theano compatible
  • Theano-MPI - MPI Parallel framework for training deep learning models built in Theano. Theano compatible

Others

  • CNTK - Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit.
  • Neon - Intel® Nervana™ reference deep learning framework committed to best performance on all hardware.
  • Tangent - Source-to-Source Debuggable Derivatives in Pure Python.
  • autograd - Efficiently computes derivatives of numpy code.
  • Myia - Deep Learning framework (pre-alpha).
  • nnabla - Neural Network Libraries by Sony.
  • Caffe - A fast open framework for deep learning.
  • Caffe2 - A lightweight, modular, and scalable deep learning framework (now a part of PyTorch).
  • hipCaffe - The HIP port of Caffe. Possible to run on AMD GPU

Data Manipulation

Data Containers

  • pandas - Powerful Python data analysis toolkit.
  • cuDF - GPU DataFrame Library. pandas compatible GPU accelerated
  • blaze - NumPy and pandas interface to Big Data. pandas compatible
  • pandasql - Allows you to query pandas DataFrames using SQL syntax. pandas compatible
  • pandas-gbq - pandas Google Big Query. pandas compatible
  • xpandas - Universal 1d/2d data containers with Transformers .functionality for data analysis by The Alan Turing Institute.
  • pysparkling - A pure Python implementation of Apache Spark's RDD and DStream interfaces. Apache Spark based
  • Arctic - High performance datastore for time series and tick data.
  • datatable - Data.table for Python. R inspired/ported lib
  • koalas - pandas API on Apache Spark. pandas compatible
  • modin - Speed up your pandas workflows by changing a single line of code. pandas compatible
  • swifter - A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.

Pipelines

  • pdpipe - Sasy pipelines for pandas DataFrames.
  • SSPipe - Python pipe (|) operator with support for DataFrames and Numpy and Pytorch.
  • pandas-ply - Functional data manipulation for pandas. pandas compatible
  • Dplython - Dplyr for Python. R inspired/ported lib
  • sklearn-pandas - pandas integration with sklearn. sklearn pandas compatible
  • Dataset - Helps you conveniently work with random or sequential batches of your data and define data processing.
  • pyjanitor - Clean APIs for data cleaning. pandas compatible
  • meza - A Python toolkit for processing tabular data.
  • Prodmodel - Build system for data science pipelines.

Feature Engineering

General

  • Featuretools - Automated feature engineering.
  • skl-groups - A scikit-learn addon to operate on set/"group"-based features. sklearn
  • Feature Forge - A set of tools for creating and testing machine learning feature. sklearn
  • few - A feature engineering wrapper for sklearn. sklearn
  • scikit-mdr - A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction. sklearn
  • tsfresh - Automatic extraction of relevant features from time series. sklearn

Feature Selection

  • scikit-feature - Feature selection repository in python.
  • boruta_py - Implementations of the Boruta all-relevant feature selection method. sklearn
  • BoostARoota - A fast xgboost feature selection algorithm. sklearn
  • scikit-rebate - A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning. sklearn

Visualization

  • Matplotlib - Plotting with Python.
  • seaborn - Statistical data visualization using matplotlib.
  • Bokeh - Interactive Web Plotting for Python.
  • HoloViews - Stop plotting your data - annotate your data and let it visualize itself.
  • prettyplotlib - Painlessly create beautiful matplotlib plots.
  • python-ternary - Ternary plotting library for python with matplotlib.
  • missingno - Missing data visualization module for Python.
  • chartify - Python library that makes it easy for data scientists to create charts.
  • physt - Improved histograms.
  • animatplot - A python package for animating plots build on matplotlib.

Model Explanation

  • Alibi - Algorithms for monitoring and explaining machine learning models.
  • anchor - Code for "High-Precision Model-Agnostic Explanations" paper.
  • aequitas - Bias and Fairness Audit Toolkit.
  • Contrastive Explanation - Contrastive Explanation (Foil Trees). sklearn
  • yellowbrick - Visual analysis and diagnostic tools to facilitate machine learning model selection. sklearn
  • scikit-plot - An intuitive library to add plotting functionality to scikit-learn objects. sklearn
  • shap - A unified approach to explain the output of any machine learning model. sklearn
  • ELI5 - A library for debugging/inspecting machine learning classifiers and explaining their predictions.
  • Lime - Explaining the predictions of any machine learning classifier. sklearn
  • FairML - FairML is a python toolbox auditing the machine learning models for bias. sklearn
  • L2X - Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation.
  • PDPbox - Partial dependence plot toolbox.
  • pyBreakDown - Python implementation of R package breakDown. sklearnR inspired/ported lib
  • PyCEbox - Python Individual Conditional Expectation Plot Toolbox.
  • Skater - Python Library for Model Interpretation.
  • model-analysis - Model analysis tools for TensorFlow. sklearn
  • themis-ml - A library that implements fairness-aware machine learning algorithms. sklearn
  • treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions. sklearn
  • AI Explainability 360 - Interpretability and explainability of data and machine learning models.
  • Auralisation - Auralisation of learned features in CNN (for audio).
  • CapsNet-Visualization - A visualization of the CapsNet layers to better understand how it works.
  • lucid - A collection of infrastructure and tools for research in neural network interpretability.
  • Netron - Visualizer for deep learning and machine learning models (no Python code, but visualizes models from most Python Deep Learning frameworks).
  • FlashLight - Visualization Tool for your NeuralNetwork.
  • tensorboard-pytorch - Tensorboard for pytorch (and chainer, mxnet, numpy, ...).
  • mxboard - Logging MXNet data for visualization in TensorBoard. MXNet based

Reinforcement Learning

  • OpenAI Gym - A toolkit for developing and comparing reinforcement learning algorithms.
  • Coach - Easy experimentation with state of the art Reinforcement Learning algorithms.
  • garage - A toolkit for reproducible reinforcement learning research.
  • OpenAI Baselines - High-quality implementations of reinforcement learning algorithms.
  • Stable Baselines - A set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.
  • RLlib - Scalable Reinforcement Learning.
  • Horizon - A platform for Applied Reinforcement Learning.
  • TF-Agents - A library for Reinforcement Learning in TensorFlow. sklearn
  • TensorForce - A TensorFlow library for applied reinforcement learning. sklearn
  • TRFL - TensorFlow Reinforcement Learning. sklearn
  • Dopamine - A research framework for fast prototyping of reinforcement learning algorithms.
  • keras-rl - Deep Reinforcement Learning for Keras. Keras compatible
  • ChainerRL - A deep reinforcement learning library built on top of Chainer.

Distributed Computing

  • Horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. sklearn
  • PySpark - Exposes the Spark programming model to Python. Apache Spark based
  • Veles - Distributed machine learning platform.
  • Jubatus - Framework and Library for Distributed Online Machine Learning.
  • DMTK - Microsoft Distributed Machine Learning Toolkit.
  • PaddlePaddle - PArallel Distributed Deep LEarning.
  • dask-ml - Distributed and parallel machine learning. sklearn
  • Distributed - Distributed computation in Python.

Probabilistic Methods

  • pomegranate - Probabilistic and graphical models for Python. GPU accelerated
  • pyro - A flexible, scalable deep probabilistic programming library built on PyTorch. PyTorch based/compatible
  • ZhuSuan - Bayesian Deep Learning. sklearn
  • PyMC - Bayesian Stochastic Modelling in Python.
  • PyMC3 - Python package for Bayesian statistical modeling and Probabilistic Machine Learning. Theano compatible
  • sampled - Decorator for reusable models in PyMC3.
  • Edward - A library for probabilistic modeling, inference, and criticism. sklearn
  • InferPy - Deep Probabilistic Modelling Made Easy. sklearn
  • GPflow - Gaussian processes in TensorFlow. sklearn
  • PyStan - Bayesian inference using the No-U-Turn sampler (Python interface).
  • gelato - Bayesian dessert for Lasagne. Theano compatible
  • sklearn-bayes - Python package for Bayesian Machine Learning with scikit-learn API. sklearn
  • skggm - Estimation of general graphical models. sklearn
  • pgmpy - A python library for working with Probabilistic Graphical Models.
  • skpro - Supervised domain-agnostic prediction framework for probabilistic modelling by The Alan Turing Institute. sklearn
  • Aboleth - A bare-bones TensorFlow framework for Bayesian deep learning and Gaussian process approximation. sklearn
  • PtStat - Probabilistic Programming and Statistical Inference in PyTorch. PyTorch based/compatible
  • PyVarInf - Bayesian Deep Learning methods with Variational Inference for PyTorch. PyTorch based/compatible
  • emcee - The Python ensemble sampling toolkit for affine-invariant MCMC.
  • hsmmlearn - A library for hidden semi-Markov models with explicit durations.
  • pyhsmm - Bayesian inference in HSMMs and HMMs.
  • GPyTorch - A highly efficient and modular implementation of Gaussian Processes in PyTorch. PyTorch based/compatible
  • MXFusion - Modular Probabilistic Programming on MXNet MXNet based
  • sklearn-crfsuite - A scikit-learn inspired API for CRFsuite. sklearn

Genetic Programming

  • gplearn - Genetic Programming in Python. sklearn
  • DEAP - Distributed Evolutionary Algorithms in Python.
  • karoo_gp - A Genetic Programming platform for Python with GPU support. sklearn
  • monkeys - A strongly-typed genetic programming framework for Python.
  • sklearn-genetic - Genetic feature selection module for scikit-learn. sklearn

Optimization

  • Spearmint - Bayesian optimization.
  • BoTorch - Bayesian optimization in PyTorch. PyTorch based/compatible
  • SMAC3 - Sequential Model-based Algorithm Configuration.
  • Optunity - Is a library containing various optimizers for hyperparameter tuning.
  • hyperopt - Distributed Asynchronous Hyperparameter Optimization in Python.
  • hyperopt-sklearn - Hyper-parameter optimization for sklearn. sklearn
  • sklearn-deap - Use evolutionary algorithms instead of gridsearch in scikit-learn. sklearn
  • sigopt_sklearn - SigOpt wrappers for scikit-learn methods. sklearn
  • Bayesian Optimization - A Python implementation of global optimization with gaussian processes.
  • SafeOpt - Safe Bayesian Optimization.
  • scikit-optimize - Sequential model-based optimization with a scipy.optimize interface.
  • Solid - A comprehensive gradient-free optimization framework written in Python.
  • PySwarms - A research toolkit for particle swarm optimization in Python.
  • Platypus - A Free and Open Source Python Library for Multiobjective Optimization.
  • GPflowOpt - Bayesian Optimization using GPflow. sklearn
  • POT - Python Optimal Transport library.
  • Talos - Hyperparameter Optimization for Keras Models.
  • nlopt - Library for nonlinear optimization (global and local, constrained or unconstrained).

Natural Language Processing

  • NLTK - Modules, data sets, and tutorials supporting research and development in Natural Language Processing.
  • CLTK - The Classical Language Toolkik.
  • gensim - Topic Modelling for Humans.
  • PSI-Toolkit - A natural language processing toolkit.
  • pyMorfologik - Python binding for Morfologik.
  • skift - Scikit-learn wrappers for Python fastText. sklearn
  • Phonemizer - Simple text to phonemes converter for multiple languages.
  • flair - Very simple framework for state-of-the-art NLP.

Computer Audition

  • librosa - Python library for audio and music analysis.
  • Yaafe - Audio features extraction.
  • aubio - A library for audio and music analysis.
  • Essentia - Library for audio and music analysis, description and synthesis.
  • LibXtract - A simple, portable, lightweight library of audio feature extraction functions.
  • Marsyas - Music Analysis, Retrieval and Synthesis for Audio Signals.
  • muda - A library for augmenting annotated audio data.
  • madmom - Python audio and music signal processing library.

Computer Vision

  • OpenCV - Open Source Computer Vision Library.
  • scikit-image - Image Processing SciKit (Toolbox for SciPy).
  • imgaug - Image augmentation for machine learning experiments.
  • imgaug_extension - Additional augmentations for imgaug.
  • Augmentor - Image augmentation library in Python for machine learning.
  • albumentations - Fast image augmentation library and easy to use wrapper around other libraries.

Statistics

  • pandas_summary - Extension to pandas dataframes describe function. pandas compatible
  • Pandas Profiling - Create HTML profiling reports from pandas DataFrame objects. pandas compatible
  • statsmodels - Statistical modeling and econometrics in Python.
  • stockstats - Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline stock statistics/indicators support.
  • weightedcalcs - pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
  • scikit-posthocs - Pairwise Multiple Comparisons Post-hoc Tests.
  • Alphalens - Performance analysis of predictive (alpha) stock factors.

Distributed Computing

  • Horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. sklearn
  • PySpark - Exposes the Spark programming model to Python. Apache Spark based
  • Veles - Distributed machine learning platform.
  • Jubatus - Framework and Library for Distributed Online Machine Learning.
  • DMTK - Microsoft Distributed Machine Learning Toolkit.
  • PaddlePaddle - PArallel Distributed Deep LEarning
  • dask-ml - Distributed and parallel machine learning. sklearn
  • Distributed - Distributed computation in Python.

Experimentation

  • Sacred - A tool to help you configure, organize, log and reproduce experiments.
  • Xcessiv - A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling.
  • Persimmon - A visual dataflow programming language for sklearn.
  • Ax - Adaptive Experimentation Platform. sklearn

Evaluation

  • recmetrics - Library of useful metrics and plots for evaluating recommender systems.
  • Metrics - Machine learning evaluation metric.
  • sklearn-evaluation - scikit-learn model evaluation made easy: plots, tables and markdown reports.
  • AI Fairness 360 - Fairness metrics for datasets and ML models, explanations and algorithms to mitigate bias in datasets and models.

Computations

  • numpy - The fundamental package needed for scientific computing with Python.
  • Dask - Parallel computing with task scheduling. pandas compatible
  • bottleneck - Fast NumPy array functions written in C.
  • CuPy - NumPy-like API accelerated with CUDA.
  • scikit-tensor - Python library for multilinear algebra and tensor factorizations.
  • numdifftools - Solve automatic numerical differentiation problems in one or more variables.
  • quaternion - Add built-in support for quaternions to numpy.
  • adaptive - Tools for adaptive and parallel samping of mathematical functions.

Spatial Analysis

  • GeoPandas - Python tools for geographic data. pandas compatible
  • PySal - Python Spatial Analysis Library.

Quantum Computing

  • PennyLane - Quantum machine learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
  • QML - A Python Toolkit for Quantum Machine Learning.

Conversion

  • sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
  • ONNX - Open Neural Network Exchange.
  • MMdnn - A set of tools to help users inter-operate among different deep learning frameworks.

Contributing

Contributions are welcome! 😎
Read the contribution guideline.

License

This work is licensed under the Creative Commons Attribution 4.0 International License - CC BY 4.0