Tuesday 16 August, 1:30pm-4:30pm, Room 102-b
Authors: Kiran Kate, Martin Hirzel, Parikshit Ram, Avraham Shinnar, and Jason Tsay (IBM Research)
https://kdd.org/kdd2022/handsOnTutorial.html
Lale is a sklearn-compatible library for automated machine learning (AutoML). It is open-source (https://github.com/ibm/lale) and addresses the need for gradual automation of machine learning as opposed to offering a black-box AutoML tool. Black-box AutoML tools are difficult to customize and thus restrict data scientists in leveraging their knowledge and intuition in the automation process. Lale is built on three principles: progressive disclosure, orthogonality, and least surprise. These enable a gradual approach offering a spectrum of usage patterns starting from total automation to controlling almost every aspect of AutoML. Lale provides compositional constructs that let data scientists control some aspects of their pipelines while leaving other aspects free to be searched automatically. This tutorial demonstrates the use of Lale for various machine-learning tasks, showing how to progressively exercise more customization. It also covers AutoML for advanced scenarios such as class imbalance correction, bias detection and mitigation, multi-objective optimization, and working with multi-table datasets. While Lale comes with hyperparameter specifications for 216 operators out-of-the-box, users can also add more operators of their own, and this tutorial covers how to do that. Overall, this tutorial teaches you how you can exercise fine-grained control over AutoML without having to be an AutoML expert.
This tutorial targets data scientists who want to leverage AutoML. It expects some familiarity with Python libraries such as numpy, pandas, and sklearn (our user study showed that data scientists with moderate sklearn experience can successfully use Lale to solve advanced tasks). No AutoML experience is required.
This notebook will provide an overview of AutoML in general and introduce concepts such as hyperparameter optimization (HPO) and combined algorithm selection and hyperparameter tuning (CASH). We will also explain the concept of gradual AutoML.
Lale provides a simple sklearn-style operator called AutoPipeline
to achieve total automation for standard machine learning tasks such as
classification and regression on tabular data.
We will demonstrate how to use AutoPipeline
in just 3 lines of code,
and how to inspect the output models of AutoPipeline
.
This notebook will focus on using Lale to compose a pipeline
with algorithm choices, such as between different categorical
encoders and between different classifiers.
Users can then perform CASH on this pipeline using
auto_configure
.
We will also demonstrate how to refine outputs of AutoML and do
iterative AutoML.
Lale includes operators for class imbalance correction from imbalanced-learn. This notebook introduces the concept of higher-order operators and shows how to use them to create pipelines involving down-sampling and up-sampling.
To the best of our knowledge, Lale is the first AutoML library that allows easy inclusion of bias mitigators in the search space. This notebook introduces fairness metrics and bias mitigators from AIF360 and demonstrates CASH over them.
Most commonly used AutoML tools optimize for a single metric. In practice, there is often a need to take more than one metric into account while searching for the best model. For example, we may want good predictive performance as well as model fairness. This notebook will demonstrate the use of Lale for such multi-objective optimization to find Pareto frontiers.
The preprocessing steps in a data science workflow often involve more than one table. Data scientists usually write independent scripts for such data preparation as there is no easy way to include it in a machine learning pipeline. Lale introduced preprocessing operators for performing join, filter, map, groupby, aggregate, etc. as part of end-to-end sklearn-style pipelines. We will demonstrate these operators in the context of a standard machine learning task.
An operator in Lale is a light-weight wrapper over existing algorithmic implementations. For example, Lale has wrappers for 216 existing implementations from sklearn, imbalanced-learn, AIF360, etc. Adding a new operator in Lale is relatively straightforward and well documented. This notebook walks through the steps to add a new operator wrapper and to define a search space for that operator's hyperparameters.
This notebook will discuss how lale provides compatibility with scikit-learn based code.
This notebook will discuss how Lale takes advantage of schemas, and employs compiler techniques to generate search spaces for different optimizer backends, for early error reporting, and for automatic documentation generation.
We will wrap up the tutorial with examples of some research directions: (1) batch-wise training of pipelines for datasets that do not fit in main memory, and (2) grammars to define recursive search spaces. We will cover examples of when and how these capabilities can be used.
We would like to thank Guillaume Baudart (DI ENS, Ecole normale superieure, PSL University, CNRS, INRIA, France), Michael Feffer (Carnegie Mellon University, USA), Louis Mandel (IBM Research, USA), Chirag Sahni (Rensselaer Polytechnic Institute, USA), and Vaibhav Saxena (IBM Research, India) for their valuable contributions to Lale which will be included in the tutorial.
@InProceedings{kate_et_al_2022,
author = "Kate, Kiran and Hirzel, Martin and Ram, Parikshit and Shinnar, Avraham and Tsay, Jason",
title = "Gradual {AutoML} using {Lale}",
booktitle = "Tutorial at the Conference on Knowledge Discovery and Data Mining (KDD-Tutorial)",
year = 2022,
month = aug,
url = "https://doi.org/10.1145/3534678.3542630" }