EBM feature binarizer#

A risk-score type model can only be constructed based on binary features. In most cases, datasets are constructed with a mix of continuous, categorical and binary features.
From there, we either manually engineer the binary features, or we can automatically construct them.

General Additive Model (GAM)#

A risk-score model is an additive model, therefore we need to craft binary features in such a way that they help to predict by summing their contribution, modulo a coefficient. This corresponds to General Additive Model (GAM), that’s why The AutoBinarizer class is based on a GAM that predicts the binary target.

Explainable Boosting Machine (EBM)#

The GAM used is an Explainable Boosting Machine (EBM), from the fantastic interpretML package.

EBM are perfect candidates for such a task. Indeed the individual feature function in EBMs is a single-feature tree. Hereafter is an example from the interpretML package for the feature Age.

We can observe that each single-feature tree defines intervals on the feature domain, and a constant value on each interval. In a binary classification setting, that constant value is the log-odd to add for each sample depending on the feature value of each sample.

‘ ‘

The EBMBinarizer class extracts every plateau of single-feature trees of the fitted EBM. Plateau with a negative log-odd value can be filtered with the keep_negative parameter. The number of plateaux of single-feature tree, and thus binary features extracted by continuous feature, is bounded by the max_number_binaries_by_features parameter.

Info

The log-odd value associated to each binary feature created is extracted. It can be used as a filter to make a selection afterwards.