Scorepyo risk-score model#

Primer on risk-score model#

Let’s start with the introduction of risk-score model from Wikipedia:

Risk score (or risk scoring) is the name given to a general practice in applied statistics, bio-statistics, econometrics and other related disciplines, of creating an easily calculated number (the score) that reflects the level of risk in the presence of some risk factors (e.g. risk of mortality or disease in the presence of symptoms or genetic profile, risk financial loss considering credit and financial history, etc.).

A risk-score model is composed of :

  • a feature-point card, indicating points for each feature

  • a score card, associating a risk to a score

Here is CHADS2, an example of risk-score model for stroke risk of people with a specific condition (Image source):



What motivates the use of such model?

Motivation#

Risk-score models sacrifice the performance in terms of classification performance, in favor to all the following aspects:

  • understand and trust the model decision

  • investigate fairness issues

  • make sure to satisfy legal requirements

  • remember it

  • be able to use it with a paper and pen

The extreme interpretability comes from 3 properties : binary features, the additive aspect and sparse small integer coefficients.
The binary features aspect is independent from the model. However the additive aspect can be found in logistic regression, and the sparse coefficients property is tunable via a L1 regularization. A first simple approach could be to fit a logistic regression with L1 regularization, and round the coefficient in order to have small integers.

Logistic regression approach

For further explanation on the drawback of the rounding of logistic regression approach, the Neurips 2022 paper FasterRisk: Fast and Accurate Interpretable Risk Scores adresses this issue in its introduction.

Architecture#

The problem can be decomposed into finding the subset of binary features, defining points for each, and defining the probabilities of each possible score.

These 3 steps can be done at once or separately. The specificity of the Scorepyo package is that it does not rely on the logistic function to link scores and associated probability. It allows to have a wider model search-space, with more freedom on model probabilities.

Scorepyo first combines binary feature selection and points definition, then defines the probability for each possible score.

We can represent the different steps as follows :

Binary feature selection and points definition#

The binary features selection and points definition can be decomposed into 2 steps :

  • Ranking of binary features

  • Enumeration and maximization


Ranking of binary features#

Once the binary features are created, a binarizer such as EBMBinarizer stores information about the process, such as the associated log-odds, the density (number of samples positive on this binary feature) and the name of the feature it originates from. All these information can be used to rank the features.

Suppose you have the following set of binary features with associated log-odds and density:

binary_feature

log_odds

density

worst concave points < 0.1

0.58

213

mean texture < 18.81

0.55

130

worst concavity >= 0.23

-0.53

210

area error < 23.93

0.45

200

worst area < 684.55

0.44

213

A possible ranking would be to multiply the associated log-odd of the binary feature by the number of samples positive for this binary feature. This would give the following table:

binary_feature

log_odds

density

sum_log_odds

worst concave points < 0.1

0.58

213

123.54

mean texture < 18.81

0.55

130

71.5

worst concavity >= 0.23

-0.53

210

111.3

area error < 23.93

0.45

200

90

worst area < 684.55

0.44

213

93.72

The ranking of binary features would then be :

binary_feature

log_odds

density

sum_log_odds

rank

worst concave points < 0.1

0.58

213

123.54

1

worst concavity >= 0.23

-0.53

210

111.3

2

worst area < 684.55

0.44

213

93.72

3

area error < 23.93

0.45

200

90

4

mean texture < 18.81

0.55

130

71.5

5

See here for more information on available rankers.

Once the features are ranked, we keep the top-\(k\) features according to the ranking, with \(k\) being a hyperparameter of the risk-score model. This hyperparameter implies a tradeoff between computation time and size of the exploration.

Feel free to design any ranker you feel would be more appropriate, and compare it to the other existing ones. The package is designed to evolve by increasing the possibilities for each component, and adding new rankers is a part of this task.


Binary features combination and points enumeration#

Now that the features are ranked, the next step is to enumerate all possible combinations of binary features within the top \(k\) binary features. This means that the enumeration will yield \(\binom{k}{s}\) combinations. This explains why \(k\) controls the tradeoff between exploration and computation time. Here is a figure illustrating the increase of combinations tested when increasing \(k\), for the same model size \(s\).

For each combination of \(s\) binary features among the top \(k\), Scorepyo will enumerate all possible points combination. This corresponds to the cartesian product of a range of integers for each binary feature.

In order to lower as much as possible the number of possibilities, the range of integers for each binary feature will vary. Suppose the log-odd value associated to a binary feature is negative, then the range of possible integer values will only be from the minimum point value allowed by the model to -1. Conversely, in case the associated log-odd is positive, then the range will be from 1 to the maximum point value allowed by the model.

You will find below an example of point definitions for 3 features, a minimum point value of -2 and a maximum point value of 2.

binary_feature

log_odds

range_possible_points

worst concave points < 0.1

0.58

[1,2]

worst concavity >= 0.23

-0.53

[-2,-1]

worst area < 684.55

0.44

[1,2]


This will lead to all the following points combination for these 3 features:

worst concave points < 0.1

worst concavity >= 0.23

worst area < 684.55

1

-2

1

1

-2

2

1

-1

1

1

-1

2

2

-2

1

2

-2

2

2

-1

1

2

-1

2

We can observe how the number of combinations explodes and why constructing a risk-score model is a combinatorial problem. This outlines the importance of the quality of the ranker, in order to put at the top the best features to work with.


We can now compute the score (i.e. sum of points) on each sample for each point combination. The selection of features and points combination kept will be the combination with the best value on the defined metric. The chosen metric will be maximized.

The scores on each samples naturally rank the samples. As the scores are not probabilities yet, ranking metrics such as ROC-AUC or Average Precision are preferable at this stage. Like the ranking step, any custom metric based on a list of integer scores and a binary target can be used.

Fast numba ROC-AUC

There exists in Scorepyo a fast numba implementation of ROC-AUC taken from this repo.


Enumerating all combinations 1 by 1 takes a lot of time but fit in memory, and doing all combinations with numpy is more efficient but can take too much memory. Therefore, the Dask package has been chosen to select the best combination of features and points, as it can work out-of-memory if needed, and parallelize as much as possible the different computations.

Probability calibration#

At this stage, the selection of binary features and their associated point has been done so given a calibration set, each sample has a score. For each possible score \(i\), we have the information of positive label samples (\(n^+_i\)) and negative label samples (\(n^-_i\)). A perfectly calibrated model on the training set would output a probability of \(\frac{n^+_i}{n^+_i+n^-_i}\) for each score \(i\). This is also the solution that minimizes the log loss (see proof here).

However in risk-score model, you need to respect an ordering constraint of having an increasing probability with the score. In package like risk-slim or fasterrisk, this is done by using the logistic function on the score, with eventually an additional multiplier term for fasterrisk.

Scorepyo performs probability calibration by finding the probabilities for each score, respecting the ordering constraint, that optimize the logloss. This problem is convex, and is solved thanks to the CVXPY package.

Calibration set

You can give the calibration set via arguments in the fit method. If none are provided, calibration will be done on the training set.

What is described here is done by the VanillaCalibrator class. As with the ranker, or the enumeration maximization metric, you can create your custom calibrator class, given that it respects the convention defined in the package.

As examples of that, Scorepyo provides 2 additional calibrators that are useful when the logloss of the calibration set is deteriorated too much on a test set, mostly due to calibration set size. There exists BootstrappedCalibrator that bootstraps the calibration set in order to have multiple datasets coming from the same distribution. After bootstrapping a specific number of datasets, it will find probabilities that either :

  • optimize the average logloss on all bootstrapped datasets

  • optimize the worst logloss of all bootstrapped datasets

This approach helps by putting some uncertainty on the dataset used for calibration, and reduce overfitting, especially if you are using the training set because of a reduced dataset.