Local Interpretable Model-Agnostic Explanations (LIME): An Introduction

olooney · 2019-02-17 · Original thread

I would guess both you and author of the article have in mind something like gene expression[GE] in bioinformatics. Thousands of markers, but only hundreds of examples, and the researchers are using some automated feature selection approach[LB] to pick genes that predict some disease.

[GE]: https://en.wikipedia.org/wiki/Machine_learning_in_bioinforma...

[LB]: https://www.quora.com/How-is-Lasso-method-used-in-bioinforma...

Obviously if this were done sloppily it would be a huge problem and could produce a ton of false positives. But that's not actually what happens. The idea that ML practitioners just fit crazy complicated models to data and blindly believe whatever the model fits seems to be a common stereotype but is completely inaccurate. We are acutely aware that powerful models can overfit all to easily and spend perhaps the majority of our time understanding and fighting this exact phenomena. Because we tend to work with models for which few closed-form analytic theorems exist, we tend to do this empirically but no less rigorously. In fact, we tend to be more scientific and rely on fewer assumptions than classical statistics.

The dominant paradigm is empirical risk minimization, sometimes called structural risk minimization[SRM], especially when complexity is being penalized. The idea is to acknowledge that models are always fit to one particular sample from the population but that the goal is to generalize to the full population. We can never truly evaluate a model on a whole population, but we can form an empirical estimate for how well our model will do by taking a new sample from the population (not used for fitting/training) and evaluating model performance on this new sample. Computational learning theories such as VC Theory[VC] and Probably Approximately Correct Learning[PAC] provide theorems that give bounds on how tight these empirical bounds are. For example, VC Theory and Hoeffding's Inequality[HI] can give us an upper bound on how large the gap between "true" performance and this empirical estimate is for a binary classifier in terms of the number of observations used to measure performance and the "VC Dimension" (roughly the number of parameters) of the model.

A typical SRM workflow would be to divide a data set up into "training," "validation," and "test" sets, fit a set of candidate models to the training set, estimate their performance from the validation set, select the best based on validation set performance[MS], then evaluate the final model performance from the test set. This procedure can be used on arbitrary models to demonstrate the validity of fit models. For example, a model which is just randomly picking 5 genes based on noise in the training set is extremely unlikely to perform better than chance on the final test set.

[SRM]: http://www.svms.org/srm/

[VC]: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_th...

[PAC]: https://en.wikipedia.org/wiki/Probably_approximately_correct...

[HI]: https://people.cs.umass.edu/~domke/courses/sml2010/10theory....

[MS]: https://en.wikipedia.org/wiki/Model_selection

Not every machine learning practitioner is familiar with VC Theory or PAC, but almost everyone uses the practical tools[CV] and language[BV] that arose from SRM. If you're following Andrew Ng's or Max Kuhn's advice[NG][MK] on "best practices" you are in fact benefiting from VC Theory although you may never have heard of it.

[VC]: https://en.wikipedia.org/wiki/Cross-validation_(statistics)

[BV]: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

[NG]: https://www.youtube.com/playlist?list=PLA89DCFA6ADACE599

[MK]: http://appliedpredictivemodeling.com/

So that's my answer to the question of validity: ML researchers use different techniques, but their techniques have equally good theoretical foundations but make very few assumptions and are very robust in practice. If researchers aren't using these techniques, or abusing them, it's not because ML is unsatisfactory or broken, but because of the same perverse incentives we see everywhere in academia.

There's another criticism floating around that ML models are "black boxes", useful only for prediction and totally opaque. This is only true because non-linear things are harder to understand, and to the extent to which it is true, it is equally true of classical models. A linear model with lots of quadratic and interaction terms, or a model on stratified bands, or a hierarchical model, can be just as hard to interpret. A properly regularized ML model only fits a crazy non-linear boundary when the data themselves require it. A classical model fit to the same data will either have to exhibit the same non-linearity or will be badly wrong. A lot of researcher papers are wrong because someone fit a straight line to curved data!

I also think the "total opaque black box" meme is overstated. We can often understand even very complex models to some degree with a little effort. A basic technique is to run k-means with high k, say, 100, to select a number of "representative" examples from your training set and look at the model's predictions for each. It's also incredibly instructive just to look at a sample of 100 examples the model got wrong. One way to understand a non-linear response surface is by focusing in on different regions where the behavior is locally linear and trying perturbations[LIME]. There are also ML models which do fit easy to understand models[MARS]. It's also usually possible to visualize the low level features[DFV].

[LIME]: https://www.oreilly.com/learning/introduction-to-local-inter...

[MARS]: https://en.wikipedia.org/wiki/Multivariate_adaptive_regressi...

[DFV]: https://distill.pub/2017/feature-visualization/

nerdponx · 2018-06-26 · Original thread

Yes. It always bothers me when people claim you can't get insight out of a nonlinear black-box model, because you absolutely can. It's just not right in front of you, and it's not always as clear-cut as what you might find in a linear regression model. But even a linear regression model with quadratic interactions is already pushing the limits of interpretability, so it's not a problem that's unique to neural networks. It is, however, limited by computational ability.

Partial dependence plots: https://journal.r-project.org/archive/2017/RJ-2017-016/RJ-20...

Locally-interpretable model-agnostic explanations (LIME): https://www.oreilly.com/learning/introduction-to-local-inter...

nl · 2017-12-11 · Original thread

LIME: Local Interpretable Model-Agnostic Explanations; https://www.oreilly.com/learning/introduction-to-local-inter...

"“Why Should I Trust You?” Explaining the Predictions of Any Classifier": https://arxiv.org/pdf/1602.04938.pdf

https://homes.cs.washington.edu/~marcotcr/blog/lime/

https://github.com/marcotcr/lime

Anytime anyone makes snide HN comments like "oh you can't understand why neural networks make predictions" the correct response should always be "why doesn't LIME work in your specific case".

LIME is being used within the EU to explain credit decisions and fraud detection flagging on neural network based models, which is quite a high bar to regulatory oversight to pass.

siliconc0w · 2017-07-17 · Original thread

A neat technique to help 'explain' models is LIME: https://www.oreilly.com/learning/introduction-to-local-inter...

There is a video here https://www.youtube.com/watch?v=hUnRCxnydCc

I think this has some better examples than the Panda vs Gibbon example in the OP if you want to 'see' why a model may classify a tree-frog as a tree-frog vs a billiard (for example). IMO this suggests some level of anthropomorphizing is useful for understanding and building models as the pixels the model picks up aren't really too dissimilar to what I imagine a naive, simple, mind might use. (i.e the tree-frog's goofy face) We like to look at faces for lots of reasons but one of them probably is because they're usually more distinct which is the same, rough, reason why the model likes the face. This is interesting (to me at least) even if it's just matrix multiplication (or uncrumpling high dimensional manifolds) underneath the hood,

nl · 2016-08-25 · Original thread

It's pretty hard to return "rules" for a ML system, especially a non-linear system. Google is currently working systems that use a trillion features - I can't imagine returning some kind of rule list for that.

LIME[1] is a nice start, though.

[1] https://www.oreilly.com/learning/introduction-to-local-inter...

nl · 2016-08-21 · Original thread

Seems relevant: https://www.oreilly.com/learning/introduction-to-local-inter...

ISBN: None