Found in 1 comment on Hacker News
olooney · 2019-02-17 · Original thread
I would guess both you and author of the article have in mind something like gene expression[GE] in bioinformatics. Thousands of markers, but only hundreds of examples, and the researchers are using some automated feature selection approach[LB] to pick genes that predict some disease.

[GE]: https://en.wikipedia.org/wiki/Machine_learning_in_bioinforma...

[LB]: https://www.quora.com/How-is-Lasso-method-used-in-bioinforma...

Obviously if this were done sloppily it would be a huge problem and could produce a ton of false positives. But that's not actually what happens. The idea that ML practitioners just fit crazy complicated models to data and blindly believe whatever the model fits seems to be a common stereotype but is completely inaccurate. We are acutely aware that powerful models can overfit all to easily and spend perhaps the majority of our time understanding and fighting this exact phenomena. Because we tend to work with models for which few closed-form analytic theorems exist, we tend to do this empirically but no less rigorously. In fact, we tend to be more scientific and rely on fewer assumptions than classical statistics.

The dominant paradigm is empirical risk minimization, sometimes called structural risk minimization[SRM], especially when complexity is being penalized. The idea is to acknowledge that models are always fit to one particular sample from the population but that the goal is to generalize to the full population. We can never truly evaluate a model on a whole population, but we can form an empirical estimate for how well our model will do by taking a new sample from the population (not used for fitting/training) and evaluating model performance on this new sample. Computational learning theories such as VC Theory[VC] and Probably Approximately Correct Learning[PAC] provide theorems that give bounds on how tight these empirical bounds are. For example, VC Theory and Hoeffding's Inequality[HI] can give us an upper bound on how large the gap between "true" performance and this empirical estimate is for a binary classifier in terms of the number of observations used to measure performance and the "VC Dimension" (roughly the number of parameters) of the model.

A typical SRM workflow would be to divide a data set up into "training," "validation," and "test" sets, fit a set of candidate models to the training set, estimate their performance from the validation set, select the best based on validation set performance[MS], then evaluate the final model performance from the test set. This procedure can be used on arbitrary models to demonstrate the validity of fit models. For example, a model which is just randomly picking 5 genes based on noise in the training set is extremely unlikely to perform better than chance on the final test set.

[SRM]: http://www.svms.org/srm/

[VC]: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_th...

[PAC]: https://en.wikipedia.org/wiki/Probably_approximately_correct...

[HI]: https://people.cs.umass.edu/~domke/courses/sml2010/10theory....

[MS]: https://en.wikipedia.org/wiki/Model_selection

Not every machine learning practitioner is familiar with VC Theory or PAC, but almost everyone uses the practical tools[CV] and language[BV] that arose from SRM. If you're following Andrew Ng's or Max Kuhn's advice[NG][MK] on "best practices" you are in fact benefiting from VC Theory although you may never have heard of it.

[VC]: https://en.wikipedia.org/wiki/Cross-validation_(statistics)

[BV]: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

[NG]: https://www.youtube.com/playlist?list=PLA89DCFA6ADACE599

[MK]: http://appliedpredictivemodeling.com/

So that's my answer to the question of validity: ML researchers use different techniques, but their techniques have equally good theoretical foundations but make very few assumptions and are very robust in practice. If researchers aren't using these techniques, or abusing them, it's not because ML is unsatisfactory or broken, but because of the same perverse incentives we see everywhere in academia.

There's another criticism floating around that ML models are "black boxes", useful only for prediction and totally opaque. This is only true because non-linear things are harder to understand, and to the extent to which it is true, it is equally true of classical models. A linear model with lots of quadratic and interaction terms, or a model on stratified bands, or a hierarchical model, can be just as hard to interpret. A properly regularized ML model only fits a crazy non-linear boundary when the data themselves require it. A classical model fit to the same data will either have to exhibit the same non-linearity or will be badly wrong. A lot of researcher papers are wrong because someone fit a straight line to curved data!

I also think the "total opaque black box" meme is overstated. We can often understand even very complex models to some degree with a little effort. A basic technique is to run k-means with high k, say, 100, to select a number of "representative" examples from your training set and look at the model's predictions for each. It's also incredibly instructive just to look at a sample of 100 examples the model got wrong. One way to understand a non-linear response surface is by focusing in on different regions where the behavior is locally linear and trying perturbations[LIME]. There are also ML models which do fit easy to understand models[MARS]. It's also usually possible to visualize the low level features[DFV].

[LIME]: https://www.oreilly.com/learning/introduction-to-local-inter...

[MARS]: https://en.wikipedia.org/wiki/Multivariate_adaptive_regressi...

[DFV]: https://distill.pub/2017/feature-visualization/