Partial dependence plots: https://journal.r-project.org/archive/2017/RJ-2017-016/RJ-20...
Locally-interpretable model-agnostic explanations (LIME): https://www.oreilly.com/learning/introduction-to-local-inter...
"“Why Should I Trust You?” Explaining the Predictions of Any Classifier": https://arxiv.org/pdf/1602.04938.pdf
https://homes.cs.washington.edu/~marcotcr/blog/lime/
https://github.com/marcotcr/lime
Anytime anyone makes snide HN comments like "oh you can't understand why neural networks make predictions" the correct response should always be "why doesn't LIME work in your specific case".
LIME is being used within the EU to explain credit decisions and fraud detection flagging on neural network based models, which is quite a high bar to regulatory oversight to pass.
There is a video here https://www.youtube.com/watch?v=hUnRCxnydCc
I think this has some better examples than the Panda vs Gibbon example in the OP if you want to 'see' why a model may classify a tree-frog as a tree-frog vs a billiard (for example). IMO this suggests some level of anthropomorphizing is useful for understanding and building models as the pixels the model picks up aren't really too dissimilar to what I imagine a naive, simple, mind might use. (i.e the tree-frog's goofy face) We like to look at faces for lots of reasons but one of them probably is because they're usually more distinct which is the same, rough, reason why the model likes the face. This is interesting (to me at least) even if it's just matrix multiplication (or uncrumpling high dimensional manifolds) underneath the hood,
LIME[1] is a nice start, though.
[1] https://www.oreilly.com/learning/introduction-to-local-inter...
[GE]: https://en.wikipedia.org/wiki/Machine_learning_in_bioinforma...
[LB]: https://www.quora.com/How-is-Lasso-method-used-in-bioinforma...
Obviously if this were done sloppily it would be a huge problem and could produce a ton of false positives. But that's not actually what happens. The idea that ML practitioners just fit crazy complicated models to data and blindly believe whatever the model fits seems to be a common stereotype but is completely inaccurate. We are acutely aware that powerful models can overfit all to easily and spend perhaps the majority of our time understanding and fighting this exact phenomena. Because we tend to work with models for which few closed-form analytic theorems exist, we tend to do this empirically but no less rigorously. In fact, we tend to be more scientific and rely on fewer assumptions than classical statistics.
The dominant paradigm is empirical risk minimization, sometimes called structural risk minimization[SRM], especially when complexity is being penalized. The idea is to acknowledge that models are always fit to one particular sample from the population but that the goal is to generalize to the full population. We can never truly evaluate a model on a whole population, but we can form an empirical estimate for how well our model will do by taking a new sample from the population (not used for fitting/training) and evaluating model performance on this new sample. Computational learning theories such as VC Theory[VC] and Probably Approximately Correct Learning[PAC] provide theorems that give bounds on how tight these empirical bounds are. For example, VC Theory and Hoeffding's Inequality[HI] can give us an upper bound on how large the gap between "true" performance and this empirical estimate is for a binary classifier in terms of the number of observations used to measure performance and the "VC Dimension" (roughly the number of parameters) of the model.
A typical SRM workflow would be to divide a data set up into "training," "validation," and "test" sets, fit a set of candidate models to the training set, estimate their performance from the validation set, select the best based on validation set performance[MS], then evaluate the final model performance from the test set. This procedure can be used on arbitrary models to demonstrate the validity of fit models. For example, a model which is just randomly picking 5 genes based on noise in the training set is extremely unlikely to perform better than chance on the final test set.
[SRM]: http://www.svms.org/srm/
[VC]: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_th...
[PAC]: https://en.wikipedia.org/wiki/Probably_approximately_correct...
[HI]: https://people.cs.umass.edu/~domke/courses/sml2010/10theory....
[MS]: https://en.wikipedia.org/wiki/Model_selection
Not every machine learning practitioner is familiar with VC Theory or PAC, but almost everyone uses the practical tools[CV] and language[BV] that arose from SRM. If you're following Andrew Ng's or Max Kuhn's advice[NG][MK] on "best practices" you are in fact benefiting from VC Theory although you may never have heard of it.
[VC]: https://en.wikipedia.org/wiki/Cross-validation_(statistics)
[BV]: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
[NG]: https://www.youtube.com/playlist?list=PLA89DCFA6ADACE599
[MK]: http://appliedpredictivemodeling.com/
So that's my answer to the question of validity: ML researchers use different techniques, but their techniques have equally good theoretical foundations but make very few assumptions and are very robust in practice. If researchers aren't using these techniques, or abusing them, it's not because ML is unsatisfactory or broken, but because of the same perverse incentives we see everywhere in academia.
There's another criticism floating around that ML models are "black boxes", useful only for prediction and totally opaque. This is only true because non-linear things are harder to understand, and to the extent to which it is true, it is equally true of classical models. A linear model with lots of quadratic and interaction terms, or a model on stratified bands, or a hierarchical model, can be just as hard to interpret. A properly regularized ML model only fits a crazy non-linear boundary when the data themselves require it. A classical model fit to the same data will either have to exhibit the same non-linearity or will be badly wrong. A lot of researcher papers are wrong because someone fit a straight line to curved data!
I also think the "total opaque black box" meme is overstated. We can often understand even very complex models to some degree with a little effort. A basic technique is to run k-means with high k, say, 100, to select a number of "representative" examples from your training set and look at the model's predictions for each. It's also incredibly instructive just to look at a sample of 100 examples the model got wrong. One way to understand a non-linear response surface is by focusing in on different regions where the behavior is locally linear and trying perturbations[LIME]. There are also ML models which do fit easy to understand models[MARS]. It's also usually possible to visualize the low level features[DFV].
[LIME]: https://www.oreilly.com/learning/introduction-to-local-inter...
[MARS]: https://en.wikipedia.org/wiki/Multivariate_adaptive_regressi...
[DFV]: https://distill.pub/2017/feature-visualization/