Machine Learning: The Art and Science of Algorithms that Make Sense of Data (Flach): http://www.amazon.com/Machine-Learning-Science-Algorithms-Se...
Machine Learning: A Probabilistic Perspective (Murphy): http://www.amazon.com/Machine-Learning-Probabilistic-Perspec...
Pattern Recognition and Machine Learning (Bishop): http://www.amazon.com/Pattern-Recognition-Learning-Informati...
There are some great resources/books for Bayesian statistics and graphical models. I've listed them in (approximate) order of increasing difficulty/mathematical complexity:
Think Bayes (Downey): http://www.amazon.com/Think-Bayes-Allen-B-Downey/dp/14493707...
Bayesian Methods for Hackers (Davidson-Pilon et al): https://github.com/CamDavidsonPilon/Probabilistic-Programmin...
Doing Bayesian Data Analysis (Kruschke), aka "the puppy book": http://www.amazon.com/Doing-Bayesian-Data-Analysis-Second/dp...
Bayesian Data Analysis (Gellman): http://www.amazon.com/Bayesian-Analysis-Chapman-Statistical-...
Bayesian Reasoning and Machine Learning (Barber): http://www.amazon.com/Bayesian-Reasoning-Machine-Learning-Ba...
Probabilistic Graphical Models (Koller et al): https://www.coursera.org/course/pgm http://www.amazon.com/Probabilistic-Graphical-Models-Princip...
If you want a more mathematical/statistical take on Machine Learning, then the two books by Hastie/Tibshirani et al are definitely worth a read (plus, they're free to download from the authors' websites!):
Introduction to Statistical Learning: http://www-bcf.usc.edu/~gareth/ISL/
The Elements of Statistical Learning: http://statweb.stanford.edu/~tibs/ElemStatLearn/
Obviously there is the whole field of "deep learning" as well! A good place to start is with: http://deeplearning.net/
Everyone hates picking priors in Bayesian analysis. If you pick an informative prior, you can always be criticized for it (in peer review, for a business decision, etc.) The usual dodge is to use a non-informative prior (like the Jeffreys prior[3].) I interpret Gelman's point as saying this can also lead to bad decisions. Thus, Bayesian analysts must thread the needle between Scyllia and Charybdis when picking priors. That's certainly a real pain point when using Bayesian methods.
However, it's pretty much the same pain point as choosing regularization parameters (or choosing not to use regularization) when doing frequentist statistics. For example, sklearn was recently criticized for turing on L2 regularization by default which could be viewed as a violation of the principle of least surprise, as well as causing practical problems when inputs are not standardized. But leaving regularization turned off is equivalent to choosing an non-informative or even improper prior. (informally in many cases, and formally identical for linear regression with normally distributed errors[4].) So Scyllia and Charybdis still loom on either side.
My problem with Bayesian models, completely unrelated to Gelman's criticism, is that the partition function is usually intractable and really only amenable to probabilistic methods (MCMC with NUTS[5], for example.) This makes them computationally expensive to fit, and this in turn makes them suitable for (relatively) small data sets. But using a lot more data is the single best way to allow a model to get more accurate while avoiding over-fitting! That is why I live with the following contradiction: 1) I believe Bayesian models have better theoretical foundations, and 2) I almost always use non-Bayesian methods for practical problems.
[1]: https://mc-stan.org/
[2]: https://www.amazon.com/Bayesian-Analysis-Chapman-Statistical...
[3]: https://en.wikipedia.org/wiki/Jeffreys_prior
[4]: https://stats.stackexchange.com/questions/163388/l2-regulari...
[5]: http://www.stat.columbia.edu/~gelman/research/published/nuts...