https://www.amazon.com/Networks-Recognition-Advanced-Econome...
Another oldie-but-goodie is
https://www.amazon.com/Neural-Networks-Lecture-Computer-Scie...
which tells the secret of when to stop when you're doing "early stopping", something I've seen many modern deep learners fail to get right.
Off the top of my head I would say the fundamental math about deep networks is not really new. Most of the work in that field is pretty ad-hoc and not a lot is proven; probably people that are proving things are using difficult graduate-level math but you don't need to go there.
https://www.amazon.com/Networks-Recognition-Advanced-Econome...
https://machinelearningmastery.com/neural-networks-tricks-of...
There is something to say for downloading models from huggingface and just going from there. The fact is you are never going to train a foundation model but you can do useful tasks with one in minutes and if the one you use isn't good for your application try another one. See
https://sbert.net/
particularly the "usage" example that is right there. If you have 1000-10,000 short texts and put them through k-means clustering
https://scikit-learn.org/stable/modules/clustering.html#k-me...
your jaw will drop or at least mine did because for years I have done clustering with bag of words, LDA and methods like that and when I applied to my RSS feed all the sports articles ended up in one place, the ukraine articles in another place, deep learning there, and reinforcement learning there... In like 30 minutes worth of work and faster than my LDA clustering engine. With DBSCAN clustering I get all the articles about the same news event clustered... It's just amazing.