Last time I looked (a few years ago now) everybody serious was still using C/C++. I suppose Haskell could be a good choice for you, or you could be hip and pioneer the use of Rust for this stuff. :)
For math, you need to understand complex analysis and Fourier transforms. If poles and residues don't faze you, you know enough to read a good DSP textbook like http://www.amazon.com/Discrete-Time-Signal-Processing-Editio...
The more interesting problem is converting from non-embarrassingly (impressively?) parallel algorithm to an embarrassing one.
For that there are some standard tricks. Most are based on some theory. The first and very effective one is based on some abstract algebra: given a set of values A, define an operation * that is associative (for x,y in A, x * y = y * x) and commutative ((x * y) * z = x * (y * z)). Then define some unit value in A such that u * x == x. The existence of a binary associative operator and unit value is called a "monoid"
If your algorithm can be formulated as follows :
Then you can recognize this as prefix sum (1) and convert to SIMD over N lanes using any of the standard decompositions. This is based on exploiting the algebraic properties of the operation * to reorder instances. There's some interesting (2,3) applications to using monoids this way for converting seemingly recursive operations like parentheses matching into SIMD. If you have heard of monoids in the context of ~~endofunctors~~ functional programming, this is why they're useful and interesting.Something that I haven't played with much outside of numeric algorithms comes from control theory merged with abstract algebra. If we extend our monoid to an abelian group (4) we introduce two new operations, + and - (addition and negation/inversion). Define + such that it is also associative and commutative and has a new unit, called the zero (0) for convenience, and negation such that for x in A x + -x = 0, x * -u = -x. An example of a group where this is well defined are real and complex numbers (but not floats, without relaxing mathematical pedantry). These properties give rise to another well-known expression that is not embarrassingly parallel, but can be converted to one. It's called the finite difference equation (5).
This is a recursive formula, and any recursive linear operation made up of the + and * operations can be represented this way. The a and b vectors of length N are called the "coefficients" and there are some interesting proofs and results from this, it can describe evolution of an output from input very well (further study could come from any book/course on discrete-time signals and systems or controls applied to real/complex valued data (6), but I digress)Why this is useful is because it can be converted to a different form, called the state-space representation
A, B, C, and D are called the "state-space matrices" and why this is important for SIMD is that matrix multiplication is a well-known SIMD exercise with some helpful built-ins (although you have to benchmark, see my opening paragraph). You can convert any finite difference equation into an equivalent state-space representation using a technique called "canonical forms" (7). There are four well known/standard approaches.So: once you have your operations, define your matrix multiplications using SIMD, then convert to canonical form and evaluate. The speed up is not equal to the number of SIMD lanes available (and may not be better than sequential! You will pay a price for extra work and the time to fill SIMD vectors!).
This is all super theoretical and hand-wavey, but I hope it gets the general approach across: start with a complex problem and convert to a simpler one with known solutions. SIMD excels when it is easy to divide work, and converting from state-ful operations (state in the sense of the nth output depends on the n-1th input) to state-less SIMD realizations is non-trivial
Outside the scope of this are non-linear SIMD realizations which is its own can of worms. Most useful numeric code that can be SIMD-ified is linear. The stuff that isn't, you convert or live with.
(1) https://en.wikipedia.org/wiki/Prefix_sum
(2) https://raphlinus.github.io/gpu/2020/09/05/stack-monoid.html
(3) https://raphlinus.github.io/gpu/2021/05/13/stack-monoid-revi...
(4) https://en.wikipedia.org/wiki/Abelian_group
(5) https://en.wikipedia.org/wiki/Linear_recurrence_with_constan...
(6) https://www.amazon.com/Discrete-Time-Signal-Processing-3rd-P...
(7) https://lpsa.swarthmore.edu/Representations/SysRepTransforma...
PS/Aside. If you're like me and think, "you mean to tell me that by providing some straightforward properties that specify my algorithm's properties I can convert through boring and tedious methods to a very fast versions for the architectures I care about?" now you can also understand why folks like me get excited about modern compiler development and dependent typing in functional languages, which in the not-too-distant future will make this automated.