How could this be optimised for speed?

How could this be optimised for speed?
Constraints are: CPU only

mat4 mult(const mat4& a, const mat4& b) {
float xx = a.x.x * b.x.x + a.x.y * b.y.x + a.x.z * b.z.x + a.x.w * b.w.x;
float xy = a.x.x * b.x.y + a.x.y * b.y.y + a.x.z * b.z.y + a.x.w * b.w.y;
float xz = a.x.x * b.x.z + a.x.y * b.y.z + a.x.z * b.z.z + a.x.w * b.w.z;
float xw = a.x.x * b.x.w + a.x.y * b.y.w + a.x.z * b.z.w + a.x.w * b.w.w;

float yx = a.y.x * b.x.x + a.y.y * b.y.x + a.y.z * b.z.x + a.y.w * b.w.x;
float yy = a.y.x * b.x.y + a.y.y * b.y.y + a.y.z * b.z.y + a.y.w * b.w.y;
float yz = a.y.x * b.x.z + a.y.y * b.y.z + a.y.z * b.z.z + a.y.w * b.w.z;
float yw = a.y.x * b.x.w + a.y.y * b.y.w + a.y.z * b.z.w + a.y.w * b.w.w;

float zx = a.z.x * b.x.x + a.z.y * b.y.x + a.z.z * b.z.x + a.z.w * b.w.x;
float zy = a.z.x * b.x.y + a.z.y * b.y.y + a.z.z * b.z.y + a.z.w * b.w.y;
float zz = a.z.x * b.x.z + a.z.y * b.y.z + a.z.z * b.z.z + a.z.w * b.w.z;
float zw = a.z.x * b.x.w + a.z.y * b.y.w + a.z.z * b.z.w + a.z.w * b.w.w;

float wx = a.w.x * b.x.x + a.w.y * b.y.x + a.w.z * b.z.x + a.w.w * b.w.x;
float wy = a.w.x * b.x.y + a.w.y * b.y.y + a.w.z * b.z.y + a.w.w * b.w.y;
float wz = a.w.x * b.x.z + a.w.y * b.y.z + a.w.z * b.z.z + a.w.w * b.w.z;
float ww = a.w.x * b.x.w + a.w.y * b.y.w + a.w.z * b.z.w + a.w.w * b.w.w;

return mat4{
{xx, xy, xz, xw},
{yx, yy, yz, yw},
{zx, zy, zz, zw},
{wx, wy, wz, ww}
};
}

Attached: 1200px-LLVM_Logo.svg.png (1200x901, 336K)

Check the assembly and see if its using SIMD instructions.

make use of ARM neon or SSE SIMD. look up if floats are supported

>How could this be optimised for speed?
by going to university, taking several algebra courses and not being braindead. specialize what you do, retard. if you don't need numerical stability and your matrix is actually just rotation*translation matrix, you can go with vector+quaternions, dual quaternions or whatever. perhaps your matrix is dense, so you do multiplication on paper and see where zeros propagate. jesus fucking CHRIST get a brain cell.

Attached: brainlet_bead_maze.jpg (812x1024, 61K)

you both are genuinely fucking retarded, using vector instructions for such short bursts will be a pessimization. vectorization only benefits you for long segments (or long running loops) of _vectorized_ code. in this case gcc and clang can autovectorize by themselves anyway if they need to. if you try to vectorize this shit by yourself you might be surprised it runs slower than longer scalar version generated by compiler.

Attached: braingrater.jpg (372x574, 46K)

-O3

also make sure mat4 constructor is inline

i'm pretty sure GCC (at -O3 ) will emit SSE instructions for this code.

SSE has both scalar (e.g. addss) and vector (addps) instructions. addss will only work with first element in xmm registers, addps will work with all 4. you can also use some sort of AVX encoding and have "vaddss" with 3 operand instructions, same thing. the practical difference is that in case of *ss and *sd instructions cpu runs your program ok but in case of *ps and *pd it will slow down for various reasons (including downclocking on int*l cpus)

so, continuing, if your program jumps into short burst of *ps or *pd instructions and then goes back to *ss, *sd or just outright something else (e.g. integer crunching), it will bottleneck at the transition points. with avx on intel it's way worse, you also get big slowdowns if you mix sse code with avx and when you actually use avx, your core has to do some work (i guess it's firing up avx circuit or some shit), it will also downclock itself and that is a waste of cpu time, so if your sse or avx vector code doesn't run long enough, your cpu will just spend time downclocking and firing up avx and then clocking up again

vzeroupper before and after AVX code fixes this.

actually nevermind, vzeroupper is not required (or even harmful) since skylake/ryzen. the other points are still valid, short bursts of vector code will bring nothing but slowness.

Build mat4 in first line as c and asigne every function, instead of instance 16 floats and then copy to mat4.

-O3 -maltivec -mabi=altivec

Use the compiler intrinsic SIMD?

> in this case gcc and clang can autovectorize by themselves anyway if they need to
>trusting a compiler
Compiler auto-optimization for SIMD related tasks is absolute trash and shouldn't be trusted at all.

Attached: goofy_is_angry.png (401x291, 144K)

Write it in assembly.

Just use SIMD intrinsics in your C/C++/Rust/D code. On my fx 8350 my SSE2 matrix multiplication routine runs 3-4 times faster than than -O3 asm. Just keep in mind that you MIGHT pay a penalty for transitioning from scalar assembly to SSE.

There are tons of articles talking about avx sse transition penalty. But I can't find anything about scalar/sse penalties like what is talking about. But I honestly don't doubt him. CPUs use a pipeline architecture where any change to this pipeline can fuck with things.

As always test your shit.

by using fortran

Attached: 1549891598923.jpg (764x713, 309K)

/thread

Shouldn't the compiler be able to optimize this perfectly?