performance improvement: rewrite of the matrix-matrix product following
Goto's paper => x1.4 speedup with more consistent perf results
1 file changed