performance improvement: rewrite of the matrix-matrix product following Goto's paper => x1.4 speedup with more consistent perf results