5. Matrix Multiplication
Open the notebook in Colab

We saw the NumPy dot operator nearly reaches the peak performance of our CPU (the Xeon E5-2686 v4) in Section 1. In this section, we will investigate multiple scheduling strategies for this operator in TVM.

5.1. Setup

%matplotlib inline
import d2ltvm
import numpy as np
import timeit
import tvm
from tvm import te

target = 'llvm -mcpu=skylake-avx512'

As we did in Section 3, we first define a method to measure NumPy performance as our baseline.

# Save to the d2ltvm package.
def np_matmul_timer(n):
    timer = timeit.Timer(setup='import numpy as np\n'
                         'import d2ltvm\n'
                         'a, b, c = d2ltvm.get_abc(%s)' % str((n,n)),
                         stmt = 'np.dot(a, b, out=c)')
    return timer.timeit

sizes = 2**np.arange(5, 12, 1)
exe_times = [d2ltvm.bench_workload(np_matmul_timer(n)) for n in sizes]
np_gflops = 2 * sizes **3 / 1e9 / np.array(exe_times)

5.2. Default Schedule

The default schedule consists of three nested for-loops.

def default(n):
    A, B, C = d2ltvm.matmul(n, n, n)
    return te.create_schedule(C.op), (A, B, C)

s, args = default(64)
print(tvm.lower(s, args, simple_mode=True))
produce C {
  for (x, 0, 64) {
    for (y, 0, 64) {
      C[((x*64) + y)] = 0f
      for (k, 0, 64) {
        C[((x*64) + y)] = (C[((x*64) + y)] + (A[((x*64) + k)]*B[((k*64) + y)]))
      }
    }
  }
}

To benchmark its performance, we also define a reusable method as we did in Section 3.

# Save to the d2ltvm package.
def bench_matmul_tvm(func, sizes, target):
    def workload(nrepeats):
        timer = mod.time_evaluator(mod.entry_name, ctx=ctx, number=nrepeats)
        return timer(a, b, c).mean * nrepeats
    times = []
    for n in sizes:
        s, (A, B, C) = func(int(n))
        mod = tvm.build(s, [A, B, C], target)
        ctx = tvm.context(target, 0)
        a, b, c = d2ltvm.get_abc((n, n), lambda x: tvm.nd.array(x, ctx=ctx))
        times.append(d2ltvm.bench_workload(workload))
    return 2 * sizes**3 / 1e9 / np.array(times)

The default schedule follows the computation illustrated in Fig. 3.2.1. It’s not surprising to see that the default schedule doesn’t perform well, especially for large matrices that cannot fit into the cache.

default_gflops = bench_matmul_tvm(default, sizes, target)
d2ltvm.plot_gflops(sizes, [np_gflops, default_gflops], ['numpy', 'default'])
../_images/output_matmul_f79eab_9_0.svg

5.3. Reordering Axes

The first problem we can see from Fig. 3.2.1 is that matrix \(B\) is accessed column by column while its elements are stored by rows (i.e. matrix \(B\) is in row-major). In other words, in the pseudo code above, we iterate axis y before axis k. Simply switching these two for-loops will make all elements read and written sequentially. Fig. 5.3.1 illustrates the changed the data access pattern.

../_images/matmul_reorder.svg

Fig. 5.3.1 Reorder axes in matrix multiplication.

To implement it, we change the axes order from (x, y, k) to (x, k, y) by the reorder primitive. The corresponding pseudo code verifies that we are processing all matrices row by row now.

def reorder(n):
    s, (A, B, C) = default(n)
    (x, y), (k,) = C.op.axis, C.op.reduce_axis
    s[C].reorder(x, k, y)
    return s, (A, B, C)

s, args = reorder(64)
print(tvm.lower(s, args, simple_mode=True))
produce C {
  for (x, 0, 64) {
    for (y.init, 0, 64) {
      C[((x*64) + y.init)] = 0f
    }
    for (k, 0, 64) {
      for (y, 0, 64) {
        C[((x*64) + y)] = (C[((x*64) + y)] + (A[((x*64) + k)]*B[((k*64) + y)]))
      }
    }
  }
}

We can see that the reordering significantly improves the performance compared to the default schedule.

reorder_gflops = bench_matmul_tvm(reorder, sizes, target)
d2ltvm.plot_gflops(sizes, [np_gflops, default_gflops, reorder_gflops],
            ['numpy', 'default', 'reorder'])
../_images/output_matmul_f79eab_13_0.svg

5.4. Parallelization

Let’s revisit the pseudo code above. In the outermost for-loop for (x, 0, 64), each time we compute the results of a row in \(C\). Each row can be computed in parallel, so we can make the schedule parallelize axis x.

def parallel(n):
    s, (A, B, C) = reorder(n)
    s[C].parallel(C.op.axis[0])
    return s, (A, B, C)

s, args = parallel(64)
print(tvm.lower(s, args, simple_mode=True))
produce C {
  parallel (x, 0, 64) {
    for (y.init, 0, 64) {
      C[((x*64) + y.init)] = 0f
    }
    for (k, 0, 64) {
      for (y, 0, 64) {
        C[((x*64) + y)] = (C[((x*64) + y)] + (A[((x*64) + k)]*B[((k*64) + y)]))
      }
    }
  }
}
parallel_gflops = bench_matmul_tvm(parallel, sizes, target)
d2ltvm.plot_gflops(sizes, [np_gflops, default_gflops, reorder_gflops, parallel_gflops],
            ['numpy', 'default', 'reorder', 'parallel'])
../_images/output_matmul_f79eab_16_0.svg

Parallelization improves the performance again. But we can see that there is still a gap compared to NumPy on large matrices, specially when they cannot fit into the L2 cache. We will try to resolve it in the next section.

5.5. Summary

  • Reordering the for-loops in matrix multiplication properly improves the performance.

  • Proper thread-level parallelization also improves the performance.

5.6. Exercises

  1. Change the number of threads

  2. Try to order the axes in method parallel differently

  3. Benchmark matrix multiplication in larger sizes