3. Vector Add
Open the notebook in Colab

In this section, we will optimize the vector add defined in Section 1.2 on CPU.

3.1. Setup

%matplotlib inline
import d2ltvm
import inspect
from IPython import display
import numpy as np
from matplotlib import pyplot as plt
import timeit
import tvm
from tvm import te

We first define reusable plot functions to draw multiple lines, which generalize the plot function defined in Section 2.

# Save to the d2ltvm package.
def plot(X, Y, xlabel=None, ylabel=None, legend=[], xlim=None,
         ylim=None, xscale='linear', yscale='linear', fmts=None,
         figsize=(4.5, 3)):
    """Plot multiple lines"""
    display.set_matplotlib_formats('svg')
    plt.rcParams['figure.figsize'] = figsize
    axes = plt.gca()
    X, Y = np.array(X), np.array(Y)
    if X.shape != Y.shape: X = [X] * len(Y)
    if not fmts: fmts = ['-'] * len(X)
    for x, y, fmt in zip(X, Y, fmts):
        axes.plot(x, y, fmt)
    axes.set_xlabel(xlabel)
    axes.set_ylabel(ylabel)
    axes.set_xscale(xscale)
    axes.set_yscale(yscale)
    axes.set_xlim(xlim)
    axes.set_ylim(ylim)
    if legend: axes.legend(legend)
    axes.grid()

# Save to the d2ltvm package
def plot_gflops(sizes, gflops, legend, xlabel='Size'):
    d2ltvm.plot(sizes, gflops, xlabel=xlabel, ylabel='GFLOPS',
             xscale='log', yscale='log',
             legend=legend, fmts=['--']*(len(gflops)-1)+['-'])

Then we benchmark the performance of NumPy as our baseline. We show the vector size vs measured GFLOPS, giga-floating point operations per second, in the following diagram.

sizes = 2**np.arange(5, 29, 4)
np_add = lambda n: timeit.Timer(setup='import numpy as np\n'
                                'import d2ltvm\n'
                                'a, b, c = d2ltvm.get_abc(%d)' % n,
                                stmt='np.add(a, b, out=c)')
exe_times = [d2ltvm.bench_workload(np_add(n).timeit) for n in sizes]
np_gflops = sizes / 1e9 / np.array(exe_times)
plot_gflops(sizes, [np_gflops], ['numpy'])
../_images/output_vector_add_c0862c_5_0.svg

As we can see that the performance first increases with the vector length, which is due to the system overhead domination when the workload is small. The performance then decreases when we cannot fit all data into the cache.

3.2. Default Schedule

In the following code block, we define a reusable method to benchmark TVM performance. It accepts three arguments: 1) a func which returns the schedule and its corresponding symbolic tensor arguments; 2) the size list specifying a number of the vector lengths; and 3) the machine target which is CPU-related for this chapter and will be GPU-related in the next chapter.

# Save to the d2ltvm package.
def bench_vector_add_tvm(func, sizes, target):
    def workload(nrepeats):
        timer = mod.time_evaluator(mod.entry_name, ctx=ctx, number=nrepeats)
        return timer(a, b, c).mean * nrepeats
    times = []
    for n in sizes:
        s, (A, B, C) = func(int(n))
        mod = tvm.build(s, [A, B, C], target)
        ctx = tvm.context(target, 0)
        a, b, c = d2ltvm.get_abc(n, lambda x: tvm.nd.array(x, ctx=ctx))
        times.append(d2ltvm.bench_workload(workload))
    return sizes / 1e9 / np.array(times)

The default schedule is a plain one-level for-loop program.

def default(n):
    A, B, C = d2ltvm.vector_add(n)
    s = te.create_schedule(C.op)
    return s, (A, B, C)

s, args = default(64)
print(tvm.lower(s, args, simple_mode=True))
produce c {
  for (i, 0, 64) {
    c[i] = (a[i] + b[i])
  }
}

Remember in Section 1 we found that our CPU supports AVX-512, we pass -mcpu=skylake-avx512 to LLVM so that it can generate AVX-512 instructions if possible. In the following codes, we print a few lines of generated LLVM code.

target = 'llvm -mcpu=skylake-avx512'
mod = tvm.build(s, args, target)
print(mod.get_source()[:500])
; ModuleID = 'TVMMod'
source_filename = "TVMMod"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

%0 = type { i8*, %1, i32, %2, i64*, i64*, i64 }
%1 = type { i32, i32 }
%2 = type { i8, i8, i16 }

@__TVMAPISetLastError = linkonce dllexport local_unnamed_addr global void (i8*)* null, align 8
@.str = private constant [69 x i8] c"Assert fail: (num_args == 3), default_function: num_args should be 300", align 1
@.str.1 = private constant [144 x i8] c"

You may find it not quite readable if you are not familiar with LLVM IR. But you don’t need to worry much as in most cases, only reading the C-like pseudo code is sufficient to study the performance. Now let’s benchmark the default schedule.

default_gflops = bench_vector_add_tvm(default, sizes, target)
plot_gflops(sizes, [np_gflops, default_gflops], ['numpy', 'default'])
../_images/output_vector_add_c0862c_13_0.svg

When the vector size is small, the default scheduling outperforms NumPy, which means that in this method the function call overhead of TVM is smaller than NumPy. It’s not surprising to find the performance degrades when increasing the vector size as the data cannot fit into the last level cache.

3.3. Parallelization

One important optimization that is not enabled by default is thread-level parallelization. The vector add operator is an embarrassingly parallel workload, we can just change the for-loop into a parallel for-loop. In TVM, we first obtain the scheduler for the output symbol C by s[C], and then impose the parallelization of the computation to its first axis, which is C.op.axis[0].

def parallel(n):
    s, (A, B, C) = default(n)
    # add thread-level parallelism
    s[C].parallel(C.op.axis[0])
    return s, (A, B, C)

s, args = parallel(64)
print(tvm.lower(s, args, simple_mode=True))
produce c {
  parallel (i, 0, 64) {
    c[i] = (a[i] + b[i])
  }
}

We can see that for is changed to parallel in the above pseudo codes. It means that the iterations could be executed in parallel by multiple threads. A typical implementation on a system with \(t\) CPU cores is we first create \(t\) threads with one thread for each core, then thread \(i\) will execute blocks \(j\) if j % t = i. All these threads will run simultaneously on their exclusive cores to achieve parallelization, and come back to retrieve another block to execute after finishing one. This is often called the round-robin scheduling, which works well if each thread runs at the same speed and every iteration has a roughly same workload. TVM’s thread-level parallelization implementation is a special case of round-robin, which evenly divides the to-be-parallelized \(n\)-length loop into \(t\) blocks. Therefore, each thread only needs to process one block. Furthermore, the TVM runtime binds each working thread to a disjoint physical core to avoid resource contention and thread migration overhead. There are other thread-level parallelization schemes such as the more dynamic consumer-producer scheduling.

Then we check the parallelization of vectorization and plot the comparison diagram via the following code block.

parallel_gflops = bench_vector_add_tvm(parallel, sizes, target)
plot_gflops(sizes, [np_gflops, default_gflops, parallel_gflops],
     ['numpy', 'default', 'parallel'])
../_images/output_vector_add_c0862c_17_0.svg

Comparing the results we obtained before, parallelization significantly improves the performance when the workloads are large, e.g. vector lengths beyond \(10^4\). However, the parallelization overhead impact the performance for small workloads, where single thread is even faster. The performance drops at a larger size as multi-core comes in play, leading to a larger amount of L2 cache in total.

3.4. Vectorization

A single core may have SIMD units to run multiple arithmetic operations at the same time as we saw in Section 1. Although one iteration in the above loop only has a single add operation, we can explicitly allocate more operations within an iteration, and ask the compiler to use SIMD instructions to process them.

The way to do it is first splitting the one-level for-loop into a two-level nested for-loop using a factor. The inner loop consists of factor original iterations that will be grouped together accordingly to execute in SIMD instructions on a core. And iterations in the outer loop still run in parallel.

def vectorized(n):
    s, (A, B, C) = default(n)
    outer, inner = s[C].split(C.op.axis[0], factor=8)
    s[C].parallel(outer)
    s[C].vectorize(inner)
    return s, (A, B, C)

s, args = vectorized(64)
print(tvm.lower(s, args, simple_mode=True))
produce c {
  parallel (i.outer, 0, 8) {
    c[ramp((i.outer*8), 1, 8)] = (a[ramp((i.outer*8), 1, 8)] + b[ramp((i.outer*8), 1, 8)])
  }
}

We can see that the outer for-loop is reduced to 8 iterations, while the inner for-loop is vectorized by ramp with a stride of 1 and width of 8. The definition of ramp is inherited from Halide.

Again, we check the performance of vectorization and plot the comparison diagram via the following code block.

vectorized_gflops = bench_vector_add_tvm(vectorized, sizes, target)
plot_gflops(sizes, [np_gflops, default_gflops, parallel_gflops, vectorized_gflops],
     ['numpy', 'default', 'parallel', 'vectorized'])
../_images/output_vector_add_c0862c_21_0.svg

The performance of the vectorized version is almost as the plain parallelization version. It’s partially because the vector add is bottlenecked by memory bandwidth instead of computation, while SIMD only helps the latter. We will see it helps more on computation intensive workloads such as matrix multiplication later.

3.5. Summary

  • The default scheduling generates naive single-thread CPU program.

  • Parallelization improves performance for large workloads.

  • We can split a for-loop and then vectorize the inner loop if the system supports SIMD.