.. _ch_bcast_add_cpu:

Broadcast Add
=============


This section talks about scheduling the broadcast add computation
defined in :numref:`ch_bcast_add` on CPU. The optimization of it is
similar to what we have done for vector add in
:numref:`ch_vector_add_cpu`, as they are both essentially memory-bound
element-wise calculation. Hence, instead of "what works", we will talk
about "what doesn't work" in this section.

Setup
-----

.. code:: python

    %matplotlib inline
    import d2ltvm
    import inspect
    from IPython import display
    import numpy as np
    from matplotlib import pyplot as plt
    import timeit
    import tvm
    from tvm import te

Here we choose the broadcast add depicted in :numref:`fig_bcast_add`.
The other broadcast patterns do not make essential difference.

First, we define the baseline in ``numpy``. We use the real used data
sizes in the x axis.

.. code:: python

    sizes = 2**np.arange(5, 15, 2)
    np_bcast_add = lambda s1, s2: timeit.Timer(setup='import numpy as np\n'
                                    'import d2ltvm\n'
                                    'a, b, c = d2ltvm.get_bcast_data(%s, %s)' % (s1, s2),
                                    stmt='np.add(a, b, out=c)')
    exe_times = [d2ltvm.bench_workload(np_bcast_add((n, 1), (n, n)).timeit) for n in sizes]
    np_gflops = sizes * sizes / 1e9 / np.array(exe_times)
    # data size in MB
    x_axis_sizes = (sizes * sizes * 2 + sizes * sizes) * 4 / 1e6
    d2ltvm.plot_gflops(x_axis_sizes, [np_gflops], ['numpy'], xlabel='Size (MB)')


.. figure:: output_broadcast_add_ba6441_3_0.svg


Note that the x axis is denoted as the total data size consumed. The
performance drops when the data size is larger than the L2 cache of a
single core (1024 KB or 1 MB), which indirectly suggests that ``numpy``
may use single-thread to perform this operator.

Good Schedule
-------------

The following code block defines a benchmarking method for broadcast add
of a specific pattern in TVM. It follows the same format we discussed in
:numref:`ch_vector_add_cpu`.

.. code:: python

    # Save to the d2ltvm package.
    def bench_bcast_add_tvm(func, sizes, target):
        def workload(nrepeats):
            timer = mod.time_evaluator(mod.entry_name, ctx=ctx, number=nrepeats)
            return timer(a, b, c).mean * nrepeats
        times = []
        for n in sizes:
            n = int(n)
            s, (A, B, C) = func(n)
            mod = tvm.build(s, [A, B, C], target)
            ctx = tvm.context(target, 0)
            a, b, c = d2ltvm.get_bcast_data((n, 1), (n, n), lambda x: tvm.nd.array(x, ctx=ctx))
            times.append(d2ltvm.bench_workload(workload))
        return sizes * sizes / 1e9 / np.array(times)

The good schedule (i.e. what works) follows the similar scheme defined
in :numref:`ch_vector_add_cpu`.

.. code:: python

    target = 'llvm -mcpu=skylake-avx512'
    def default(n):
        A, B, C = d2ltvm.broadcast_add((n, 1), (n, n))
        s = te.create_schedule(C.op)
        return s, (A, B, C)
    
    def good_schedule(n):
        s, (A, B, C) = default(n)
        x, y = C.op.axis
        s[C].parallel(x)
        s[C].vectorize(y)
        return s, (A, B, C)
    
    s, args = good_schedule(64)
    print(tvm.lower(s, args, simple_mode=True))


.. parsed-literal::
    :class: output

    produce C {
      parallel (x, 0, 64) {
        C[ramp((x*64), 1, 64)] = (x64(A[x]) + B[ramp((x*64), 1, 64)])
      }
    }
    

Now the C-like pseudo code should be familiar to you. One notable
difference from :numref:ch\_vector\_add\_cpu is that we broadcast
``A[x]`` to a vectorized register (i.e. ``x64(A[x]``) for vectorized
add.

Let's benchmark the good schedule.

.. code:: python

    good_gflops = bench_bcast_add_tvm(good_schedule, sizes, target)
    d2ltvm.plot_gflops(x_axis_sizes, [np_gflops, good_gflops], ['numpy', 'good'], xlabel='Size (MB)')


.. figure:: output_broadcast_add_ba6441_9_0.svg


Like the case in :numref:`ch_vector_add_cpu`, the performance is
better and drops at a larger data size as multi-core comes into play,
leading to a larger amount L2 cache in total.

Bad Schedule
------------

Now let's see how a bad schedule (i.e. what doesn't work) looks like.
Note that our data are all stored in row-major and our schedules always
make sure that the innermost loop manipulates the data in consecutive
space.

However, if we think from the perspective vectorization, we may find
that broadcasting ``A[x]`` to a vectorized register brings overhead. A
better way is to have multiple values of ``A`` to add with multiple
values of ``B`` in a vectorized instruction, which is illustrated in
:numref:`fig_bcast_add_vectorize`.

.. _fig_bcast_add_vectorize:

.. figure:: ../img/bcast_add_vectorize.svg

   Different vectorization strategies of broadcast add.


In order to make vectorization along the column, we need to reorder to
data access pattern as follows.

.. code:: python

    def bad_schedule(n):
        s, (A, B, C) = default(n)
        x, y = C.op.axis
        s[C].reorder(y, x)
        s[C].parallel(y)
        s[C].vectorize(x)
        return s, (A, B, C)
    
    s, args = bad_schedule(64)
    print(tvm.lower(s, args, simple_mode=True))


.. parsed-literal::
    :class: output

    produce C {
      parallel (y, 0, 64) {
        C[ramp(y, 64, 64)] = (A[ramp(0, 1, 64)] + B[ramp(y, 64, 64)])
      }
    }
    

Now we eliminate the broadcast of ``A[x]`` in vectorization. Let's run
the benchmark and plot the chart for this schedule.

.. code:: python

    sizes = 2**np.arange(5, 11, 2)
    bad_gflops = bench_bcast_add_tvm(bad_schedule, sizes, target)
    diff = len(good_gflops)-len(bad_gflops)
    bad_gflops = np.append(bad_gflops, [np.nan] * diff)
    d2ltvm.plot_gflops(x_axis_sizes, [np_gflops, good_gflops, bad_gflops],
         ['numpy', 'good', 'bad'],  xlabel='Size (MB)')


.. figure:: output_broadcast_add_ba6441_13_0.svg


Note that in order to make it finish in a short period of time, we
reduce the number of tested sizes for this schedule.

We can see that the performance of the vectorization-favorable schedule
is pretty bad. The reason is that we access ``B`` in a stride to get the
data in the same column, which is much slower than accessing data in a
consecutive space. And the thread-level parallelism would make it even
worse as different threads compete in getting the data in the same cache
line.

The result shows that a good schedule needs to consider multiple
performance-related factors together. A scheduling scheme that favors
one aspect may lead to bad overall performance as it harms other
aspects.

Summary
-------

-  Like vector add, broadcast add is a memory-bound operator.
-  A good schedule needs to consider multiple performance-related
   factors together.

Exercise
--------

-  Try to schedule other broadcast add patterns and observe the
   difference.