.. _ch_batch_norm_cpu:

Batch Normalization
===================


This section talks about scheduling the batch normalization computation
defined in :numref:`ch_batch_norm` on CPU.

Setup
-----

.. code:: python

    %matplotlib inline
    import d2ltvm
    import inspect
    from IPython import display
    import numpy as np
    from matplotlib import pyplot as plt
    import timeit
    import tvm
    from tvm import te
    import topi
    
    target = 'llvm -mcpu=skylake-avx512'

Schedule
--------

We first review the default scheduling of batch normalization and its
IR, as shown in :numref:`ch_batch_norm`.

.. code:: python

    # a tuple of channel and input height/width
    size = (32, 28)
    
    def default_bn(size):
        c, n = size[:]
        X, Mean, Var, Gamma, Beta, Y = d2ltvm.batch_norm(c, n)
        sch = te.create_schedule(Y.op)
        return sch, (X, Mean, Var, Gamma, Beta, Y)
    
    sch, args = default_bn(size)
    print(tvm.lower(sch, args, simple_mode=True))


.. parsed-literal::
    :class: output

    // attr [T_subtract] storage_scope = "global"
    allocate T_subtract[float32 * 25088]
    // attr [T_add] storage_scope = "global"
    allocate T_add[float32 * 32]
    produce T_subtract {
      for (ax0, 0, 32) {
        for (ax1, 0, 28) {
          for (ax2, 0, 28) {
            T_subtract[(((ax0*784) + (ax1*28)) + ax2)] = (X[(((ax0*784) + (ax1*28)) + ax2)] - Mean[ax0])
          }
        }
      }
    }
    produce T_add {
      for (ax0, 0, 32) {
        T_add[ax0] = (Var[ax0] + 1e-05f)
      }
    }
    produce compute {
      for (i0, 0, 32) {
        T_add[i0] = sqrt(T_add[i0])
      }
    }
    produce T_divide {
      for (ax0, 0, 32) {
        for (ax1, 0, 28) {
          for (ax2, 0, 28) {
            T_subtract[(((ax0*784) + (ax1*28)) + ax2)] = (T_subtract[(((ax0*784) + (ax1*28)) + ax2)]/T_add[ax0])
          }
        }
      }
    }
    produce T_multiply {
      for (ax0, 0, 32) {
        for (ax1, 0, 28) {
          for (ax2, 0, 28) {
            T_subtract[(((ax0*784) + (ax1*28)) + ax2)] = (T_subtract[(((ax0*784) + (ax1*28)) + ax2)]*Gamma[ax0])
          }
        }
      }
    }
    produce T_add {
      for (ax0, 0, 32) {
        for (ax1, 0, 28) {
          for (ax2, 0, 28) {
            T_add[(((ax0*784) + (ax1*28)) + ax2)] = (T_subtract[(((ax0*784) + (ax1*28)) + ax2)] + Beta[ax0])
          }
        }
      }
    }
    

We can easily tell that the multiple stages of the computation can be
fused injectively together. Like we have done in
:numref:`ch_pooling_cpu`, ``te.schedule.AutoInlineInjective(sch)`` is
used for it. Essentially, ``AutoInlineInjective`` traverses the stages
of the schedule and fuses all stages that are fusable injectively. The
resulting stage is a three-level nested loop of channel, input height,
and input width.

In addition, we can fuse the first two axes into one and parallelize it
in multi-thread. We can also vectorize the innermost axis. The optimized
scheduling scheme looks a lot like the scheduling of ``max_pooling`` in
:numref:`ch_pooling_cpu`.

.. code:: python

    
    def optimized_bn(size):
        sch, (X, Mean, Var, Gamma, Beta, Y) = default_bn(size)
        te.schedule.AutoInlineInjective(sch)
        c, h, w = Y.op.axis[0:3]
        fused = sch[Y].fuse(c, h)
        sch[Y].parallel(fused)
        sch[Y].vectorize(w)
        return sch, (X, Mean, Var, Gamma, Beta, Y)
    
    sch, args = optimized_bn(size)
    print(tvm.lower(sch, args, simple_mode=True))


.. parsed-literal::
    :class: output

    produce T_add {
      parallel (ax0.ax1.fused, 0, 896) {
        T_add[ramp((ax0.ax1.fused*28), 1, 28)] = ((((X[ramp((ax0.ax1.fused*28), 1, 28)] - x28(Mean[floordiv(ax0.ax1.fused, 28)]))/x28(sqrt((Var[floordiv(ax0.ax1.fused, 28)] + 1e-05f))))*x28(Gamma[floordiv(ax0.ax1.fused, 28)])) + x28(Beta[floordiv(ax0.ax1.fused, 28)]))
      }
    }
    

Benchmarking
------------

First, we define the method to benchmark batch normalization of TVM as
usual.

.. code:: python

    # Save to the d2ltvm package.
    def bench_bn_tvm(func, sizes, target):
        """Benchmark batch normalization in TVM
    
        func : the scheduling method
        sizes : the data size list, each of which is a (channel, input_hw) tuple
        target : the TVM target, e.g. llvm or cuda
        """
        def workload(nrepeats):
            timer = mod.time_evaluator(mod.entry_name, ctx=ctx, number=nrepeats)
            return timer(data, mean, var, gamma, beta, out).mean * nrepeats
        times = []
        for size in sizes:
            sch, args = func(size)
            mod = tvm.build(sch, args, target)
            ctx = tvm.context(target, 0)
            data, mean, var, gamma, beta, out = d2ltvm.get_bn_data(size[0], size[1], 
                                                                   lambda x: tvm.nd.array(x, ctx=ctx))
            times.append(d2ltvm.bench_workload(workload))
        return np.array(times)

Then, we use MXNet as the baseline, and define the method to benchmark
its performance.

.. code:: python

    # Save to the d2ltvm package.
    def bn_timer_mxnet(c, n, ctx):
        """Benchmark batch normalization in MXNet
    
        c : channels
        n : input width and height
        ctx : compute ctx, e.g., cpu or gpu
        """
        timer = timeit.Timer(
            setup='import d2ltvm\n'
            'import mxnet as mx\n'
            'c, n = %d, %d\n'
            'data, mean, var, gamma, beta, out = d2ltvm.get_bn_data_mxnet(\n'
            '    c, n, "%s")'%(c, n, ctx),
            stmt='d2ltvm.batch_norm_mxnet(data, mean, var, gamma, beta, out);'
            'out.wait_to_read()')
        return timer.timeit
    
    # Save to the d2ltvm package.
    def bench_bn_mxnet(sizes, ctx='cpu'):
        """Return the execution times of MXNet batch norm"""
        return [d2ltvm.bench_workload(bn_timer_mxnet(c, n, ctx))
                for c, n in sizes]

Finally, we define a number of different channel numbers (with the fixed
input size :math:`28 \times 28`) to plot the benchmarking results of our
baseline, default scheduling and optimized scheduling.

.. code:: python

    channels = 2**np.arange(3, 9, 1)
    # a list of (c, n)
    sizes = [(int(c), 28) for c in channels]
    
    mxnet_times = bench_bn_mxnet(sizes)
    default_times = bench_bn_tvm(default_bn, sizes, target)
    optimized_times = bench_bn_tvm(optimized_bn, sizes, target)
    
    times = [mxnet_times, default_times, optimized_times]
    d2ltvm.plot(channels, times, ylabel='s',
                xscale='log', yscale='log',
                legend=['mxnet', 'default', 'optimized'], fmts=['--']*(len(times)-1)+['-'])


.. figure:: output_batch_norm_3fb782_11_0.svg


From the diagram we can see multiple things. First, for this kind of
small operators, MXNet's execution time is dominated by the function
invoking overhead (:numref:`ch_call_overhead`). Second, TVM's invoking
overhead is small, making even the default schedule outperforms the
MXNet baseline. Third, after fusing stages and doing proper
parallelization and vectorization, we can have a much better performance
for batch normalization.

Summary
-------

-  Like Pooling, the optimization of batch normalization on CPU is all
   about stage fusion and parallelization/vectorization.