.. _ch_call_overhead:

Function Call Overhead
======================


We are starting to benchmark various schedules since this chapter.
Before diving into various execution time numbers, we need to be aware
of the overhead of issuing a function call in Python. It's well known
that Python is not the fastest language on earth. Prototyping with
Python is fast, but we also need to pay the cost that Python does smart
things for us under the hook. In this section, we will investigate the
overhead to call a function in Python, and demonstrate its impact to our
later benchmarking results.

Execution Time Measurement
--------------------------

In Python, we often use the ``timeit`` module to benchmark a workload,
especially when its execution time is way less than 1 second. Note that
Jupyter has a magic build-in function ``%timeit`` that makes the usage
simpler, but a function using it cannot be saved for future usage, so we
will use ``timeit`` directly.

.. code:: python

    import timeit
    import numpy as np
    from matplotlib import pyplot as plt
    from IPython import display

The following codes return the execution time of creating a ``[4, 4]``
zero matrix by 10 times.

.. code:: python

    timer = timeit.Timer(setup='import numpy as np',
                         stmt='np.zeros((4, 4))')
    timer.timeit(10)


.. parsed-literal::
    :class: output

    2.514570951461792e-05


We can see that the above workload can be done in tens of microsecond,
we may need to increase the number of repeats to obtain relatively
accurate execution time. The following function will determine the
number of repeats needed to run a workload for at least 1 second, and
then return the average execution time.

.. code:: python

    # Save to the d2ltvm package.
    def bench_workload(workload):
        """Benchmark a workload
        
        workload: a method that accept a num_repeat argument 
        and return its total execution time
        """
        workload(1)  # warmup
        time = workload(1)  # the time to run once
        if time > 1: return time
        # The number of repeats to measure at least 1 second
        num_repeats = max(int(1.0 / time), 5)
        return workload(num_repeats) / num_repeats
    
    print('time for np.zeros((4, 4)):', bench_workload(timer.timeit))


.. parsed-literal::
    :class: output

    time for np.zeros((4, 4)): 2.6995336035230484e-07


We can see that the newly measured execution time of
``np.zeros((4, 4))`` is much smaller than the one we first measured, as
the former one includes the warmup time.

The Non-Negligible Overhead
---------------------------

Now we benchmark the ``copyto`` method, which copies the contents of an
ndarray to another one, on various size ndarrays.

.. code:: python

    def np_setup(n):
        return 'import numpy as np\n' \
               'x = np.random.normal(size=%d).astype("float32")\n' \
               'y = np.empty_like(x)\n'% n
            
    def np_copy(n):
        return timeit.Timer(setup=np_setup(n),
                            stmt='np.copyto(y, x)')
        
    sizes = 2**np.arange(5, 20, 4).astype('int')
    exe_times = [bench_workload(np_copy(n).timeit) for n in sizes]

Plot the throughput versus vector length.

.. code:: python

    display.set_matplotlib_formats('svg')
    # one float32 takes 4 bytes
    plt.loglog(sizes, sizes*4/exe_times/1e9)
    plt.xlabel('Vector length')
    plt.ylabel('Throughput (GB/sec)')
    plt.grid()


.. figure:: output_call_overhead_151f90_9_0.svg


We can see that the throughput first increases and then plateaus. The
plateau reflects the saturation of the memory bandwidth. However, the
behavior of small vectors is not as expected since when the vector
length is small, all data should be in the cache which should result in
much higher throughput. To examine the reason, let's simply draw the
execution time versus vector length.

.. code:: python

    plt.loglog(sizes, exe_times)
    plt.xlabel('Vector length')
    plt.ylabel('Time (sec)')
    plt.grid()


.. figure:: output_call_overhead_151f90_11_0.svg


We can see that when the vector length is smaller than :math:`10^3`, the
execution time barely decrease when the vector length shortens. This is
because when the vector length is short, the real ``copyto`` execution
time is too small, and the total execution time is dominated by the
function call overhead. The overhead includes any argument preprocessing
in the Python function, evoking the foreign function interface, and
other `Python backend
overhead <https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/>`__.
Therefore, benchmarking too small workloads is not quite meaningful.

Overhead of NumPy, TVM and MXNet
--------------------------------

Throughout the book, we will benchmark various operators in Numpy, TVM
and MXNet. Let's examine their function call overheads. We could roughly
estimate it by executing some small workloads where the actual execution
time is negligible compared to the overhead. Let's first check NumPy.

.. code:: python

    sizes = 2**np.arange(1, 8).astype('int')
    exe_times = np.array([bench_workload(np_copy(n).timeit) for n in sizes])
    
    print('NumPy call overhead: %.1f microsecond' % (exe_times.mean()*1e6))


.. parsed-literal::
    :class: output

    NumPy call overhead: 0.6 microsecond


The overhead of TVM is higher but in the same order of magnitude.

.. code:: python

    def tvm_copy(n):
        return timeit.Timer(setup=np_setup(n)+'import tvm\n'
                            'x, y = tvm.nd.array(x), tvm.nd.array(y)',
                            stmt='x.copyto(y)')
    
    tvm_times = np.array([bench_workload(tvm_copy(n).timeit) for n in sizes])
    print('TVM call overhead: %.1f microsecond'% (tvm_times.mean()*1e6,))


.. parsed-literal::
    :class: output

    TVM call overhead: 0.2 microsecond


Compared to NumPy and TVM, MXNet has substantially higher overhead. The
reason might due to MXNet uses ``ctypes`` while TVM is compiled with
``cython``, and MXNet uses lazy evaluation that brings additional
overhead.

.. code:: python

    def mx_copy(n):
        return timeit.Timer(setup=np_setup(n)+'import mxnet as mx\n'
                            'x, y = mx.nd.array(x), mx.nd.array(y)\n'
                            'mx.nd.waitall()'% n,
                            stmt='x.copyto(y); y.wait_to_read()')
    
    mx_times = np.array([bench_workload(mx_copy(n).timeit) for n in sizes])
    print('MXNet call overhead: %.1f microsecond'% (mx_times.mean()*1e6,))


.. parsed-literal::
    :class: output

    MXNet call overhead: 37.4 microsecond


Summary
-------

-  Warming up a method by running it a few times beforehand is a good
   practice to measure its execution time.
-  The function call overhead might takes several microsecond.
   Benchmarking too small functions in Python is meaningless.