1.2. Vector Add¶
Now you have installed all libraries, let’s write our first program:
summing two n
-dimensional vectors a
and b
. It’s
straightforward in NumPy, where we can do it by c = a + b
.
1.2.1. Implementing with NumPy¶
import numpy as np
np.random.seed(0)
n = 100
a = np.random.normal(size=n).astype(np.float32)
b = np.random.normal(size=n).astype(np.float32)
c = a + b
Here we create two random vectors with length 100, and sum them element-wisely. Note that NumPy in default uses 64-bit floating-points or 64-bit integers, which is different from 32-bit floating point typically used in deep learning, so we explicitly cast the data type.
Although we can use the build-in +
operator in NumPy to realize
element-wise add, let’s try to implement it by only using scalar
operators. It will help us understand the implementation with TVM. The
following function uses a for-loop to iterate over every element of the
vectors, and then add two elements together with the scalar +
operator each time.
def vector_add(a, b, c):
for i in range(n):
c[i] = a[i] + b[i]
d = np.empty(shape=n, dtype=np.float32)
vector_add(a, b, d)
np.testing.assert_array_equal(c, d)
Given we will frequently create two random ndarrays and another empty one to store the results in the following chapters, we save this routine to reuse it in the future.
# Save to the d2ltvm package.
def get_abc(shape, constructor=None):
"""Return random a, b and empty c with the same shape.
"""
np.random.seed(0)
a = np.random.normal(size=shape).astype(np.float32)
b = np.random.normal(size=shape).astype(np.float32)
c = np.empty_like(a)
if constructor:
a, b, c = [constructor(x) for x in (a, b, c)]
return a, b, c
Note that we fixed the random seed so that we will always get the same
results to facilitate the comparison between NumPy, TVM and others. In
addition, it accepts an optional constructor
to convert the data
into a different format.
1.2.2. Defining the TVM Computation¶
Now let’s implement vector_add
in TVM. The TVM implementation
differs from above in two ways:
We don’t need to write the complete function, but only to specify how each element of the output, i.e.
c[i]
, is computedTVM is symbolic, we create symbolic variables by specifying their shapes, and define how the program will be computed
In the following program, we first declare the placeholders A
and
B
for both inputs by specifying their shapes, (n,)
, through
tvm.te.placeholder
. Both A
and B
are Tensor
objects,
which we can feed data later. We assign names to them so we can print an
easy-to-read program later.
Next we define how the output C
is computed by tvm.compute
. It
accepts two arguments, the output shape, and a function to compute each
element by giving its index. Since the output is a vector, its elements
are indexed by integers. The lambda function defined in tvm.compute
accepts a single argument i
, and returns c[i]
, which is
identical to c[i] = a[i] + b[i]
defined in vector_add
. One
difference is that we don’t write the for-loop, which will be filled by
TVM later.
import tvm
from tvm import te # te stands for tensor expression
# Save to the d2ltvm package.
def vector_add(n):
"""TVM expression for vector add"""
A = te.placeholder((n,), name='a')
B = te.placeholder((n,), name='b')
C = te.compute(A.shape, lambda i: A[i] + B[i], name='c')
return A, B, C
A, B, C = vector_add(n)
type(A), type(C)
(tvm.te.tensor.Tensor, tvm.te.tensor.Tensor)
We can see that A
, B
, and C
are all Tensor
objects,
which can be viewed as a symbolic version of NumPy’s ndarray. We can
access the variables’ attributes such as data type and shape. But those
values don’t have concrete values right now.
(A.dtype, A.shape), (C.dtype, C.shape)
(('float32', [100]), ('float32', [100]))
The operation that generates the tensor object can be accessed by
A.op
.
type(A.op), type(C.op)
(tvm.te.tensor.PlaceholderOp, tvm.te.tensor.ComputeOp)
We can see that the types of the operations for A
and C
are
different, but they share the same base class Operation
, which
represents an operation that generates a tensor object.
A.op.__class__.__bases__[0]
tvm.te.tensor.Operation
1.2.3. Creating a Schedule¶
To run the computation, we need to specify how to execute the program,
for example, the order to access data and how to do multi-threading
parallelization. Such an execution plan is called a schedule. Since
C
is the output tensor, let’s create a default schedule on its
operator and print the pseudo codes.
s = te.create_schedule(C.op)
A schedule consists of several stages. Each stage corresponds to an
operation to describe how it is scheduled. We can access a particular
stage by either s[C]
or s[C.op]
.
type(s), type(s[C])
(tvm.te.schedule.Schedule, tvm.te.schedule.Stage)
Later on we will see how to change the execution plan to better utilize the hardware resources to improve its efficiency. Here let’s see the default execution plan by printing the C-like pseudo codes.
tvm.lower(s, [A, B, C], simple_mode=True)
produce c {
for (i, 0, 100) {
c[i] = (a[i] + b[i])
}
}
The lower
method accepts the schedule and input and output tensors.
The simple_mode=True
will print the program in a simple and compact
way. Note that the program has added proper for-loops according to the
output shape. Overall, it’s quite similar to the preview function
vector_add
.
Now you see that TVM separates the computation and the schedule. The computation defines how the results are computed, which will not change no matter on what hardware platform you run the program. On the other hand, an efficient schedule are often hardware dependent, but changing a schedule will not impact the correctness. The idea of separating computation from schedule is inherited by TVM from Halide [Ragan-Kelley et al., 2013].
1.2.4. Compilation and Execution¶
Once both computation and schedule are defined, we can compile them into
an executable module with tvm.build
. It accepts the same argument as
tvm.lower
. In fact, it first calls tvm.lower
to generate the
program and then compiles to machine codes.
mod = tvm.build(s, [A, B, C])
type(mod)
tvm.runtime.module.Module
It returns an executable module object. Now we can feed data for A
,
B
and C
to run it. The tensor data must be
tvm.ndarray.NDArray
object. The easiest way is to create NumPy
ndarray objects first and then convert them into TVM ndarray by
tvm.nd.array
. We can convert them back to NumPy by the asnumpy
method.
x = np.ones(2)
y = tvm.nd.array(x)
type(y), y.asnumpy()
(tvm.runtime.ndarray.NDArray, array([1., 1.]))
Now let’s construct data and return them as TVM ndarrays.
a, b, c = get_abc(100, tvm.nd.array)
Do the computation, and verify the results.
mod(a, b, c)
np.testing.assert_array_equal(a.asnumpy() + b.asnumpy(), c.asnumpy())
1.2.5. Argument Constraints¶
Remember that we specified both inputs to be 100-length vectors when
declaring A
and B
.
A.shape, B.shape, C.shape
([100], [100], [100])
TVM will check if the input shapes satisfy this specification.
try:
a, b, c = get_abc(200, tvm.nd.array)
mod(a, b, c)
except tvm.TVMError as e:
print(e)
Traceback (most recent call last):
[bt] (1) /var/lib/jenkins/miniconda3/envs/d2l-tvm-0/lib/python3.7/site-packages/tvm/libtvm.so(TVMFuncCall+0x61) [0x7f8eec7a60f1]
[bt] (0) /var/lib/jenkins/miniconda3/envs/d2l-tvm-0/lib/python3.7/site-packages/tvm/libtvm.so(+0xcab5a1) [0x7f8eec78d5a1]
File "/home/ubuntu/tvm/src/runtime/library_module.cc", line 89
TVMError: Check failed: ret == 0 (-1 vs. 0) : Assert fail: (100 == int32(arg0.shape[0])), Argument arg0.shape[0] has an unsatisfied constraint
The default data type in TVM is float32
.
A.dtype, B.dtype, C.dtype
('float32', 'float32', 'float32')
An error will appear if input with a different data type.
try:
a, b, c = get_abc(100, tvm.nd.array)
a = tvm.nd.array(a.asnumpy().astype('float64'))
mod(a, b, c)
except tvm.TVMError as e:
print(e)
Traceback (most recent call last):
[bt] (1) /var/lib/jenkins/miniconda3/envs/d2l-tvm-0/lib/python3.7/site-packages/tvm/libtvm.so(TVMFuncCall+0x61) [0x7f8eec7a60f1]
[bt] (0) /var/lib/jenkins/miniconda3/envs/d2l-tvm-0/lib/python3.7/site-packages/tvm/libtvm.so(+0xcab5a1) [0x7f8eec78d5a1]
File "/home/ubuntu/tvm/src/runtime/library_module.cc", line 89
TVMError: Check failed: ret == 0 (-1 vs. 0) : Assert fail: (((tvm_struct_get(arg0, 0, 5) == (uint8)2) && (tvm_struct_get(arg0, 0, 6) == (uint8)32)) && (tvm_struct_get(arg0, 0, 7) == (uint16)1)), arg0.dtype is expected to be float32
1.2.6. Saving and Loading a Module¶
A compiled a module can be saved into disk,
mod_fname = 'vector-add.tar'
mod.export_library(mod_fname)
and then loaded back later.
loaded_mod = tvm.runtime.load_module(mod_fname)
Verify the results.
a, b, c = get_abc(100, tvm.nd.array)
loaded_mod(a, b, c)
np.testing.assert_array_equal(a.asnumpy() + b.asnumpy(), c.asnumpy())
1.2.7. Summary¶
Implementing an operator using TVM has three steps:
Declare the computation by specifying input and output shapes and how each output element is computed.
Create a schedule to (hopefully) fully utilize the machine resources.
Compile to the hardware target.
In addition, we can save the compiled module into disk so we can load it back later.