1.3. Neural Network Inference
Open the notebook in Colab

You have seen how to implement and compile a simple vector addition operator in Section 1.2. Now we will make a big jump to compile a whole pre-trained neural network, which consists of a set of operators, to run the inference.

import numpy as np
import mxnet as mx
from PIL import Image
import tvm
from tvm import relay

Here three additional modules are imported compared to the previous chapter. We will use PIL to read images, mxnet to obtain pre-trained neural networks, and the relay module [Roesch et al., 2019] in TVM to convert and optimize a neural network. Relay is the high-level intermediate representation (IR) in TVM to represent a neural network.

1.3.1. Obtaining Pre-trained Models

A pre-trained model means a neural network with parameters trained on a data set. Here we download and load a ResNet-18 model by specifying pretrained=True from MXNet’s model zoo [Chen et al., 2015]. If you want to know details about this model, please refer to Chapter 7.6 in D2L. You can find more models on the MXNet model zoo page, or refer to GluonCV and GluonNLP for more computer vision and natural language models.

model = mx.gluon.model_zoo.vision.resnet18_v2(pretrained=True)
len(model.features), model.output
(13, Dense(512 -> 1000, linear))

The loaded model is trained on the Imagenet 1K dataset, which contains around 1 million natural object images among 1000 classes. The model has two parts, the main body part model.features contains 13 blocks, and the output layer is a dense layer with 1000 outputs.

The following code block loads the text labels for each class in the Imagenet dataset.

with open('../data/imagenet1k_labels.txt') as f:
    labels = eval(f.read())

1.3.2. Pre-processing Data

We first read a sample image. It is resized to the size, i.e. 224 px width and height, which we used to train the neural network.

image = Image.open('../data/cat.jpg').resize((224, 224))
image
../_images/output_from_mxnet_04cfdb_7_0.png

According to the model zoo page. Image pixels are normalized on each color channel, and the data layout is (batch, RGB channels, height, width). The following function transforms the input image to satisfy the requirement.

# Save to the d2ltvm package
def image_preprocessing(image):
    image = np.array(image) - np.array([123., 117., 104.])
    image /= np.array([58.395, 57.12, 57.375])
    image = image.transpose((2, 0, 1))
    image = image[np.newaxis, :]
    return image.astype('float32')

x = image_preprocessing(image)
x.shape
(1, 3, 224, 224)

1.3.3. Compile Pre-trained Models

To compile a model, we first express the MXNet model in Relay IR, which the from_mxnet method could help. In the method, we provide the model with the input data shape. Some neural networks may require some dimension(s) of the data shape to be determined later. However, in ResNet model the data shape is fixed, which makes it easier for the compiler to achieve high performance. We will mostly stick to fixed data shape in the book. We only touch the dynamic data shape (i.e. some dimension(s) to be determined in runtime) in very late chapters.

relay_mod, relay_params = relay.frontend.from_mxnet(model, {'data': x.shape})
type(relay_mod), type(relay_params)
(tvm.ir.module.IRModule, dict)

This method will return the program relay_mod, which is a relay module, and a dictionary of parameters relay_params that maps a string key to a TVM ndarray. Next, we lower the module to some lower-level IR which can be consumed by llvm backend. LLVM defines an IR that has been adopted by multiple programming languages. The LLVM compiler is then able to compile the generated programs into machine codes for CPUs. We have already used it to compile the vector addition operator in Section 1.2, despite that we didn’t explicitly specify it.

In addition, we set the optimization level to the highest level 3. You may get warning messages that not every operator is well optimized, you can ignore it for now. We will get back to it later.

target = 'llvm'
with relay.build_config(opt_level=3):
    graph, mod, params = relay.build(relay_mod, target, params=relay_params)
Cannot find config for target=llvm, workload=('dense_nopack.x86', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.

The compiled module has three parts: graph is a json string described the neural network, mod is a library that contains all compiled operators used to run the inference, and params is a dictionary mapping parameter name to weights.

type(graph), type(mod), type(params)
(str, tvm.runtime.module.Module, dict)

You can view mod as a TVM module we already seen in Section 1.2.

1.3.4. Inference

Now we can create a runtime to run the model inference, namely the forward pass of a neural network. Creating the runtime needs the neural network definition in json (i.e. graph) and the library that contains machine code of compiled operators (i.e. mod), with a device context that can be constructed from the target. The device is CPU here, specified by llvm. Next we load the parameters with set_input and run the workload by feeding the input data. Since this network has a single output layer, we can obtain it, a (1, 1000) shape matrix, by get_output(0). The final output is a 1000-length NumPy vector.

ctx = tvm.context(target)
rt = tvm.contrib.graph_runtime.create(graph, mod, ctx)
rt.set_input(**params)
rt.run(data=tvm.nd.array(x))
scores = rt.get_output(0).asnumpy()[0]
scores.shape
(1000,)

The vector contains the predicted confidence score for each class. Note that the pre-trained model doesn’t have the softmax operator, so these scores are not mapped into probabilities in (0, 1). Now we can find the two largest scores and report their labels.

a = np.argsort(scores)[-1:-5:-1]
labels[a[0]], labels[a[1]]
('tiger cat', 'Egyptian cat')

1.3.5. Saving the Compiled Library

We can save the output of relay.build in disk to reuse them later. The following code block saves the json string, library, and parameters.

!rm -rf resnet18*

name = 'resnet18'
graph_fn, mod_fn, params_fn = [name+ext for ext in ('.json','.tar','.params')]
mod.export_library(mod_fn)
with open(graph_fn, 'w') as f:
    f.write(graph)
with open(params_fn, 'wb') as f:
    f.write(relay.save_param_dict(params))

!ls -alht resnet18*
-rw-r--r-- 1 jenkins jenkins  45M Oct 13 10:45 resnet18.params
-rw-r--r-- 1 jenkins jenkins  28K Oct 13 10:45 resnet18.json
-rw-r--r-- 1 jenkins jenkins 157K Oct 13 10:45 resnet18.tar

Now we load the saved module back.

loaded_graph = open(graph_fn).read()
loaded_mod = tvm.runtime.load_module(mod_fn)
loaded_params = open(params_fn, "rb").read()

And then construct the runtime as before to verify the results

loaded_rt = tvm.contrib.graph_runtime.create(loaded_graph, loaded_mod, ctx)
loaded_rt.load_params(loaded_params)
loaded_rt.run(data=tvm.nd.array(x))
loaded_scores = loaded_rt.get_output(0).asnumpy()[0]
tvm.testing.assert_allclose(loaded_scores, scores)

1.3.6. Summary

  • We can use relay of TVM to convert and compile a neural network into a module for model inference.

  • We can save the compiled module into disk to facilitate future deployment.

1.3.7. Discussions