AI Insights

Optimization of Edge based Inference Pipeline for Weed Control

May 7, 2021

article featured image

This article captures, the optimizations explored for the AI-enabled edge-based weed controller on Nvidia Jetson Xavier AGX. The experiments include TensorRT quantization, calibrations, benchmarking of inference pipeline based on YolactEdge, Bonnetal models, and the enhancements of non-model parts of the pipeline.

Omdena team of 40 members across the globe collaborated on the Weedbot challenge, optimization of real-time computer vision application on Nvidia Jetson Xavier AGX, that would help in eliminating weeds on the field with a laser beam.

How did we achieve low latency and high precision targets in a span of 8 weeks?

U.S. Geological Survey (USGS) scientists report that glyphosate, known commercially by many trade names, and its degradation product AMPA (aminomethylphosphonic acid) are transported off-site from agricultural and urban sources and occur widely in the environment.

It is very common in agricultural practices to use chemical herbicides in the removal of weeds to enhance the production of the crop. However, these chemicals are proven to have harmful effects on the environment, as much as 55% of the toxic residue is observed in soil, water and atmosphere. This impacts the life in water bodies, the surrounding plant, and algal species that support the agroecosystems, resulting in biodiversity loss.

It has adverse effects on the humans working on the crop and traces of the herbicide are seen in the crop and thereby enter the human body causing toxicity.

Fig1: Effect of Chemical herbicides on environment

Fig1: Effect of Chemical herbicides on the environment

To counter these effects, sustainability practices have gained prominence and in this path technology-enabled weed control supports organic farmers to a great extent in replacing manual work. This would facilitate pesticide-free food production and reduce the final price for such food, encouraging people to buy organic food and follow a healthy lifestyle.

For this task, technical characteristics of the second laser weeding prototype developed by WeedBot was used.

For this task, the technical characteristics of the second laser weeding prototype developed by WeedBot were used.

AI-based laser weeding machinery is based on NVIDIA Jetson AGX Xavier that enables high-performance edge AI applications. It has a high-speed camera, 512-core GPU, and 8-core ARM CP, 32GB of memory. It runs Linux and provides 32 TeraOPS of compute performance in user-configurable 10/15/30W power profiles.

At just 100 x 87 mm, Jetson AGX Xavier offers big workstation performance, making it ideal for autonomous machines like delivery and logistics robots, factory systems, and large industrial UAVs.

Designed specifically for autonomous machines, Jetson AGX Xavier has the performance to handle obstacle detection algorithms critical to next-generation robots. It gives GPU workstation-class performance with up to an unparalleled 32 TeraOPS (TOPS) of peak compute and 750 Gbps of high-speed I/O in a compact form factor. NVIDIA’s rich set of AI tools and workflows enables developers to train and deploy neural networks quickly.

Jetpack4.4.1 is installed on the Jetson hardware which includes libraries needed for the pipeline such as Tensor RT 7.1.3 with support for quantized models for Int8 calibration, CuDNN for high-performance primitives for deep learning frameworks, CUDA10.2.

NVIDIA’s TensorRT is a high-performance deep learning inference runtime library for image classification, segmentation, and object detection neural networks. TensorRT is built on CUDA, NVIDIA’s parallel programming model, and enables to optimize inference for all deep learning frameworks. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications

A high precision, low latency inference pipeline is built using real-time instance, semantic segmentation frameworks like YolactEdge, and Bonnetal models with a lightweight backbone such as Mobilenet.

To learn more about YolactEdge and how it was used in this project, read here.

The camera on the hardware captures a high frame-rate video stream from the top of the crop row. Images with a resolution of 3008x3008px are being used for training the model, each image contains multiple plant annotations. MS COCO annotation is used for labeling the images by the image labeling tool CVAT and albumentations are applied to the image dataset to boost the performance of the network model.

The images are further annotated to increase the number of classes of plants. Dataset with additional annotation classes have given better MAP on the inference over initial annotations.

mAP of validation Image 3008x3008px resized to 550x 550px

MAP of validation Image 3008x3008px resized to 550x 550px

Let us dive into the optimizations of the pipeline and benchmarking of the models explored. Broadly the enhancements are investigated in these areas :

  • Tensor RT optimizations to the model
  • Partial code conversion to C++
  • Post-processing of images using Cupy

When optimizing deep learning models, there are different approaches that one can take such as pruning, quantization, and knowledge distillation to reduce the latency and size of the model with minimal loss in accuracy. These techniques are used to compress the size of the models and make the models faster.

Model pruning

The first approach of model pruning involves reducing the size of the final neural network by reducing the number of parameters in order to reduce memory, latency, battery, and hardware consumption without sacrificing accuracy, deploy lightweight models on device. The famous paper titled “Lottery Ticket Hypothesis” by Jonathan Frankle and Michael Carbin shows that inside neural networks there exist a sub-network (“lottery tickets”) which when trained in isolation performs on-par with the whole network. This indicates that not all connections in the network are important and some can be omitted through this process of iterative pruning. This resultant pruned model will be lightweight and faster without loss of accuracy. There are different types of pruning such as iterative pruning, weight pruning, and neuron pruning.

Fig2: Removing one entire neuron. Left: the unpruned neural network with a neuron in red that is thought to be unnecessary. Right: the equivalent pruned neural network, whose computational complexity is 25% smaller

Fig2: Removing one entire neuron. Left: the unpruned neural network with a neuron in red that is thought to be unnecessary. Right: the equivalent pruned neural network, whose computational complexity is 25% smaller

Model Quantization

By default in all DL libraries, the variables and weights of the neural networks are stored with FP32 (float32) precision. If INT8 precision is used, there is 4x reduction in the model size and a 4x reduction in memory bandwidth requirements. INT8 models provide 2 to 4 times speedup compared to FP32 models. Also, there are different ways to perform quantization such as dynamic quantization, static quantization, and quantization aware training (QAT). The post-quantization methods i.e. dynamic and static are the simplest where a model is trained with FP32 precision and during prediction, its weights are quantized to either FP16 or INT32. There is a performance loss associated with this type of quantization. In QAT, quantization is performed during training. There is almost none to a very low loss in accuracy using this approach.

Fig3: Model Quantization

Fig3: Model Quantization

Knowledge Distillation

The final approach for compressing neural networks involves using knowledge distillation technique. In knowledge distillation, the knowledge is transferred from a large teacher model to a small student model. This idea of knowledge distillation was introduced in a paper titled “Distilling the Knowledge in a Neural Network” by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. The basic approach is to first train a teacher, a large model on the dataset that generalizes well. A knowledge dataset is created where instead of ground-truths, the predictions of the teacher model are used as targets. These targets are also called soft-targets. Finally, we train a small student model on this knowledge dataset that has performance on par with the teacher model. This is another approach to get small models with loss of accuracy that is faster and memory-efficient for edge devices.

Fig 4: The generic teacher-student framework for knowledge distillation.

Fig 4: The generic teacher-student framework for knowledge distillation.

Nvidia TensorRT optimization

For this project, a quantization approach using Nvidia TensorRT library was explored. NVIDIA TensorRT is a C++ library optimized for providing high-performance inference. After training any model, TensorRT supports using ONNX parser, Caffe parser, or UFF parser to convert parsed model from saved trained model format to TensorRT format. So, it is required to save the model as an ONNX model or TF model to support TensorRT parsing. This parsed network is passed to TensorRT build step to build an optimized inference engine based on various optimization parameters. These options include batch size, workspace size, mixed precision, bounds on dynamic shapes, etc, and many more. The build step creates an inference engine using TensorRT. The various precision for supported devices are FP32, FP16 or INT8. If INT8 mode is used, TensorRT requires a calibration dataset which is used for appropriately adjusting the scaling of quantization. This optimized inference engine can further be saved in a serialized format.

Fig5: Onnx Workflow

Fig5: Onnx Workflow

Experiments were conducted on image resolution of 1920x1200px and 560x560px to observe the precision to latency time tradeoff for the trained networks. With lower resolution, the inference speed reduces significantly due to a lesser number of computations, while higher resolutions maintain higher precision.

It is observed in the benchmarking of the trained models that the Int8 calibration model gave the best results with the lowest latency time for 560x560px resolution.

Benchmark Results of Bonnetal Model

Benchmark Results of Bonnetal Model

Benchmark Results of Bonnetal Model

Let us move to the optimizations of non-model parts of the pipeline

There are several tasks that run in parallel with the model inference on a CPU thread. It is noted that conversion of these tasks to C++ routines has given 3.25x performance, compared to the equivalent functionality in python. It is a preferred choice to use C++ library for high-performance low latency applications, as in our case. This is mainly due to Python being an interpreted language, while C++ code is compiled down to machine code and the compiler optimizations available with C++.

Image Preprocessing

Cuda enabled C++/Python libraries are made use for image preprocessing before feeding into the inference network for enhanced performance.

Part of the code is converted to CUDA C++ routines to make use of GPU acceleration available on the hardware. However, it is observed that GPU acceleration did not help in reducing the conversion time. It is mainly due to less number of computations and the overhead in moving to GPU memory.

Pybind wrapper for C++ routines for access from python scripts is developed as the rest of the pipeline is in python as is the case with most of the segmentation frameworks.

pybind11 is a lightweight header-only library that exposes C++ types in Python and vice versa, mainly to create Python bindings. pybind11 can map the core C++ features to Python like functions accepting and returning custom data structures, instance methods and static methods, overloaded functions, etc. This compact library makes use of C++ 11 features (tuples, lambda functions ) leading to simpler ways of binding in python.

Numpy type ‘ndarray’ is overloaded to enable passing of parameters between python and C++modules. Pybind11 wrapper is created as an extension library and installed as a python package, which the python script can import and use the routines seamlessly.

The pipeline also includes tasks dealing with post-processing of images. All these operations are performed using NumPy library. For further reduction in latency of the pipeline, the NumPy routines are converted to Cupy to utilize the CUDA operations available on Jetson Xavier AGX.

However as these calculations are considerably less, as compared to mask generation, the Cupy conversions did not yield improvement in the latency time.


The desired latency time for the inference pipeline close to 12ms was achieved with good precision by adapting the TensorRT optimizations and CUDA enhanced libraries available for Nvidia Jetson Xavier AGX.

This article is written by Aruna Sri T.

Ready to test your skills?

If you’re interested in collaborating, apply to join an Omdena project at:

media card
Revolutionizing Short-term Traffic Congestion Prediction with Machine Learning
media card
Transforming Artwork Analysis with Advanced Computer Vision Techniques
media card
FloodGuard: Harnessing the Power of AI and GIS to Protect Bangladesh from the Fury of Floods
media card
How We Created an Innovative Solution for Power Accessibility without the Available Resources