Detecting Weeds Using YOLACTEdge Instance Segmentation for Smart Farming
Explore how YOLACTEdge delivers fast, accurate weed detection for robotic and automated agriculture systems.
December 16, 2025
13 minutes read

This case study shows how YOLACTEdge enables real time, field ready weed detection by delivering instance segmentation at edge device speeds. Through hardware aware optimisation and temporal feature reuse, the approach achieves fast, reliable inference suitable for robotic and automated farming systems. The result is more precise weed control, reduced chemical usage and a practical foundation for scalable, climate resilient smart agriculture.
Introduction
Smart farming depends on accurately distinguishing crops from weeds to enable automated weeding and precision interventions. While modern instance segmentation techniques provide pixel-level accuracy, real-world agricultural systems also demand high-speed inference. Models deployed on robotic platforms must operate in real time to support continuous decision-making in dynamic field conditions.
This article examines how the YOLACT family of instance segmentation models, particularly YOLACTEdge, balances accuracy and speed for weed detection in smart farming. Through architectural efficiency, hardware-aware optimisation and the reuse of temporal information in video streams, YOLACTEdge delivers real-time performance on edge devices. These capabilities not only improve automated weed control but also support climate-resilient farming by enabling rapid, field-scale data collection for AI-driven agricultural intelligence.
Real‑time Instance Segmentation In Agriculture
Early real‑time object detection relied on one‑stage detectors like Single Shot MultiBox Detector (SSD) and You Only Look Once (YOLO). These algorithms treat detection as a simple regression problem that predicts class probabilities and bounding‑box coordinates in a single forward pass through a convolutional network. Because they avoid region proposal and per‑region processing, one‑stage methods are faster than two‑stage detectors, though they often trade off some accuracy.
Two‑stage detectors, on the other hand, employ a Region Proposal Network to generate candidate regions before classifying them and refining their bounding boxes. This separation improves accuracy but incurs additional computation, making two‑stage pipelines too slow for real‑time agricultural robots.

Fig.1. Architecture of a convolutional neural network with a single-stage detector
The architecture of a single‑stage detector is illustrated in Fig. 1. The model takes an input image, processes it through convolutional layers, and outputs class predictions and bounding boxes directly without a separate proposal stage. This streamlined approach offers the speed needed for field robotics.
The Yolact Family Of Models
To bridge the gap between speed and accuracy, the Yolact framework introduced a real‑time instance segmentation model that generates a dictionary of prototype masks over the entire image and then predicts a set of linear combination coefficients for each detected instance. The predicted masks are assembled by linearly combining the prototypes and cropping them with the corresponding bounding boxes. This design yields competitive accuracy while maintaining fast inference times.
The architecture of Yolact comprises four stages feature backbone, feature pyramid, prediction head and Protonet as shown in Fig. 2. The backbone (often a ResNet) extracts hierarchical features; the feature pyramid network (FPN) upsamples and downsamples these features to create a multi‑scale representation; the prediction head outputs class, bounding box and mask coefficients; and the Protonet generates prototypes used to assemble final masks.
![Fig.2 Yolact architecture consists of four stages; feature backbone, feature pyramid, prediction head, and Protonet Yolact architecture consists of four stages; feature backbone, feature pyramid, prediction head, and Protonet [1].](https://cmsnew.omdena.com/wp-content/uploads/2021/03/yolact3.png)
Fig.2. Yolact architecture consists of four stages; feature backbone, feature pyramid, prediction head, and Protonet
Despite these advances, further optimization was needed for deployment on resource‑constrained edge devices. YolactEdge adapts Yolact for video inputs and introduces systematic and algorithmic optimizations that increase speed fivefold while retaining comparable accuracy. The next sections explore how these improvements are achieved.
YolactEdge Model Architecture
YolactEdge is designed for video segmentation, operating on successive frames rather than individual images. It shares Yolact’s overall layout backbone, FPN, prediction head and Protonet but introduces optimizations at two levels:
- Systematic optimization using the TensorRT inference engine: TensorRT is NVIDIA’s deep learning optimizer that converts floating‑point weights (FP32) to reduced‑precision formats such as INT8 and FP16. This quantization dramatically accelerates inference while preserving accuracy.
- Algorithmic optimization by exploiting temporal redundancy in video: successive frames often share high‑level features, especially at coarser scales. Instead of recomputing the full feature hierarchy for every frame, YolactEdge computes expensive features (C4 and C5) only on keyframes and reuses them for the following non‑keyframes. We detail this process below.
Feature Extraction
The feature extraction component consists of a ResNet backbone and an FPN. ResNet progressively downsamples the input frame, producing feature maps C1–C5; the FPN then upsamples and downsamples these maps to generate multi‑scale feature maps P3–P7 that feed into the prediction head. ResNet layers C4 and C5 are the most computationally intensive, consuming over 40 % of the network’s resources.

Fig.3. Feature extraction part of Yolact network architecture.
Figure 3 illustrates the ResNet + FPN feature extractor: low‑level features are propagated through lateral connections to build the pyramid. The FPN feature maps serve as inputs for mask and bounding‑box prediction.
Temporal Redundancy And Keyframes
Because adjacent video frames are highly similar, especially at higher feature levels, recomputing C4 and C5 for every frame is wasteful. YolactEdge designates certain frames as keyframes, on which it computes the full feature extraction. Between keyframes, it processes non‑keyframes by computing only the lighter C1–C3 layers and warping the high‑level feature maps P4 and P5 from the previous frame. This warping exploits the spatial transformation of objects across frames, enabling the reuse of expensive features. Keyframes are selected at fixed intervals (every k frames) rather than based on motion analysis, an area noted for future research.
Feature Warping
To transform features from a keyframe to a subsequent non‑keyframe, YolactEdge estimates the motion of objects using optical flow. A neural network inspired by FlowNet computes a 2‑D flow field between two frames. The flow field indicates how each pixel moves; using it, P4 and P5 from the keyframe are warped and interpolated to align with the non‑keyframe.

Fig.4. YolactEdge network architecture.
Figure 4 depicts the overall YolactEdge network. On the left, the full feature extraction (blue) is computed for the previous keyframe. On the right, only low‑level features (light blue) are computed for the current non‑keyframe. The high‑level features (gray) are warped (yellow) using optical flow before they are combined with the current low‑level features and passed to the prediction head and Protonet for mask assembly.
An example of a flow field is shown in Fig. 5; the third image visualizes the magnitude and direction of movement between two input frames.

Fig.5. An example of a flow field
To reduce the overhead of computing optical flow, YolactEdge reuses the C1–C3 features produced by ResNet and feeds them through a small convolutional network to predict the flow field. The designers refer to this lightweight network as FeatFlowNet. Figure 6 contrasts the original FlowNetS architecture with FeatFlowNet: in FlowNetS the entire network processes the two images through a stack of convolutions, whereas FeatFlowNet leverages precomputed backbone features.

Fig.6, FlowNet structure, consists of two parts; a) FlowNetS, b) FeatFlowNet
By warping high‑level features and recomputing only the lightweight layers between keyframes, YolactEdge achieves real‑time performance on edge devices. On a Jetson AGX Xavier, it exceeds 30 frames per second, and on an RTX 2080 Ti it reaches 170 FPS.
Pseudocode Overview
Although the original article references pseudocode, it does not include a code listing. At a high level, the algorithm can be summarized as follows:
- For each input frame i, determine whether it is a keyframe.
- If it is a keyframe, compute all backbone (C1–C5) and FPN features (P3–P7).
- If it is a non‑keyframe, compute only the partial features (C1–C3), use optical flow to warp P4 and P5 from the previous keyframe, and then compute the remaining FPN layers using the warped features and the current C3.
- Pass the resulting feature maps to the prediction head and Protonet to generate mask coefficients and prototypes.
- Assemble the final instance masks by linearly combining prototypes with the mask coefficients and cropping them using the predicted bounding boxes.
Dataset Preparation
The authors trained YolactEdge on MS COCO and on a custom Weedbot dataset. The Weedbot dataset contained 750 images with a resolution of 3008 × 3008 pixels and corresponding JSON annotations. Because the data were collected in two batches with separate annotation files, the images and annotations were merged into a single dataset.
To prepare the images, the team rotated them using a simple image editor to ensure consistent orientation, then uploaded the adjusted images and corresponding COCO JSON annotations to verify that bounding boxes and masks were correctly rotated. They also increased the number of annotation classes using the CVAT tool, adding fine‑grained categories to better capture variation among plants. In total, 10 000 annotations were created within the 750 images.
For efficient training, the high‑resolution images were downsampled to the network input size of 550 × 550 pixels. A Python script resized and cropped the images, using carrot annotations as reference points; the resized coordinates were computed as new_coordinate = resize_ratio × old_coordinate. This preprocessing increased the number of training samples to 10 462 images and dramatically accelerated training.

Fig 7. Distribution of the different plant classes
Due to differences in annotation viewpoints, two additional classes (labelled Class 2–3 and Class 3–4) were introduced to handle border cases. Figure 7 shows the distribution of the five final classes, demonstrating that classes 3 and 4 are the most prevalent.
Results and Wvaluation
Training Yolact++ And Baseline Results

Fig.8. the predicted results of Yolact++ using Weedbot data.
The Weedbot dataset was first used to train Yolact++. Installation, setup and configuration details are provided in the associated GitHub repository (https://github.com/dbolya/yolact). The model was trained with a ResNet50 backbone. The predicted instance segmentation masks on Weedbot data are illustrated in Fig. 8. Although Yolact++ achieved acceptable accuracy, its inference time was around 400 ms per image, far slower than the target of 12 ms.
Transition to YolactEdge
To meet stringent speed requirements, the training was repeated using YolactEdge. The implementation, installation and documentation are available at https://github.com/haotian-liu/yolact_edge. YolactEdge exploits TensorRT for quantization into FP16 and INT8 precision modes and leverages the temporal redundancy optimizations described earlier.
Figure 9 compares the inference times of different precision modes. Converting the baseline PyTorch model (FP32) to FP16 yields a 1.5× speedup, while converting to INT8 achieves a 2.5–3× speedup, reducing latency from ~138 ms to about 59 ms. These improvements bring inference time closer to the 12 ms target and illustrate the benefits of hardware‑aware optimization.

Fig.9. Comparing different precision modes of TensorRT
Figure 10 presents an instance of multi‑class YolactEdge training. Here the original 3008 × 3008 images were cropped to 1920 × 1200 and then resized to 550 × 550, enabling the network to process more examples per second. The figure shows dense weed and crop instances labelled with class‑specific colours and bounding boxes, demonstrating that YolactEdge maintains good segmentation quality even after aggressive downsampling.

Fig.10. YolactEdge training on multi-class — 3008 X 3008
Evaluation Metrics
To quantify performance, the authors used common metrics for object detection and segmentation: F1‑score, mean Intersection over Union (mIoU), precision, recall and True Detection Rate (TDR). These metrics are computed using the confusion matrix elements—true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN)—calculated between predictions and ground truth. The formulas are summarized in Fig. 11.

Equations for IoU, Recall, TDR and Precision
The F1‑score combines precision and recall into a single measure:

Equation for F1 Score
Experimental Results
Evaluation on the Weedbot dataset showed that YolactEdge outperformed the baseline Yolact++ both in speed and (after quantization) in accuracy. Table 1 shows detailed evaluation results. Models executed on different inference engines (PyTorch, FP16, INT8) are compared in terms of mean IoU, F1‑score, precision and recall. Lower‑precision modes achieve similar or better accuracy while significantly reducing inference time.

Validation results of YolactEdge with different precision modes and execution configurations
Another result table (Table 2) from the authors’ logs lists box and mask Average Precision (AP) at various IoU thresholds. The highest AP values occur around 0.55–0.65 IoU, illustrating the sensitivity of performance to the evaluation threshold.

Box and mask average precision across IoU thresholds
ConclusionÂ
This case study demonstrates how YolactEdge brings real‑time instance segmentation to smart farming by exploiting hardware‑aware quantization and temporal redundancy. In experiments on the Weedbot dataset, YolactEdge achieved a 5× speedup over the baseline Yolact architecture while maintaining competitive accuracy. Quantization to FP16 or INT8 precision using TensorRT further reduced inference time, reaching 58 ms per image. Such performance is crucial not only for automated weeding but also for time‑critical tasks like weather forecasting for agriculture and AI weather prediction, where rapid processing of field imagery enables farmers to adapt to changing conditions and improve climate‑resilient farming strategies.
Several avenues can enhance these results:
- Temporal redundancy experiments: exploring different keyframe intervals or adaptive keyframe selection based on motion blur could yield further speedups.
- Training with varied data: expanding the dataset and experimenting with different image resolutions may improve mAP scores.
- Low‑level implementation: rewriting inference in C++ instead of Python could deliver additional latency reductions, similar to improvements seen in other frameworks.
Ultimately, integrating fast, accurate weed detection into a broader smart‑agriculture pipeline including precision spraying, crop health monitoring and weather‑aware decision support will accelerate the adoption of climate‑resilient farming practices. By combining instance segmentation with AI weather prediction and weather forecasting for agriculture, farmers can make timely interventions that reduce pesticide use, improve yields and adapt to changing climate conditions.



