Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics
Recent advancements in precision livestock farming (PLF) have been significantly enhanced by the development of foundation-model pipelines that leverage open-vocabulary detection, promptable video segmentation, and self-supervised visual embeddings. However, one major challenge remains: the GPU memory requirements of these models often exceed the capabilities of standard edge accelerators. A new study aims to address this issue by distilling the impressive capabilities of the 446M-parameter Perception Encoder (PE-ViT-L+) from the SAM 3 framework into a more manageable 40.66M-parameter model suitable for edge deployment.
Distillation Mechanisms
The distillation process involves three innovative mechanisms:
- Feature Pyramid Network Student Encoder: Built on the TinyViT-21M-512 architecture, this encoder allows for efficient multi-scale processing.
- Four-Term Direction-Then-Scale Distillation Loss: This novel loss function aids in refining the student model’s learning process.
- Backbone-Substitution Inference: Utilizing sliding-window session pruning, this method effectively manages GPU memory usage, ensuring that the model operates within feasible limits during deployment.
DINOv3 Integration
The research also incorporates elements from the DINOv3 model family, specifically the pre-distilled ViT-S/16 variant, which contains 21.6M parameters. This variant is paired with a significantly larger 6716M-parameter ViT-7B teacher model. The smaller ViT-S model serves as the embedder for individual animals, facilitating precise monitoring of their behaviors.
Performance Metrics
When tested on the Edinburgh Pig dataset, the newly compressed pipeline demonstrated remarkable performance:
- MOTA: Achieved 92.29%, closely trailing the SAM 3 teacher.
- IDF1: Reached 96.15%, maintaining a robust level of identification accuracy.
- System-Level Parameter Reduction: The model achieved a 7.77-fold reduction in parameters compared to the original, making it more efficient for edge deployment.
- Peak VRAM Usage: Reduced from 19.52GB to 6.49GB, demonstrating significant optimization for edge computing environments.
- Top-1 Accuracy: Attained 97.34% with a macro-F1 score of 91.67% across nine classes of pig behavior.
Edge Compatibility and Future Implications
The distilled model is designed to fit comfortably within the constraints of an NVIDIA Jetson Orin NX 16GB system, allowing for a headroom of 4.9GB. This configuration supports a proposed, although not yet empirically validated, on-device embedding-pool re-identification mechanism. This mechanism is projected to create a longitudinal visual record with an individual footprint of approximately 94MB per animal per year. Such a record could prove invaluable for retrospective analyses related to disease, lameness, reproductive issues, and growth outcomes.
In summary, the distillation of SAM 3 and DINOv3 not only enhances the feasibility of individual-level livestock monitoring on edge devices but also opens new avenues for longitudinal visual analytics in precision agriculture. As this technology continues to evolve, it holds the potential to transform livestock management practices significantly.
Related AI Insights
- Unsupervised Learning for Soil Heavy Metal Anomaly Detection
- People-Centred Medical Image Analysis for Fair AI
- Gated Hybrid Collaborative Filtering for Top-N Recommendations
- RoundPipe: Efficient Multi-GPU Training on Consumer GPUs
- Agent Name Service: Secure AI Agent Discovery in Kubernetes
- Entropy-Based Vocal Biomarkers for Accurate Depression Detection
- Experience Reuse in LLM Agents: Memory-Based Continual Learning
- Benchmarking LLM Utility Recovery with User Intent Clarification
- PALCAS: Priority-Aware Lane Change System for Autonomous Cars
- Cybersecurity Challenges and Solutions in the AI Era
