Pub Date : 2026-03-01Epub Date: 2026-02-04DOI: 10.1016/j.cviu.2026.104675
Sander De Coninck , Emilio Gamba , Bart Van Doninck , Abdellatif Bey-Temsamani , Thorsten Cardoen , Sam Leroux , Pieter Simoens
Multi-camera computer vision in industry offers advantages but poses risks to worker privacy and intellectual property through exposure of sensitive contextual information. Existing privacy methods often inadequately protect background details crucial in manufacturing. This issue is prominent in applications like automated ergonomic assessment, where visual data for posture analysis can reveal sensitive workplace information. We propose a system for simultaneous personal privacy and enhanced contextual intellectual property protection, featuring a novel probabilistic obfuscation technique. Our edge-based Generative Adversarial Privacy system employs a modified obfuscator that learns to inject controlled, pixel-wise random noise, particularly into non-critical background regions. This more effectively obscures IP-sensitive environmental details before data transmission for central analysis (e.g., pose estimation). Our approach, validated in a multi-camera ergonomic study, effectively protects worker privacy and contextual IP (metrics-evaluated) and maintains 3D pose accuracy for reliable ergonomic assessment. This work provides a solution for deploying vision systems in sensitive industrial settings by holistically addressing privacy requirements through an advanced, adaptive obfuscation strategy.
{"title":"Securing workers and workspaces: Contextual privacy for vision-based ergonomics","authors":"Sander De Coninck , Emilio Gamba , Bart Van Doninck , Abdellatif Bey-Temsamani , Thorsten Cardoen , Sam Leroux , Pieter Simoens","doi":"10.1016/j.cviu.2026.104675","DOIUrl":"10.1016/j.cviu.2026.104675","url":null,"abstract":"<div><div>Multi-camera computer vision in industry offers advantages but poses risks to worker privacy and intellectual property through exposure of sensitive contextual information. Existing privacy methods often inadequately protect background details crucial in manufacturing. This issue is prominent in applications like automated ergonomic assessment, where visual data for posture analysis can reveal sensitive workplace information. We propose a system for simultaneous personal privacy and enhanced contextual intellectual property protection, featuring a novel probabilistic obfuscation technique. Our edge-based Generative Adversarial Privacy system employs a modified obfuscator that learns to inject controlled, pixel-wise random noise, particularly into non-critical background regions. This more effectively obscures IP-sensitive environmental details before data transmission for central analysis (e.g., pose estimation). Our approach, validated in a multi-camera ergonomic study, effectively protects worker privacy and contextual IP (metrics-evaluated) and maintains 3D pose accuracy for reliable ergonomic assessment. This work provides a solution for deploying vision systems in sensitive industrial settings by holistically addressing privacy requirements through an advanced, adaptive obfuscation strategy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104675"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-18DOI: 10.1016/j.cviu.2026.104679
Qiangchang Wang , Jing Li , Yilong Yin , Huimin Lu
Vision foundation models (VFMs) play a crucial role in complex, real-world computer vision applications. Key developments such as Vision Transformers, self-supervised learning, and multi-modal learning have significantly contributed to the advancement of VFMs. However, these models face challenges, including high resource demands, weak locality, and a lack of generalized features. Drawing inspiration from the human brain’s ability to emphasize important information while suppressing irrelevant input, local mechanisms have been designed to address these challenges and enhance VFMs development. Local mechanisms not only focus on specific parts of an input to learn discriminative, fine-grained features but also selectively process information, thereby improving model efficiency. These mechanisms exhibit unique characteristics across different domains. In this survey, we provide a systematic review of local mechanisms across three key areas of VFMs: multi-modal learning, self-supervised learning, and Vision Transformers. We categorize the various local mechanisms, summarize quantitative results in each area, and analyze the advantages and disadvantages of each category in depth, offering insights for further exploration. We examine the potential of VFMs in semi-supervised learning, weakly supervised learning, domain adaptation, and few-shot learning, highlighting their specific applications in remote sensing and medical image analysis. Finally, we discuss potential research directions related to local mechanisms that could inspire future work. To the best of our knowledge, this is the first comprehensive survey on local mechanisms across different fields. We hope this review will provide valuable insights and guide future research in VFMs.
{"title":"Recent advances of local mechanisms in vision foundation models: A survey and outlook","authors":"Qiangchang Wang , Jing Li , Yilong Yin , Huimin Lu","doi":"10.1016/j.cviu.2026.104679","DOIUrl":"10.1016/j.cviu.2026.104679","url":null,"abstract":"<div><div>Vision foundation models (VFMs) play a crucial role in complex, real-world computer vision applications. Key developments such as Vision Transformers, self-supervised learning, and multi-modal learning have significantly contributed to the advancement of VFMs. However, these models face challenges, including high resource demands, weak locality, and a lack of generalized features. Drawing inspiration from the human brain’s ability to emphasize important information while suppressing irrelevant input, local mechanisms have been designed to address these challenges and enhance VFMs development. Local mechanisms not only focus on specific parts of an input to learn discriminative, fine-grained features but also selectively process information, thereby improving model efficiency. These mechanisms exhibit unique characteristics across different domains. In this survey, we provide a systematic review of local mechanisms across three key areas of VFMs: multi-modal learning, self-supervised learning, and Vision Transformers. We categorize the various local mechanisms, summarize quantitative results in each area, and analyze the advantages and disadvantages of each category in depth, offering insights for further exploration. We examine the potential of VFMs in semi-supervised learning, weakly supervised learning, domain adaptation, and few-shot learning, highlighting their specific applications in remote sensing and medical image analysis. Finally, we discuss potential research directions related to local mechanisms that could inspire future work. To the best of our knowledge, this is the first comprehensive survey on local mechanisms across different fields. We hope this review will provide valuable insights and guide future research in VFMs.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104679"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147422113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-03DOI: 10.1016/j.cviu.2026.104646
Hongwei Sheng, Xin Shen, Heming Du, Xin Yu
Learning sign language is not only a gateway to linguistic competence but also to cultural participation, self-expression, and meaningful interaction within the Deaf community. Recent advances in sign language education technologies have made notable progress in supporting vocabulary acquisition and sentence-level translation. However, dialogue, as one critical component of natural language use, remains largely absent from existing systems. We identify two underlying challenges that contribute to this gap: the lack of recognition robustness to viewpoint variation, which constrains expressive freedom during signing; and the limited semantic modeling capabilities needed to support discourse-level interpretation and interaction. To address these issues, we propose a learning-centered system that integrates pose-based multi-view augmentation and a multi-agent language modeling workflow. This system supports free-form input across diverse signing perspectives and provides structured feedback across word, sentence, and dialogue levels. Built as a modular and deployable platform, the system demonstrates strong recognition performance and learner engagement across varied input conditions. Through this integration of spatial robustness and semantic scaffolding, our work advances the design of sign language learning technologies toward more interactive, expressive, and pedagogically grounded experiences. However, the current system remains limited by its focus on successive (non-continuous) signing and by its moderate vocabulary and restricted pedagogical scope. Future work will therefore extend lexical coverage, incorporate more natural continuous signing, and investigate richer educational functions.
{"title":"Mobile Auslan: A multimodal dialogue-centered sign language learning system","authors":"Hongwei Sheng, Xin Shen, Heming Du, Xin Yu","doi":"10.1016/j.cviu.2026.104646","DOIUrl":"10.1016/j.cviu.2026.104646","url":null,"abstract":"<div><div>Learning sign language is not only a gateway to linguistic competence but also to cultural participation, self-expression, and meaningful interaction within the Deaf community. Recent advances in sign language education technologies have made notable progress in supporting vocabulary acquisition and sentence-level translation. However, dialogue, as one critical component of natural language use, remains largely absent from existing systems. We identify two underlying challenges that contribute to this gap: the lack of recognition robustness to viewpoint variation, which constrains expressive freedom during signing; and the limited semantic modeling capabilities needed to support discourse-level interpretation and interaction. To address these issues, we propose a learning-centered system that integrates pose-based multi-view augmentation and a multi-agent language modeling workflow. This system supports free-form input across diverse signing perspectives and provides structured feedback across word, sentence, and dialogue levels. Built as a modular and deployable platform, the system demonstrates strong recognition performance and learner engagement across varied input conditions. Through this integration of spatial robustness and semantic scaffolding, our work advances the design of sign language learning technologies toward more interactive, expressive, and pedagogically grounded experiences. However, the current system remains limited by its focus on successive (non-continuous) signing and by its moderate vocabulary and restricted pedagogical scope. Future work will therefore extend lexical coverage, incorporate more natural continuous signing, and investigate richer educational functions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104646"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Object detectors in complex traffic scenes often struggle with small objects, cluttered boundaries, and heavy occlusions due to a low-frequency bias, where models capture smooth backgrounds but overlook high-frequency details. To address this, we propose a Frequency-Domain Enhancement (FDE) framework that integrates three modules: (i) Frequency-Adaptive Attention (FAA) to dynamically emphasize informative frequency components, (ii) Enhanced Frequency Dynamic Convolution (EFDConv) to adaptively model diverse spectral patterns, and (iii) Frequency-Aware Supervision (FAS) to guide training towards edge and fine-structure preservation. Embedded into RT-DETR, FDE achieves consistent gains of +1.7 mAP on KITTI dataset and +1.9 mAP on COCO dataset, with negligible computational overhead. These results demonstrate that FDE effectively alleviates low-frequency bias and improves detection of small and adjacent objects, making it suitable for deployment in complex traffic environments.
{"title":"FDE: A Frequency-Domain Enhancement method for object detection in complex traffic scenes","authors":"Wenyan Sun , Feifei Xie , Yuxuan Zhang , Liangrui Wei , Fuzheng Chu , Xiaoyu Tang","doi":"10.1016/j.cviu.2026.104677","DOIUrl":"10.1016/j.cviu.2026.104677","url":null,"abstract":"<div><div>Object detectors in complex traffic scenes often struggle with small objects, cluttered boundaries, and heavy occlusions due to a low-frequency bias, where models capture smooth backgrounds but overlook high-frequency details. To address this, we propose a Frequency-Domain Enhancement (FDE) framework that integrates three modules: (i) Frequency-Adaptive Attention (FAA) to dynamically emphasize informative frequency components, (ii) Enhanced Frequency Dynamic Convolution (EFDConv) to adaptively model diverse spectral patterns, and (iii) Frequency-Aware Supervision (FAS) to guide training towards edge and fine-structure preservation. Embedded into RT-DETR, FDE achieves consistent gains of +1.7 mAP on KITTI dataset and +1.9 mAP on COCO dataset, with negligible computational overhead. These results demonstrate that FDE effectively alleviates low-frequency bias and improves detection of small and adjacent objects, making it suitable for deployment in complex traffic environments.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104677"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-04DOI: 10.1016/j.cviu.2026.104659
María-Luisa Pérez-Delgado , Jesús-Ángel Román-Gallego , M. Emre Celebi
Color quantization is an image processing operation that attempts to reduce the number of distinct colors used to represent an image without significant loss of quality. This operation is useful as an initial step to perform further processing on the image. The interest in this operation has led to several solution methods being proposed over the years. Within these methods several swarm-based methods have been used in recent years to solve the problem. This article discusses the application of one of these methods, called grey wolf optimization, to perform color quantization. This algorithm has generated good results when applied to a variety of optimization problems. Among the features that differentiate this algorithm from other swarm algorithms are its ability to converge toward better solutions, its speed, and the existence of a single control parameter. The article describes in detail how the grey wolf optimization method should be adapted to perform color quantization, so that it generates a quantized palette that allows the quantized image to be represented. In this case, each individual in the group represents a quantized palette, which is improved with the information provided by the group. The detailed description of the algorithm is complemented by an extensive testing section that compares the results of the proposed method to those of 16 others techniques. The results, based on the comparison of MSE, MAE, PSNR, SSIM, and runtime, show that grey wolf optimization can generate good quality images, better than most of the compared methods.
{"title":"Grey wolf optimization for color quantization","authors":"María-Luisa Pérez-Delgado , Jesús-Ángel Román-Gallego , M. Emre Celebi","doi":"10.1016/j.cviu.2026.104659","DOIUrl":"10.1016/j.cviu.2026.104659","url":null,"abstract":"<div><div>Color quantization is an image processing operation that attempts to reduce the number of distinct colors used to represent an image without significant loss of quality. This operation is useful as an initial step to perform further processing on the image. The interest in this operation has led to several solution methods being proposed over the years. Within these methods several swarm-based methods have been used in recent years to solve the problem. This article discusses the application of one of these methods, called grey wolf optimization, to perform color quantization. This algorithm has generated good results when applied to a variety of optimization problems. Among the features that differentiate this algorithm from other swarm algorithms are its ability to converge toward better solutions, its speed, and the existence of a single control parameter. The article describes in detail how the grey wolf optimization method should be adapted to perform color quantization, so that it generates a quantized palette that allows the quantized image to be represented. In this case, each individual in the group represents a quantized palette, which is improved with the information provided by the group. The detailed description of the algorithm is complemented by an extensive testing section that compares the results of the proposed method to those of 16 others techniques. The results, based on the comparison of MSE, MAE, PSNR, SSIM, and runtime, show that grey wolf optimization can generate good quality images, better than most of the compared methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104659"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-06DOI: 10.1016/j.cviu.2026.104686
Meiyin Hu , Huan Wan , Hui Wang , Jianfeng Xu , Xin Wei
Few-shot fine-grained image classification is a challenging task as it requires the ability to detect minor variations across subclasses. Metric-based methods are some of the most popular methods for dealing with this issue, but they have the problems of spatial information loss and local feature misalignment. Feature reconstruction-based methods can partially mitigate these two problems and exhibit substantial potential for fine-grained image classification. However, we found and verified that feature reconstruction-based methods have two unresolved difficulties, including inadequate extraction of discriminative local information and insufficient fitting capability when reconstructing numerous support set features with limited query set features. These difficulties are key obstacles to further improving the performance of few-shot fine-grained image classification. Therefore, we propose the enhanced local homogenization and reconstruction network (ELHRN) for few-shot fine-grained image classification. The proposed method includes three main modules: the local homogenization and distinction module (HDM) to learn more subtle discriminative local features; the double-layer cross-reconstruction module (DCM) to increase the hierarchy and complexity of feature reconstruction, effectively enhancing feature diversity; and the branch weighting module (BWM) to adjust the weights of mutual reconstruction between the support set and query set, thereby mitigating the issue of insufficient fitting capability when reconstructing numerous features with a limited number of features. Extensive experiments are conducted on five benchmark fine-grained datasets, and the results demonstrate that ELHRN outperforms state-of-the-art methods. The code is available at https://github.com/sausage0611/ELHRN.
{"title":"Enhanced local homogenization and reconstruction network for few-shot fine-grained image classification","authors":"Meiyin Hu , Huan Wan , Hui Wang , Jianfeng Xu , Xin Wei","doi":"10.1016/j.cviu.2026.104686","DOIUrl":"10.1016/j.cviu.2026.104686","url":null,"abstract":"<div><div>Few-shot fine-grained image classification is a challenging task as it requires the ability to detect minor variations across subclasses. Metric-based methods are some of the most popular methods for dealing with this issue, but they have the problems of spatial information loss and local feature misalignment. Feature reconstruction-based methods can partially mitigate these two problems and exhibit substantial potential for fine-grained image classification. However, we found and verified that feature reconstruction-based methods have two unresolved difficulties, including inadequate extraction of discriminative local information and insufficient fitting capability when reconstructing numerous support set features with limited query set features. These difficulties are key obstacles to further improving the performance of few-shot fine-grained image classification. Therefore, we propose the enhanced local homogenization and reconstruction network (ELHRN) for few-shot fine-grained image classification. The proposed method includes three main modules: the local homogenization and distinction module (HDM) to learn more subtle discriminative local features; the double-layer cross-reconstruction module (DCM) to increase the hierarchy and complexity of feature reconstruction, effectively enhancing feature diversity; and the branch weighting module (BWM) to adjust the weights of mutual reconstruction between the support set and query set, thereby mitigating the issue of insufficient fitting capability when reconstructing numerous features with a limited number of features. Extensive experiments are conducted on five benchmark fine-grained datasets, and the results demonstrate that ELHRN outperforms state-of-the-art methods. The code is available at <span><span>https://github.com/sausage0611/ELHRN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104686"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we introduce Slope-Track. Slope-Track is a novel multiple object tracking (MOT) dataset designed to reflect the complexities of real ski slope environments. The dataset has over 96,000 frames collected from 10 different ski resorts under various weather and visibility conditions. Slope-Track addresses significant challenges in slope monitoring, including small object sizes, occlusions, fast and irregular motion, and low appearance consistency. It is densely annotated with bounding boxes and object identities, facilitating the evaluation of detection and tracking algorithms. We analyze the dataset’s characteristics comparing it to the existing MOT datasets. The results demonstrate that Slope-Track encapsulates a combination of challenges found in other datasets. Additionally, we benchmark a range of existing tracking algorithms and propose a new module that improves motion-based association by dealing with the specific shape of trajectories along ski slopes. Our results demonstrate that incorporating appearance features can have a mixed impact, depending on how they are used within each tracking algorithm. In contrast, motion-based methods and spatial association strategies show more reliable performance. Overall, we provide a challenging benchmark for evaluating and improving multi-object tracking systems in real-world outdoor environments. The dataset and code can be found at https://slopetrack.github.io/.
{"title":"Slope-Track: Multiple Object Tracking on Ski Slopes","authors":"M’Saydez Campbell , Christophe Ducottet , Damien Muselet , Rémi Emonet","doi":"10.1016/j.cviu.2026.104663","DOIUrl":"10.1016/j.cviu.2026.104663","url":null,"abstract":"<div><div>In this paper, we introduce Slope-Track. Slope-Track is a novel multiple object tracking (MOT) dataset designed to reflect the complexities of real ski slope environments. The dataset has over 96,000 frames collected from 10 different ski resorts under various weather and visibility conditions. Slope-Track addresses significant challenges in slope monitoring, including small object sizes, occlusions, fast and irregular motion, and low appearance consistency. It is densely annotated with bounding boxes and object identities, facilitating the evaluation of detection and tracking algorithms. We analyze the dataset’s characteristics comparing it to the existing MOT datasets. The results demonstrate that Slope-Track encapsulates a combination of challenges found in other datasets. Additionally, we benchmark a range of existing tracking algorithms and propose a new module that improves motion-based association by dealing with the specific shape of trajectories along ski slopes. Our results demonstrate that incorporating appearance features can have a mixed impact, depending on how they are used within each tracking algorithm. In contrast, motion-based methods and spatial association strategies show more reliable performance. Overall, we provide a challenging benchmark for evaluating and improving multi-object tracking systems in real-world outdoor environments. The dataset and code can be found at <span><span>https://slopetrack.github.io/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104663"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-01-07DOI: 10.1016/j.cviu.2026.104637
Yiming Yang, Feng Guo, Pei Niu
Real-time object detection is pivotal in traffic-related Unmanned Aerial Vehicles (UAV) applications. However, UAV imagery presents significant challenges due to the predominance of small objects and complex backgrounds. Traditional backbones generally perform aggressive early-stage downsampling, causing the loss of fine-grained features. To address these issues, we propose UAVDet, a real-time detection model that combines Convolutional Neural Network (CNN) and Mamba architectures. First, we revisit the conventional backbone design by reconfiguring its depth and width, with a focus on preserving fine-grained details crucial for small object detection. Second, we propose the Cross Stage Partial Mamba (CSPMB) module, which integrates the Mamba structure into the CNN framework to enhance global feature representation and improve robustness against complex background interference. Third, we design Tiny-focused Feature Pyramid Network (TFPN) by rebalancing the feature fusion flow and replacing the large-object detection head with a tiny-object detection head, which significantly improves the perception of small objects. Comprehensive experiments on the VisDrone dataset show that our method improves AP and AP by 4.5% and 5.0%, respectively, while reducing parameters by 84.9% compared to the baseline. It also reaches 53 FPS on an RTX 4090, exceeding the 30 FPS real-time threshold. Additional evaluations on UAVDT and DroneVehicle further verify the method’s robust generalization. These results indicate the effectiveness of the developed method in UAV image detection.
{"title":"UAVDet: A CNN–Mamba hybrid network for efficient small object detection in UAV imagery","authors":"Yiming Yang, Feng Guo, Pei Niu","doi":"10.1016/j.cviu.2026.104637","DOIUrl":"10.1016/j.cviu.2026.104637","url":null,"abstract":"<div><div>Real-time object detection is pivotal in traffic-related Unmanned Aerial Vehicles (UAV) applications. However, UAV imagery presents significant challenges due to the predominance of small objects and complex backgrounds. Traditional backbones generally perform aggressive early-stage downsampling, causing the loss of fine-grained features. To address these issues, we propose UAVDet, a real-time detection model that combines Convolutional Neural Network (CNN) and Mamba architectures. First, we revisit the conventional backbone design by reconfiguring its depth and width, with a focus on preserving fine-grained details crucial for small object detection. Second, we propose the Cross Stage Partial Mamba (CSPMB) module, which integrates the Mamba structure into the CNN framework to enhance global feature representation and improve robustness against complex background interference. Third, we design Tiny-focused Feature Pyramid Network (TFPN) by rebalancing the feature fusion flow and replacing the large-object detection head with a tiny-object detection head, which significantly improves the perception of small objects. Comprehensive experiments on the VisDrone dataset show that our method improves AP and AP<span><math><msub><mrow></mrow><mrow><mi>S</mi></mrow></msub></math></span> by 4.5% and 5.0%, respectively, while reducing parameters by 84.9% compared to the baseline. It also reaches 53 FPS on an RTX 4090, exceeding the 30 FPS real-time threshold. Additional evaluations on UAVDT and DroneVehicle further verify the method’s robust generalization. These results indicate the effectiveness of the developed method in UAV image detection.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104637"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-01-06DOI: 10.1016/j.cviu.2025.104629
Ofer Idan, Yoli Shavit, Yosi Keller
Relative pose regressors (RPRs) determine the pose of a query image by estimating its relative translation and rotation to a reference pose-labeled camera. Unlike other regression-based localization techniques confined to a scene’s absolute parameters, RPRs learn residuals, making them adaptable to new environments. However, RPRs have exhibited limited generalization to scenes not utilized during training (“unseen scenes”). In this work, we explore the ability of RPRs to localize in unseen scenes and propose algorithmic modifications to enhance their generalization. These modifications include attention-based aggregation of coarse feature maps, dynamic adaptation of model weights, and geometry-aware optimization. Our proposed approach improves the localization accuracy of RPRs in unseen scenes by a notable margin across multiple indoor and outdoor benchmarks and under various conditions while maintaining comparable performance in scenes used during training. We assess the contribution of each component through ablation studies and further analyze the uncertainty of our model in unseen scenes. Our Code and pre-trained models are available at https://github.com/yolish/relformer.
{"title":"Beyond familiar landscapes: Exploring the limits of relative pose regressors in new environments","authors":"Ofer Idan, Yoli Shavit, Yosi Keller","doi":"10.1016/j.cviu.2025.104629","DOIUrl":"10.1016/j.cviu.2025.104629","url":null,"abstract":"<div><div>Relative pose regressors (RPRs) determine the pose of a query image by estimating its relative translation and rotation to a reference pose-labeled camera. Unlike other regression-based localization techniques confined to a scene’s absolute parameters, RPRs learn residuals, making them adaptable to new environments. However, RPRs have exhibited limited generalization to scenes not utilized during training (“unseen scenes”). In this work, we explore the ability of RPRs to localize in unseen scenes and propose algorithmic modifications to enhance their generalization. These modifications include attention-based aggregation of coarse feature maps, dynamic adaptation of model weights, and geometry-aware optimization. Our proposed approach improves the localization accuracy of RPRs in unseen scenes by a notable margin across multiple indoor and outdoor benchmarks and under various conditions while maintaining comparable performance in scenes used during training. We assess the contribution of each component through ablation studies and further analyze the uncertainty of our model in unseen scenes. Our Code and pre-trained models are available at <span><span>https://github.com/yolish/relformer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104629"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-12-23DOI: 10.1016/j.cviu.2025.104621
Edgar R. Guzman , Letizia Gionfrida , Robert D. Howe
This paper introduces a vision-based environment quantification pipeline designed to tailor the assistance provided by lower limb assistive devices during the transition from level walking to stair navigation. The framework consists of three components: staircase detection, transitional step prediction, and staircase dimension estimation. These components utilize an RGB-D camera worn on the chest and an Inertial Measurement Unit (IMU) worn at the hip. To detect ascending stairs, we employed a YOLOv3 model applied to continuous recordings, achieving an average accuracy of 98.1%. For descending stair detection, an edge detection algorithm was used, resulting in a pixel-wise edge localization accuracy of 89.1%. To estimate user locomotion speed and footfall, the IMU was positioned on the participant’s left waist, and the RGB-D camera was mounted at chest level. This setup accurately captured step lengths with an average accuracy of 94.4% across all participants and trials, enabling precise determination of the number of steps leading up to the transitional step on the staircase. As a result, the system accurately predicted the number of steps and localized the final footfall with an average error of , measured as the distance between the predicted and actual placement of the final foot relative to the target destination. Finally, to capture the dimensions of the staircase’s tread depth and riser height, an algorithm analyzing point cloud data was applied when the user was in close proximity to the stairs. This yielded mean absolute errors of in height and in depth for ascending stairs, and in height and in depth for descending stairs. Our proposed approach lays the groundwork for optimizing control strategies in exoskeleton technologies by integrating environmental sensing with human locomotion analysis. These results demonstrate the feasibility and effectiveness of our system, promising enhanced user experiences and improved functionality in real-world scenarios.
{"title":"RGB-D and IMU-based staircase quantification for assistive navigation using step estimation for exoskeleton support","authors":"Edgar R. Guzman , Letizia Gionfrida , Robert D. Howe","doi":"10.1016/j.cviu.2025.104621","DOIUrl":"10.1016/j.cviu.2025.104621","url":null,"abstract":"<div><div>This paper introduces a vision-based environment quantification pipeline designed to tailor the assistance provided by lower limb assistive devices during the transition from level walking to stair navigation. The framework consists of three components: staircase detection, transitional step prediction, and staircase dimension estimation. These components utilize an RGB-D camera worn on the chest and an Inertial Measurement Unit (IMU) worn at the hip. To detect ascending stairs, we employed a YOLOv3 model applied to continuous recordings, achieving an average accuracy of 98.1%. For descending stair detection, an edge detection algorithm was used, resulting in a pixel-wise edge localization accuracy of 89.1%. To estimate user locomotion speed and footfall, the IMU was positioned on the participant’s left waist, and the RGB-D camera was mounted at chest level. This setup accurately captured step lengths with an average accuracy of 94.4% across all participants and trials, enabling precise determination of the number of steps leading up to the transitional step on the staircase. As a result, the system accurately predicted the number of steps and localized the final footfall with an average error of <span><math><mrow><mn>5</mn><mo>.</mo><mn>77</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span>, measured as the distance between the predicted and actual placement of the final foot relative to the target destination. Finally, to capture the dimensions of the staircase’s tread depth and riser height, an algorithm analyzing point cloud data was applied when the user was in close proximity to the stairs. This yielded mean absolute errors of <span><math><mrow><mn>1</mn><mo>.</mo><mn>20</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>49</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in height and <span><math><mrow><mn>1</mn><mo>.</mo><mn>35</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>45</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in depth for ascending stairs, and <span><math><mrow><mn>1</mn><mo>.</mo><mn>28</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>55</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in height and <span><math><mrow><mn>1</mn><mo>.</mo><mn>47</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>65</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in depth for descending stairs. Our proposed approach lays the groundwork for optimizing control strategies in exoskeleton technologies by integrating environmental sensing with human locomotion analysis. These results demonstrate the feasibility and effectiveness of our system, promising enhanced user experiences and improved functionality in real-world scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104621"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145847576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}