Computer Vision and Image Understanding最新文献_第5页

Securing workers and workspaces: Contextual privacy for vision-based ergonomics 保护工人和工作空间：基于视觉的人体工程学的上下文隐私

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2026-03-01 Epub Date: 2026-02-04 DOI: 10.1016/j.cviu.2026.104675

Sander De Coninck , Emilio Gamba , Bart Van Doninck , Abdellatif Bey-Temsamani , Thorsten Cardoen , Sam Leroux , Pieter Simoens

Multi-camera computer vision in industry offers advantages but poses risks to worker privacy and intellectual property through exposure of sensitive contextual information. Existing privacy methods often inadequately protect background details crucial in manufacturing. This issue is prominent in applications like automated ergonomic assessment, where visual data for posture analysis can reveal sensitive workplace information. We propose a system for simultaneous personal privacy and enhanced contextual intellectual property protection, featuring a novel probabilistic obfuscation technique. Our edge-based Generative Adversarial Privacy system employs a modified obfuscator that learns to inject controlled, pixel-wise random noise, particularly into non-critical background regions. This more effectively obscures IP-sensitive environmental details before data transmission for central analysis (e.g., pose estimation). Our approach, validated in a multi-camera ergonomic study, effectively protects worker privacy and contextual IP (metrics-evaluated) and maintains 3D pose accuracy for reliable ergonomic assessment. This work provides a solution for deploying vision systems in sensitive industrial settings by holistically addressing privacy requirements through an advanced, adaptive obfuscation strategy.

工业上的多摄像头计算机视觉具有优势，但由于暴露了敏感的上下文信息，对工人的隐私和知识产权构成了风险。现有的隐私保护方法往往不能充分保护制造业中至关重要的背景细节。这个问题在自动化人体工程学评估等应用中很突出，在这些应用中，姿势分析的视觉数据可以揭示敏感的工作场所信息。我们提出了一个同时保护个人隐私和增强上下文知识产权保护的系统，该系统采用了一种新颖的概率混淆技术。我们基于边缘的生成对抗隐私系统采用了一种改进的混淆器，可以学习注入受控的、像素级的随机噪声，特别是在非关键背景区域。这更有效地模糊了数据传输之前的ip敏感的环境细节进行中央分析（例如，姿态估计）。我们的方法在多摄像头人体工程学研究中得到验证，有效地保护了工人的隐私和上下文IP（指标评估），并保持了3D姿势的准确性，以进行可靠的人体工程学评估。这项工作为在敏感的工业环境中部署视觉系统提供了一种解决方案，通过先进的自适应混淆策略全面解决隐私要求。

{"title":"Securing workers and workspaces: Contextual privacy for vision-based ergonomics","authors":"Sander De Coninck , Emilio Gamba , Bart Van Doninck , Abdellatif Bey-Temsamani , Thorsten Cardoen , Sam Leroux , Pieter Simoens","doi":"10.1016/j.cviu.2026.104675","DOIUrl":"10.1016/j.cviu.2026.104675","url":null,"abstract":"<div><div>Multi-camera computer vision in industry offers advantages but poses risks to worker privacy and intellectual property through exposure of sensitive contextual information. Existing privacy methods often inadequately protect background details crucial in manufacturing. This issue is prominent in applications like automated ergonomic assessment, where visual data for posture analysis can reveal sensitive workplace information. We propose a system for simultaneous personal privacy and enhanced contextual intellectual property protection, featuring a novel probabilistic obfuscation technique. Our edge-based Generative Adversarial Privacy system employs a modified obfuscator that learns to inject controlled, pixel-wise random noise, particularly into non-critical background regions. This more effectively obscures IP-sensitive environmental details before data transmission for central analysis (e.g., pose estimation). Our approach, validated in a multi-camera ergonomic study, effectively protects worker privacy and contextual IP (metrics-evaluated) and maintains 3D pose accuracy for reliable ergonomic assessment. This work provides a solution for deploying vision systems in sensitive industrial settings by holistically addressing privacy requirements through an advanced, adaptive obfuscation strategy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104675"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recent advances of local mechanisms in vision foundation models: A survey and outlook 视觉基础模型局部机制研究进展综述与展望

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2026-03-01 Epub Date: 2026-02-18 DOI: 10.1016/j.cviu.2026.104679

Qiangchang Wang , Jing Li , Yilong Yin , Huimin Lu

Vision foundation models (VFMs) play a crucial role in complex, real-world computer vision applications. Key developments such as Vision Transformers, self-supervised learning, and multi-modal learning have significantly contributed to the advancement of VFMs. However, these models face challenges, including high resource demands, weak locality, and a lack of generalized features. Drawing inspiration from the human brain’s ability to emphasize important information while suppressing irrelevant input, local mechanisms have been designed to address these challenges and enhance VFMs development. Local mechanisms not only focus on specific parts of an input to learn discriminative, fine-grained features but also selectively process information, thereby improving model efficiency. These mechanisms exhibit unique characteristics across different domains. In this survey, we provide a systematic review of local mechanisms across three key areas of VFMs: multi-modal learning, self-supervised learning, and Vision Transformers. We categorize the various local mechanisms, summarize quantitative results in each area, and analyze the advantages and disadvantages of each category in depth, offering insights for further exploration. We examine the potential of VFMs in semi-supervised learning, weakly supervised learning, domain adaptation, and few-shot learning, highlighting their specific applications in remote sensing and medical image analysis. Finally, we discuss potential research directions related to local mechanisms that could inspire future work. To the best of our knowledge, this is the first comprehensive survey on local mechanisms across different fields. We hope this review will provide valuable insights and guide future research in VFMs.

视觉基础模型（VFMs）在复杂的、真实的计算机视觉应用中起着至关重要的作用。诸如视觉变形器、自监督学习和多模态学习等关键发展对视觉模型的进步做出了重大贡献。然而，这些模型面临着挑战，包括高资源需求、弱局部性和缺乏一般化特征。从人类大脑强调重要信息同时抑制无关输入的能力中获得灵感，局部机制被设计来解决这些挑战并促进VFMs的发展。局部机制不仅关注输入的特定部分来学习有区别的、细粒度的特征，还可以选择性地处理信息，从而提高模型效率。这些机制在不同的领域表现出独特的特征。在这项调查中，我们系统地回顾了视觉模型的三个关键领域的局部机制：多模式学习、自我监督学习和视觉变形。我们对各种局部机制进行分类，总结每个领域的定量结果，并深入分析每个类别的优缺点，为进一步探索提供见解。我们研究了VFMs在半监督学习、弱监督学习、领域适应和少镜头学习方面的潜力，重点介绍了它们在遥感和医学图像分析中的具体应用。最后，我们讨论了与局部机制相关的潜在研究方向，可以启发未来的工作。据我们所知，这是第一次对不同领域的地方机制进行全面调查。我们希望这篇综述能提供有价值的见解和指导未来的研究VFMs。

{"title":"Recent advances of local mechanisms in vision foundation models: A survey and outlook","authors":"Qiangchang Wang , Jing Li , Yilong Yin , Huimin Lu","doi":"10.1016/j.cviu.2026.104679","DOIUrl":"10.1016/j.cviu.2026.104679","url":null,"abstract":"<div><div>Vision foundation models (VFMs) play a crucial role in complex, real-world computer vision applications. Key developments such as Vision Transformers, self-supervised learning, and multi-modal learning have significantly contributed to the advancement of VFMs. However, these models face challenges, including high resource demands, weak locality, and a lack of generalized features. Drawing inspiration from the human brain’s ability to emphasize important information while suppressing irrelevant input, local mechanisms have been designed to address these challenges and enhance VFMs development. Local mechanisms not only focus on specific parts of an input to learn discriminative, fine-grained features but also selectively process information, thereby improving model efficiency. These mechanisms exhibit unique characteristics across different domains. In this survey, we provide a systematic review of local mechanisms across three key areas of VFMs: multi-modal learning, self-supervised learning, and Vision Transformers. We categorize the various local mechanisms, summarize quantitative results in each area, and analyze the advantages and disadvantages of each category in depth, offering insights for further exploration. We examine the potential of VFMs in semi-supervised learning, weakly supervised learning, domain adaptation, and few-shot learning, highlighting their specific applications in remote sensing and medical image analysis. Finally, we discuss potential research directions related to local mechanisms that could inspire future work. To the best of our knowledge, this is the first comprehensive survey on local mechanisms across different fields. We hope this review will provide valuable insights and guide future research in VFMs.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104679"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147422113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mobile Auslan: A multimodal dialogue-centered sign language learning system 移动手语：多模态对话中心手语学习系统

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2026-03-01 Epub Date: 2026-02-03 DOI: 10.1016/j.cviu.2026.104646

Hongwei Sheng, Xin Shen, Heming Du, Xin Yu

Learning sign language is not only a gateway to linguistic competence but also to cultural participation, self-expression, and meaningful interaction within the Deaf community. Recent advances in sign language education technologies have made notable progress in supporting vocabulary acquisition and sentence-level translation. However, dialogue, as one critical component of natural language use, remains largely absent from existing systems. We identify two underlying challenges that contribute to this gap: the lack of recognition robustness to viewpoint variation, which constrains expressive freedom during signing; and the limited semantic modeling capabilities needed to support discourse-level interpretation and interaction. To address these issues, we propose a learning-centered system that integrates pose-based multi-view augmentation and a multi-agent language modeling workflow. This system supports free-form input across diverse signing perspectives and provides structured feedback across word, sentence, and dialogue levels. Built as a modular and deployable platform, the system demonstrates strong recognition performance and learner engagement across varied input conditions. Through this integration of spatial robustness and semantic scaffolding, our work advances the design of sign language learning technologies toward more interactive, expressive, and pedagogically grounded experiences. However, the current system remains limited by its focus on successive (non-continuous) signing and by its moderate vocabulary and restricted pedagogical scope. Future work will therefore extend lexical coverage, incorporate more natural continuous signing, and investigate richer educational functions.

学习手语不仅是提高语言能力的途径，也是聋人社区参与文化、自我表达和有意义互动的途径。近年来，手语教育技术在支持词汇习得和句子级翻译方面取得了显著进展。然而，对话作为自然语言使用的一个重要组成部分，在现有系统中仍然基本缺失。我们发现了导致这一差距的两个潜在挑战：缺乏对视点变化的识别鲁棒性，这限制了签名过程中的表达自由；以及支持话语级解释和交互所需的有限语义建模能力。为了解决这些问题，我们提出了一个以学习为中心的系统，该系统集成了基于姿态的多视图增强和多智能体语言建模工作流。该系统支持跨不同签名视角的自由形式输入，并提供跨单词、句子和对话级别的结构化反馈。作为一个模块化和可部署的平台，该系统在不同的输入条件下表现出强大的识别性能和学习者参与。通过空间稳健性和语义支架的整合，我们的工作将手语学习技术的设计推向更具互动性、表现力和教学基础的体验。然而，目前的手语系统仍然受到其对连续（非连续）手语的关注、词汇量适中和教学范围有限的限制。因此，未来的工作将扩大词汇覆盖范围，纳入更自然的连续签名，并研究更丰富的教育功能。

{"title":"Mobile Auslan: A multimodal dialogue-centered sign language learning system","authors":"Hongwei Sheng, Xin Shen, Heming Du, Xin Yu","doi":"10.1016/j.cviu.2026.104646","DOIUrl":"10.1016/j.cviu.2026.104646","url":null,"abstract":"<div><div>Learning sign language is not only a gateway to linguistic competence but also to cultural participation, self-expression, and meaningful interaction within the Deaf community. Recent advances in sign language education technologies have made notable progress in supporting vocabulary acquisition and sentence-level translation. However, dialogue, as one critical component of natural language use, remains largely absent from existing systems. We identify two underlying challenges that contribute to this gap: the lack of recognition robustness to viewpoint variation, which constrains expressive freedom during signing; and the limited semantic modeling capabilities needed to support discourse-level interpretation and interaction. To address these issues, we propose a learning-centered system that integrates pose-based multi-view augmentation and a multi-agent language modeling workflow. This system supports free-form input across diverse signing perspectives and provides structured feedback across word, sentence, and dialogue levels. Built as a modular and deployable platform, the system demonstrates strong recognition performance and learner engagement across varied input conditions. Through this integration of spatial robustness and semantic scaffolding, our work advances the design of sign language learning technologies toward more interactive, expressive, and pedagogically grounded experiences. However, the current system remains limited by its focus on successive (non-continuous) signing and by its moderate vocabulary and restricted pedagogical scope. Future work will therefore extend lexical coverage, incorporate more natural continuous signing, and investigate richer educational functions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104646"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FDE: A Frequency-Domain Enhancement method for object detection in complex traffic scenes 一种用于复杂交通场景中目标检测的频域增强方法

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2026-03-01 Epub Date: 2026-02-04 DOI: 10.1016/j.cviu.2026.104677

Wenyan Sun , Feifei Xie , Yuxuan Zhang , Liangrui Wei , Fuzheng Chu , Xiaoyu Tang

Object detectors in complex traffic scenes often struggle with small objects, cluttered boundaries, and heavy occlusions due to a low-frequency bias, where models capture smooth backgrounds but overlook high-frequency details. To address this, we propose a Frequency-Domain Enhancement (FDE) framework that integrates three modules: (i) Frequency-Adaptive Attention (FAA) to dynamically emphasize informative frequency components, (ii) Enhanced Frequency Dynamic Convolution (EFDConv) to adaptively model diverse spectral patterns, and (iii) Frequency-Aware Supervision (FAS) to guide training towards edge and fine-structure preservation. Embedded into RT-DETR, FDE achieves consistent gains of +1.7 mAP on KITTI dataset and +1.9 mAP on COCO dataset, with negligible computational overhead. These results demonstrate that FDE effectively alleviates low-frequency bias and improves detection of small and adjacent objects, making it suitable for deployment in complex traffic environments.

在复杂的交通场景中，物体检测器通常会因为低频偏置而与小物体、混乱的边界和严重的遮挡作斗争，其中模型捕获平滑的背景，但忽略了高频细节。为了解决这个问题，我们提出了一个频域增强（FDE）框架，该框架集成了三个模块：(i)频率自适应注意（FAA）动态强调信息频率成分，（ii）增强频率动态卷积（EFDConv）自适应建模不同的频谱模式，以及（iii）频率感知监督（FAS）指导边缘和精细结构保存的训练。将FDE嵌入到RT-DETR中，在KITTI数据集上获得+1.7 mAP的一致增益，在COCO数据集上获得+1.9 mAP的一致增益，计算开销可以忽略不计。这些结果表明，FDE有效地缓解了低频偏置，提高了对小物体和相邻物体的检测，适合在复杂的交通环境中部署。

引用次数: 0

Grey wolf optimization for color quantization 灰色狼优化的颜色量化

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2026-03-01 Epub Date: 2026-02-04 DOI: 10.1016/j.cviu.2026.104659

María-Luisa Pérez-Delgado , Jesús-Ángel Román-Gallego , M. Emre Celebi

Color quantization is an image processing operation that attempts to reduce the number of distinct colors used to represent an image without significant loss of quality. This operation is useful as an initial step to perform further processing on the image. The interest in this operation has led to several solution methods being proposed over the years. Within these methods several swarm-based methods have been used in recent years to solve the problem. This article discusses the application of one of these methods, called grey wolf optimization, to perform color quantization. This algorithm has generated good results when applied to a variety of optimization problems. Among the features that differentiate this algorithm from other swarm algorithms are its ability to converge toward better solutions, its speed, and the existence of a single control parameter. The article describes in detail how the grey wolf optimization method should be adapted to perform color quantization, so that it generates a quantized palette that allows the quantized image to be represented. In this case, each individual in the group represents a quantized palette, which is improved with the information provided by the group. The detailed description of the algorithm is complemented by an extensive testing section that compares the results of the proposed method to those of 16 others techniques. The results, based on the comparison of MSE, MAE, PSNR, SSIM, and runtime, show that grey wolf optimization can generate good quality images, better than most of the compared methods.

颜色量化是一种图像处理操作，它试图减少用于表示图像的不同颜色的数量，而不会造成明显的质量损失。这个操作作为对图像执行进一步处理的初始步骤是有用的。多年来，对这种操作的兴趣导致了几种解决方法的提出。在这些方法中，近年来已经使用了几种基于群体的方法来解决这个问题。本文讨论了其中一种称为灰狼优化的方法的应用，以执行颜色量化。该算法应用于各种优化问题，取得了良好的效果。该算法与其他群算法的不同之处在于，它能够收敛到更好的解，速度快，并且存在单个控制参数。本文详细描述了应该如何调整灰狼优化方法来执行颜色量化，以便它生成一个量化的调色板，允许表示量化的图像。在这种情况下，组中的每个个体代表一个量化的调色板，该调色板通过组提供的信息得到改进。该算法的详细描述由一个广泛的测试部分补充，该测试部分将所提出的方法的结果与其他16种技术的结果进行比较。通过对MSE、MAE、PSNR、SSIM和runtime的比较，结果表明，灰狼优化可以生成质量较好的图像，优于大多数比较方法。

{"title":"Grey wolf optimization for color quantization","authors":"María-Luisa Pérez-Delgado , Jesús-Ángel Román-Gallego , M. Emre Celebi","doi":"10.1016/j.cviu.2026.104659","DOIUrl":"10.1016/j.cviu.2026.104659","url":null,"abstract":"<div><div>Color quantization is an image processing operation that attempts to reduce the number of distinct colors used to represent an image without significant loss of quality. This operation is useful as an initial step to perform further processing on the image. The interest in this operation has led to several solution methods being proposed over the years. Within these methods several swarm-based methods have been used in recent years to solve the problem. This article discusses the application of one of these methods, called grey wolf optimization, to perform color quantization. This algorithm has generated good results when applied to a variety of optimization problems. Among the features that differentiate this algorithm from other swarm algorithms are its ability to converge toward better solutions, its speed, and the existence of a single control parameter. The article describes in detail how the grey wolf optimization method should be adapted to perform color quantization, so that it generates a quantized palette that allows the quantized image to be represented. In this case, each individual in the group represents a quantized palette, which is improved with the information provided by the group. The detailed description of the algorithm is complemented by an extensive testing section that compares the results of the proposed method to those of 16 others techniques. The results, based on the comparison of MSE, MAE, PSNR, SSIM, and runtime, show that grey wolf optimization can generate good quality images, better than most of the compared methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104659"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhanced local homogenization and reconstruction network for few-shot fine-grained image classification 改进的局部均匀化重建网络用于少镜头细粒度图像分类

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2026-03-01 Epub Date: 2026-02-06 DOI: 10.1016/j.cviu.2026.104686

Meiyin Hu , Huan Wan , Hui Wang , Jianfeng Xu , Xin Wei

Few-shot fine-grained image classification is a challenging task as it requires the ability to detect minor variations across subclasses. Metric-based methods are some of the most popular methods for dealing with this issue, but they have the problems of spatial information loss and local feature misalignment. Feature reconstruction-based methods can partially mitigate these two problems and exhibit substantial potential for fine-grained image classification. However, we found and verified that feature reconstruction-based methods have two unresolved difficulties, including inadequate extraction of discriminative local information and insufficient fitting capability when reconstructing numerous support set features with limited query set features. These difficulties are key obstacles to further improving the performance of few-shot fine-grained image classification. Therefore, we propose the enhanced local homogenization and reconstruction network (ELHRN) for few-shot fine-grained image classification. The proposed method includes three main modules: the local homogenization and distinction module (HDM) to learn more subtle discriminative local features; the double-layer cross-reconstruction module (DCM) to increase the hierarchy and complexity of feature reconstruction, effectively enhancing feature diversity; and the branch weighting module (BWM) to adjust the weights of mutual reconstruction between the support set and query set, thereby mitigating the issue of insufficient fitting capability when reconstructing numerous features with a limited number of features. Extensive experiments are conducted on five benchmark fine-grained datasets, and the results demonstrate that ELHRN outperforms state-of-the-art methods. The code is available at https://github.com/sausage0611/ELHRN.

少量的细粒度图像分类是一项具有挑战性的任务，因为它需要能够检测子类之间的微小变化。基于度量的方法是处理这一问题的一些最流行的方法，但它们存在空间信息丢失和局部特征不对齐的问题。基于特征重建的方法可以部分缓解这两个问题，并显示出细粒度图像分类的巨大潜力。然而，我们发现并验证了基于特征重构的方法存在两个尚未解决的困难，即在用有限的查询集特征重构大量支持集特征时，对判别性局部信息的提取不足，以及拟合能力不足。这些困难是进一步提高少镜头细粒度图像分类性能的关键障碍。为此，我们提出了一种增强的局部均匀化重建网络（ELHRN），用于小镜头细粒度图像分类。该方法包括三个主要模块：局部均匀化和区分模块（HDM），用于学习更细微的判别局部特征；采用双层交叉重构模块（DCM），增加特征重构的层次性和复杂性，有效增强特征多样性；分支加权模块（branch weighted module， BWM），调整支持集和查询集之间相互重构的权重，从而缓解了用有限数量的特征重构大量特征时拟合能力不足的问题。在五个基准细粒度数据集上进行了大量实验，结果表明ELHRN优于最先进的方法。代码可在https://github.com/sausage0611/ELHRN上获得。

{"title":"Enhanced local homogenization and reconstruction network for few-shot fine-grained image classification","authors":"Meiyin Hu , Huan Wan , Hui Wang , Jianfeng Xu , Xin Wei","doi":"10.1016/j.cviu.2026.104686","DOIUrl":"10.1016/j.cviu.2026.104686","url":null,"abstract":"<div><div>Few-shot fine-grained image classification is a challenging task as it requires the ability to detect minor variations across subclasses. Metric-based methods are some of the most popular methods for dealing with this issue, but they have the problems of spatial information loss and local feature misalignment. Feature reconstruction-based methods can partially mitigate these two problems and exhibit substantial potential for fine-grained image classification. However, we found and verified that feature reconstruction-based methods have two unresolved difficulties, including inadequate extraction of discriminative local information and insufficient fitting capability when reconstructing numerous support set features with limited query set features. These difficulties are key obstacles to further improving the performance of few-shot fine-grained image classification. Therefore, we propose the enhanced local homogenization and reconstruction network (ELHRN) for few-shot fine-grained image classification. The proposed method includes three main modules: the local homogenization and distinction module (HDM) to learn more subtle discriminative local features; the double-layer cross-reconstruction module (DCM) to increase the hierarchy and complexity of feature reconstruction, effectively enhancing feature diversity; and the branch weighting module (BWM) to adjust the weights of mutual reconstruction between the support set and query set, thereby mitigating the issue of insufficient fitting capability when reconstructing numerous features with a limited number of features. Extensive experiments are conducted on five benchmark fine-grained datasets, and the results demonstrate that ELHRN outperforms state-of-the-art methods. The code is available at <span><span>https://github.com/sausage0611/ELHRN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104686"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Slope-Track: Multiple Object Tracking on Ski Slopes 斜坡跟踪：滑雪斜坡上的多目标跟踪

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2026-02-01 Epub Date: 2026-01-21 DOI: 10.1016/j.cviu.2026.104663

M’Saydez Campbell , Christophe Ducottet , Damien Muselet , Rémi Emonet

In this paper, we introduce Slope-Track. Slope-Track is a novel multiple object tracking (MOT) dataset designed to reflect the complexities of real ski slope environments. The dataset has over 96,000 frames collected from 10 different ski resorts under various weather and visibility conditions. Slope-Track addresses significant challenges in slope monitoring, including small object sizes, occlusions, fast and irregular motion, and low appearance consistency. It is densely annotated with bounding boxes and object identities, facilitating the evaluation of detection and tracking algorithms. We analyze the dataset’s characteristics comparing it to the existing MOT datasets. The results demonstrate that Slope-Track encapsulates a combination of challenges found in other datasets. Additionally, we benchmark a range of existing tracking algorithms and propose a new module that improves motion-based association by dealing with the specific shape of trajectories along ski slopes. Our results demonstrate that incorporating appearance features can have a mixed impact, depending on how they are used within each tracking algorithm. In contrast, motion-based methods and spatial association strategies show more reliable performance. Overall, we provide a challenging benchmark for evaluating and improving multi-object tracking systems in real-world outdoor environments. The dataset and code can be found at https://slopetrack.github.io/.

在本文中，我们介绍了斜坡轨道。slope - track是一种新的多目标跟踪（MOT）数据集，旨在反映真实滑雪场环境的复杂性。该数据集从10个不同的滑雪胜地收集了96000帧图像，这些图像是在不同的天气和能见度条件下收集的。slope - track解决了斜坡监测中的重大挑战，包括物体尺寸小、遮挡、快速和不规则运动以及低外观一致性。它密集地标注了边界框和目标标识，便于检测和跟踪算法的评估。我们分析了数据集的特征，并将其与现有的MOT数据集进行了比较。结果表明，Slope-Track封装了在其他数据集中发现的挑战组合。此外，我们对一系列现有的跟踪算法进行了基准测试，并提出了一个新的模块，该模块通过处理沿滑雪场轨迹的特定形状来改进基于运动的关联。我们的研究结果表明，结合外观特征可能会产生混合影响，这取决于它们在每个跟踪算法中的使用方式。相比之下，基于运动的方法和空间关联策略表现出更可靠的性能。总的来说，我们为评估和改进现实世界户外环境中的多目标跟踪系统提供了一个具有挑战性的基准。数据集和代码可以在https://slopetrack.github.io/上找到。

{"title":"Slope-Track: Multiple Object Tracking on Ski Slopes","authors":"M’Saydez Campbell , Christophe Ducottet , Damien Muselet , Rémi Emonet","doi":"10.1016/j.cviu.2026.104663","DOIUrl":"10.1016/j.cviu.2026.104663","url":null,"abstract":"<div><div>In this paper, we introduce Slope-Track. Slope-Track is a novel multiple object tracking (MOT) dataset designed to reflect the complexities of real ski slope environments. The dataset has over 96,000 frames collected from 10 different ski resorts under various weather and visibility conditions. Slope-Track addresses significant challenges in slope monitoring, including small object sizes, occlusions, fast and irregular motion, and low appearance consistency. It is densely annotated with bounding boxes and object identities, facilitating the evaluation of detection and tracking algorithms. We analyze the dataset’s characteristics comparing it to the existing MOT datasets. The results demonstrate that Slope-Track encapsulates a combination of challenges found in other datasets. Additionally, we benchmark a range of existing tracking algorithms and propose a new module that improves motion-based association by dealing with the specific shape of trajectories along ski slopes. Our results demonstrate that incorporating appearance features can have a mixed impact, depending on how they are used within each tracking algorithm. In contrast, motion-based methods and spatial association strategies show more reliable performance. Overall, we provide a challenging benchmark for evaluating and improving multi-object tracking systems in real-world outdoor environments. The dataset and code can be found at <span><span>https://slopetrack.github.io/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104663"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UAVDet: A CNN–Mamba hybrid network for efficient small object detection in UAV imagery UAVDet：一种用于无人机图像中高效小目标检测的CNN-Mamba混合网络

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2026-02-01 Epub Date: 2026-01-07 DOI: 10.1016/j.cviu.2026.104637

Yiming Yang, Feng Guo, Pei Niu

Real-time object detection is pivotal in traffic-related Unmanned Aerial Vehicles (UAV) applications. However, UAV imagery presents significant challenges due to the predominance of small objects and complex backgrounds. Traditional backbones generally perform aggressive early-stage downsampling, causing the loss of fine-grained features. To address these issues, we propose UAVDet, a real-time detection model that combines Convolutional Neural Network (CNN) and Mamba architectures. First, we revisit the conventional backbone design by reconfiguring its depth and width, with a focus on preserving fine-grained details crucial for small object detection. Second, we propose the Cross Stage Partial Mamba (CSPMB) module, which integrates the Mamba structure into the CNN framework to enhance global feature representation and improve robustness against complex background interference. Third, we design Tiny-focused Feature Pyramid Network (TFPN) by rebalancing the feature fusion flow and replacing the large-object detection head with a tiny-object detection head, which significantly improves the perception of small objects. Comprehensive experiments on the VisDrone dataset show that our method improves AP and AP

_{S}

by 4.5% and 5.0%, respectively, while reducing parameters by 84.9% compared to the baseline. It also reaches 53 FPS on an RTX 4090, exceeding the 30 FPS real-time threshold. Additional evaluations on UAVDT and DroneVehicle further verify the method’s robust generalization. These results indicate the effectiveness of the developed method in UAV image detection.

实时目标检测是交通相关无人机（UAV）应用的关键。然而，由于小物体和复杂背景的优势，无人机图像呈现出显著的挑战。传统的骨干网通常会执行激进的早期降采样，导致细粒度特征的丢失。为了解决这些问题，我们提出了UAVDet，一种结合卷积神经网络（CNN）和曼巴架构的实时检测模型。首先，我们通过重新配置其深度和宽度来重新审视传统的骨干设计，重点是保留对小目标检测至关重要的细粒度细节。其次，我们提出了跨阶段部分曼巴（CSPMB）模块，该模块将曼巴结构集成到CNN框架中，以增强全局特征表示并提高对复杂背景干扰的鲁棒性。第三，通过重新平衡特征融合流，将大目标检测头替换为小目标检测头，设计了聚焦小目标的特征金字塔网络（TFPN），显著提高了小目标的感知能力。在VisDrone数据集上的综合实验表明，与基线相比，我们的方法将AP和APS分别提高了4.5%和5.0%，同时将参数降低了84.9%。在RTX 4090上达到53 FPS，超过了30 FPS的实时阈值。对UAVDT和无人机的附加评估进一步验证了该方法的鲁棒泛化。这些结果表明了该方法在无人机图像检测中的有效性。

{"title":"UAVDet: A CNN–Mamba hybrid network for efficient small object detection in UAV imagery","authors":"Yiming Yang, Feng Guo, Pei Niu","doi":"10.1016/j.cviu.2026.104637","DOIUrl":"10.1016/j.cviu.2026.104637","url":null,"abstract":"<div><div>Real-time object detection is pivotal in traffic-related Unmanned Aerial Vehicles (UAV) applications. However, UAV imagery presents significant challenges due to the predominance of small objects and complex backgrounds. Traditional backbones generally perform aggressive early-stage downsampling, causing the loss of fine-grained features. To address these issues, we propose UAVDet, a real-time detection model that combines Convolutional Neural Network (CNN) and Mamba architectures. First, we revisit the conventional backbone design by reconfiguring its depth and width, with a focus on preserving fine-grained details crucial for small object detection. Second, we propose the Cross Stage Partial Mamba (CSPMB) module, which integrates the Mamba structure into the CNN framework to enhance global feature representation and improve robustness against complex background interference. Third, we design Tiny-focused Feature Pyramid Network (TFPN) by rebalancing the feature fusion flow and replacing the large-object detection head with a tiny-object detection head, which significantly improves the perception of small objects. Comprehensive experiments on the VisDrone dataset show that our method improves AP and AP<span><math><msub><mrow></mrow><mrow><mi>S</mi></mrow></msub></math></span> by 4.5% and 5.0%, respectively, while reducing parameters by 84.9% compared to the baseline. It also reaches 53 FPS on an RTX 4090, exceeding the 30 FPS real-time threshold. Additional evaluations on UAVDT and DroneVehicle further verify the method’s robust generalization. These results indicate the effectiveness of the developed method in UAV image detection.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104637"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Beyond familiar landscapes: Exploring the limits of relative pose regressors in new environments 超越熟悉的景观：探索新环境中相对姿态回归量的极限

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2026-02-01 Epub Date: 2026-01-06 DOI: 10.1016/j.cviu.2025.104629

Ofer Idan, Yoli Shavit, Yosi Keller

Relative pose regressors (RPRs) determine the pose of a query image by estimating its relative translation and rotation to a reference pose-labeled camera. Unlike other regression-based localization techniques confined to a scene’s absolute parameters, RPRs learn residuals, making them adaptable to new environments. However, RPRs have exhibited limited generalization to scenes not utilized during training (“unseen scenes”). In this work, we explore the ability of RPRs to localize in unseen scenes and propose algorithmic modifications to enhance their generalization. These modifications include attention-based aggregation of coarse feature maps, dynamic adaptation of model weights, and geometry-aware optimization. Our proposed approach improves the localization accuracy of RPRs in unseen scenes by a notable margin across multiple indoor and outdoor benchmarks and under various conditions while maintaining comparable performance in scenes used during training. We assess the contribution of each component through ablation studies and further analyze the uncertainty of our model in unseen scenes. Our Code and pre-trained models are available at https://github.com/yolish/relformer.

相对姿态回归器（Relative pose regressors, RPRs）通过估计查询图像与参考姿态标记相机的相对平移和旋转来确定查询图像的姿态。与其他基于回归的定位技术局限于场景的绝对参数不同，RPRs学习残差，使其适应新环境。然而，RPRs对训练中未使用的场景（“看不见的场景”）表现出有限的泛化。在这项工作中，我们探索了RPRs在未知场景中的定位能力，并提出了改进算法以增强其泛化能力。这些改进包括基于注意力的粗特征映射聚合、模型权重的动态适应和几何感知优化。我们提出的方法在多个室内和室外基准以及各种条件下显著提高了RPRs在未见场景中的定位精度，同时在训练过程中使用的场景中保持相当的性能。我们通过消融研究评估了每个成分的贡献，并进一步分析了我们的模型在未知场景下的不确定性。我们的代码和预训练模型可在https://github.com/yolish/relformer上获得。

{"title":"Beyond familiar landscapes: Exploring the limits of relative pose regressors in new environments","authors":"Ofer Idan, Yoli Shavit, Yosi Keller","doi":"10.1016/j.cviu.2025.104629","DOIUrl":"10.1016/j.cviu.2025.104629","url":null,"abstract":"<div><div>Relative pose regressors (RPRs) determine the pose of a query image by estimating its relative translation and rotation to a reference pose-labeled camera. Unlike other regression-based localization techniques confined to a scene’s absolute parameters, RPRs learn residuals, making them adaptable to new environments. However, RPRs have exhibited limited generalization to scenes not utilized during training (“unseen scenes”). In this work, we explore the ability of RPRs to localize in unseen scenes and propose algorithmic modifications to enhance their generalization. These modifications include attention-based aggregation of coarse feature maps, dynamic adaptation of model weights, and geometry-aware optimization. Our proposed approach improves the localization accuracy of RPRs in unseen scenes by a notable margin across multiple indoor and outdoor benchmarks and under various conditions while maintaining comparable performance in scenes used during training. We assess the contribution of each component through ablation studies and further analyze the uncertainty of our model in unseen scenes. Our Code and pre-trained models are available at <span><span>https://github.com/yolish/relformer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104629"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RGB-D and IMU-based staircase quantification for assistive navigation using step estimation for exoskeleton support 基于RGB-D和imu的外骨骼辅助导航阶梯量化

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2026-02-01 Epub Date: 2025-12-23 DOI: 10.1016/j.cviu.2025.104621

Edgar R. Guzman , Letizia Gionfrida , Robert D. Howe

This paper introduces a vision-based environment quantification pipeline designed to tailor the assistance provided by lower limb assistive devices during the transition from level walking to stair navigation. The framework consists of three components: staircase detection, transitional step prediction, and staircase dimension estimation. These components utilize an RGB-D camera worn on the chest and an Inertial Measurement Unit (IMU) worn at the hip. To detect ascending stairs, we employed a YOLOv3 model applied to continuous recordings, achieving an average accuracy of 98.1%. For descending stair detection, an edge detection algorithm was used, resulting in a pixel-wise edge localization accuracy of 89.1%. To estimate user locomotion speed and footfall, the IMU was positioned on the participant’s left waist, and the RGB-D camera was mounted at chest level. This setup accurately captured step lengths with an average accuracy of 94.4% across all participants and trials, enabling precise determination of the number of steps leading up to the transitional step on the staircase. As a result, the system accurately predicted the number of steps and localized the final footfall with an average error of

5.77 cm

, measured as the distance between the predicted and actual placement of the final foot relative to the target destination. Finally, to capture the dimensions of the staircase’s tread depth and riser height, an algorithm analyzing point cloud data was applied when the user was in close proximity to the stairs. This yielded mean absolute errors of

1.20 \pm 0.49 cm

in height and

1.35 \pm 0.45 cm

in depth for ascending stairs, and

1.28 \pm 0.55 cm

in height and

1.47 \pm 0.65 cm

in depth for descending stairs. Our proposed approach lays the groundwork for optimizing control strategies in exoskeleton technologies by integrating environmental sensing with human locomotion analysis. These results demonstrate the feasibility and effectiveness of our system, promising enhanced user experiences and improved functionality in real-world scenarios.

本文介绍了一种基于视觉的环境量化管道，用于定制下肢辅助装置在水平行走到楼梯导航过渡过程中的辅助。该框架由三个部分组成：阶梯检测、过渡阶梯预测和阶梯维数估计。这些组件使用佩戴在胸前的RGB-D摄像头和佩戴在臀部的惯性测量单元（IMU）。为了检测上升楼梯，我们采用了适用于连续记录的YOLOv3模型，平均准确率达到98.1%。下楼楼梯检测采用边缘检测算法，像素级边缘定位精度达到89.1%。为了估计用户的运动速度和脚步，IMU被放置在参与者的左腰部，RGB-D摄像机被安装在胸部水平。这种设置准确地捕获了所有参与者和试验的台阶长度，平均准确率为94.4%，能够精确地确定通往楼梯过渡台阶的台阶数。结果，系统准确地预测了步数，并定位了最终的脚步，平均误差为5.77厘米，以预测和实际的最终脚步相对于目标目的地的距离来衡量。最后，为了捕捉楼梯的踏面深度和隔水管高度的尺寸，当用户靠近楼梯时，应用了一种分析点云数据的算法。上楼梯的平均绝对误差为高度1.20±0.49cm，深度1.35±0.45cm；下楼梯的平均绝对误差为高度1.28±0.55cm，深度1.47±0.65cm。我们提出的方法通过将环境传感与人体运动分析相结合，为优化外骨骼技术的控制策略奠定了基础。这些结果证明了我们系统的可行性和有效性，有望在现实场景中增强用户体验和改进功能。

{"title":"RGB-D and IMU-based staircase quantification for assistive navigation using step estimation for exoskeleton support","authors":"Edgar R. Guzman , Letizia Gionfrida , Robert D. Howe","doi":"10.1016/j.cviu.2025.104621","DOIUrl":"10.1016/j.cviu.2025.104621","url":null,"abstract":"<div><div>This paper introduces a vision-based environment quantification pipeline designed to tailor the assistance provided by lower limb assistive devices during the transition from level walking to stair navigation. The framework consists of three components: staircase detection, transitional step prediction, and staircase dimension estimation. These components utilize an RGB-D camera worn on the chest and an Inertial Measurement Unit (IMU) worn at the hip. To detect ascending stairs, we employed a YOLOv3 model applied to continuous recordings, achieving an average accuracy of 98.1%. For descending stair detection, an edge detection algorithm was used, resulting in a pixel-wise edge localization accuracy of 89.1%. To estimate user locomotion speed and footfall, the IMU was positioned on the participant’s left waist, and the RGB-D camera was mounted at chest level. This setup accurately captured step lengths with an average accuracy of 94.4% across all participants and trials, enabling precise determination of the number of steps leading up to the transitional step on the staircase. As a result, the system accurately predicted the number of steps and localized the final footfall with an average error of <span><math><mrow><mn>5</mn><mo>.</mo><mn>77</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span>, measured as the distance between the predicted and actual placement of the final foot relative to the target destination. Finally, to capture the dimensions of the staircase’s tread depth and riser height, an algorithm analyzing point cloud data was applied when the user was in close proximity to the stairs. This yielded mean absolute errors of <span><math><mrow><mn>1</mn><mo>.</mo><mn>20</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>49</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in height and <span><math><mrow><mn>1</mn><mo>.</mo><mn>35</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>45</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in depth for ascending stairs, and <span><math><mrow><mn>1</mn><mo>.</mo><mn>28</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>55</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in height and <span><math><mrow><mn>1</mn><mo>.</mo><mn>47</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>65</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in depth for descending stairs. Our proposed approach lays the groundwork for optimizing control strategies in exoskeleton technologies by integrating environmental sensing with human locomotion analysis. These results demonstrate the feasibility and effectiveness of our system, promising enhanced user experiences and improved functionality in real-world scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104621"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145847576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0