Pub Date : 2026-04-02DOI: 10.1109/tpami.2026.3680162
Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Lin Lin, Xiang Bai
{"title":"Training-Free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing","authors":"Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Lin Lin, Xiang Bai","doi":"10.1109/tpami.2026.3680162","DOIUrl":"https://doi.org/10.1109/tpami.2026.3680162","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"26 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147598866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-02DOI: 10.1109/tpami.2026.3680159
Xiaowan Hu, Henan Liu, Ce Zheng, Xinyang Li, Mai Xu
{"title":"Learning Continuous Spatiotemporal Implicit Neural Fields for Unsupervised Video Denoising","authors":"Xiaowan Hu, Henan Liu, Ce Zheng, Xinyang Li, Mai Xu","doi":"10.1109/tpami.2026.3680159","DOIUrl":"https://doi.org/10.1109/tpami.2026.3680159","url":null,"abstract":"","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"121 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147598871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-04-01DOI: 10.1109/tpami.2026.3679808
Yang Chen,Haisong Liu,Limin Wang
Camera-based 3D object detection in BEV (Bird's Eye View) space has drawn great attention over the past few years. Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation costs. On the other side, sparse detectors follow a query-based paradigm without explicit dense BEV feature construction but generally underperform compared to dense ones. In this paper, we find that the key to mitigating this performance gap is the adaptability of the detector in both BEV and image space. To this end, we propose a fully sparse 3D object detector that outperforms the dense counterparts and enjoys a higher running speed. Our sparse detector contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) scale-adaptive cross attention to capture the unique temporal dynamics associated with different objects, (3) adaptive sampling and mixing to perform interactions between queries and image features under the guidance of queries. These key components enhance the adaptability of the detector in both BEV and image space. Furthermore, we explore two distinct temporal modeling approaches: sampling-point-based multi-frame stacking (dubbed SparseBEV) and query-based recurrent temporal fusion (dubbed SparseBEV++) to leverage temporal features effectively. Experiments are conducted on the nuScenes and Waymo datasets. On the val split of nuScenes, both SparseBEV and SparseBEV++ surpass all previous methods. Our SparseBEV achieves a performance of 55.8 NDS and a speed of 23.5 FPS, and SparseBEV++ further achieves a remarkable 57.1 NDS while maintaining a real-time inference speed of 24.6 FPS. On the Waymo dataset, our best-performing model, SparseBEV++, outperforms previous methods with a lead of 58.9 mAP and 55.2 mAPH.
{"title":"SparseBEV: A Fully Sparse Framework for Multi-View 3D Object Detection.","authors":"Yang Chen,Haisong Liu,Limin Wang","doi":"10.1109/tpami.2026.3679808","DOIUrl":"https://doi.org/10.1109/tpami.2026.3679808","url":null,"abstract":"Camera-based 3D object detection in BEV (Bird's Eye View) space has drawn great attention over the past few years. Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation costs. On the other side, sparse detectors follow a query-based paradigm without explicit dense BEV feature construction but generally underperform compared to dense ones. In this paper, we find that the key to mitigating this performance gap is the adaptability of the detector in both BEV and image space. To this end, we propose a fully sparse 3D object detector that outperforms the dense counterparts and enjoys a higher running speed. Our sparse detector contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) scale-adaptive cross attention to capture the unique temporal dynamics associated with different objects, (3) adaptive sampling and mixing to perform interactions between queries and image features under the guidance of queries. These key components enhance the adaptability of the detector in both BEV and image space. Furthermore, we explore two distinct temporal modeling approaches: sampling-point-based multi-frame stacking (dubbed SparseBEV) and query-based recurrent temporal fusion (dubbed SparseBEV++) to leverage temporal features effectively. Experiments are conducted on the nuScenes and Waymo datasets. On the val split of nuScenes, both SparseBEV and SparseBEV++ surpass all previous methods. Our SparseBEV achieves a performance of 55.8 NDS and a speed of 23.5 FPS, and SparseBEV++ further achieves a remarkable 57.1 NDS while maintaining a real-time inference speed of 24.6 FPS. On the Waymo dataset, our best-performing model, SparseBEV++, outperforms previous methods with a lead of 58.9 mAP and 55.2 mAPH.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"17 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147585516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs. Code will be made available at https://github.com/ZzZZCHS/Chat-Scene https://github.com/ZzZZCHS/Chat-Scene.
{"title":"Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM.","authors":"Haifeng Huang,Yilun Chen,Zehan Wang,Jiangmiao Pang,Zhou Zhao","doi":"10.1109/tpami.2026.3679561","DOIUrl":"https://doi.org/10.1109/tpami.2026.3679561","url":null,"abstract":"Recent advancements in multi-modal large language models (MLLMs) have shown strong potential for 3D scene understanding. However, existing methods struggle with fine-grained object grounding and contextual reasoning, limiting their ability to interpret and interact with complex 3D environments. In this paper, we present Chat-Scene++, an MLLM framework that represents 3D scenes as context-rich object sequences. By structuring scenes as sequences of objects with contextual semantics, Chat-Scene++ enables object-centric representation and interaction. It decomposes a 3D scene into object representations paired with identifier tokens, allowing LLMs to follow instructions across diverse 3D vision-language tasks. To capture inter-object relationships and global semantics, Chat-Scene++ extracts context-rich object features using large-scale pre-trained 3D scene-level and 2D image-level encoders, unlike the isolated per-object features in Chat-Scene. Its flexible object-centric design also supports grounded chain-of-thought (G-CoT) reasoning, enabling the model to distinguish objects at both category and spatial levels during multi-step inference. Without the need for additional task-specific heads or fine-tuning, Chat-Scene++ achieves state-of-the-art performance on five major 3D vision-language benchmarks: ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D. These results highlight its effectiveness in scene comprehension, object grounding, and spatial reasoning. Additionally, without reconstructing 3D worlds through computationally expensive processes, we demonstrate its applicability to real-world scenarios using only 2D inputs. Code will be made available at https://github.com/ZzZZCHS/Chat-Scene https://github.com/ZzZZCHS/Chat-Scene.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"12 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147584078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video stereo matching is the task of estimating consistent disparity maps from rectified stereo videos. There is considerable scope for improvement in both datasets and methods within this area. Recent learning-based methods often focus on optimizing performance for independent stereo pairs, leading to temporal inconsistencies in videos. Existing video methods typically employ sliding window operation over time dimension, which can result in low-frequency oscillations corresponding to the window size. To address these challenges, we propose a bidirectional alignment mechanism for adjacent frames as a fundamental operation. Building on this, we introduce a novel video processing framework, BiDAStereo, and a plugin stabilizer network, BiDAStabilizer, compatible with general image-based methods. Regarding datasets, current synthetic object-based and indoor datasets are commonly used for training and benchmarking, with a lack of outdoor nature scenarios. To bridge this gap, we present a realistic synthetic dataset and benchmark focused on natural scenes, along with a real-world dataset captured by a stereo camera in diverse urban scenes for qualitative evaluation. Extensive experiments on in-domain, out-of-domain, and robustness evaluation demonstrate the contribution of our methods and datasets, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks. The project page, demos, code, and datasets are available at: https://tomtomtommi.github.io/BiDAVideo/.
{"title":"Match Stereo Videos Via Bidirectional Alignment.","authors":"Junpeng Jing,Ye Mao,Anlan Qiu,Krystian Mikolajczyk","doi":"10.1109/tpami.2026.3679033","DOIUrl":"https://doi.org/10.1109/tpami.2026.3679033","url":null,"abstract":"Video stereo matching is the task of estimating consistent disparity maps from rectified stereo videos. There is considerable scope for improvement in both datasets and methods within this area. Recent learning-based methods often focus on optimizing performance for independent stereo pairs, leading to temporal inconsistencies in videos. Existing video methods typically employ sliding window operation over time dimension, which can result in low-frequency oscillations corresponding to the window size. To address these challenges, we propose a bidirectional alignment mechanism for adjacent frames as a fundamental operation. Building on this, we introduce a novel video processing framework, BiDAStereo, and a plugin stabilizer network, BiDAStabilizer, compatible with general image-based methods. Regarding datasets, current synthetic object-based and indoor datasets are commonly used for training and benchmarking, with a lack of outdoor nature scenarios. To bridge this gap, we present a realistic synthetic dataset and benchmark focused on natural scenes, along with a real-world dataset captured by a stereo camera in diverse urban scenes for qualitative evaluation. Extensive experiments on in-domain, out-of-domain, and robustness evaluation demonstrate the contribution of our methods and datasets, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks. The project page, demos, code, and datasets are available at: https://tomtomtommi.github.io/BiDAVideo/.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"41 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147584077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-31DOI: 10.1109/tpami.2026.3679394
Zheng Wang,Guanxiong He,Jie Wang,Runxin Zhang,Liaoyuan Tang,Rong Wang,Feiping Nie
Ensemble learning methods, such as Bagging and Boosting, are well-regarded for their ability to enhance model performance by combining diverse base learners. These approaches leverage the strengths of individual models to achieve more accurate and robust predictions. However, real-world datasets often contain noise, which can significantly impair model effectiveness. This paper focuses on two prevalent and challenging types: feature noise, which can lead to fitting instability and poor generalization, and label noise, which can lead to erroneous supervision and model overfitting. Recognizing the inherent properties of ensemble learning, particularly its focus on optimizing the decision margin to improve classification accuracy, we see an opportunity to bolster ensemble model robustness. To address both feature and label noise, we propose a novel approach called Dual Geometry Margin Boosting (DGMB). This method employs two key strategies: the Decision Plane Margin (DPM), which enhances class separation, and the Hyper-Sphere Margin (HSM), which effectively filters out potentially noisy samples during the learning process. Our experiments demonstrate the impressive ability of DGMB to resist both feature and label noise. Through rigorous testing on various noise-contaminated datasets, we show that DGMB maintains strong performance and outperforms other robust Ensemble methods.
{"title":"Dual Geometry Margin Optimization for Coupled-Noisy Robust Ensemble Learning.","authors":"Zheng Wang,Guanxiong He,Jie Wang,Runxin Zhang,Liaoyuan Tang,Rong Wang,Feiping Nie","doi":"10.1109/tpami.2026.3679394","DOIUrl":"https://doi.org/10.1109/tpami.2026.3679394","url":null,"abstract":"Ensemble learning methods, such as Bagging and Boosting, are well-regarded for their ability to enhance model performance by combining diverse base learners. These approaches leverage the strengths of individual models to achieve more accurate and robust predictions. However, real-world datasets often contain noise, which can significantly impair model effectiveness. This paper focuses on two prevalent and challenging types: feature noise, which can lead to fitting instability and poor generalization, and label noise, which can lead to erroneous supervision and model overfitting. Recognizing the inherent properties of ensemble learning, particularly its focus on optimizing the decision margin to improve classification accuracy, we see an opportunity to bolster ensemble model robustness. To address both feature and label noise, we propose a novel approach called Dual Geometry Margin Boosting (DGMB). This method employs two key strategies: the Decision Plane Margin (DPM), which enhances class separation, and the Hyper-Sphere Margin (HSM), which effectively filters out potentially noisy samples during the learning process. Our experiments demonstrate the impressive ability of DGMB to resist both feature and label noise. Through rigorous testing on various noise-contaminated datasets, we show that DGMB maintains strong performance and outperforms other robust Ensemble methods.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"17 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147584116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-31DOI: 10.1109/tpami.2026.3678628
Qianyao Qiang,Bin Zhang,Yunjia Hua,Feiping Nie
The anchor similarity matrix, widely used for efficient clustering, exhibits an imbalance between its rows and columns - only the rows are typically constrained by probabilistic properties, unlike the regular similarity matrix where both dimensions are regulated. This paper addresses the critical question of how to impose meaningful constraints on the columns to better capture the data structure. We propose a novel method, termed Multi-view Clustering via Bilaterally constrained anchor Graph (MCBG), which learns a fused anchor similarity matrix with bilateral constraints. To ensure consistency across views, we quantitatively assess their contributions and integrate them into a unified model. By applying distinct constraints to rows and columns, MCBG promotes a balanced and expressive anchor similarity distribution, avoiding degenerate cases. Furthermore, a rank constraint on the Laplacian matrix of an anchor-pairwise graph is incorporated, ensuring a one-step post-processing-free multi-view clustering framework. An efficient alternating iterative optimization algorithm is developed, adapted to the natural properties of the target problem. Extensive experiments validate the superiority of the proposed method.
{"title":"Multi-View Clustering Via Bilaterally Constrained Anchor Graph.","authors":"Qianyao Qiang,Bin Zhang,Yunjia Hua,Feiping Nie","doi":"10.1109/tpami.2026.3678628","DOIUrl":"https://doi.org/10.1109/tpami.2026.3678628","url":null,"abstract":"The anchor similarity matrix, widely used for efficient clustering, exhibits an imbalance between its rows and columns - only the rows are typically constrained by probabilistic properties, unlike the regular similarity matrix where both dimensions are regulated. This paper addresses the critical question of how to impose meaningful constraints on the columns to better capture the data structure. We propose a novel method, termed Multi-view Clustering via Bilaterally constrained anchor Graph (MCBG), which learns a fused anchor similarity matrix with bilateral constraints. To ensure consistency across views, we quantitatively assess their contributions and integrate them into a unified model. By applying distinct constraints to rows and columns, MCBG promotes a balanced and expressive anchor similarity distribution, avoiding degenerate cases. Furthermore, a rank constraint on the Laplacian matrix of an anchor-pairwise graph is incorporated, ensuring a one-step post-processing-free multi-view clustering framework. An efficient alternating iterative optimization algorithm is developed, adapted to the natural properties of the target problem. Extensive experiments validate the superiority of the proposed method.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"7 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147584076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}