感知运动视觉Transformer

A Sensorimotor Vision Transformer

摘要 Abstract

本文提出了一种名为感知运动Transformer(SMT)的视觉模型,该模型受到人类扫视眼动的启发,通过优先处理视觉输入中的高显著性区域,提升了计算效率并减少了内存消耗。与传统对所有图像块均匀处理的模型不同,SMT基于内在二维(i2D)特征(如角点和遮挡)识别并选择最具显著性的图像块,这些特征富含信息且与人类注视模式相一致。SMT架构利用这一生物学原理,借助视觉Transformer仅处理最具有信息量的图像块,从而大幅降低了与所选图像块序列长度相关的内存使用。这种方法与视觉神经科学的研究结果一致,表明人类视觉系统通过选择性和空间动态聚焦优化了信息采集过程。在Imagenet-1k数据集上的实验评估显示,SMT在显著减少内存消耗和计算复杂度的同时,仍能实现竞争性的top-1准确率,尤其是在使用有限数量图像块时尤为明显。这项工作将类似扫视的选择机制引入基于Transformer的视觉模型中,为图像分析提供了高效替代方案,并为资源受限应用提供了新的生物启发式架构思路。

This paper presents the Sensorimotor Transformer (SMT), a vision model inspired by human saccadic eye movements that prioritize high-saliency regions in visual input to enhance computational efficiency and reduce memory consumption. Unlike traditional models that process all image patches uniformly, SMT identifies and selects the most salient patches based on intrinsic two-dimensional (i2D) features, such as corners and occlusions, which are known to convey high-information content and align with human fixation patterns. The SMT architecture uses this biological principle to leverage vision transformers to process only the most informative patches, allowing for a substantial reduction in memory usage that scales with the sequence length of selected patches. This approach aligns with visual neuroscience findings, suggesting that the human visual system optimizes information gathering through selective, spatially dynamic focus. Experimental evaluations on Imagenet-1k demonstrate that SMT achieves competitive top-1 accuracy while significantly reducing memory consumption and computational complexity, particularly when a limited number of patches is used. This work introduces a saccade-like selection mechanism into transformer-based vision models, offering an efficient alternative for image analysis and providing new insights into biologically motivated architectures for resource-constrained applications.

感知运动视觉Transformer - arXiv