首页期刊导航|机器智能研究(英文)
期刊信息/Journal information
机器智能研究(英文)
机器智能研究(英文)

谭铁牛 刘国平 胡豁生

双月刊

2731-538X

ijac@ia.ac.cn

010-62655893

100190

北京海淀区中关村东路95号2728信箱

机器智能研究(英文)/Journal Machine Intelligence ResearchCSCDCSTPCD北大核心EI
查看更多>>International Journal of Automation and computing is a publication of Institute of Automation, the Chinese Academy of Sciencs and Chinese Automation and computing Society in the United Kingdom. The Journal publishes papers on original theoretical and experimental research and development in automation and computing. The scope of the journal is extensive. Topics include; artificial intelligence, automatic control, bioinformatics, computer sciene, information technology, modeling and simulation, networks and communications, optimization and decision, pattern recognition, robotics, signal processing, and systems engineering.
正式出版
收录年代

    Editorial for Special Issue on Multi-modal Representation Learning

    Deng-Ping FanNick BarnesMing-Ming ChengLuc Van Gool...
    615-616页

    Segment Anything Is Not Always Perfect:An Investigation of SAM on Different Real-world Applications

    Wei JiJingjing LiQi BiTingwei Liu...
    617-630页
    查看更多>>摘要:Recently,Meta AI Research approaches a general,promptable segment anything model(SAM)pre-trained on an unpreced-entedly large segmentation dataset(SA-1B).Without a doubt,the emergence of SAM will yield significant benefits for a wide array of practical image segmentation applications.In this study,we conduct a series of intriguing investigations into the performance of SAM across various applications,particularly in the fields of natural images,agriculture,manufacturing,remote sensing and healthcare.We analyze and discuss the benefits and limitations of SAM,while also presenting an outlook on its future development in segmentation tasks.By doing so,we aim to give a comprehensive understanding of SAM's practical applications.This work is expected to provide in-sights that facilitate future research activities toward generic segmentation.Source code is publicly available at https://github.com/Li-uTingWed/SAM-Not-Perfect.

    Rethinking Polyp Segmentation from An Out-of-distribution Perspective

    Ge-Peng JiJing ZhangDylan CampbellHuan Xiong...
    631-639页
    查看更多>>摘要:Unlike existing fully-supervised approaches,we rethink colorectal polyp segmentation from an out-of-distribution perspect-ive with a simple but effective self-supervised learning approach.We leverage the ability of masked autoencoders-self-supervised vision transformers trained on a reconstruction task-to learn in-distribution representations,here,the distribution of healthy colon images.We then perform out-of-distribution reconstruction and inference,with feature space standardisation to align the latent distribution of the diverse abnormal samples with the statistics of the healthy samples.We generate per-pixel anomaly scores for each image by calcu-lating the difference between the input and reconstructed images and use this signal for out-of-distribution(i.e.,polyp)segmentation.Experimental results on six benchmarks show that our model has excellent segmentation performance and generalises across datasets.Our code is publicly available at https://github.com/GewelsJI/Polyp-OOD.

    Rethinking Global Context in Crowd Counting

    Guolei SunYun LiuThomas ProbstDanda Pani Paudel...
    640-651页
    查看更多>>摘要:This paper investigates the role of global context for crowd counting.Specifically,a pure transformer is used to extract fea-tures with global information from overlapping image patches.Inspired by classification,we add a context token to the input sequence,to facilitate information exchange with tokens corresponding to image patches throughout transformer layers.Due to the fact that trans-formers do not explicitly model the tried-and-true channel-wise interactions,we propose a token-attention module(TAM)to recalibrate encoded features through channel-wise attention informed by the context token.Beyond that,it is adopted to predict the total person count of the image through regression-token module(RTM).Extensive experiments on various datasets,including ShanghaiTech,UCF-QNRF,JHU-CROWD++and NWPU,demonstrate that the proposed context extraction techniques can significantly improve the per-formance over the baselines.

    Towards Domain-agnostic Depth Completion

    Guangkai XuWei YinJianming ZhangOliver Wang...
    652-669页
    查看更多>>摘要:Existing depth completion methods are often targeted at a specific sparse depth type and generalize poorly across task do-mains.We present a method to complete sparse/semi-dense,noisy,and potentially low-resolution depth maps obtained by various range sensors,including those in modern mobile phones,or by multi-view reconstruction algorithms.Our method leverages a data-driven pri-or in the form of a single image depth prediction network trained on large-scale datasets,the output of which is used as an input to our model.We propose an effective training scheme where we simulate various sparsity patterns in typical task domains.In addition,we design two new benchmarks to evaluate the generalizability and robustness of depth completion methods.Our simple method shows su-perior cross-domain generalization ability against state-of-the-art depth completion methods,introducing a practical solution to high-quality depth capture on a mobile device.

    Vision Transformers with Hierarchical Attention

    Yun LiuYu-Huan WuGuolei SunLe Zhang...
    670-683页
    查看更多>>摘要:This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes self-attention in a hier-archical fashion.Specifically,we first divide the input image into patches as commonly done,and each patch is viewed as a token.Then,the proposed H-MHSA learns token relationships within local patches,serving as local relationship modeling.Then,the small patches are merged into larger ones,and H-MHSA models the global dependencies for the small number of the merged tokens.At last,the local and global attentive features are aggregated to obtain features with powerful representation capacity.Since we only calculate attention for a limited number of tokens at each step,the computational load is reduced dramatically.Hence,H-MHSA can efficiently model glob-al relationships among tokens without sacrificing fine-grained information.With the H-MHSA module incorporated,we build a family of hierarchical-attention-based transformer networks,namely HAT-Net.To demonstrate the superiority of HAT-Net in scene understand-ing,we conduct extensive experiments on fundamental vision tasks,including image classification,semantic segmentation,object detec-tion and instance segmentation.Therefore,HAT-Net provides a new perspective for vision transformers.Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.

    A Novel Divide and Conquer Solution for Long-term Video Salient Object Detection

    Yun-Xiao LiCheng-Li-Zhao ChenShuai LiAi-Min Hao...
    684-703页
    查看更多>>摘要:Recently,a new research trend in our video salient object detection(VSOD)research community has focused on enhancing the detection results via model self-fine-tuning using sparsely mined high-quality keyframes from the given sequence.Although such a learning scheme is generally effective,it has a critical limitation,i.e.,the model learned on sparse frames only possesses weak generaliza-tion ability.This situation could become worse on"long"videos since they tend to have intensive scene variations.Moreover,in such videos,the keyframe information from a longer time span is less relevant to the previous,which could also cause learning conflict and de-teriorate the model performance.Thus,the learning scheme is usually incapable of handling complex pattern modeling.To solve this problem,we propose a divide-and-conquer framework,which can convert a complex problem domain into multiple simple ones.First,we devise a novel background consistency analysis(BCA)which effectively divides the mined frames into disjoint groups.Then for each group,we assign an individual deep model on it to capture its key attribute during the fine-tuning phase.During the testing phase,we design a model-matching strategy,which could dynamically select the best-matched model from those fine-tuned ones to handle the giv-en testing frame.Comprehensive experiments show that our method can adapt severe background appearance variation coupling with object movement and obtain robust saliency detection compared with the previous scheme and the state-of-the-art methods.

    TextFormer:A Query-based End-to-end Text Spotter with Mixed Supervision

    Yukun ZhaiXiaoqiang ZhangXiameng QinSanyuan Zhao...
    704-717页
    查看更多>>摘要:End-to-end text spotting is a vital computer vision task that aims to integrate scene text detection and recognition into a unified framework.Typical methods heavily rely on region-of-interest(RoI)operations to extract local features and complex post-pro-cessing steps to produce final predictions.To address these limitations,we propose TextFormer,a query-based end-to-end text spotter with a transformer architecture.Specifically,using query embedding per text instance,TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multitask modeling.It allows for mutual training and optimization of classifica-tion,segmentation and recognition branches,resulting in deeper feature sharing without sacrificing flexibility or simplicity.Addition-ally,we design an adaptive global aggregation(AGG)module to transfer global features into sequential features for reading arbitrarily-shaped texts,which overcomes the suboptimization problem of RoI operations.Furthermore,potential corpus information is utilized from weak annotations to full labels through mixed supervision,further improving text detection and end-to-end text spotting results.Extensive experiments on various bilingual(i.e.,English and Chinese)benchmarks demonstrate the superiority of our method.Espe-cially on the TDA-ReCTS dataset,TextFormer surpasses the state-of-the-art method in terms of 1-NED by 13.2%.

    Interpretability of Neural Networks Based on Game-theoretic Interactions

    Huilin ZhouJie RenHuiqi DengXu Cheng...
    718-739页
    查看更多>>摘要:This paper introduces the system of game-theoretic interactions,which connects both the explanation of knowledge en-coded in a deep neural networks(DNN)and the explanation of the representation power of a DNN.In this system,we define two game-theoretic interaction indexes,namely the multi-order interaction and the multivariate interaction.More crucially,we use these interac-tion indexes to explain feature representations encoded in a DNN from the following four aspects:1)Quantifying knowledge concepts en-coded by a DNN;2)Exploring how a DNN encodes visual concepts,and extracting prototypical concepts encoded in the DNN;3)Learn-ing optimal baseline values for the Shapley value,and providing a unified perspective to compare fourteen different attribution methods;4)Theoretically explaining the representation bottleneck of DNNs.Furthermore,we prove the relationship between the interaction en-coded in a DNN and the representation power of a DNN(e.g.,generalization power,adversarial transferability,and adversarial robust-ness).In this way,game-theoretic interactions successfully bridge the gap between"the explanation of knowledge concepts encoded in a DNN"and"the explanation of the representation capacity of a DNN"as a unified explanation.

    Toward Human-centered XAI in Practice:A survey

    Xiangwei KongShujie LiuLuhao Zhu
    740-770页
    查看更多>>摘要:Human adoption of artificial intelligence(AI)technique is largely hampered because of the increasing complexity and opa-city of AI development.Explainable AI(XAI)techniques with various methods and tools have been developed to bridge this gap between high-performance black-box AI models and human understanding.However,the current adoption of XAI technique still lacks"human-centered"guidance for designing proper solutions to meet different stakeholders'needs in XAI practice.We first summarize a human-centered demand framework to categorize different stakeholders into five key roles with specific demands by reviewing existing research and then extract six commonly used human-centered XAI evaluation measures which are helpful for validating the effect of XAI.In addition,a taxonomy of XAI methods is developed for visual computing with analysis of method properties.Holding clearer hu-man demands and XAI methods in mind,we take a medical image diagnosis scenario as an example to present an overview of how ex-tant XAI approaches for visual computing fulfil stakeholders'human-centered demands in practice.And we check the availability of open-source XAI tools for stakeholders'use.This survey provides further guidance for matching diverse human demands with appropri-ate XAI methods or tools in specific applications with a summary of main challenges and future work toward human-centered XAI in practice.