China,characterized by its expansive territory and diverse ecological conditions,hosts a rich tapestry of forest flora,showcasing extensive botanical diversity.Accurate leaf recognition is a pivotal component in botanical research,requiring meticulous identification and classification of intricate leaf attributes such as shape,texture,and color.This study introduced an innovative leaf classification and recognition methodology based on the Cross Vision Transformer(CrossViT).The research focused on ten distinct types of leaves:Fatsia japonica,Rhododendron simsii,Magnolia grandiflora,Cinnamomum cassia,Pittosporum tobira,Hibiscus syriacus,Photinia serratifolia,Firmiana simplex,Ginkgo biloba,and Camphora officinarum.Comprehensive datasets were curated by capturing leaf images under controlled experimental conditions and in diverse real-world environments.This meticulous approach ensured the robustness of the dataset used for training and validation of the CrossViT model.Central to the methodology is the enhancement of the CrossViT model's architecture.Dual independent branches were incorporated to generate embedding vectors of varying dimensions,effectively capturing a wide range of leaf image features.The Transformer encoder was further optimized through the integration of a cross-attention mechanism,facilitating the seamless fusion of embedding vectors across different scales.This strategic refinement aimed to strike a balance between computational efficiency and classification accuracy,enhancing the model's performance in high-precision leaf categorization tasks.The classification process utilized a Multilayer Perceptron(MLP)Head,which successfully yielded robust results.Evaluation across distinct environmental settings revealed significant achievements,with an overall accuracy of approximately 92.5%in the controlled experimental dataset and 75.2%in the real-world dataset.The comparative analysis with traditional convolutional neural networks(CNNs)highlighted notable performance advantages of the CrossViT-based approach.In the controlled experimental environment,performance improvements ranged from 0.6 to 4.0 percentage points,while in the real-world scenario,improvements ranged from 1.3 to 3.3 percentage points.Despite a modest increase in floating-point operations(FLOPs)and model parameters,the CrossViT model demonstrated substantial gains in accuracy,underscoring its efficacy in leaf classification and recognition tasks.In conclusion,the proposed CrossViT-based methodology represents an efficient and effective approach to advance tree research and ecological conservation.By leveraging advanced deep learning techniques,this study contributes significantly to the disciplines of botany and environmental science,addressing critical challenges in biodiversity monitoring and sustainable natural resource management.The findings hold promise for enhancing our understanding and preservation of global forest ecosystems,emphasizing the importance of technological innovation in fostering environmental stewardship and conservation efforts worldwide.
tree species identificationcross vision transformer(CrossViT model)self-attentionvisualizationplant phenotype analysis