Diabetic Retinopathy Lesion Segmentation Based on Hierarchical Feature Progressive Fusion in Retinal Fundus Images
Objective Diabetic retinopathy(DR)is one of the most common complications of diabetes and one of the main causes of irreversible vision impairment or permanent blindness among the working-age population.Early detection has been shown to slow the disease's progression and prevent vision loss.Fundus photography is a widely used modality for DR-related lesion identification and large-scale screening owing to its non-invasive and cost-effective characteristics.Ophthalmologists typically observe fundus lesions,including microaneurysms(MAs),hemorrhages(HEs),hard exudates(EXs),and soft exudates(SEs),in images to perform manual DR diagnosis and grading for all suspected patients.However,expert identification of these lesions is cumbersome,time consuming,and easily affected by individual expertise and clinical experience.With the increasing prevalence of DR,automated segmentation methods are urgently required to identify multiclass fundus lesions.Recently,deep-learning technology,which is represented by convolutional neural networks(CNNs)and Transformers,has progressed significantly in the domain of medical-image analysis and has become the mainstream technology for DR-related lesion segmentation.The most commonly used methods are semantic segmentation-oriented CNNs,Transformers,or their combinations.These deep-learning methods exhibit promising results in terms of both accuracy and efficiency.Nevertheless,CNN-based methods are inferior in terms of global-character contextual information owing to their intrinsically limited receptive field,whereas Transformer-based approaches exhibit low local inductive biases and subpar perception of multiscale feature dependencies.Whereas models combining CNNs with transformers exhibit clear advantages,they require the extraction of deep semantic characteristics and direct feature concatenation from the same feature level without fully considering the importance of concrete boundary information for small-lesion segmentation,thus resulting in inadequate feature interaction between adjacent layers and conflicts among different feature scales.Moreover,these methods only focus on a certain type of DR lesion and seldom delineate multitype lesions simultaneously,thereby hampering their practical clinical application.Methods In this study,we developed a novel progressive multifeature fusion network based on an encoder-decoder U-shaped structure,which we named PMFF-Net,to achieve accurate multiclass DR-related fundus lesion segmentation.The overall framework of the proposed PMFF-Net is shown in Fig.1.It primarily comprises an encoder module embedding a hybrid Transformer(HT)module,a gradual characteristic fusion(GCF)module,a selective edge aggregation(SEA)module,a dynamic attention(DA)module,and a decoder module.For the encoder module,we sequentially cascaded four HT blocks to form four stages to excavate multiscale long-range features and local spatial information.For a fundus image I∈ R(H×W×C)(with height H,weight W,and C channels)as the input,we first applied a convolutional stem with a convolutional layer and MaxPooling layer for patch partitioning,which resulted in N patches X with a convolutional layer.The resulting patches X were embedded into the image tokens E using a trainable linear projection,and we denoted the output of the convolutional stem as F0=E.Subsequently,the embedded tokens E were fed into the four encoder stages to generate hierarchical feature maps Fi∈R(H/2i+1×W/2i+1×Ci)(i=1,2,3,4).The designed GCF module gradually aggregates the adjacent features of various scales under the guidance of high-level semantic cues to generate an enhanced feature representation FGCFi(i=2,3,4)in each layer,except for the first layer and the narrow semantic gaps between different levels of features.Subsequently,the presented DA module dynamically selects useful features and refine the merged characteristics to obtain consistent multiscale features Ai(i=2,3,4)using a dynamic learning algorithm.Meanwhile,the developed SEA module incorporates low-level boundary features F1 and high-level semantic feature information A3 and A4 to dynamically establish the association between lesion areas and edges,refine lesion boundary features,and recalibrate the lesion location.In the decoder module,we introduced a successive patch-expanding layer between adjacent resolution blocks to double the size of the feature map and halve the number of channels.Within each convolution block,a convolution layer was embedded to learn informative features.Finally,we applied a prediction head to obtain the lesion-segmentation probability map Y∈R(H×W×K),where K indicates the number of categories corresponding to the K-1 lesion map and background map.Results and Discussions We used two publicly available DR datasets,i.e.,IDRiD and DDR,to verify the proposed PMFF-Net.The comparison results(see Tables 1 and 2)show that our PMFF-Net performs better than the current state-of-the-art DR lesion-segmentation models on the two datasets,with mDice and mIoU values of approximately 45.11%and 33.39%,respectively,for predicting EX,HE,MA,and SE simultaneously on the IDRiD dataset;and mDice and mIoU values of 36.64%and 35.04%,respectively,on the DDR dataset.Specifically,compared with H2Former,our model achieves higher mDice and mIoU values by 3.94 percentage points and 3.28 percentage points,respectively,on the IDRiD dataset,and 4.55 percentage points and 4.69 percentage points higher values,respectively,compared with those of PMCNet.On the DDR dataset,our model achieves the best segmentation results,outperforming H2Former by 5.17 percentage points and 6.15 percentage points in terms of mDice and mIoU,respectively,and surpassing PMCNet by 6.36 percentage points and 7.43 percentage points,respectively.Meanwhile,our model can provide real-time DR-lesion analysis,with analysis times of approximately 34.74 and 38.48 ms per image on the IDRiD and DDR datasets,respectively.The visualized comparison results shown in Figs.6 and 7 indicate that the results predicted by our model are more similar to the ground truth compared with those of other advanced methods.The cross-validation results across datasets presented in Tables 3 and 4 show that,compared with other advanced segmentation methods,our model offers better generalizability.The perfect segmentation performance of the developed PMFF-Net may be attributed to the ability of our HT module in capturing global context information and local spatial details,the GCF module gradually aggregating different levels of multiscale features through high-level semantic information guidance,the DA module eliminating irrelevant noise and enhancing DR-lesion discriminative feature identification,and the SEA block establishing a constraint between the DR-lesion region and boundary.Additionally,the effectiveness of the components of the proposed PMFF-Net was justified,including the HT,GCF,DA,and SEA modules,on the IDRiD dataset.Conclusions In this study,we developed a novel PMFF-Net for the simultaneous segmentation of four types of DR lesions in retinal fundus images.In the PMFF-Net,we constructed an HT module by elegantly integrating a CNN,multiscale channel attention,and Transformer to model the long-range global dependency of lesions and their local spatial features.The GCF module was designed to merge features from adjacent encoder layers progressively under the guidance of high-level semantic cues.We utilized a DA module to suppress irrelevant noisy interference and refine the fusion multiscale features from the GCF module dynamically.Furthermore,we incorporated an SEA module to emphasize lesion boundary contours and recalibrate lesion locations.Extensive experimental results on the IDRiD and DDR datasets show that our PMFF-Net perform better than other competitive segmentation methods.By performing cross-validation across datasets,the excellent generalizability of our model can be similarly demonstrated.Finally,we demonstrated the effectiveness and necessity of the proposed model via a comprehensive ablation analysis.The developed method can serve as a general segmentation framework and has been applied to segment other types of biomedical images.