DCT-YOLOv5:Designing Object Detection Algorithms from a Frequency Perspective
Discrete cosine transform(DCT)is one of the core steps of JPEG compression algorithm,which converts pixel data in the spatial domain of image into coefficients in the frequency domain.Algorithms that combine DCT with deep learning are largely common,but do not resolve the convolutional structures from the frequency perspective.To further improve the performance of object detection,we propose an improved algorithm for this problem:DCT-YOLOv5.First,it is shown that convolutional neural networks(CNNs),Transformers,and MLP architectures all implicitly model the frequency domain,validating previous standard model design principles:the effective perceptual field is always smaller than the theoretical perceptual field,and multiple small convolutional kernel is preferred to a large convolutional kernel.Second,the input channels and the convolution kernel are considered to choose a reasonable number of output channels to achieve an approximate lossless transformation,where the only place to change the number of channels is at the down-sampling stage.Finally,by comparing DCT and convolution with fixed parameters,the difference between the two is stabilized within±0.8%.And to minimize the computation,grouped convolution with a fixed number of in-groups is introduced.The model is benchmarked with YOLOv5,and enriched experiments are designed on the COCO2017 dataset to validate the effectiveness of the proposed method.Theresultshowsa detection speed of 277.8 FPS and a mAP@.5 of 28.9%,achieving a relative improvement of 1.3%over the benchmark model.The test results indicate that the enhanced model has significantly improved accuracy and can operate on lower computing platforms.