FFConvNeXt3D:Large Convolutional Kernel Network for Extracting Target Features of Medium and Large Size
Large convolutional kernel models was proven effective in the image domain,but the available 3D large convolutional kernel models were not good enough in the video domain. Additionally,the back-bone network only could extract features for generic targets,and human was ignored as the subject in the spatio-temporal action detection task in previous work. To address these issues,a 3D large convolutional kernel neural network containing a feature fusion structure (FFConvNeXt3D) was proposed. Firstly,the mature ConvNeXt network into a ConvNeXt3D network was extended to the video domain,where pre-training weights were also processed for the expanded network. Secondly,the effect of the size and posi-tion of the temporal dimension of the convolutional kernel on the performance of the model was investiga-ted. Finally,a feature fusion structure that would focus on improving the ability of the backbone network to extract features from targets of medium or larger size such as humans was proposed. The ablation ex-periments and comparison experiments were conducted on the UCF101-24 dataset. The experimental re-sults verified the effectiveness of the feature fusion structure,and the model performed better than other methods.
large convolution kernelobject detectionspatio temporal action detectionaction recognitionfeature fusion