Abstract
Developing intelligent visual systems for next-generation smart classrooms has become anactive area of research in computer vision. Advances in computer vision and deep learningtechnologies have enabled the development of such systems capable of automatically classifyingstudents’ behavior and providing feedback to teachers. Recently, some vision-basedmethods have been proposed for this purpose. However, most works do not integratemultiplevisual cues like facial expressions and body poses, which can effectively improve classificationaccuracy. Moreover, these methods cannot be extended to get individual students’behavior feedback. This paper attempts to fill these research gaps by proposing a novel multiplevisual cues-based automated system that monitors and reports individual students’ andoverall class behavior. First, the system detects and tracks each student from the input classroomvideo frames. Then, it extracts body pose, mobile proximity, and facial features usingthe Openpose and Py-Feat frameworks and combines them into a single feature vector. Thisvector is fed into the trained behaviormodel, classifying each student’s behavior as “positive”or “negative.” Subsequently, the individual labels are aggregated frame-by-frame to estimatethe overall class behavior. The behavior model was developed by training a customized neuralnetwork architecture on our newly developed dataset, named “Classroom SpontaneousStudent Behavior Dataset.” The model trained on concatenated features achieved 91.30%and 90.80% training and validation accuracy, respectively, outperforming the models trainedon individual features and other relevant methods. Additionally, we empirically analyzedthe proposed system’s computational complexity and demonstrated its output on a sampleclassroom video.