The preschool situational teaching method often has difficulty in making scenarios match children’s physical and cognitive development. The present research carries out optimization on situational design through concentrating on children’s gesture movements and voice emotions. For the purpose of improving complex-scene identification, a hand-movement identification model which is based on YOLOv7 is developed, and it is integrated with the CBAM attention module. One two-stage speech enhancement model which combines self-attention and spatial attention has been constructed for the reduction of noise in the speech of children. Mel frequency cepstrum coefficients and one residual neural network are then employed by us to conduct speech emotion recognition. After we have put this model into use in situational teaching, the children of Class E1 have showed markedly better self-control (P<0.05), thus this has verified the effectiveness of this method.