For the problems of same content, not enough individual character and low interaction in traditional English listening and speaking teaching, this research builds a multi-modal learning model that combines speech identification, natural language handling, computer vision and big data technique. This model has realized the fusion of multimodal information, the accurate portraying of learner features, the adaptive carrying out of teaching and the diversified conducting of evaluation, hence it forms a closed-loop teaching system. The teaching experiments which we carry out on 200 college students display that this model can significantly promote students’ listening and speaking ability, learning activeness and degree of satisfaction. This research offers a standardized and repeatable scheme for the intellectual and personalized transformation of English listening and speaking classes.