With the development of modern society, many people have become accustomed to life under IoT and machine learning, and now interaction has become an important factor in affecting English listening and speaking skills. The more interaction the learners have among themselves, the more active they will be and the better the results will be. Vision is a way for people to look at the world and know about it. People use their eyes and brains to acquire, process and understand visual information. At present, the series of problems for mobile English listening and speaking learning are still being addressed, such as non-optimal functions and poor operability. Based on Smart Sensing and Communication, we fully utilize computer vision technology to design a set of computer vision modules for an interactive English self-study system, and have completed user demand analysis, overall architecture design, functional applications, etc. The System will be able to play sound and video generally. Add images, mind maps, art words and other multimedia to increase the learners’ enthusiasm and interest in learning, and provide an intuitive and vivid experience for users. Based on the above experiments, it is clear that the system interaction is more convenient and pleasant for English learners. The system can address some deficiencies of the existing system by adding new functions to the current system, improving the user experience, and achieving an effective voice recognition accuracy of over 95 per cent. The studies in this paper provide necessary support for the application of both IoT networks and machine learning.