Sounds are captured in recordings and shared on the Internet on a minute-by-minute basis. These recordings, which are predominantly videos, constitute the largest archive of sounds we've ever seen. However, most of these audios are unlabeled in terms of their content, requiring methods for automatic content analysis, which poses unique challenges when done on a large-scale. Therefore, we propose the Never-Ending Learner of Sounds (NELS), a framework for large-scale learning of sounds and its associated knowledge. Currently, NELS has a vocabulary of 605 sounds drawn from four different datasets and has processed over 4 million segments of YouTube videos. The never-ending approach is inspired by the Never Ending Language Learner (NELL) and the Never Ending Image Learner (NEIL).
The framework consists of three modules Crawl, Hear & Learn and Website, illustrated in the Figure. In its current form, in Crawl, audio and its associated metadata from YouTube videos is crawled using search queries corresponding to the class labels ($<$sound event label$>$ sound) drawn from four datasets. In Hear & Learn, the datasets are used to train multi-class classifiers, which are then run on the video segments. Classifier's predictions and metadata are used to index the segments. Lastly, the Website module allows users to search for the indexed audio using a text query, then, the results are displayed allowing human feedback to evaluate quality. Additionally, a user can provide either a YouTube video link or upload its own audio to have NELS predict the dominant sound.