Speech technologies are heavily dependent on textual information for the training of acoustic and language models (orthographic annotations, phonetic dictionary, text). Yet, half of the human languages do not have an orthography, and many others do not use it in a consistent fashion. This represents millions of potential users that cannot be served by existing speech technologies. As any human 4-year-old demonstrates, however, it is theoretically possible to learn a language communication system before learning to read and write, from raw sensory signals and limited human supervision.
During the 2017 JSALT workshop at Carnegie Mellon University, the Speaking Rosetta Rosetta will construct a system that automatically discovers symbolic representations of speech sounds, (i.e., a machine-internal ‘alphabet’, and dictionary) and put them to use for end-to-end tasks (image retrieval, translation, speech synthesis). We will work from multi-modal datasets that can be easily collected in a language without orthography: spoken image descriptions, plus their translations in a resource-rich language. The main idea is to use cross-modal correspondences as a form of weak supervision to constrain the unsupervised discovery of linguistic units. We will evaluate the discovered units both quantitatively (performance on applications and, psycholinguistic tasks) and qualitatively (language description).
This workshop will put together experts from machine learning, speech and language technology, image processing, machine translation, field linguistics, and psycholinguistics. This work opens up a new way of collecting language resources for speech technologies without relying on orthographic/phonetic transcriptions, provide tools for language documentation, and will increase our knowledge of unsupervised/weakly supervised learning.
Figure 1. General description of the project.