Data Collection Types and Locations
Appen has performed speech and language data collections in more than 80 languages across 40+ countries around the world -- from North and South-East Asia, North Africa, the Middle East, Europe, Scandinavia, North and South America.
Appen has experience collecting in a variety of modes. These include:
- telephony - fixed-line, mobile, in-car
- microphone recorded - for embedded device applications
- broadcast - for acoustic search applications
- desktop
- web interface
- field microphone
- tablet style PCs
Appen data collections have been based in a wide range of locations:
- in-car (microphone and telephony) - include some involving our experience with Lombard effect
- recording studio
- office environment
- street and public place recordings
The range of speech and language types collected includes:
scripted speech - elicited speech
- free speech
- two-way conversational speech
- multi-speaker meeting interaction
- text corpora - emails, SMS, vowelised Arabic, ontologies, domain-specific materials
- handwriting - databases for handwriting recognition and document parsing (diagrams etc)