Text Corpora

Appen has a variety of text collections in different languages available for license. Additional corpora are being added continuously.

  • Single language texts: the texts collected are from many different situations, different writers and in various languages. The text types we are collecting are: travel, sports, recipes, instructions, descriptions, narrative, letters, opinions, fables, children's stories and email.
  • Parallel text corpora: some of these texts are now available in parallel format - i.e. translation of one text into a second language.
  • Named entity annotated texts: corpora of 500,000 words have also been developed in several languages (named entities included persons, titles, quantities, geopolitical entities, locations, facilities, etc.

Please use our interactive Product Catalogue for a list and detailed descriptions of text corpora available for license.