Conditions for Data Publication in LinguRep


LinguRep archives linguistic corpora and data collections in the fields of variationist linguistics, historical linguistics, dialectology, and general research on regional languages. Below you will find an overview of the terms and conditions as well as procedures.

1. Objectives

The aim is the long-term preservation, archiving, and reuse of linguistic data (e.g., recordings of conversations or interviews, data from corpus analyses, videos of participant observation, datasets from indirect surveys) in a discipline-specific data repository.

2. Criteria for Data Ingest

LinguRep primarily ingests data from completed research projects with a focus on regional language variation (e.g., dialects, regiolects). The emphasis is on data from the entire German-speaking area, including extraterritorial varieties (language islands).

Specifically, the following should apply:

  • the data originate from a completed project and are suitable for reuse,

  • data preparation (recordings, transcriptions, annotations, metadata) is as standardized and well documented as possible,

  • agreements on use, access, licensing, and privacy and data protection have been clarified in advance,

  • metadata allow for a sufficiently detailed description (e.g., documentation of the period of data collection, location, speakers, recording conditions, annotations, etc.).

3. Preferred Data Formats

To ensure smooth archiving and reuse, we recommend the following formats – they are not mandatory, but strongly preferred:

Audio & Video

  • Audio: uncompressed recordings are preferred, such as WAV format with 48 kHz and 16 bit. Other uncompressed or losslessly compressed formats may also be used. If possible, avoid already lossy compressed formats (e.g., MP3).

  • Video: files in MPEG-4 format with H.264 encoding are preferred. Common frame rates are 25 or 50 fps, with resolutions ranging from Full HD (1920×1080) up to UHD (3840×2160). Other well-documented formats are also accepted.

Transcription & Annotation

  • Phonetic transcriptions in IPA, SAMPA, or Teuthonista should be provided as text files (e.g., .txt, .textgrid, or .csv).

  • Aligned transcriptions based on, for example, Praat (TextGrids), ELAN, or EXMARaLDA are well suited.

  • If possible, avoid office formats (e.g., .docx, .xlsx, .odt, .ods, .pages, .numbers).

Metadata

  • Structured formats (e.g., XML) are ideal; alternatively, tabular formats (e.g., .csv).

  • It is important that all relevant information required for indexing the corpus is included, such as details on data collection, contributors, time and place, and methodological procedures.

  • LinguRep provides a metadata schema that must be completed by the data providers.

4. Licensing and Rights Clearance

The rights to the data are specified in a Deposit Agreement. This agreement clarifies who holds the rights and under which conditions archiving and reuse take place. Data protection and personal rights must be given special consideration; where necessary, consent from participants or anonymization of the collected data is required.

5. Check

lingurep ingest check list