Conventions

Database name

The name of a database should be lowercase and must not contain blanks or special characters. If you have different versions, or very long names you can use - to increase readability.

audformat.Database(name="librispeech-mfa-cseg-pho")
name: librispeech-mfa-cseg-pho
source: ''
usage: unrestricted
languages: []

Table and scheme names

Use lower case for table and scheme names.

Here’s a list of common scheme names:

Name

Content

emotion

emotion categories

arousal / valence / dominance

emotion dimensions

speaker

unique speaker id

role

the role a speaker has, e.g. agent vs. client

transcription

transcriptions on word or phonetic level accurate enough to be used for ASR

text

transcriptions that are not accurate enough for ASR

In case you have several schemes of the same type, append -xxx. E.g. use transcription-word and transcription-phoneme if a database offers word and phoneme transcriptions.

Tables, schemes, and raters

Consider one table per scheme with the name of the scheme. Use lower case for table and scheme names. If you have multiple raters, name each column after the name of the rater.

db = audformat.Database("mydata")

db.schemes["arousal"] = audformat.Scheme(audformat.define.DataType.FLOAT)
db.schemes["valence"] = audformat.Scheme(audformat.define.DataType.FLOAT)

db.raters["rater1"] = audformat.Rater()
db.raters["rater2"] = audformat.Rater()

for scheme_id in db.schemes:
    db[scheme_id] = audformat.Table(audformat.filewise_index())
    for rater_id in db.raters:
        db[scheme_id][rater_id] = audformat.Column(
            scheme_id=scheme_id,
            rater_id=rater_id,
        )

db
name: mydata
source: ''
usage: unrestricted
languages: []
raters:
  rater1: {type: human}
  rater2: {type: human}
schemes:
  arousal: {dtype: float}
  valence: {dtype: float}
tables:
  arousal:
    type: filewise
    columns:
      rater1: {scheme_id: arousal, rater_id: rater1}
      rater2: {scheme_id: arousal, rater_id: rater2}
  valence:
    type: filewise
    columns:
      rater1: {scheme_id: valence, rater_id: rater1}
      rater2: {scheme_id: valence, rater_id: rater2}

Database splits

If an official split into training, development and test set consists, consider one table per split, named scheme_id.split.

db = audformat.Database("mydata")

db.schemes["arousal"] = audformat.Scheme(audformat.define.DataType.FLOAT)

db.splits["train"] = audformat.Split(audformat.define.SplitType.TRAIN)
db.splits["dev"] = audformat.Split(audformat.define.SplitType.DEVELOP)
db.splits["test"] = audformat.Split(audformat.define.SplitType.TEST)

for scheme_id in db.schemes:
    for split_id in db.splits:
        table_id = f"{scheme_id}.{split_id}"
        db[table_id] = audformat.Table(
            index=audformat.filewise_index(),
            split_id=split_id,
        )

db
name: mydata
source: ''
usage: unrestricted
languages: []
schemes:
  arousal: {dtype: float}
splits:
  dev: {type: dev}
  test: {type: test}
  train: {type: train}
tables:
  arousal.dev: {type: filewise, split_id: dev}
  arousal.test: {type: filewise, split_id: test}
  arousal.train: {type: filewise, split_id: train}

Gold standard annotation

Annotations by several raters belonging to the same scheme should be stored in a single table, but not aggregated, e.g. by adding a column with mean or some other metric. Instead a new table with the postfix .gold_standard should be created to store the average of all rater. In addition, a rater with the id "gold_standard" and the type audformat.define.RaterType.VOTE should be created and associated with the column holding the gold standard values.

db = audformat.Database("mydata")

db.schemes["arousal"] = audformat.Scheme(audformat.define.DataType.FLOAT)

db.raters["rater1"] = audformat.Rater()
db.raters["rater2"] = audformat.Rater()
db.raters["gold_standard"] = audformat.Rater(audformat.define.RaterType.VOTE)

for scheme_id in db.schemes:
    db[scheme_id] = audformat.Table(audformat.filewise_index())
    for rater_id in ["rater1", "rater2"]:
        db[scheme_id][rater_id] = audformat.Column(
            scheme_id=scheme_id,
            rater_id=rater_id,
        )
    gold_id = f"{scheme_id}.gold_standard"
    db[gold_id] = audformat.Table(audformat.filewise_index())
    db[gold_id][scheme_id] = audformat.Column(
        scheme_id=scheme_id,
        rater_id="gold_standard",
    )

db
name: mydata
source: ''
usage: unrestricted
languages: []
raters:
  gold_standard: {type: vote}
  rater1: {type: human}
  rater2: {type: human}
schemes:
  arousal: {dtype: float}
tables:
  arousal:
    type: filewise
    columns:
      rater1: {scheme_id: arousal, rater_id: rater1}
      rater2: {scheme_id: arousal, rater_id: rater2}
  arousal.gold_standard:
    type: filewise
    columns:
      arousal: {scheme_id: arousal, rater_id: gold_standard}

Confidence values

Assume you have an annotation that does not only provide a value, but also a confidence of that value. In this case you create two schemes, one for the value, and one for the confidence using the same scheme ID, but followed by .confidence.

The confidence values should be stored in a separate table. Or it can be stored within the same table as a different column, which might be worth considering when storing the gold standard.

db = audformat.Database("mydata")

db.schemes["arousal"] = audformat.Scheme(audformat.define.DataType.FLOAT)
db.schemes["arousal.confidence"] = audformat.Scheme(
    audformat.define.DataType.FLOAT,
    minimum=0,
    maximum=1,
)

db.raters["gold_standard"] = audformat.Rater(audformat.define.RaterType.VOTE)

db["arousal"] = audformat.Table(audformat.filewise_index())
for scheme_id in db.schemes:
    db["arousal"][scheme_id] = audformat.Column(
        scheme_id=scheme_id,
        rater_id="gold_standard",
    )

db
name: mydata
source: ''
usage: unrestricted
languages: []
raters:
  gold_standard: {type: vote}
schemes:
  arousal: {dtype: float}
  arousal.confidence: {dtype: float, minimum: 0, maximum: 1}
tables:
  arousal:
    type: filewise
    columns:
      arousal: {scheme_id: arousal, rater_id: gold_standard}
      arousal.confidence: {scheme_id: arousal.confidence, rater_id: gold_standard}

File and speaker information

Meta information like speaker ID that is not included in another table should be collected in a table files. If you have metadata belonging only to segments, collect it in a table segments.

Additional meta information, that is bound to another information like age of speaker, should be collected in the header as it can be later automatically mapped.

db = audformat.Database("mydata")

M = audformat.define.Gender.MALE
F = audformat.define.Gender.FEMALE
speaker = {
    "speaker1": {"gender": F, "age": 31},
    "speaker2": {"gender": M, "age": 85},
}

db.schemes["speaker"] = audformat.Scheme(labels=speaker)
db["files"] = audformat.Table(
    index=audformat.filewise_index(["a.wav", "b.wav"])
)
db["files"]["speaker"] = audformat.Column(scheme_id="speaker")
db["files"]["speaker"].set(["speaker1", "speaker2"])

db
name: mydata
source: ''
usage: unrestricted
languages: []
schemes:
  speaker:
    dtype: str
    labels:
      speaker1: {gender: female, age: 31}
      speaker2: {gender: male, age: 85}
tables:
  files:
    type: filewise
    columns:
      speaker: {scheme_id: speaker}
db["files"].get()
speaker
file
a.wav speaker1
b.wav speaker2

You can access the additional information with the map argument of audformat.Table.get(), see Map scheme labels for an extended documentation.

db["files"].get(map={"speaker": "gender"})
gender
file
a.wav female
b.wav male

Temporal data

Temporal duration data like response time of a rater should be stored as pd.Timedelta. Temporal dates like time of rating should be stored as datetime.datetime.

import pandas as pd


times = [2.1, 0.1]  # in seconds

db = audformat.Database("mydata")

db.schemes["time"] = audformat.Scheme(audformat.define.DataType.TIME)
db.raters["rater"] = audformat.Rater()

db["files"] = audformat.Table(
    index=audformat.filewise_index(["a.wav", "b.wav"])
)
db["files"]["time"] = audformat.Column(
    scheme_id="time",
    rater_id="rater",
)
db["files"]["time"].set(pd.to_timedelta(times, unit="s"))

db["files"].get()
time
file
a.wav 0 days 00:00:02.100000
b.wav 0 days 00:00:00.100000