Conventions¶
Database name¶
The name of a database should be lowercase
and must not contain blanks or
special characters.
If you have different versions,
or very long names you can use -
to increase readability.
audformat.Database(name="librispeech-mfa-cseg-pho")
name: librispeech-mfa-cseg-pho
source: ''
usage: unrestricted
languages: []
Table and scheme names¶
Use lower case for table and scheme names.
Here’s a list of common scheme names:
Name |
Content |
---|---|
emotion |
emotion categories |
arousal / valence / dominance |
emotion dimensions |
speaker |
unique speaker id |
role |
the role a speaker has, e.g. agent vs. client |
transcription |
transcriptions on word or phonetic level accurate enough to be used for ASR |
text |
transcriptions that are not accurate enough for ASR |
In case you have several schemes of the same type,
append -xxx
.
E.g. use transcription-word
and transcription-phoneme
if a database offers word
and phoneme transcriptions.
Tables, schemes, and raters¶
Consider one table per scheme with the name of the scheme. Use lower case for table and scheme names. If you have multiple raters, name each column after the name of the rater.
db = audformat.Database("mydata")
db.schemes["arousal"] = audformat.Scheme(audformat.define.DataType.FLOAT)
db.schemes["valence"] = audformat.Scheme(audformat.define.DataType.FLOAT)
db.raters["rater1"] = audformat.Rater()
db.raters["rater2"] = audformat.Rater()
for scheme_id in db.schemes:
db[scheme_id] = audformat.Table(audformat.filewise_index())
for rater_id in db.raters:
db[scheme_id][rater_id] = audformat.Column(
scheme_id=scheme_id,
rater_id=rater_id,
)
db
name: mydata
source: ''
usage: unrestricted
languages: []
raters:
rater1: {type: human}
rater2: {type: human}
schemes:
arousal: {dtype: float}
valence: {dtype: float}
tables:
arousal:
type: filewise
columns:
rater1: {scheme_id: arousal, rater_id: rater1}
rater2: {scheme_id: arousal, rater_id: rater2}
valence:
type: filewise
columns:
rater1: {scheme_id: valence, rater_id: rater1}
rater2: {scheme_id: valence, rater_id: rater2}
Database splits¶
If an official split into training,
development
and test set
consists,
consider one table per split,
named scheme_id.split
.
db = audformat.Database("mydata")
db.schemes["arousal"] = audformat.Scheme(audformat.define.DataType.FLOAT)
db.splits["train"] = audformat.Split(audformat.define.SplitType.TRAIN)
db.splits["dev"] = audformat.Split(audformat.define.SplitType.DEVELOP)
db.splits["test"] = audformat.Split(audformat.define.SplitType.TEST)
for scheme_id in db.schemes:
for split_id in db.splits:
table_id = f"{scheme_id}.{split_id}"
db[table_id] = audformat.Table(
index=audformat.filewise_index(),
split_id=split_id,
)
db
name: mydata
source: ''
usage: unrestricted
languages: []
schemes:
arousal: {dtype: float}
splits:
dev: {type: dev}
test: {type: test}
train: {type: train}
tables:
arousal.dev: {type: filewise, split_id: dev}
arousal.test: {type: filewise, split_id: test}
arousal.train: {type: filewise, split_id: train}
Gold standard annotation¶
Annotations by several raters
belonging to the same scheme
should be stored in a single table,
but not aggregated,
e.g. by adding a column with mean or some other metric.
Instead a new table with the postfix .gold_standard
should be created
to store the average of all rater.
In addition,
a rater with the id "gold_standard"
and the type audformat.define.RaterType.VOTE
should be created
and associated with the column
holding the gold standard values.
db = audformat.Database("mydata")
db.schemes["arousal"] = audformat.Scheme(audformat.define.DataType.FLOAT)
db.raters["rater1"] = audformat.Rater()
db.raters["rater2"] = audformat.Rater()
db.raters["gold_standard"] = audformat.Rater(audformat.define.RaterType.VOTE)
for scheme_id in db.schemes:
db[scheme_id] = audformat.Table(audformat.filewise_index())
for rater_id in ["rater1", "rater2"]:
db[scheme_id][rater_id] = audformat.Column(
scheme_id=scheme_id,
rater_id=rater_id,
)
gold_id = f"{scheme_id}.gold_standard"
db[gold_id] = audformat.Table(audformat.filewise_index())
db[gold_id][scheme_id] = audformat.Column(
scheme_id=scheme_id,
rater_id="gold_standard",
)
db
name: mydata
source: ''
usage: unrestricted
languages: []
raters:
gold_standard: {type: vote}
rater1: {type: human}
rater2: {type: human}
schemes:
arousal: {dtype: float}
tables:
arousal:
type: filewise
columns:
rater1: {scheme_id: arousal, rater_id: rater1}
rater2: {scheme_id: arousal, rater_id: rater2}
arousal.gold_standard:
type: filewise
columns:
arousal: {scheme_id: arousal, rater_id: gold_standard}
Confidence values¶
Assume you have an annotation
that does not only provide a value,
but also a confidence of that value.
In this case you create
two schemes,
one for the value,
and one for the confidence
using the same scheme ID,
but followed by .confidence
.
The confidence values should be stored in a separate table. Or it can be stored within the same table as a different column, which might be worth considering when storing the gold standard.
db = audformat.Database("mydata")
db.schemes["arousal"] = audformat.Scheme(audformat.define.DataType.FLOAT)
db.schemes["arousal.confidence"] = audformat.Scheme(
audformat.define.DataType.FLOAT,
minimum=0,
maximum=1,
)
db.raters["gold_standard"] = audformat.Rater(audformat.define.RaterType.VOTE)
db["arousal"] = audformat.Table(audformat.filewise_index())
for scheme_id in db.schemes:
db["arousal"][scheme_id] = audformat.Column(
scheme_id=scheme_id,
rater_id="gold_standard",
)
db
name: mydata
source: ''
usage: unrestricted
languages: []
raters:
gold_standard: {type: vote}
schemes:
arousal: {dtype: float}
arousal.confidence: {dtype: float, minimum: 0, maximum: 1}
tables:
arousal:
type: filewise
columns:
arousal: {scheme_id: arousal, rater_id: gold_standard}
arousal.confidence: {scheme_id: arousal.confidence, rater_id: gold_standard}
File and speaker information¶
Meta information like speaker ID
that is not included in another table
should be collected in a table files
.
If you have metadata
belonging only to segments,
collect it in a table segments
.
Additional meta information, that is bound to another information like age of speaker, should be collected in the header as it can be later automatically mapped.
db = audformat.Database("mydata")
M = audformat.define.Gender.MALE
F = audformat.define.Gender.FEMALE
speaker = {
"speaker1": {"gender": F, "age": 31},
"speaker2": {"gender": M, "age": 85},
}
db.schemes["speaker"] = audformat.Scheme(labels=speaker)
db["files"] = audformat.Table(
index=audformat.filewise_index(["a.wav", "b.wav"])
)
db["files"]["speaker"] = audformat.Column(scheme_id="speaker")
db["files"]["speaker"].set(["speaker1", "speaker2"])
db
name: mydata
source: ''
usage: unrestricted
languages: []
schemes:
speaker:
dtype: str
labels:
speaker1: {gender: female, age: 31}
speaker2: {gender: male, age: 85}
tables:
files:
type: filewise
columns:
speaker: {scheme_id: speaker}
db["files"].get()
speaker | |
---|---|
file | |
a.wav | speaker1 |
b.wav | speaker2 |
You can access the additional information with the map
argument
of audformat.Table.get()
,
see Map scheme labels
for an extended documentation.
db["files"].get(map={"speaker": "gender"})
gender | |
---|---|
file | |
a.wav | female |
b.wav | male |
Temporal data¶
Temporal duration data
like response time of a rater
should be stored as pd.Timedelta
.
Temporal dates
like time of rating
should be stored as datetime.datetime
.
import pandas as pd
times = [2.1, 0.1] # in seconds
db = audformat.Database("mydata")
db.schemes["time"] = audformat.Scheme(audformat.define.DataType.TIME)
db.raters["rater"] = audformat.Rater()
db["files"] = audformat.Table(
index=audformat.filewise_index(["a.wav", "b.wav"])
)
db["files"]["time"] = audformat.Column(
scheme_id="time",
rater_id="rater",
)
db["files"]["time"].set(pd.to_timedelta(times, unit="s"))
db["files"].get()
time | |
---|---|
file | |
a.wav | 0 days 00:00:02.100000 |
b.wav | 0 days 00:00:00.100000 |