Database¶

class audformat.Database(name, source='', usage='unrestricted', *, expires=None, languages=None, description=None, author=None, organization=None, license=None, license_url=None, meta=None)[source]¶

Database object.

A database consists of a header holding raters, schemes, splits, and other meta information. In addition, it links to a number of tables listing files and labels.

For a start see how to create a database and inspect the example of the emodb database.

Parameters

name (str) – name of database
source (str) – data source (e.g. link to website)
usage (str) – permission of usage, see audformat.define.Usage. Set to 'other' if none of the other fields fit.
expires (Optional[date]) – expiry date
languages (Union[str, Sequence[str], None]) – list of languages. Will be mapped to ISO 639-3 strings with audformat.utils.map_language()
description (Optional[str]) – database description
author (Optional[str]) – database author(s)
organization (Optional[str]) – organization(s) maintaining the database
license (Union[str, License, None]) – database license. You can use a custom license or pick one from audformat.define.License. In the later case, license_url will be automatically set if it is not given
license_url (Optional[str]) – URL of database license
meta (Optional[dict]) – additional meta fields

Raises

BadValueError – if an invalid usage value is passed
ValueError – if language is unknown

Examples

>>> db = Database(
...     "mydb",
...     "https://www.audeering.com/",
...     define.Usage.COMMERCIAL,
...     languages=["English", "de"],
... )
>>> db
name: mydb
source: https://www.audeering.com/
usage: commercial
languages: [eng, deu]
>>> labels = ["positive", "neutral", "negative"]
>>> db.schemes["emotion"] = Scheme(labels=labels)
>>> db.schemes["match"] = Scheme(dtype="bool")
>>> db.raters["rater"] = Rater()
>>> db.media["audio"] = Media(
...     define.MediaType.AUDIO,
...     format="wav",
...     sampling_rate=16000,
... )
>>> index = filewise_index(["f1.wav", "f2.wav"])
>>> db["table"] = Table(index, media_id="audio")
>>> db["table"]["column"] = Column(
...     scheme_id="emotion",
...     rater_id="rater",
... )
>>> db["table"]["column"].set(["neutral", "positive"])
>>> index = pd.Index([], dtype="string", name="idx")
>>> db["misc-table"] = MiscTable(index)
>>> db["misc-table"]["column"] = Column(scheme_id="match")
>>> db
name: mydb
source: https://www.audeering.com/
usage: commercial
languages: [eng, deu]
media:
  audio: {type: audio, format: wav, sampling_rate: 16000}
raters:
  rater: {type: human}
schemes:
  emotion:
    dtype: str
    labels: [positive, neutral, negative]
  match: {dtype: bool}
tables:
  table:
    type: filewise
    media_id: audio
    columns:
      column: {scheme_id: emotion, rater_id: rater}
misc_tables:
  misc-table:
    levels: {idx: str}
    columns:
      column: {scheme_id: match}
>>> list(db)
['misc-table', 'table']
>>> db.get("emotion")
         emotion
file
f1.wav   neutral
f2.wav  positive

contains()¶

Database.__contains__(table_id)[source]¶

Check if (miscellaneous) table exists.

Parameters: table_id (str) – table identifier
Return type: bool

eq()¶

Database.__eq__(other)[source]¶

Comparison if database equals another database.

Return type: bool

getitem()¶

Database.__getitem__(table_id)[source]¶

Get (miscellaneous) table from database.

Parameters: table_id (str) – table identifier
Raises: BadKeyError – if table does not exist
Return type: MiscTable | Table

iter()¶

Database.__iter__()[source]¶

Iterate over (miscellaneous) tables of database.

Return type: MiscTable | Table

setitem()¶

Database.__setitem__(table_id, table)[source]¶

Add table to database.

Parameters

table_id (str) – table identifier
table (MiscTable | Table) – the table

Raises

BadIdError – if table has a split_id or media_id, which is not specified in the underlying database
TableExistsError – if setting a miscellaneous table when a filewise or segmented table with the same ID exists (or vice versa)

Return type

MiscTable | Table

attachments¶

Database.attachments¶

Dictionary of attachments.

Raises: RuntimeError – if the path of a newly assigned attachment overlaps with the path of an existing attachment

author¶

Database.author¶: Author(s) of database

description¶

Database.description¶: Description

drop_files()¶

Database.drop_files(files, num_workers=1, verbose=False)[source]¶

Drop files from tables.

Iterate through all tables and remove rows with a reference to listed or matching files.

Parameters

files (str | Sequence[str] | Callable[[str], bool]) – list of files or condition function
num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5
verbose (bool) – show progress bar

drop_tables()¶

Database.drop_tables(table_ids)[source]¶

Drop (miscellaneous) tables by ID.

Parameters

table_ids (str | Sequence[str]) – table IDs to drop

Raises

audformat.errors.BadIdError – if a table with provided ID does not exist in the database
RuntimeError – if a misc table that is used in a scheme would be removed

dump()¶

Database.dump(stream=None, indent=2)¶

Serialize object to YAML.

Parameters

stream – file-like object. If None serializes to string
indent (int) – indent

Return type

str

Returns

YAML string

expires¶

Database.expires¶: Expiry date

files¶

Database.files¶

Files referenced in the database.

Includes files from filewise and segmented tables.

Returns: files

files_duration()¶

Database.files_duration(files, *, root=None)[source]¶

Duration of files in the database.

Use db.files_duration(db.files).sum() to get the total duration of all files in a database. Or db.files_duration(db[table_id].files).sum() to get the total duration of all files assigned to a table.

Note

Durations are cached, i.e. changing the files on disk after calling this function can lead to wrong results. The cache is cleared when the database is reloaded from disk.

Parameters

files (str | Sequence[str]) – file names
root (Optional[str]) – root directory under which the files are stored. Provide if file names are relative and database was not saved or loaded from disk. If None audformat.Database.root is used

Return type

Series

Returns

mapping from file to duration

Raises

ValueError – if root is not set when using relative file names with a database that was not saved or loaded from disk

from_dict()¶

Database.from_dict(d, ignore_keys=None)¶

Deserialize object from dictionary.

Parameters

d (dict) – dictionary of class variables to assign
ignore_keys (Optional[Sequence[str]]) – variables listed here will be ignored

get()¶

Database.get(scheme, additional_schemes=[], *, tables=None, splits=None, strict=False, map=True, original_column_names=False, aggregate_function=None, aggregate_strategy='mismatch')[source]¶

Get labels by scheme.

Return all labels from columns assigned to a audformat.Scheme with name scheme. The request can be limited to specific tables and/or splits. By providing additional_schemes the result can be enriched with labels from other schemes (searched in all tables). If strict is False, a scheme is defined more broadly and does not only match schemes of the database, but also columns with the same name or labels of a scheme with the requested name as key. If at least one returned label belongs to a segmented table, the returned data frame has a segmented index. An aggregate_function can be provided that specifies how values are combined if more than one value is found for the same file or segment.

Parameters

scheme (str) – scheme ID for which labels should be returned. The search can be restricted to specific tables and splits by the tables and splits arguments. Or extended to columns with that same name or the name of a label in the scheme using the strict argument
additional_schemes (str | Sequence) – scheme ID or sequence of scheme IDs for which additional labels should be returned. The search is not affected by the tables and splits arguments
tables (Union[str, Sequence, None]) – limit search for scheme to selected tables
splits (Union[str, Sequence, None]) – limit search for scheme to selected splits
strict (bool) – if False the search is extended to columns that match the name of the scheme or the name of a label in the scheme
map (bool) – if True and a requested scheme has labels with mappings, those will be returned
original_column_names (bool) – if True keep the original column names (possibly results in multiple columns). For mapped schemes, the column name before mapping is returned, e.g. when requesting 'gender' it might return a column named 'speaker'
aggregate_function (Optional[Callable[[Series], object]]) – callable to aggregate overlapping values. The function gets a pandas.Series with overlapping values as input. E.g. set to lambda y: y.mean() to average the values or to tuple to return them as a tuple
aggregate_strategy (str) – if aggregate_function is not None, aggregate_strategy decides when aggregate_function is applied. 'overlap': apply to all samples that have an overlapping index; 'mismatch': apply to all samples that have an overlapping index and a different value

Return type

DataFrame

Returns

data frame with values

Raises

ValueError – if different labels are found for a requested scheme under the same index entry
ValueError – if original_column_names is True and two columns in the returned data frame have the same name and cannot be joined due to overlapping data or different data type
TypeError – if labels of different data type are found for a requested scheme

Examples

Return all labels that match a requested scheme.

>>> import audb
>>> db = audb.load(
...     "emodb",
...     version="1.4.1",
...     only_metadata=True,
...     full_path=False,
...     verbose=False,
... )
>>> db.get("emotion").head()
                   emotion
file
wav/03a01Fa.wav  happiness
wav/03a01Nc.wav    neutral
wav/03a01Wa.wav      anger
wav/03a02Fc.wav  happiness
wav/03a02Nc.wav    neutral
>>> db.get("transcription").head()
                                        transcription
file
wav/03a01Fa.wav  Der Lappen liegt auf dem Eisschrank.
wav/03a01Nc.wav  Der Lappen liegt auf dem Eisschrank.
wav/03a01Wa.wav  Der Lappen liegt auf dem Eisschrank.
wav/03a02Fc.wav     Das will sie am Mittwoch abgeben.
wav/03a02Nc.wav     Das will sie am Mittwoch abgeben.
>>> db.get("emotion", ["transcription"], map=False).head()
                   emotion transcription
file
wav/03a01Fa.wav  happiness           a01
wav/03a01Nc.wav    neutral           a01
wav/03a01Wa.wav      anger           a01
wav/03a02Fc.wav  happiness           a02
wav/03a02Nc.wav    neutral           a02

Non-existent schemes are ignored.

>>> db.get("emotion", ["non-existing"]).head()
       emotion non-existing
file
wav/03a01Fa.wav  happiness          NaN
wav/03a01Nc.wav    neutral          NaN
wav/03a01Wa.wav      anger          NaN
wav/03a02Fc.wav  happiness          NaN
wav/03a02Nc.wav    neutral          NaN

Limit to a particular table or split.

>>> db.get("emotion", tables=["emotion.categories.train.gold_standard"]).head()
                   emotion
file
wav/03a01Fa.wav  happiness
wav/03a01Nc.wav    neutral
wav/03a01Wa.wav      anger
wav/03a02Fc.wav  happiness
wav/03a02Nc.wav    neutral
>>> db.get("emotion", splits=["test"]).head()
                   emotion
file
wav/12a01Fb.wav  happiness
wav/12a01Lb.wav    boredom
wav/12a01Nb.wav    neutral
wav/12a01Wc.wav      anger
wav/12a02Ac.wav       fear

Return requested scheme name independent of column ID.

>>> db["emotion"].columns
emotion:
  {scheme_id: emotion, rater_id: gold}
emotion.confidence:
  {scheme_id: confidence, rater_id: gold}
>>> db.get("confidence").head()
                 confidence
file
wav/03a01Fa.wav        0.90
wav/03a01Nc.wav        1.00
wav/03a01Wa.wav        0.95
wav/03a02Fc.wav        0.85
wav/03a02Nc.wav        1.00

If strict is True only values that have an attached scheme are returned.

>>> db.get("emotion.confidence").head()
                 emotion.confidence
file
wav/03a01Fa.wav                0.90
wav/03a01Nc.wav                1.00
wav/03a01Wa.wav                0.95
wav/03a02Fc.wav                0.85
wav/03a02Nc.wav                1.00
>>> db.get("emotion.confidence", strict=True).head()
Empty DataFrame
Columns: [emotion.confidence]
Index: []

If more then one value exists for the requested scheme and index entry, an error is raised and aggregate_function can be used to combine the values.

>>> # Add a shuffled version of emotion ratings as `random` column
>>> db["emotion"]["random"] = Column(scheme_id="emotion")
>>> db["emotion"]["random"].set(
...     db["emotion"]["emotion"].get().sample(frac=1, random_state=1)
... )
>>> db.get("emotion")
Traceback (most recent call last):
    ...
ValueError: Found overlapping data in column 'emotion':
                      left    right
file
wav/03a01Nc.wav    neutral  disgust
wav/03a01Wa.wav      anger  neutral
wav/03a02Fc.wav  happiness  neutral
wav/03a02Ta.wav    sadness  boredom
wav/03a02Wb.wav      anger  sadness
wav/03a04Ad.wav       fear  neutral
wav/03a04Fd.wav  happiness    anger
wav/03a04Nc.wav    neutral  sadness
wav/03a04Wc.wav      anger  boredom
wav/03a05Aa.wav       fear  sadness
...
>>> db.get("emotion", aggregate_function=lambda y: y[0]).head()
                   emotion
file
wav/03a01Fa.wav  happiness
wav/03a01Nc.wav    neutral
wav/03a01Wa.wav      anger
wav/03a02Fc.wav  happiness
wav/03a02Nc.wav    neutral

Alternatively, use original_column_names to return column IDs.

>>> db.get("emotion", original_column_names=True).head()
                   emotion     random
file
wav/03a01Fa.wav  happiness  happiness
wav/03a01Nc.wav    neutral    disgust
wav/03a01Wa.wav      anger    neutral
wav/03a02Fc.wav  happiness    neutral
wav/03a02Nc.wav    neutral    neutral

is_portable¶

Database.is_portable¶

Check if database can be moved to another location.

To be portable, media must not be referenced with an absolute path, and not contain \, ., or ... If a database is portable it can be moved to another folder or updated by another database.

Returns: True if the database is portable

languages¶

Database.languages¶: List of included languages

license¶

Database.license¶: License of database

license_url¶

Database.license_url¶: URL of database license

load()¶

static Database.load(root, *, name='db', load_data=False, num_workers=1, verbose=False)[source]¶

Load database from disk.

Expects a header <root>/<name>.yaml and for every table a file <root>/<name>.<table-id>.[csv|parquet|pkl] Media files should be located under root.

Parameters

root (str) – root directory
name (str) – base name of header and table files
load_data (bool) – if False, audformat.Table data is only loaded on demand, e.g. when audformat.Table.get() is called for the first time. Set to True to load all audformat.Table data immediately
num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5
verbose (bool) – show progress bar

Return type

Database

Returns

database object

Raises

FileNotFoundError – if the database header file cannot be found under root
RuntimeError – if a CSV or PARQUET table file is newer than the corresponding PKL file

load_header_from_yaml()¶

static Database.load_header_from_yaml(header)[source]¶

Load database header from YAML.

Parameters: header (dict) – YAML header definition
Return type: Database
Returns: database object

map_files()¶

Database.map_files(func, num_workers=1, verbose=False)[source]¶

Apply function to file names in all tables.

If speed is crucial, see audformat.utils.map_file_path() for further hints how to optimize your code.

Parameters

func (Callable[[str], str]) – map function
num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5
verbose (bool) – show progress bar

media¶

Database.media¶: Dictionary of media information

meta¶

Database.meta¶: Dictionary with meta fields

misc_tables¶

Database.misc_tables¶: Dictionary of miscellaneous tables

name¶

Database.name¶: Name of database

organization¶

Database.organization¶: Organization that created the database

pick_files()¶

Database.pick_files(files, num_workers=1, verbose=False)[source]¶

Pick files from tables.

Iterate through all tables and keep only rows with a reference to listed files or matching files.

Parameters

files (str | Sequence[str] | Callable[[str], bool]) – list of files or condition function
num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5
verbose (bool) – show progress bar

pick_tables()¶

Database.pick_tables(table_ids)[source]¶

Pick (miscellaneous) tables by ID.

Parameters

table_ids (str | Sequence[str]) – table IDs to pick

Raises

audformat.errors.BadIdError – if a table with provided ID does not exist in the database
RuntimeError – if a misc table that is used in a scheme would be removed

raters¶

Database.raters¶: Dictionary of raters

root¶

Database.root¶

Database root directory.

Returns None if database has not been stored yet.

Returns: root directory

save()¶

Database.save(root, *, name='db', indent=2, storage_format='parquet', update_other_formats=True, header_only=False, num_workers=1, verbose=False)[source]¶

Save database to disk.

Creates a header <root>/<name>.yaml and for every table a file <root>/<name>.<table-id>.[csv,parquet,pkl].

Existing files will be overwritten. If update_other_formats is provided, it will overwrite all existing files in others formats as well.

Parameters

root (str) – root directory (possibly created)
name (str) – base name of files
indent (int) – indent size
storage_format (str) – storage format of tables. See audformat.define.TableStorageFormat for available formats
update_other_formats (bool) – if True it will not only save to the given storage_format, but update all files stored in other storage formats as well
header_only (bool) – store header only
num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5
verbose (bool) – show progress bar

schemes¶

Database.schemes¶: Dictionary of schemes

segments¶

Database.segments¶

Segments referenced in the database.

Returns: segments

source¶

Database.source¶: Database source

splits¶

Database.splits¶: Dictionary of splits

tables¶

Database.tables¶: Dictionary of audformat tables

to_dict()¶

Database.to_dict()¶

Serialize object to dictionary.

Return type: dict
Returns: dictionary with attributes

update()¶

Database.update(others, *, copy_attachments=False, copy_media=False, overwrite=False)[source]¶

Update database with other database(s).

In order to update a database, license and usage have to match. Labels and values of schemes with the same ID are combined. Media, raters, schemes, splits, and attachments that are not part of the database yet are added. Other fields will be updated by applying the following rules:

field	result
author	‘db.author, other.author’
description	db.description
expires	min(db.expires, other.expires)
languages	db.languages + other.languages
license_url	db.license_url
meta	db.meta + other.meta
name	db.name
organization	‘db.organization, other.organization’
source	‘db.source, other.source’

Parameters

others (Database | Sequence[Database]) – database object(s)
copy_attachments (bool) – if True it copies the attachment files associated with others to the current database root folder
copy_media (bool) – if True it copies the media files associated with others to the current database root folder
overwrite (bool) – overwrite table values where indices overlap

Return type

Database

Returns

the updated database

Raises

ValueError – if database has different license or usage
ValueError – if different media, rater, scheme, split, or attachment with same ID is found
ValueError – if schemes cannot be combined, e.g. labels have different dtype
ValueError – if tables cannot be combined (e.g. values in same position overlap or level and dtypes of table indices do not match)
RuntimeError – if copy_media or copy_attachments is True, but one of the involved databases was not saved (contains files but no root folder)
RuntimeError – if any involved database is not portable

usage¶

Database.usage¶

Usage permission.

Possible return values are given by audformat.define.Usage.

Database¶

__contains__()¶

__eq__()¶

__getitem__()¶

__iter__()¶

__setitem__()¶

attachments¶

author¶

description¶

drop_files()¶

drop_tables()¶

dump()¶

expires¶

files¶

files_duration()¶

from_dict()¶

get()¶

is_portable¶

languages¶

license¶

license_url¶

load()¶

load_header_from_yaml()¶

map_files()¶

media¶

meta¶

misc_tables¶

name¶

organization¶

pick_files()¶

pick_tables()¶

raters¶

root¶

save()¶

schemes¶

segments¶

source¶

splits¶

tables¶

to_dict()¶

update()¶

usage¶

contains()¶

eq()¶

getitem()¶

iter()¶

setitem()¶