DatabaseIterator

class audb.DatabaseIterator(db, table, *, version, map, batch_size, shuffle, buffer_size, only_metadata, bit_depth, channels, format, mixdown, sampling_rate, full_path, cache_root, num_workers, timeout, verbose)[source]

Database iterator.

This class cannot be created directly, but only by calling audb.stream().

Examples

Create audb.DatabaseIterator object.

>>> db = audb.stream(
...     "emodb",
...     "files",
...     version="1.4.1",
...     batch_size=4,
...     only_metadata=True,
...     full_path=False,
...     verbose=False,
... )

The audb.DatabaseIterator object is restricted to the requested table, and all related schemes and misc tables used as labels in a related scheme.

>>> db
name: emodb
...
schemes:
  age: {description: Age of speaker, dtype: int, minimum: 0}
  duration: {dtype: time}
  gender:
    description: Gender of speaker
    dtype: str
    labels: [female, male]
  language: {description: Language of speaker, dtype: str}
  speaker: {description: The actors could produce each sentence as often as they liked
      and were asked to remember a real situation from their past when they had felt
      this emotion., dtype: int, labels: speaker}
  transcription:
    description: Sentence produced by actor.
    dtype: str
    labels: ...
tables:
  files:
    type: filewise
    columns:
      duration: {scheme_id: duration}
      speaker: {scheme_id: speaker}
      transcription: {scheme_id: transcription}
misc_tables:
  speaker:
    levels: {speaker: int}
    columns:
      age: {scheme_id: age}
      gender: {scheme_id: gender}
      language: {scheme_id: language}
...

Request the first batch of data.

>>> next(db)
                                 duration  speaker transcription
file
wav/03a01Fa.wav    0 days 00:00:01.898250        3           a01
wav/03a01Nc.wav    0 days 00:00:01.611250        3           a01
wav/03a01Wa.wav 0 days 00:00:01.877812500        3           a01
wav/03a02Fc.wav    0 days 00:00:02.006250        3           a02

During the iteration, the audb.DatabaseIterator object provides access to the current batch of data.

>>> db["files"].get(map={"speaker": "age"})
                                 duration transcription  age
file
wav/03a01Fa.wav    0 days 00:00:01.898250           a01   31
wav/03a01Nc.wav    0 days 00:00:01.611250           a01   31
wav/03a01Wa.wav 0 days 00:00:01.877812500           a01   31
wav/03a02Fc.wav    0 days 00:00:02.006250           a02   31

__contains__()

DatabaseIterator.__contains__(table_id)

Check if (miscellaneous) table exists.

Parameters

table_id (str) – table identifier

Return type

bool

__eq__()

DatabaseIterator.__eq__(other)

Comparison if database equals another database.

Return type

bool

__getitem__()

DatabaseIterator.__getitem__(table_id)

Get (miscellaneous) table from database.

Parameters

table_id (str) – table identifier

Raises

BadKeyError – if table does not exist

Return type

Union[MiscTable, Table]

__iter__()

DatabaseIterator.__iter__()[source]

Iterator generator.

Return type

DatabaseIterator

attachments

DatabaseIterator.attachments

Dictionary of attachments.

Raises

RuntimeError – if the path of a newly assigned attachment overlaps with the path of an existing attachment

author

DatabaseIterator.author

Author(s) of database

description

DatabaseIterator.description

Description

drop_files()

DatabaseIterator.drop_files(files, num_workers=1, verbose=False)

Drop files from tables.

Iterate through all tables and remove rows with a reference to listed or matching files.

Parameters
  • files (Union[str, Sequence[str], Callable[[str], bool]]) – list of files or condition function

  • num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5

  • verbose (bool) – show progress bar

drop_tables()

DatabaseIterator.drop_tables(table_ids)

Drop (miscellaneous) tables by ID.

Parameters

table_ids (Union[str, Sequence[str]]) – table IDs to drop

Raises

dump()

DatabaseIterator.dump(stream=None, indent=2)

Serialize object to YAML.

Parameters
  • stream – file-like object. If None serializes to string

  • indent (int) – indent

Return type

str

Returns

YAML string

expires

DatabaseIterator.expires

Expiry date

files

DatabaseIterator.files

Files referenced in the database.

Includes files from filewise and segmented tables.

Returns

files

files_duration()

DatabaseIterator.files_duration(files, *, root=None)

Duration of files in the database.

Use db.files_duration(db.files).sum() to get the total duration of all files in a database. Or db.files_duration(db[table_id].files).sum() to get the total duration of all files assigned to a table.

Note

Durations are cached, i.e. changing the files on disk after calling this function can lead to wrong results. The cache is cleared when the database is reloaded from disk.

Parameters
Return type

Series

Returns

mapping from file to duration

Raises

ValueError – if root is not set when using relative file names with a database that was not saved or loaded from disk

from_dict()

DatabaseIterator.from_dict(d, ignore_keys=None)

Deserialize object from dictionary.

Parameters
  • d (dict) – dictionary of class variables to assign

  • ignore_keys (Optional[Sequence[str]]) – variables listed here will be ignored

get()

DatabaseIterator.get(scheme, additional_schemes=[], *, tables=None, splits=None, strict=False, map=True, original_column_names=False, aggregate_function=None, aggregate_strategy='mismatch')

Get labels by scheme.

Return all labels from columns assigned to a audformat.Scheme with name scheme. The request can be limited to specific tables and/or splits. By providing additional_schemes the result can be enriched with labels from other schemes (searched in all tables). If strict is False, a scheme is defined more broadly and does not only match schemes of the database, but also columns with the same name or labels of a scheme with the requested name as key. If at least one returned label belongs to a segmented table, the returned data frame has a segmented index. An aggregate_function can be provided that specifies how values are combined if more than one value is found for the same file or segment.

Parameters
  • scheme (str) – scheme ID for which labels should be returned. The search can be restricted to specific tables and splits by the tables and splits arguments. Or extended to columns with that same name or the name of a label in the scheme using the strict argument

  • additional_schemes (Union[str, Sequence]) – scheme ID or sequence of scheme IDs for which additional labels should be returned. The search is not affected by the tables and splits arguments

  • tables (Union[str, Sequence, None]) – limit search for scheme to selected tables

  • splits (Union[str, Sequence, None]) – limit search for scheme to selected splits

  • strict (bool) – if False the search is extended to columns that match the name of the scheme or the name of a label in the scheme

  • map (bool) – if True and a requested scheme has labels with mappings, those will be returned

  • original_column_names (bool) – if True keep the original column names (possibly results in multiple columns). For mapped schemes, the column name before mapping is returned, e.g. when requesting 'gender' it might return a column named 'speaker'

  • aggregate_function (Optional[Callable[[Series], Any]]) – callable to aggregate overlapping values. The function gets a pandas.Series with overlapping values as input. E.g. set to lambda y: y.mean() to average the values or to tuple to return them as a tuple

  • aggregate_strategy (str) – if aggregate_function is not None, aggregate_strategy decides when aggregate_function is applied. 'overlap': apply to all samples that have an overlapping index; 'mismatch': apply to all samples that have an overlapping index and a different value

Return type

DataFrame

Returns

data frame with values

Raises
  • ValueError – if different labels are found for a requested scheme under the same index entry

  • ValueError – if original_column_names is True and two columns in the returned data frame have the same name and cannot be joined due to overlapping data or different data type

  • TypeError – if labels of different data type are found for a requested scheme

Examples

Return all labels that match a requested scheme.

>>> import audb
>>> db = audb.load(
...     "emodb", version="1.4.1", only_metadata=True, full_path=False, verbose=False
... )
>>> db.get("emotion").head()
                   emotion
file
wav/03a01Fa.wav  happiness
wav/03a01Nc.wav    neutral
wav/03a01Wa.wav      anger
wav/03a02Fc.wav  happiness
wav/03a02Nc.wav    neutral
>>> db.get("transcription").head()
                                        transcription
file
wav/03a01Fa.wav  Der Lappen liegt auf dem Eisschrank.
wav/03a01Nc.wav  Der Lappen liegt auf dem Eisschrank.
wav/03a01Wa.wav  Der Lappen liegt auf dem Eisschrank.
wav/03a02Fc.wav     Das will sie am Mittwoch abgeben.
wav/03a02Nc.wav     Das will sie am Mittwoch abgeben.
>>> db.get("emotion", ["transcription"], map=False).head()
                   emotion transcription
file
wav/03a01Fa.wav  happiness           a01
wav/03a01Nc.wav    neutral           a01
wav/03a01Wa.wav      anger           a01
wav/03a02Fc.wav  happiness           a02
wav/03a02Nc.wav    neutral           a02

Non-existent schemes are ignored.

>>> db.get("emotion", ["non-existing"]).head()
       emotion non-existing
file
wav/03a01Fa.wav  happiness          NaN
wav/03a01Nc.wav    neutral          NaN
wav/03a01Wa.wav      anger          NaN
wav/03a02Fc.wav  happiness          NaN
wav/03a02Nc.wav    neutral          NaN

Limit to a particular table or split.

>>> db.get("emotion", tables=["emotion.categories.train.gold_standard"]).head()
                   emotion
file
wav/03a01Fa.wav  happiness
wav/03a01Nc.wav    neutral
wav/03a01Wa.wav      anger
wav/03a02Fc.wav  happiness
wav/03a02Nc.wav    neutral
>>> db.get("emotion", splits=["test"]).head()
                   emotion
file
wav/12a01Fb.wav  happiness
wav/12a01Lb.wav    boredom
wav/12a01Nb.wav    neutral
wav/12a01Wc.wav      anger
wav/12a02Ac.wav       fear

Return requested scheme name independent of column ID.

>>> db["emotion"].columns
emotion:
  {scheme_id: emotion, rater_id: gold}
emotion.confidence:
  {scheme_id: confidence, rater_id: gold}
>>> db.get("confidence").head()
                 confidence
file
wav/03a01Fa.wav        0.90
wav/03a01Nc.wav        1.00
wav/03a01Wa.wav        0.95
wav/03a02Fc.wav        0.85
wav/03a02Nc.wav        1.00

If strict is True only values that have an attached scheme are returned.

>>> db.get("emotion.confidence").head()
                 emotion.confidence
file
wav/03a01Fa.wav                0.90
wav/03a01Nc.wav                1.00
wav/03a01Wa.wav                0.95
wav/03a02Fc.wav                0.85
wav/03a02Nc.wav                1.00
>>> db.get("emotion.confidence", strict=True).head()
Empty DataFrame
Columns: [emotion.confidence]
Index: []

If more then one value exists for the requested scheme and index entry, an error is raised and aggregate_function can be used to combine the values.

>>> # Add a shuffled version of emotion ratings as `random` column
>>> db["emotion"]["random"] = Column(scheme_id="emotion")
>>> db["emotion"]["random"].set(
...     db["emotion"]["emotion"].get().sample(frac=1, random_state=1)
... )
>>> db.get("emotion")
Traceback (most recent call last):
    ...
ValueError: Found overlapping data in column 'emotion':
                      left    right
file
wav/03a01Nc.wav    neutral  disgust
wav/03a01Wa.wav      anger  neutral
wav/03a02Fc.wav  happiness  neutral
wav/03a02Ta.wav    sadness  boredom
wav/03a02Wb.wav      anger  sadness
wav/03a04Ad.wav       fear  neutral
wav/03a04Fd.wav  happiness    anger
wav/03a04Nc.wav    neutral  sadness
wav/03a04Wc.wav      anger  boredom
wav/03a05Aa.wav       fear  sadness
...
>>> db.get("emotion", aggregate_function=lambda y: y[0]).head()
                   emotion
file
wav/03a01Fa.wav  happiness
wav/03a01Nc.wav    neutral
wav/03a01Wa.wav      anger
wav/03a02Fc.wav  happiness
wav/03a02Nc.wav    neutral

Alternatively, use original_column_names to return column IDs.

>>> db.get("emotion", original_column_names=True).head()
                   emotion     random
file
wav/03a01Fa.wav  happiness  happiness
wav/03a01Nc.wav    neutral    disgust
wav/03a01Wa.wav      anger    neutral
wav/03a02Fc.wav  happiness    neutral
wav/03a02Nc.wav    neutral    neutral

is_portable

DatabaseIterator.is_portable

Check if database can be moved to another location.

To be portable, media must not be referenced with an absolute path, and not contain \, ., or ... If a database is portable it can be moved to another folder or updated by another database.

Returns

True if the database is portable

languages

DatabaseIterator.languages

List of included languages

license

DatabaseIterator.license

License of database

license_url

DatabaseIterator.license_url

URL of database license

load()

static DatabaseIterator.load(root, *, name='db', load_data=False, num_workers=1, verbose=False)

Load database from disk.

Expects a header <root>/<name>.yaml and for every table a file <root>/<name>.<table-id>.[csv|parquet|pkl] Media files should be located under root.

Parameters
  • root (str) – root directory

  • name (str) – base name of header and table files

  • load_data (bool) – if False, audformat.Table data is only loaded on demand, e.g. when audformat.Table.get() is called for the first time. Set to True to load all audformat.Table data immediately

  • num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5

  • verbose (bool) – show progress bar

Return type

Database

Returns

database object

Raises
  • FileNotFoundError – if the database header file cannot be found under root

  • RuntimeError – if a CSV or PARQUET table file is newer than the corresponding PKL file

load_header_from_yaml()

static DatabaseIterator.load_header_from_yaml(header)

Load database header from YAML.

Parameters

header (dict) – YAML header definition

Return type

Database

Returns

database object

map_files()

DatabaseIterator.map_files(func, num_workers=1, verbose=False)

Apply function to file names in all tables.

If speed is crucial, see audformat.utils.map_file_path() for further hints how to optimize your code.

Parameters
  • func (Callable[[str], str]) – map function

  • num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5

  • verbose (bool) – show progress bar

media

DatabaseIterator.media

Dictionary of media information

meta

DatabaseIterator.meta

Dictionary with meta fields

misc_tables

DatabaseIterator.misc_tables

Dictionary of miscellaneous tables

name

DatabaseIterator.name

Name of database

organization

DatabaseIterator.organization

Organization that created the database

pick_files()

DatabaseIterator.pick_files(files, num_workers=1, verbose=False)

Pick files from tables.

Iterate through all tables and keep only rows with a reference to listed files or matching files.

Parameters
  • files (Union[str, Sequence[str], Callable[[str], bool]]) – list of files or condition function

  • num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5

  • verbose (bool) – show progress bar

pick_tables()

DatabaseIterator.pick_tables(table_ids)

Pick (miscellaneous) tables by ID.

Parameters

table_ids (Union[str, Sequence[str]]) – table IDs to pick

Raises

raters

DatabaseIterator.raters

Dictionary of raters

root

DatabaseIterator.root

Database root directory.

Returns None if database has not been stored yet.

Returns

root directory

save()

DatabaseIterator.save(root, *, name='db', indent=2, storage_format='parquet', update_other_formats=True, header_only=False, num_workers=1, verbose=False)

Save database to disk.

Creates a header <root>/<name>.yaml and for every table a file <root>/<name>.<table-id>.[csv,parquet,pkl].

Existing files will be overwritten. If update_other_formats is provided, it will overwrite all existing files in others formats as well.

Parameters
  • root (str) – root directory (possibly created)

  • name (str) – base name of files

  • indent (int) – indent size

  • storage_format (str) – storage format of tables. See audformat.define.TableStorageFormat for available formats

  • update_other_formats (bool) – if True it will not only save to the given storage_format, but update all files stored in other storage formats as well

  • header_only (bool) – store header only

  • num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5

  • verbose (bool) – show progress bar

schemes

DatabaseIterator.schemes

Dictionary of schemes

segments

DatabaseIterator.segments

Segments referenced in the database.

Returns

segments

source

DatabaseIterator.source

Database source

splits

DatabaseIterator.splits

Dictionary of splits

tables

DatabaseIterator.tables

Dictionary of audformat tables

to_dict()

DatabaseIterator.to_dict()

Serialize object to dictionary.

Return type

dict

Returns

dictionary with attributes

update()

DatabaseIterator.update(others, *, copy_attachments=False, copy_media=False, overwrite=False)

Update database with other database(s).

In order to update a database, license and usage have to match. Labels and values of schemes with the same ID are combined. Media, raters, schemes, splits, and attachments that are not part of the database yet are added. Other fields will be updated by applying the following rules:

field

result

author

‘db.author, other.author’

description

db.description

expires

min(db.expires, other.expires)

languages

db.languages + other.languages

license_url

db.license_url

meta

db.meta + other.meta

name

db.name

organization

‘db.organization, other.organization’

source

‘db.source, other.source’

Parameters
  • others (Union[Database, Sequence[Database]]) – database object(s)

  • copy_attachments (bool) – if True it copies the attachment files associated with others to the current database root folder

  • copy_media (bool) – if True it copies the media files associated with others to the current database root folder

  • overwrite (bool) – overwrite table values where indices overlap

Return type

Database

Returns

the updated database

Raises
  • ValueError – if database has different license or usage

  • ValueError – if different media, rater, scheme, split, or attachment with same ID is found

  • ValueError – if schemes cannot be combined, e.g. labels have different dtype

  • ValueError – if tables cannot be combined (e.g. values in same position overlap or level and dtypes of table indices do not match)

  • RuntimeError – if copy_media or copy_attachments is True, but one of the involved databases was not saved (contains files but no root folder)

  • RuntimeError – if any involved database is not portable

usage

DatabaseIterator.usage

Usage permission.

Possible return values are given by audformat.define.Usage.