audformat

assert_index

audformat.assert_index(obj)[source]

Assert object is conform to table specifications.

Parameters

obj (Union[Index, Series, DataFrame]) – object

Raises

ValueError – if not conform to table specifications

Column

class audformat.Column(*, scheme_id=None, rater_id=None, description=None, meta=None)[source]

Table column.

Represents a table column (see audformat.Table) and optionally links it to a scheme (see audformat.Scheme) and a rater (see audformat.Rater).

Parameters

Example

>>> Column(scheme_id='emotion')
{scheme_id: emotion}
get(index=None, *, map=None, copy=True)[source]

Get labels.

By default all labels of the column are returned, use index to get a subset.

Examples are provided with the table specifications.

Parameters
  • index (Optional[Index]) – index conform to table specifications

  • copy (bool) – return a copy of the labels

  • map (Optional[str]) – map scheme or scheme field to column values. For example if your column holds speaker IDs and is assigned to a scheme that contains a dict mapping speaker IDs to age entries, map='age' will replace the ID values with the age of the speaker

Return type

Series

Returns

labels

Raises
  • RuntimeError – if column is not assigned to a table

  • ValueError – if trying to map without a scheme

  • ValueError – if trying to map from a scheme that has no labels

  • ValueError – if trying to map to a non-existing field

property rater: Optional[audformat.core.rater.Rater]

Rater object.

Return type

Optional[Rater]

Returns

rater object or None if not available

rater_id

Rater identifier

property scheme: Optional[audformat.core.scheme.Scheme]

Scheme object.

Return type

Optional[Scheme]

Returns

scheme object or None if not available

scheme_id

Scheme identifier

set(values, *, index=None)[source]

Set labels.

By default all labels of the column are replaced, use index to set a subset. If columns is assigned to a Scheme values have to match its dtype.

Examples are provided with the table specifications.

Parameters
Raises
  • RuntimeError – if column is not assign to a table

  • ValueError – if trying to set values of a filewise column using a segmented index

  • ValueError – if values do not match scheme

property table

Table object.

Returns

table object or None if not assigned yet

filewise_index

audformat.filewise_index(files=None)[source]

Creates a filewise index.

Index is conform to table specifications.

Parameters

files (Union[str, Sequence[str], Index, Series, None]) – list of files

Return type

Index

Returns

filewise index

Raises

ValueError – if created index contains duplicates

Example

>>> filewise_index(['a.wav', 'b.wav'])
Index(['a.wav', 'b.wav'], dtype='object', name='file')

segmented_index

audformat.segmented_index(files=None, starts=None, ends=None)[source]

Create segmented index.

Index is conform to table specifications.

If a non-empty index is created and starts is set to None, the level will be filled up with 0. If a non-empty index is created and ends is set to None, the level will be filled up with NaT.

Parameters
Return type

Index

Returns

segmented index

Raises
  • ValueError – if created index contains duplicates

  • ValueError – if files, start and ends differ in size

Example

>>> segmented_index('a.wav', 0, 1.1)
MultiIndex([('a.wav', '0 days', '0 days 00:00:01.100000')],
           names=['file', 'start', 'end'])
>>> segmented_index('a.wav', '0ms', '1ms')
MultiIndex([('a.wav', '0 days', '0 days 00:00:00.001000')],
           names=['file', 'start', 'end'])
>>> segmented_index(['a.wav', 'b.wav'])
MultiIndex([('a.wav', '0 days', NaT),
            ('b.wav', '0 days', NaT)],
           names=['file', 'start', 'end'])
>>> segmented_index(['a.wav', 'b.wav'], [None, 1], [1, None])
MultiIndex([('a.wav',               NaT, '0 days 00:00:01'),
            ('b.wav', '0 days 00:00:01',               NaT)],
           names=['file', 'start', 'end'])
>>> segmented_index(
...     files=['a.wav', 'a.wav'],
...     starts=[0, 1],
...     ends=pd.to_timedelta([1000, 2000], unit='ms'),
... )
MultiIndex([('a.wav', '0 days 00:00:00', '0 days 00:00:01'),
            ('a.wav', '0 days 00:00:01', '0 days 00:00:02')],
           names=['file', 'start', 'end'])

Database

class audformat.Database(name, source='', usage='unrestricted', *, expires=None, languages=None, description=None, author=None, organization=None, license=None, license_url=None, meta=None)[source]

Database object.

A database consists of a header holding raters, schemes, splits, and other meta information. In addition it links to a number of tables listing files and labels.

Parameters
Raises

Example

>>> db = Database(
...     'mydb',
...     'https://www.audeering.com/',
...     define.Usage.COMMERCIAL,
...     languages=['English', 'de'],
... )
>>> db
name: mydb
source: https://www.audeering.com/
usage: commercial
languages: [eng, deu]
>>> labels = ['positive', 'neutral', 'negative']
>>> db.schemes['emotion'] = Scheme(
...     labels=labels,
... )
>>> db.raters['rater'] = Rater()
>>> db.media['audio'] = Media(
...     define.MediaType.AUDIO,
...     format='wav',
...     sampling_rate=16000,
... )
>>> db['table'] = Table(
...     media_id='audio',
... )
>>> db['table']['column'] = Column(
...     scheme_id='emotion',
...     rater_id='rater',
... )
>>> db
name: mydb
source: https://www.audeering.com/
usage: commercial
languages: [eng, deu]
media:
  audio: {type: audio, format: wav, sampling_rate: 16000}
raters:
  rater: {type: human}
schemes:
  emotion:
    dtype: str
    labels: [positive, neutral, negative]
tables:
  table:
    type: filewise
    media_id: audio
    columns:
      column: {scheme_id: emotion, rater_id: rater}
__contains__(table_id)[source]

Check if table exists.

Parameters

table_id (str) – table identifier

Return type

bool

__getitem__(table_id)[source]

Get table from database.

Parameters

table_id (str) – table identifier

Return type

Table

__setitem__(table_id, table)[source]

Add table to database.

Parameters
  • table_id (str) – table identifier

  • table (Table) – the table

Raises

BadIdError – if table has a split_id or media_id, which is not specified in the underlying database

Return type

Table

author

Author(s) of database

drop_files(files, num_workers=1, verbose=False)[source]

Drop files from tables.

Iterate through all tables and remove rows with a reference to listed or matching files.

Parameters
  • files (Union[str, Sequence[str], Callable[[str], bool]]) – list of files or condition function

  • num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5

  • verbose (bool) – show progress bar

drop_tables(table_ids)[source]

Drop tables by ID.

Parameters

table_ids (Union[str, Sequence[str]]) – table IDs to drop

expires

Expiry date

property files: pandas.core.indexes.base.Index

Files referenced in the database.

Includes files from filewise and segmented tables.

Return type

Index

Returns

files

property is_portable: bool

Check if a database can be moved to another location.

To be portable, media must not be referenced with an absolute path, or contain . or .. to specify a folder. If a database is portable it can be moved to another folder or updated by another database.

Return type

bool

Returns

True if the database is portable

languages

List of included languages

license

License of database

license_url

URL of database license

static load(root, *, name='db', load_data=True, num_workers=1, verbose=False)[source]

Load database from disk.

Expects a header <root>/<name>.yaml and for every table a file <root>/<name>.<table-id>.[csv|pkl] Media files should be located under root.

Parameters
  • root (str) – root directory

  • name (str) – base name of header and table files

  • load_data (bool) – if False audformat.Table will contain empty tables

  • num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5

  • verbose (bool) – show progress bar

Return type

Database

Returns

database object

static load_header_from_yaml(header)[source]

Load database header from YAML.

Parameters

header (dict) – YAML header definition

Return type

Database

Returns

database object

map_files(func, num_workers=1, verbose=False)[source]

Apply function to file names in all tables.

Relies on pandas.Index.map(), which can be slow. If speed is crucial, consider to change the index directly. In the following example we prefix every file with a folder:

root = '/root/'
for table in db.tables.values():
    if table.is_filewise:
        table.df.index = root + table.df.index
        table.df.index.name = audformat.define.IndexField.FILE
    elif len(table.df.index) > 0:
        table.df.index.set_levels(
            root + table.df.index.levels[0],
            audformat.define.IndexField.FILE,
            inplace=True,
        )
Parameters
  • func (Callable[[str], str]) – map function

  • num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5

  • verbose (bool) – show progress bar

media

Dictionary of media information

name

Name of database

organization

Organization that created the database

pick_files(files, num_workers=1, verbose=False)[source]

Pick files from tables.

Iterate through all tables and keep only rows with a reference to listed files or matching files.

Parameters
  • files (Union[str, Sequence[str], Callable[[str], bool]]) – list of files or condition function

  • num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5

  • verbose (bool) – show progress bar

pick_tables(table_ids)[source]

Pick tables by ID.

Parameters

table_ids (Union[str, Sequence[str]]) – table IDs to pick

raters

Dictionary of raters

property root: Optional[str]

Database root directory.

Returns None if database has not been stored yet.

Return type

Optional[str]

Returns

root directory

save(root, *, name='db', indent=2, storage_format='csv', update_other_formats=True, header_only=False, num_workers=1, verbose=False)[source]

Save database to disk.

Creates a header <root>/<name>.yaml and for every table a file <root>/<name>.<table-id>.[csv,pkl].

Existing files will be overwritten. If update_other_formats is provided, it will overwrite all existing files in others formats as well.

Parameters
  • root (str) – root directory (possibly created)

  • name (str) – base name of files

  • indent (int) – indent size

  • storage_format (str) – storage format of tables. See audformat.define.TableStorageFormat for available formats

  • update_other_formats (bool) – if True it will not only save to the given storage_format, but update all files stored in other storage formats as well

  • header_only (bool) – store header only

  • num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5

  • verbose (bool) – show progress bar

schemes

Dictionary of schemes

property segments: pandas.core.indexes.multi.MultiIndex

Segments referenced in the database.

Return type

MultiIndex

Returns

segments

source

Database source

splits

Dictionary of splits

tables

Dictionary of tables

update(others, *, copy_media=False, overwrite=False)[source]

Update database with other database(s).

In order to update a database, license and usage have to match. Media, raters, schemes and splits that are not part of the database yet are added. Other fields will be updated by applying the following rules:

field

result

author

‘db.author, other.author’

description

db.description

expires

min(db.expires, other.expires)

languages

db.languages + other.languages

license_url

db.license_url

meta

db.meta + other.meta

name

db.name

organization

‘db.organization, other.organization’

source

‘db.source, other.source’

Parameters
  • others (Union[Database, Sequence[Database]]) – database object(s)

  • copy_media (bool) – if True it copies the media files associated with others to the current database root folder

  • overwrite (bool) – overwrite table values where indices overlap

Return type

Database

Returns

the updated database

Raises
  • ValueError – if database has different license or usage

  • ValueError – if different media, rater, scheme or split with same ID is found

  • ValueError – if table data cannot be combined (e.g. values in same position overlap)

  • RuntimeError – if copy_media=True, but one of the involved databases was not saved (contains files but no root folder)

  • RuntimeError – if any involved database is not portable

usage

Usage permission

index_type

audformat.index_type(obj)[source]

Derive index type.

Parameters

obj (Union[Index, Series, DataFrame]) – object conform to table specifications

Return type

IndexType

Returns

table type

Raises

ValueError – if not conform to table specifications

Example

>>> index_type(filewise_index())
'filewise'
>>> index_type(segmented_index())
'segmented'

Media

class audformat.Media(type='other', *, format=None, sampling_rate=None, channels=None, bit_depth=None, video_fps=None, video_resolution=None, video_channels=None, video_depth=None, description=None, meta=None)[source]

Media information.

File format is always converted to lower case.

Parameters
Raises

BadValueError – if an invalid type is passed

Example

>>> Media(
...     type=define.MediaType.AUDIO,
...     format='wav',
...     sampling_rate=16000,
...     channels=2,
... )
{type: audio, format: wav, channels: 2, sampling_rate: 16000}
bit_depth

Audio bit depth

channels

Audio channels

format

File format

sampling_rate

Audio sampling rate in Hz

type

Media type

video_channels

Video channels per pixel

video_depth

Video bit depth

video_fps

Video frames per second

video_resolution

Video resolution

Rater

class audformat.Rater(type='human', *, description=None, meta=None)[source]

A rater is the author of an annotation.

Parameters
Raises

BadValueError – if an invalid type is passed

Example

>>> Rater(define.RaterType.HUMAN)
{type: human}
type

Rater type

Scheme

class audformat.Scheme(dtype=None, *, labels=None, minimum=None, maximum=None, description=None, meta=None)[source]

A scheme defines valid values of an annotation.

Allowed values for dtype are: 'bool', 'str', 'int', 'float', 'time', and 'date' (see audformat.define.DataType). Values can be restricted to a set of labels provided by a list or a dictionary. A continuous range can be limited by a minimum and maximum value.

Parameters
Raises

Example

>>> Scheme()
{dtype: str}
>>> Scheme(labels=['a', 'b', 'c'])
dtype: str
labels: [a, b, c]
>>> Scheme(define.DataType.INTEGER)
{dtype: int}
>>> Scheme(float, minimum=0, maximum=1)
{dtype: float, minimum: 0, maximum: 1}
draw(n, *, str_len=10, p_none=None)[source]

Randomly draws values from scheme.

Parameters
  • n (int) – number of values

  • str_len (int) – string length if drawing from a string scheme without labels

  • p_none (Optional[bool]) – probability for drawing an invalid value

Return type

list

Returns

list with values

dtype

Data type

property is_numeric: bool

Check if data type is numeric.

Return type

bool

Returns

True if data type is numeric.

labels

List of labels

maximum

Maximum value

minimum

Minimum value

replace_labels(labels)[source]

Replace labels.

If scheme is part of a audformat.Database the dtype of all audformat.Column objects that reference the scheme will be updated. Removed labels are set to NaN.

Parameters

labels (Union[dict, list]) – new labels

Raises
  • ValueError – if scheme does not define labels

  • ValueError – if dtype of new labels does not match dtype of scheme

Example

>>> speaker = Scheme(
...     labels={
...         0: {'gender': 'female'},
...         1: {'gender': 'male'},
...     }
... )
>>> speaker
dtype: int
labels:
  0: {gender: female}
  1: {gender: male}
>>> speaker.replace_labels(
...     {
...         1: {'gender': 'male', 'age': 33},
...         2: {'gender': 'female', 'age': 44},
...     }
... )
>>> speaker
dtype: int
labels:
  1: {gender: male, age: 33}
  2: {gender: female, age: 44}
to_pandas_dtype()[source]

Convert data type to pandas data type.

If labels is not None, pandas.CategoricalDtype is returned. Otherwise the following rules are applied:

  • str -> str

  • int -> Int64 (to allow NaN)

  • float -> float

  • time -> timedelta64[ns]

  • date -> datetime64[ns]

Return type

Union[str, CategoricalDtype]

Returns

pandas data type

Split

Tables can be classified by splits. Usually one of define.SplitType.

class audformat.Split(type='other', *, description=None, meta=None)[source]

Database split.

Defines if a subset of a database should be used for training, development or testing.

Parameters
Raises

BadValueError – if an invalid type is passed

Example

>>> Split(define.SplitType.TEST)
{type: test}
type

Split type

Table

Annotation data is organized in tables, which consist of file names and columns that assign labels or numerical values to the files.

There are two types of tables:

  • define.TableType.FILEWISE tables annotate whole files

  • define.TableType.SEGMENTED tables annotate file segments

class audformat.Table(index=None, *, split_id=None, media_id=None, description=None, meta=None)[source]

Table with annotation data.

Consists of a list of file names to which it assigns numerical values or labels. To fill a table with labels, add one ore more audformat.Column and use audformat.Table.set() to set the values.

Parameters
Raises

ValueError – if index not conform to table specifications

Example

>>> index = filewise_index(['f1', 'f2', 'f3'])
>>> table = Table(
...     index,
...     split_id=define.SplitType.TEST,
... )
>>> table['values'] = Column()
>>> table
type: filewise
split_id: test
columns:
  values: {}
>>> table.get()
     values
file
f1      NaN
f2      NaN
f3      NaN
>>> table.set({'values': [0, 1, 2]})
>>> table.get()
     values
file
f1        0
f2        1
f3        2
>>> table.get(index[:2])
     values
file
f1        0
f2        1
>>> index_ex = filewise_index('f4')
>>> table_ex = table.extend_index(
...     index_ex,
...     inplace=False,
... )
>>> table_ex.get()
     values
file
f1        0
f2        1
f3        2
f4      NaN
>>> table_ex.set(
...     {'values': 3},
...     index=index_ex,
... )
>>> table_ex.get()
     values
file
f1        0
f2        1
f3        2
f4        3
>>> table_str = Table(index)
>>> table_str['strings'] = Column()
>>> table_str.set({'strings': ['a', 'b', 'c']})
>>> (table + table_str).get()
     values strings
file
f1        0       a
f2        1       b
f3        2       c
>>> (table_ex + table_str).get()
     values strings
file
f1        0       a
f2        1       b
f3        2       c
f4        3     NaN
__add__(other)[source]

Create new table by combining two tables.

The new table contains index and columns of both tables. Missing values will be set to NaN. If at least one table is segmented, the output has a segmented index.

Columns with the same identifier are combined to a single column. This requires that:

  1. both columns have the same dtype

  2. in places where the indices overlap the values of both columns match or one column contains NaN

Media and split information, as well as, references to schemes and raters are discarded. If you intend to keep them, use audformat.Table.update().

Parameters

other (Table) – the other table

Raises
  • ValueError – if columns with the same name have different dtypes

  • ValueError – if values in the same position do not match

Return type

Table

__getitem__(column_id)[source]

Return view to a column.

Parameters

column_id (str) – column identifier

Return type

Column

__setitem__(column_id, column)[source]

Add new column to table.

Parameters
  • column_id (str) – column identifier

  • column (Column) – column

Raises

BadIdError – if a column with a scheme_id or rater_id is added that does not exist

Return type

Column

columns

Table columns

copy()[source]

Copy table.

Return type

Table

Returns

new Table object

property db

Database object.

Returns

database object or None if not assigned yet

property df: pandas.core.frame.DataFrame

Table data.

Return type

DataFrame

Returns

data

drop_columns(column_ids, *, inplace=False)[source]

Drop columns by ID.

Parameters
Return type

Table

Returns

new Table if inplace=False, otherwise self

drop_files(files, *, inplace=False)[source]

Drop files.

Remove rows with a reference to listed or matching files.

Parameters
Return type

Table

Returns

new Table if inplace=False, otherwise self

drop_index(index, *, inplace=False)[source]

Drop rows from index.

Parameters
Return type

Table

Returns

new Table if inplace=False, otherwise self

Raises

ValueError – if table type is not matched

property ends: pandas.core.indexes.base.Index

Segment end times.

Return type

Index

Returns

timestamps

extend_index(index, *, fill_values=None, inplace=False)[source]

Extend table by new rows.

Parameters
  • index (Index) – index conform to table specifications

  • fill_values (Union[Any, Dict[str, Any], None]) – replace NaN with these values (either a scalar applied to all columns or a dictionary with column name as key)

  • inplace (bool) – extend index in place

Return type

Table

Returns

new Table if inplace=False, otherwise self

Raises

ValueError – if index type is not matched

property files: pandas.core.indexes.base.Index

Files referenced in the table.

Return type

Index

Returns

files

get(index=None, *, map=None, copy=True)[source]

Get labels.

By default all labels of the table are returned, use index to get a subset.

Examples are provided with the table specifications.

Parameters
  • index (Optional[Index]) – index conform to table specifications

  • copy (bool) – return a copy of the labels

  • map (Optional[Dict[str, Union[str, Sequence[str]]]]) – map scheme or scheme fields to column values. For example if your table holds a column speaker with speaker IDs, which is assigned to a scheme that contains a dict mapping speaker IDs to age and gender entries, map={'speaker': ['age', 'gender']} will replace the column with two new columns that map ID values to age and gender, respectively. To also keep the original column with speaker IDS, you can do map={'speaker': ['speaker', 'age', 'gender']}

Return type

DataFrame

Returns

labels

Raises
  • RuntimeError – if table is not assign to a database

  • ValueError – if trying to map without a scheme

  • ValueError – if trying to map from a scheme that has no labels

  • ValueError – if trying to map to a non-existing field

property index: pandas.core.indexes.base.Index

Table index.

Return type

Index

Returns

index

property is_filewise: bool

Check if filewise table.

Return type

bool

Returns

True if filewise table.

property is_segmented: bool

Check if segmented table.

Return type

bool

Returns

True if segmented table.

load(path)[source]

Load table data from disk.

Tables can be stored as PKL and/or CSV files to disk. If both files are present it will load the PKL file as long as its modification date is newer, otherwise it will raise an error and ask to delete one of the files.

Parameters

path (str) – file path without extension

Raises
property media: Optional[audformat.core.media.Media]

Media object.

Return type

Optional[Media]

Returns

media object or None if not available

media_id

Media ID

pick_columns(column_ids, *, inplace=False)[source]

Pick columns by ID.

All other columns will be dropped.

Parameters
Return type

Table

Returns

new Table if inplace=False, otherwise self

pick_files(files, *, inplace=False)[source]

Pick files.

Keep only rows with a reference to listed files or matching files.

Parameters
Return type

Table

Returns

new Table if inplace=False, otherwise self

pick_index(index, *, inplace=False)[source]

Pick rows from index.

Parameters
Return type

Table

Returns

new Table if inplace=False, otherwise self

Raises

ValueError – if table type is not matched

save(path, *, storage_format='csv', update_other_formats=True)[source]

Save table data to disk.

Existing files will be overwritten.

Parameters
  • path (str) – file path without extension

  • storage_format (str) – storage format of table. See audformat.define.TableStorageFormat for available formats

  • update_other_formats (bool) – if True it will not only save to the given storage_format, but update all files stored in other storage formats as well

set(values, *, index=None)[source]

Set labels.

By default all labels of the table are replaced, use index to select a subset. If a column is assigned to a Scheme values have to match its dtype.

Examples are provided with the table specifications.

Parameters
Raises

ValueError – if values do not match scheme

property split: Optional[audformat.core.split.Split]

Split object.

Return type

Optional[Split]

Returns

split object or None if not available

split_id

Split ID

property starts: pandas.core.indexes.base.Index

Segment start times.

Return type

Index

Returns

timestamps

type

Table type

update(others, *, overwrite=False)[source]

Update table with other table(s).

Table which calls update() must be assigned to a database. For all tables media and split must match.

Columns that are not yet part of the table will be added and referenced schemes or raters are copied. For overlapping columns, schemes and raters must match.

Columns with the same identifier are combined to a single column. This requires that both columns have the same dtype and if overwrite is set to False, values in places where the indices overlap have to match or one column contains NaN. If overwrite is set to True, the value of the last table in the list is kept.

The index type of the table must not change.

Parameters
Return type

Table

Returns

the updated table

Raises
  • RuntimeError – if table is not assign to a database

  • ValueError – if split or media does not match

  • ValueError – if overlapping columns reference different schemes or raters

  • ValueError – if a missing scheme or rater cannot be copied because a different object with the same ID exists

  • ValueError – if values in same position overlap

  • ValueError – if operation would change the index type of the table