Database¶
- class audformat.Database(name, source='', usage='unrestricted', *, expires=None, languages=None, description=None, author=None, organization=None, license=None, license_url=None, meta=None)[source]¶
Database object.
A database consists of a header holding raters, schemes, splits, and other meta information. In addition, it links to a number of tables listing files and labels.
For a start see how to create a database and inspect the example of the emodb database.
- Parameters
name (
str
) – name of databasesource (
str
) – data source (e.g. link to website)usage (
str
) – permission of usage, seeaudformat.define.Usage
. Set to'other'
if none of the other fields fit.languages (
Union
[str
,Sequence
[str
],None
]) – list of languages. Will be mapped to ISO 639-3 strings withaudformat.utils.map_language()
organization (
Optional
[str
]) – organization(s) maintaining the databaselicense (
Union
[str
,License
,None
]) – database license. You can use a custom license or pick one fromaudformat.define.License
. In the later case,license_url
will be automatically set if it is not given
- Raises
BadValueError – if an invalid
usage
value is passedValueError – if language is unknown
Examples
>>> db = Database( ... "mydb", ... "https://www.audeering.com/", ... define.Usage.COMMERCIAL, ... languages=["English", "de"], ... ) >>> db name: mydb source: https://www.audeering.com/ usage: commercial languages: [eng, deu] >>> labels = ["positive", "neutral", "negative"] >>> db.schemes["emotion"] = Scheme(labels=labels) >>> db.schemes["match"] = Scheme(dtype="bool") >>> db.raters["rater"] = Rater() >>> db.media["audio"] = Media( ... define.MediaType.AUDIO, ... format="wav", ... sampling_rate=16000, ... ) >>> index = filewise_index(["f1.wav", "f2.wav"]) >>> db["table"] = Table(index, media_id="audio") >>> db["table"]["column"] = Column( ... scheme_id="emotion", ... rater_id="rater", ... ) >>> db["table"]["column"].set(["neutral", "positive"]) >>> index = pd.Index([], dtype="string", name="idx") >>> db["misc-table"] = MiscTable(index) >>> db["misc-table"]["column"] = Column(scheme_id="match") >>> db name: mydb source: https://www.audeering.com/ usage: commercial languages: [eng, deu] media: audio: {type: audio, format: wav, sampling_rate: 16000} raters: rater: {type: human} schemes: emotion: dtype: str labels: [positive, neutral, negative] match: {dtype: bool} tables: table: type: filewise media_id: audio columns: column: {scheme_id: emotion, rater_id: rater} misc_tables: misc-table: levels: {idx: str} columns: column: {scheme_id: match} >>> list(db) ['misc-table', 'table'] >>> db.get("emotion") emotion file f1.wav neutral f2.wav positive
__contains__()¶
__eq__()¶
__getitem__()¶
__iter__()¶
__setitem__()¶
- Database.__setitem__(table_id, table)[source]¶
Add table to database.
- Parameters
- Raises
BadIdError – if table has a
split_id
ormedia_id
, which is not specified in the underlying databaseTableExistsError – if setting a miscellaneous table when a filewise or segmented table with the same ID exists (or vice versa)
- Return type
attachments¶
- Database.attachments¶
Dictionary of attachments.
- Raises
RuntimeError – if the path of a newly assigned attachment overlaps with the path of an existing attachment
drop_files()¶
drop_tables()¶
- Database.drop_tables(table_ids)[source]¶
Drop (miscellaneous) tables by ID.
- Parameters
- Raises
audformat.errors.BadIdError – if a table with provided ID does not exist in the database
RuntimeError – if a misc table that is used in a scheme would be removed
dump()¶
files¶
- Database.files¶
Files referenced in the database.
Includes files from filewise and segmented tables.
- Returns
files
files_duration()¶
- Database.files_duration(files, *, root=None)[source]¶
Duration of files in the database.
Use
db.files_duration(db.files).sum()
to get the total duration of all files in a database. Ordb.files_duration(db[table_id].files).sum()
to get the total duration of all files assigned to a table.Note
Durations are cached, i.e. changing the files on disk after calling this function can lead to wrong results. The cache is cleared when the database is reloaded from disk.
- Parameters
- Return type
- Returns
mapping from file to duration
- Raises
ValueError – if
root
is not set when using relative file names with a database that was not saved or loaded from disk
from_dict()¶
get()¶
- Database.get(scheme, additional_schemes=[], *, tables=None, splits=None, strict=False, map=True, original_column_names=False, aggregate_function=None, aggregate_strategy='mismatch')[source]¶
Get labels by scheme.
Return all labels from columns assigned to a
audformat.Scheme
with namescheme
. The request can be limited to specifictables
and/orsplits
. By providingadditional_schemes
the result can be enriched with labels from other schemes (searched in all tables). Ifstrict
isFalse
, a scheme is defined more broadly and does not only match schemes of the database, but also columns with the same name or labels of a scheme with the requested name as key. If at least one returned label belongs to a segmented table, the returned data frame has a segmented index. Anaggregate_function
can be provided that specifies how values are combined if more than one value is found for the same file or segment.- Parameters
scheme (
str
) – scheme ID for which labels should be returned. The search can be restricted to specific tables and splits by thetables
andsplits
arguments. Or extended to columns with that same name or the name of a label in the scheme using the strict argumentadditional_schemes (
Union
[str
,Sequence
]) – scheme ID or sequence of scheme IDs for which additional labels should be returned. The search is not affected by thetables
andsplits
argumentstables (
Union
[str
,Sequence
,None
]) – limit search forscheme
to selected tablessplits (
Union
[str
,Sequence
,None
]) – limit search forscheme
to selected splitsstrict (
bool
) – ifFalse
the search is extended to columns that match the name of the scheme or the name of a label in the schememap (
bool
) – ifTrue
and a requested scheme has labels with mappings, those will be returnedoriginal_column_names (
bool
) – ifTrue
keep the original column names (possibly results in multiple columns). For mapped schemes, the column name before mapping is returned, e.g. when requesting'gender'
it might return a column named'speaker'
aggregate_function (
Optional
[Callable
[[Series
],Any
]]) – callable to aggregate overlapping values. The function gets apandas.Series
with overlapping values as input. E.g. set tolambda y: y.mean()
to average the values or totuple
to return them as a tupleaggregate_strategy (
str
) – ifaggregate_function
is notNone
,aggregate_strategy
decides whenaggregate_function
is applied.'overlap'
: apply to all samples that have an overlapping index;'mismatch'
: apply to all samples that have an overlapping index and a different value
- Return type
- Returns
data frame with values
- Raises
ValueError – if different labels are found for a requested scheme under the same index entry
ValueError – if
original_column_names
isTrue
and two columns in the returned data frame have the same name and cannot be joined due to overlapping data or different data typeTypeError – if labels of different data type are found for a requested scheme
Examples
Return all labels that match a requested scheme.
>>> import audb >>> db = audb.load( ... "emodb", version="1.4.1", only_metadata=True, full_path=False, verbose=False ... ) >>> db.get("emotion").head() emotion file wav/03a01Fa.wav happiness wav/03a01Nc.wav neutral wav/03a01Wa.wav anger wav/03a02Fc.wav happiness wav/03a02Nc.wav neutral >>> db.get("transcription").head() transcription file wav/03a01Fa.wav Der Lappen liegt auf dem Eisschrank. wav/03a01Nc.wav Der Lappen liegt auf dem Eisschrank. wav/03a01Wa.wav Der Lappen liegt auf dem Eisschrank. wav/03a02Fc.wav Das will sie am Mittwoch abgeben. wav/03a02Nc.wav Das will sie am Mittwoch abgeben. >>> db.get("emotion", ["transcription"], map=False).head() emotion transcription file wav/03a01Fa.wav happiness a01 wav/03a01Nc.wav neutral a01 wav/03a01Wa.wav anger a01 wav/03a02Fc.wav happiness a02 wav/03a02Nc.wav neutral a02
Non-existent schemes are ignored.
>>> db.get("emotion", ["non-existing"]).head() emotion non-existing file wav/03a01Fa.wav happiness NaN wav/03a01Nc.wav neutral NaN wav/03a01Wa.wav anger NaN wav/03a02Fc.wav happiness NaN wav/03a02Nc.wav neutral NaN
Limit to a particular table or split.
>>> db.get("emotion", tables=["emotion.categories.train.gold_standard"]).head() emotion file wav/03a01Fa.wav happiness wav/03a01Nc.wav neutral wav/03a01Wa.wav anger wav/03a02Fc.wav happiness wav/03a02Nc.wav neutral >>> db.get("emotion", splits=["test"]).head() emotion file wav/12a01Fb.wav happiness wav/12a01Lb.wav boredom wav/12a01Nb.wav neutral wav/12a01Wc.wav anger wav/12a02Ac.wav fear
Return requested scheme name independent of column ID.
>>> db["emotion"].columns emotion: {scheme_id: emotion, rater_id: gold} emotion.confidence: {scheme_id: confidence, rater_id: gold} >>> db.get("confidence").head() confidence file wav/03a01Fa.wav 0.90 wav/03a01Nc.wav 1.00 wav/03a01Wa.wav 0.95 wav/03a02Fc.wav 0.85 wav/03a02Nc.wav 1.00
If
strict
isTrue
only values that have an attached scheme are returned.>>> db.get("emotion.confidence").head() emotion.confidence file wav/03a01Fa.wav 0.90 wav/03a01Nc.wav 1.00 wav/03a01Wa.wav 0.95 wav/03a02Fc.wav 0.85 wav/03a02Nc.wav 1.00 >>> db.get("emotion.confidence", strict=True).head() Empty DataFrame Columns: [emotion.confidence] Index: []
If more then one value exists for the requested scheme and index entry, an error is raised and
aggregate_function
can be used to combine the values.>>> # Add a shuffled version of emotion ratings as `random` column >>> db["emotion"]["random"] = Column(scheme_id="emotion") >>> db["emotion"]["random"].set( ... db["emotion"]["emotion"].get().sample(frac=1, random_state=1) ... ) >>> db.get("emotion") Traceback (most recent call last): ... ValueError: Found overlapping data in column 'emotion': left right file wav/03a01Nc.wav neutral disgust wav/03a01Wa.wav anger neutral wav/03a02Fc.wav happiness neutral wav/03a02Ta.wav sadness boredom wav/03a02Wb.wav anger sadness wav/03a04Ad.wav fear neutral wav/03a04Fd.wav happiness anger wav/03a04Nc.wav neutral sadness wav/03a04Wc.wav anger boredom wav/03a05Aa.wav fear sadness ... >>> db.get("emotion", aggregate_function=lambda y: y[0]).head() emotion file wav/03a01Fa.wav happiness wav/03a01Nc.wav neutral wav/03a01Wa.wav anger wav/03a02Fc.wav happiness wav/03a02Nc.wav neutral
Alternatively, use
original_column_names
to return column IDs.>>> db.get("emotion", original_column_names=True).head() emotion random file wav/03a01Fa.wav happiness happiness wav/03a01Nc.wav neutral disgust wav/03a01Wa.wav anger neutral wav/03a02Fc.wav happiness neutral wav/03a02Nc.wav neutral neutral
is_portable¶
- Database.is_portable¶
Check if database can be moved to another location.
To be portable, media must not be referenced with an absolute path, and not contain
\
,.
, or..
. If a database is portable it can be moved to another folder or updated by another database.- Returns
True
if the database is portable
load()¶
- static Database.load(root, *, name='db', load_data=False, num_workers=1, verbose=False)[source]¶
Load database from disk.
Expects a header
<root>/<name>.yaml
and for every table a file<root>/<name>.<table-id>.[csv|parquet|pkl]
Media files should be located underroot
.- Parameters
root (
str
) – root directoryname (
str
) – base name of header and table filesload_data (
bool
) – ifFalse
,audformat.Table
data is only loaded on demand, e.g. whenaudformat.Table.get()
is called for the first time. Set toTrue
to load allaudformat.Table
data immediatelynum_workers (
Optional
[int
]) – number of parallel jobs. IfNone
will be set to the number of processors on the machine multiplied by 5verbose (
bool
) – show progress bar
- Return type
- Returns
database object
- Raises
FileNotFoundError – if the database header file cannot be found under
root
RuntimeError – if a CSV or PARQUET table file is newer than the corresponding PKL file
load_header_from_yaml()¶
map_files()¶
- Database.map_files(func, num_workers=1, verbose=False)[source]¶
Apply function to file names in all tables.
If speed is crucial, see
audformat.utils.map_file_path()
for further hints how to optimize your code.
pick_files()¶
pick_tables()¶
- Database.pick_tables(table_ids)[source]¶
Pick (miscellaneous) tables by ID.
- Parameters
- Raises
audformat.errors.BadIdError – if a table with provided ID does not exist in the database
RuntimeError – if a misc table that is used in a scheme would be removed
root¶
- Database.root¶
Database root directory.
Returns
None
if database has not been stored yet.- Returns
root directory
save()¶
- Database.save(root, *, name='db', indent=2, storage_format='parquet', update_other_formats=True, header_only=False, num_workers=1, verbose=False)[source]¶
Save database to disk.
Creates a header
<root>/<name>.yaml
and for every table a file<root>/<name>.<table-id>.[csv,parquet,pkl]
.Existing files will be overwritten. If
update_other_formats
is provided, it will overwrite all existing files in others formats as well.- Parameters
root (
str
) – root directory (possibly created)name (
str
) – base name of filesindent (
int
) – indent sizestorage_format (
str
) – storage format of tables. Seeaudformat.define.TableStorageFormat
for available formatsupdate_other_formats (
bool
) – ifTrue
it will not only save to the givenstorage_format
, but update all files stored in other storage formats as wellheader_only (
bool
) – store header onlynum_workers (
Optional
[int
]) – number of parallel jobs. IfNone
will be set to the number of processors on the machine multiplied by 5verbose (
bool
) – show progress bar
to_dict()¶
update()¶
- Database.update(others, *, copy_attachments=False, copy_media=False, overwrite=False)[source]¶
Update database with other database(s).
In order to update a database, license and usage have to match. Labels and values of schemes with the same ID are combined. Media, raters, schemes, splits, and attachments that are not part of the database yet are added. Other fields will be updated by applying the following rules:
field
result
author
‘db.author, other.author’
description
db.description
expires
min(db.expires, other.expires)
languages
db.languages + other.languages
license_url
db.license_url
meta
db.meta + other.meta
name
db.name
organization
‘db.organization, other.organization’
source
‘db.source, other.source’
- Parameters
others (
Union
[Database
,Sequence
[Database
]]) – database object(s)copy_attachments (
bool
) – ifTrue
it copies the attachment files associated withothers
to the current database root foldercopy_media (
bool
) – ifTrue
it copies the media files associated withothers
to the current database root folderoverwrite (
bool
) – overwrite table values where indices overlap
- Return type
- Returns
the updated database
- Raises
ValueError – if database has different license or usage
ValueError – if different media, rater, scheme, split, or attachment with same ID is found
ValueError – if schemes cannot be combined, e.g. labels have different dtype
ValueError – if tables cannot be combined (e.g. values in same position overlap or level and dtypes of table indices do not match)
RuntimeError – if
copy_media
orcopy_attachments
isTrue
, but one of the involved databases was not saved (contains files but no root folder)RuntimeError – if any involved database is not portable
usage¶
- Database.usage¶
Usage permission.
Possible return values are given by
audformat.define.Usage
.