Database¶
- class audformat.Database(name, source='', usage='unrestricted', *, expires=None, languages=None, description=None, author=None, organization=None, license=None, license_url=None, meta=None)[source]¶
Database object.
A database consists of a header holding raters, schemes, splits, and other meta information. In addition, it links to a number of tables listing files and labels.
For a start see how to create a database and inspect the example of the emodb database.
- Parameters
name (
str) – name of databasesource (
str) – data source (e.g. link to website)usage (
str) – permission of usage, seeaudformat.define.Usage. Set to'other'if none of the other fields fit.languages (
Union[str,Sequence[str],None]) – list of languages. Will be mapped to ISO 639-3 strings withaudformat.utils.map_language()organization (
Optional[str]) – organization(s) maintaining the databaselicense (
Union[str,License,None]) – database license. You can use a custom license or pick one fromaudformat.define.License. In the later case,license_urlwill be automatically set if it is not given
- Raises
BadValueError – if an invalid
usagevalue is passedValueError – if language is unknown
Examples
>>> db = Database( ... "mydb", ... "https://www.audeering.com/", ... define.Usage.COMMERCIAL, ... languages=["English", "de"], ... ) >>> db name: mydb source: https://www.audeering.com/ usage: commercial languages: [eng, deu] >>> labels = ["positive", "neutral", "negative"] >>> db.schemes["emotion"] = Scheme(labels=labels) >>> db.schemes["match"] = Scheme(dtype="bool") >>> db.raters["rater"] = Rater() >>> db.media["audio"] = Media( ... define.MediaType.AUDIO, ... format="wav", ... sampling_rate=16000, ... ) >>> index = filewise_index(["f1.wav", "f2.wav"]) >>> db["table"] = Table(index, media_id="audio") >>> db["table"]["column"] = Column( ... scheme_id="emotion", ... rater_id="rater", ... ) >>> db["table"]["column"].set(["neutral", "positive"]) >>> index = pd.Index([], dtype="string", name="idx") >>> db["misc-table"] = MiscTable(index) >>> db["misc-table"]["column"] = Column(scheme_id="match") >>> db name: mydb source: https://www.audeering.com/ usage: commercial languages: [eng, deu] media: audio: {type: audio, format: wav, sampling_rate: 16000} raters: rater: {type: human} schemes: emotion: dtype: str labels: [positive, neutral, negative] match: {dtype: bool} tables: table: type: filewise media_id: audio columns: column: {scheme_id: emotion, rater_id: rater} misc_tables: misc-table: levels: {idx: str} columns: column: {scheme_id: match} >>> list(db) ['misc-table', 'table'] >>> db.get("emotion") emotion file f1.wav neutral f2.wav positive
__contains__()¶
__eq__()¶
__getitem__()¶
__iter__()¶
__setitem__()¶
- Database.__setitem__(table_id, table)[source]¶
Add table to database.
- Parameters
- Raises
BadIdError – if table has a
split_idormedia_id, which is not specified in the underlying databaseTableExistsError – if setting a miscellaneous table when a filewise or segmented table with the same ID exists (or vice versa)
- Return type
attachments¶
- Database.attachments¶
Dictionary of attachments.
- Raises
RuntimeError – if the path of a newly assigned attachment overlaps with the path of an existing attachment
drop_files()¶
drop_tables()¶
- Database.drop_tables(table_ids)[source]¶
Drop (miscellaneous) tables by ID.
- Parameters
- Raises
audformat.errors.BadIdError – if a table with provided ID does not exist in the database
RuntimeError – if a misc table that is used in a scheme would be removed
dump()¶
files¶
- Database.files¶
Files referenced in the database.
Includes files from filewise and segmented tables.
- Returns
files
files_duration()¶
- Database.files_duration(files, *, root=None)[source]¶
Duration of files in the database.
Use
db.files_duration(db.files).sum()to get the total duration of all files in a database. Ordb.files_duration(db[table_id].files).sum()to get the total duration of all files assigned to a table.Note
Durations are cached, i.e. changing the files on disk after calling this function can lead to wrong results. The cache is cleared when the database is reloaded from disk.
- Parameters
- Return type
- Returns
mapping from file to duration
- Raises
ValueError – if
rootis not set when using relative file names with a database that was not saved or loaded from disk
from_dict()¶
get()¶
- Database.get(scheme, additional_schemes=[], *, tables=None, splits=None, strict=False, map=True, original_column_names=False, aggregate_function=None, aggregate_strategy='mismatch')[source]¶
Get labels by scheme.
Return all labels from columns assigned to a
audformat.Schemewith namescheme. The request can be limited to specifictablesand/orsplits. By providingadditional_schemesthe result can be enriched with labels from other schemes (searched in all tables). IfstrictisFalse, a scheme is defined more broadly and does not only match schemes of the database, but also columns with the same name or labels of a scheme with the requested name as key. If at least one returned label belongs to a segmented table, the returned data frame has a segmented index. Anaggregate_functioncan be provided that specifies how values are combined if more than one value is found for the same file or segment.- Parameters
scheme (
str) – scheme ID for which labels should be returned. The search can be restricted to specific tables and splits by thetablesandsplitsarguments. Or extended to columns with that same name or the name of a label in the scheme using the strict argumentadditional_schemes (
str|Sequence) – scheme ID or sequence of scheme IDs for which additional labels should be returned. The search is not affected by thetablesandsplitsargumentstables (
Union[str,Sequence,None]) – limit search forschemeto selected tablessplits (
Union[str,Sequence,None]) – limit search forschemeto selected splitsstrict (
bool) – ifFalsethe search is extended to columns that match the name of the scheme or the name of a label in the schememap (
bool) – ifTrueand a requested scheme has labels with mappings, those will be returnedoriginal_column_names (
bool) – ifTruekeep the original column names (possibly results in multiple columns). For mapped schemes, the column name before mapping is returned, e.g. when requesting'gender'it might return a column named'speaker'aggregate_function (
Optional[Callable[[Series],object]]) – callable to aggregate overlapping values. The function gets apandas.Serieswith overlapping values as input. E.g. set tolambda y: y.mean()to average the values or totupleto return them as a tupleaggregate_strategy (
str) – ifaggregate_functionis notNone,aggregate_strategydecides whenaggregate_functionis applied.'overlap': apply to all samples that have an overlapping index;'mismatch': apply to all samples that have an overlapping index and a different value
- Return type
- Returns
data frame with values
- Raises
ValueError – if different labels are found for a requested scheme under the same index entry
ValueError – if
original_column_namesisTrueand two columns in the returned data frame have the same name and cannot be joined due to overlapping data or different data typeTypeError – if labels of different data type are found for a requested scheme
Examples
Return all labels that match a requested scheme.
>>> import audb >>> db = audb.load( ... "emodb", ... version="1.4.1", ... only_metadata=True, ... full_path=False, ... verbose=False, ... ) >>> db.get("emotion").head() emotion file wav/03a01Fa.wav happiness wav/03a01Nc.wav neutral wav/03a01Wa.wav anger wav/03a02Fc.wav happiness wav/03a02Nc.wav neutral >>> db.get("transcription").head() transcription file wav/03a01Fa.wav Der Lappen liegt auf dem Eisschrank. wav/03a01Nc.wav Der Lappen liegt auf dem Eisschrank. wav/03a01Wa.wav Der Lappen liegt auf dem Eisschrank. wav/03a02Fc.wav Das will sie am Mittwoch abgeben. wav/03a02Nc.wav Das will sie am Mittwoch abgeben. >>> db.get("emotion", ["transcription"], map=False).head() emotion transcription file wav/03a01Fa.wav happiness a01 wav/03a01Nc.wav neutral a01 wav/03a01Wa.wav anger a01 wav/03a02Fc.wav happiness a02 wav/03a02Nc.wav neutral a02
Non-existent schemes are ignored.
>>> db.get("emotion", ["non-existing"]).head() emotion non-existing file wav/03a01Fa.wav happiness NaN wav/03a01Nc.wav neutral NaN wav/03a01Wa.wav anger NaN wav/03a02Fc.wav happiness NaN wav/03a02Nc.wav neutral NaN
Limit to a particular table or split.
>>> db.get("emotion", tables=["emotion.categories.train.gold_standard"]).head() emotion file wav/03a01Fa.wav happiness wav/03a01Nc.wav neutral wav/03a01Wa.wav anger wav/03a02Fc.wav happiness wav/03a02Nc.wav neutral >>> db.get("emotion", splits=["test"]).head() emotion file wav/12a01Fb.wav happiness wav/12a01Lb.wav boredom wav/12a01Nb.wav neutral wav/12a01Wc.wav anger wav/12a02Ac.wav fear
Return requested scheme name independent of column ID.
>>> db["emotion"].columns emotion: {scheme_id: emotion, rater_id: gold} emotion.confidence: {scheme_id: confidence, rater_id: gold} >>> db.get("confidence").head() confidence file wav/03a01Fa.wav 0.90 wav/03a01Nc.wav 1.00 wav/03a01Wa.wav 0.95 wav/03a02Fc.wav 0.85 wav/03a02Nc.wav 1.00
If
strictisTrueonly values that have an attached scheme are returned.>>> db.get("emotion.confidence").head() emotion.confidence file wav/03a01Fa.wav 0.90 wav/03a01Nc.wav 1.00 wav/03a01Wa.wav 0.95 wav/03a02Fc.wav 0.85 wav/03a02Nc.wav 1.00 >>> db.get("emotion.confidence", strict=True).head() Empty DataFrame Columns: [emotion.confidence] Index: []
If more then one value exists for the requested scheme and index entry, an error is raised and
aggregate_functioncan be used to combine the values.>>> # Add a shuffled version of emotion ratings as `random` column >>> db["emotion"]["random"] = Column(scheme_id="emotion") >>> db["emotion"]["random"].set( ... db["emotion"]["emotion"].get().sample(frac=1, random_state=1) ... ) >>> db.get("emotion") Traceback (most recent call last): ... ValueError: Found overlapping data in column 'emotion': left right file wav/03a01Nc.wav neutral disgust wav/03a01Wa.wav anger neutral wav/03a02Fc.wav happiness neutral wav/03a02Ta.wav sadness boredom wav/03a02Wb.wav anger sadness wav/03a04Ad.wav fear neutral wav/03a04Fd.wav happiness anger wav/03a04Nc.wav neutral sadness wav/03a04Wc.wav anger boredom wav/03a05Aa.wav fear sadness ... >>> db.get("emotion", aggregate_function=lambda y: y[0]).head() emotion file wav/03a01Fa.wav happiness wav/03a01Nc.wav neutral wav/03a01Wa.wav anger wav/03a02Fc.wav happiness wav/03a02Nc.wav neutral
Alternatively, use
original_column_namesto return column IDs.>>> db.get("emotion", original_column_names=True).head() emotion random file wav/03a01Fa.wav happiness happiness wav/03a01Nc.wav neutral disgust wav/03a01Wa.wav anger neutral wav/03a02Fc.wav happiness neutral wav/03a02Nc.wav neutral neutral
is_portable¶
- Database.is_portable¶
Check if database can be moved to another location.
To be portable, media must not be referenced with an absolute path, and not contain
\,., or... If a database is portable it can be moved to another folder or updated by another database.- Returns
Trueif the database is portable
load()¶
- static Database.load(root, *, name='db', load_data=False, num_workers=1, verbose=False)[source]¶
Load database from disk.
Expects a header
<root>/<name>.yamland for every table a file<root>/<name>.<table-id>.[csv|parquet|pkl]Media files should be located underroot.- Parameters
root (
str) – root directoryname (
str) – base name of header and table filesload_data (
bool) – ifFalse,audformat.Tabledata is only loaded on demand, e.g. whenaudformat.Table.get()is called for the first time. Set toTrueto load allaudformat.Tabledata immediatelynum_workers (
Optional[int]) – number of parallel jobs. IfNonewill be set to the number of processors on the machine multiplied by 5verbose (
bool) – show progress bar
- Return type
- Returns
database object
- Raises
FileNotFoundError – if the database header file cannot be found under
rootRuntimeError – if a CSV or PARQUET table file is newer than the corresponding PKL file
load_header_from_yaml()¶
map_files()¶
- Database.map_files(func, num_workers=1, verbose=False)[source]¶
Apply function to file names in all tables.
If speed is crucial, see
audformat.utils.map_file_path()for further hints how to optimize your code.
pick_files()¶
pick_tables()¶
- Database.pick_tables(table_ids)[source]¶
Pick (miscellaneous) tables by ID.
- Parameters
- Raises
audformat.errors.BadIdError – if a table with provided ID does not exist in the database
RuntimeError – if a misc table that is used in a scheme would be removed
root¶
- Database.root¶
Database root directory.
Returns
Noneif database has not been stored yet.- Returns
root directory
save()¶
- Database.save(root, *, name='db', indent=2, storage_format='parquet', update_other_formats=True, header_only=False, num_workers=1, verbose=False)[source]¶
Save database to disk.
Creates a header
<root>/<name>.yamland for every table a file<root>/<name>.<table-id>.[csv,parquet,pkl].Existing files will be overwritten. If
update_other_formatsis provided, it will overwrite all existing files in others formats as well.- Parameters
root (
str) – root directory (possibly created)name (
str) – base name of filesindent (
int) – indent sizestorage_format (
str) – storage format of tables. Seeaudformat.define.TableStorageFormatfor available formatsupdate_other_formats (
bool) – ifTrueit will not only save to the givenstorage_format, but update all files stored in other storage formats as wellheader_only (
bool) – store header onlynum_workers (
Optional[int]) – number of parallel jobs. IfNonewill be set to the number of processors on the machine multiplied by 5verbose (
bool) – show progress bar
to_dict()¶
update()¶
- Database.update(others, *, copy_attachments=False, copy_media=False, overwrite=False)[source]¶
Update database with other database(s).
In order to update a database, license and usage have to match. Labels and values of schemes with the same ID are combined. Media, raters, schemes, splits, and attachments that are not part of the database yet are added. Other fields will be updated by applying the following rules:
field
result
author
‘db.author, other.author’
description
db.description
expires
min(db.expires, other.expires)
languages
db.languages + other.languages
license_url
db.license_url
meta
db.meta + other.meta
name
db.name
organization
‘db.organization, other.organization’
source
‘db.source, other.source’
- Parameters
copy_attachments (
bool) – ifTrueit copies the attachment files associated withothersto the current database root foldercopy_media (
bool) – ifTrueit copies the media files associated withothersto the current database root folderoverwrite (
bool) – overwrite table values where indices overlap
- Return type
- Returns
the updated database
- Raises
ValueError – if database has different license or usage
ValueError – if different media, rater, scheme, split, or attachment with same ID is found
ValueError – if schemes cannot be combined, e.g. labels have different dtype
ValueError – if tables cannot be combined (e.g. values in same position overlap or level and dtypes of table indices do not match)
RuntimeError – if
copy_mediaorcopy_attachmentsisTrue, but one of the involved databases was not saved (contains files but no root folder)RuntimeError – if any involved database is not portable
usage¶
- Database.usage¶
Usage permission.
Possible return values are given by
audformat.define.Usage.