Overview

audb is similar to a version control system for text and binary data. It allows to manage your databases for machine learning applications and other tasks where reproducibility and easy combination of different data sources is needed.

The databases itself can be stored on different backends. You can publish or download different versions of the same database, without the need to copy data that hasn’t changed between the different versions.

In the following we provide a technical overview of the underlying workings of audb. If you just want to use it, you might read on at Publish a database or Load a database.

Backends

audb abstracts the database storage by using the audbackend package to communicate with the underlying backend. At the moment, it supports to store the data in a folder on a local file system, in a bucket on MinIO or S3 storage, or inside a Generic repository on an Artifactory server.

You could easily expand this, by adding your own backend that implements the required functions.

Storage on backends are managed by audb.Repository objects. For example, to store all your data on your local disk under /data/data-local you would use the following repository.

repository = audb.Repository(
    name="data-local",
    host="/data",
    backend="file-system",
)

The default repositories are configured in audb.config.REPOSITORIES and can be managed best by specifying them in the Configuration.

Publish

When publishing your data with audb.publish() the following operations are performed:

  1. calculate database dependencies

  2. pack media and csv files into ZIP archives

  3. upload all files to the backend

digraph G { rankdir=LR node[shape=Mrecord, style=filled, color=orange] compound=true subgraph cluster_project { label="Database Project" subgraph cluster_folder { label="Database Folder" header_in[label="Header"] tables_in[label="Tables"] media_in[label="Media"] deps_in[label="(Deps)"] } } subgraph cluster_publish { label="Publish with audb" pack[label="(Pack)"] upload[label="Upload"] } subgraph cluster_backend { label="Backend" subgraph cluster_database { label="Database vX.Y.Z" header_out[label="Header"] tables_out[label="Tables"] media_out[label="Media"] deps_out[label="Deps"] } } header_in->pack [ltail=cluster_folder] pack->upload upload->header_out [lhead=cluster_database] }

Load

In the process of loading data with audb.load() the following operations are performed:

  1. find the backends where the database is stored

  2. find the latest version of a database (optional)

  3. calculate database dependencies

  4. download (archive) files from the selected backend

  5. unpack the archive files (optional)

  6. inspect and convert the audio files (optional)

  7. store the data in a cache folder

digraph G { rankdir=LR node[shape=Mrecord, style=filled, color=orange] compound=true subgraph cluster_backend { label="Backend" subgraph cluster_database { label="Database vX.Y.Z" header_in[label="Header"] tables_in[label="Tables"] media_in[label="Media"] deps_in[label="Deps"] } } subgraph cluster_load { label="Load with audb" download[label="Download"] unpack[label="(Unpack)"] convert[label="(Convert)"] } subgraph cluster_cache { label="Cache" subgraph cluster_flavor { label="Database vX.Y.Z Flavor" header_out[label="Header"] tables_out[label="Tables"] media_out[label="(Conv.) Media"] deps_out[label="Deps"] } } header_in->download [ltail=cluster_database] download->unpack->convert convert->header_out [lhead=cluster_flavor] }