Overview¶
audb
is similar to a version control system
for text and binary data.
It allows to manage your databases
for machine learning applications
and other tasks
where reproducibility
and easy combination of different data sources is needed.
The databases itself can be stored on different backends. You can publish or download different versions of the same database, without the need to copy data that hasn’t changed between the different versions.
In the following we provide a technical overview
of the underlying workings of audb
.
If you just want to use it,
you might read on at Publish a database
or Load a database.
Backends¶
audb
abstracts the database storage
by using the audbackend
package
to communicate with the underlying backend.
At the moment,
it supports to store the data
in a folder on a local file system,
in a bucket on MinIO or S3 storage,
or inside a Generic repository
on an Artifactory server.
You could easily expand this, by adding your own backend that implements the required functions.
Storage on backends are managed by audb.Repository
objects.
For example,
to store all your data
on your local disk under /data/data-local
you would use the following repository.
repository = audb.Repository(
name="data-local",
host="/data",
backend="file-system",
)
The default repositories are configured in audb.config.REPOSITORIES
and can be managed best
by specifying them in the Configuration.
Publish¶
When publishing your data
with audb.publish()
the following operations are performed:
calculate database dependencies
pack media and csv files into ZIP archives
upload all files to the backend
Load¶
In the process of loading data
with audb.load()
the following operations are performed:
find the backends where the database is stored
find the latest version of a database (optional)
calculate database dependencies
download (archive) files from the selected backend
unpack the archive files (optional)
inspect and convert the audio files (optional)
store the data in a cache folder