Usage

The basic idea of audinterface is to provide easy and standardized interfaces to apply a machine learning model, or other digital signal processing algorithms to audio files. The only prerequisite is the algorithm provides a callable that takes at least the signal as a numpy.ndarray and the sampling rate as input.

The interface can then apply the algorithm on a list of files, a folder, or an index conform to the audformat database specification. Results are always returned containing a segmented index. In the following we load three files from the emodb database and define a list of files, a folder, and an index.

import audb
import os

media = [
    "wav/03a01Fa.wav",
    "wav/03a01Nc.wav",
    "wav/16b10Wb.wav",
]
db = audb.load(
    "emodb",
    version="1.3.0",
    media=media,
    verbose=False,
)

files = list(db.files)
folder = os.path.dirname(files[0])
index = db["emotion"].index

Processing interface

Let’s assume we want to calculate the root mean square (RMS) value in dB. We first define the function and create an interface for it using audinterface.Process.

import audinterface
import numpy as np

def rms(signal, sampling_rate):
    return 20 * np.log10(np.sqrt(np.mean(signal ** 2)))

interface = audinterface.Process(process_func=rms)

The following three commands apply the algorithm and all return the same result as a pandas.Series.

y = interface.process_files(files)
y = interface.process_folder(folder)
y = interface.process_index(index)
y
file start end
.../03a01Fa.wav 0 days 0 days 00:00:01.898250 -21.690142
.../03a01Nc.wav 0 days 0 days 00:00:01.611250 -18.040703
.../16b10Wb.wav 0 days 0 days 00:00:02.522499999 -20.394533

To calculate RMS with a sliding window, we create a new interface and set a window and hop duration.

interface = audinterface.Process(
    process_func=rms,
    win_dur=1.0,
    hop_dur=0.5,
)
y = interface.process_files(files)
y
file start end
.../03a01Fa.wav 0 days 00:00:00 0 days 00:00:01 -20.165249
0 days 00:00:00.500000 0 days 00:00:01.500000 -23.472969
.../03a01Nc.wav 0 days 00:00:00 0 days 00:00:01 -16.386614
... ... ... ...
.../16b10Wb.wav 0 days 00:00:00.500000 0 days 00:00:01.500000 -20.233055
0 days 00:00:01 0 days 00:00:02 -18.856522
0 days 00:00:01.500000 0 days 00:00:02.500000 -20.403574

Feature interface

When the result of the processing function has multiple dimensions it is recommended to use audinterface.Feature, which returns a pandas.DataFrame and assigns names to the dimensions/features.

def features(signal, sampling_rate):
    return [signal.mean(), signal.std()]

interface = audinterface.Feature(
    ["mean", "std"],
    process_func=features,
)

df = interface.process_index(index)
df
mean std
file start end
.../03a01Fa.wav 0 days 0 days 00:00:01.898250 -0.000311 0.082317
.../03a01Nc.wav 0 days 0 days 00:00:01.611250 -0.000312 0.125304
.../16b10Wb.wav 0 days 0 days 00:00:02.522499999 -0.000464 0.095558

To calculate features with a sliding window, we create a new interface and set a window and hop duration. By setting process_func_applies_sliding_window=False the windowing is automatically handled and single frames are passed on to the processing function.

interface = audinterface.Feature(
    ["mean", "std"],
    process_func=features,
    process_func_applies_sliding_window=False,
    win_dur=1.0,
    hop_dur=0.5,
)
df = interface.process_files(files)
df
mean std
file start end
.../03a01Fa.wav 0 days 00:00:00 0 days 00:00:01 -0.000329 0.098115
0 days 00:00:00.500000 0 days 00:00:01.500000 -0.000285 0.067042
.../03a01Nc.wav 0 days 00:00:00 0 days 00:00:01 0.000039 0.151590
... ... ... ... ...
.../16b10Wb.wav 0 days 00:00:00.500000 0 days 00:00:01.500000 -0.000461 0.097351
0 days 00:00:01 0 days 00:00:02 -0.000469 0.114070
0 days 00:00:01.500000 0 days 00:00:02.500000 -0.000447 0.095459

Feature interface for multi-channel input

By default, an interface will process the first channel of an audio signal. We can prove this by running the previous interface on the following multi-channel signal.

import audiofile

signal, sampling_rate = audiofile.read(
    files[0],
    always_2d=True,
)
signal_multi_channel = np.concatenate(
    [
        signal,
        signal * 0,
        signal - 0.5,
        signal + 0.5,
    ],
)
signal_multi_channel.shape
(4, 30372)
df = interface.process_signal(
    signal_multi_channel,
    sampling_rate,
)
df
mean std
start end
0 days 00:00:00 0 days 00:00:01 -0.000329 0.098115
0 days 00:00:00.500000 0 days 00:00:01.500000 -0.000285 0.067042

To process the second and fourth channel, we create a new interface and set channels=[1, 3]. To reuse our processing function, we additionally set process_func_is_mono=True. This will apply the function on each channel and combine the results. Otherwise, the processing function must return an array with the correct number of channels (here 2).

interface_multi_channel = audinterface.Feature(
    ["mean", "std"],
    process_func=features,
    process_func_is_mono=True,
    process_func_applies_sliding_window=False,
    win_dur=1.0,
    hop_dur=0.5,
    channels=[1, 3],
)

df = interface_multi_channel.process_signal(
    signal_multi_channel,
    sampling_rate,
)
df
1 3
mean std mean std
start end
0 days 00:00:00 0 days 00:00:01 0.0 0.0 0.499671 0.098115
0 days 00:00:00.500000 0 days 00:00:01.500000 0.0 0.0 0.499715 0.067042

We can access the features of a specific channel by its index.

df[3]
mean std
start end
0 days 00:00:00 0 days 00:00:01 0.499671 0.098115
0 days 00:00:00.500000 0 days 00:00:01.500000 0.499715 0.067042

Feature interface for external function

If we interface a function from an external library that already applies a sliding window, we again specfiy the win_dur and hop_dur arguments. However, by setting process_func_applies_sliding_window=True we still request that the whole signal is passed on. Now, the processing function is responsible for extracting the features in a framewise manner and returning the values in the correct shape, namely (num_channels, num_features, num_frames), whereas the first dimension is optionally.

import librosa

def features(signal, sampling_rate, win_dur, hop_dur, n_mfcc):
    hop_length = int(hop_dur * sampling_rate)
    win_length = int(win_dur * sampling_rate)
    mfcc = librosa.feature.mfcc(
        y=signal,
        sr=sampling_rate,
        n_mfcc=13,
        hop_length=hop_length,
        win_length=win_length,
    )
    return mfcc

n_mfcc = 13
interface = audinterface.Feature(
    [f"mfcc-{idx}" for idx in range(n_mfcc)],
    process_func=features,
    process_func_args={"n_mfcc": n_mfcc},  # "win_dur" and "hop_dur" can be omitted
    process_func_applies_sliding_window=True,
    win_dur=0.02,
    hop_dur=0.01,
)
df = interface.process_index(index)
df
mfcc-0 mfcc-1 ... mfcc-11 mfcc-12
file start end
.../03a01Fa.wav 0 days 00:00:00 0 days 00:00:00.020000 -611.993286 47.627602 ... 10.659678 1.151390
0 days 00:00:00.010000 0 days 00:00:00.030000 -668.175842 35.650566 ... 13.644274 14.068543
0 days 00:00:00.020000 0 days 00:00:00.040000 -664.612793 43.068939 ... 5.081633 7.949757
... ... ... ... ... ... ... ...
.../16b10Wb.wav 0 days 00:00:02.500000 0 days 00:00:02.520000 -644.156494 34.024551 ... 2.472639 7.411011
0 days 00:00:02.510000 0 days 00:00:02.530000 -618.545898 44.741577 ... 12.690891 17.645359
0 days 00:00:02.520000 0 days 00:00:02.540000 -666.805237 19.845566 ... 4.124321 3.711080

Serializable feature interface

To use a feature extractor as an input transform of a machine learning model it is recommend to provide it in a serializable way so it can be stored as part of the model. One example of such a feature extractor is opensmile.Smile.

To create such a feature extractor, we create a class that inherits from audinterface.Feature and audobject.Object.

import audobject

class MeanStd(audinterface.Feature, audobject.Object):

    def __init__(self):
        super().__init__(
            ["mean", "std"],
            process_func=self.features,
        )

    def features(self, signal, sampling_rate):
        return [signal.mean(), signal.std()]

fex = MeanStd()
df = fex.process_index(index)
df
mean std
file start end
.../03a01Fa.wav 0 days 0 days 00:00:01.898250 -0.000311 0.082317
.../03a01Nc.wav 0 days 0 days 00:00:01.611250 -0.000312 0.125304
.../16b10Wb.wav 0 days 0 days 00:00:02.522499999 -0.000464 0.095558

The advantage of the feature extraction object is that we can save it to a YAML file and re-instantiate it from there.

fex.to_yaml("mean-std.yaml")
fex2 = audobject.from_yaml("mean-std.yaml")
df = fex2.process_index(index)
df
mean std
file start end
.../03a01Fa.wav 0 days 0 days 00:00:01.898250 -0.000311 0.082317
.../03a01Nc.wav 0 days 0 days 00:00:01.611250 -0.000312 0.125304
.../16b10Wb.wav 0 days 0 days 00:00:02.522499999 -0.000464 0.095558

Segmentation interface

When the result of the processing function is an index it is recommended to use audinterface.Segment, which returns a segmented index conform to audformat. An example for such a processing function would be a voice activity detection algorithm.

import auditok

def segments(signal, sampling_rate):

    # Convert floating point array to 16bit PCM little-endian
    ints = (signal[0, :] * 32767).astype(np.int16)
    little_endian = ints.astype("<u2")
    signal = little_endian.tobytes()

    regions = auditok.split(
        signal,
        sampling_rate=sampling_rate,
        sample_width=2,
        channels=1,
        min_dur=0.2,
        energy_threshold=70,
    )
    index = pd.MultiIndex.from_tuples(
        [
            (
                pd.Timedelta(region.meta.start, unit="s"),
                pd.Timedelta(region.meta.end, unit="s"),
            )
            for region in regions
        ],
        names=["start", "end"],
    )
    return index

interface = audinterface.Segment(process_func=segments)
idx = interface.process_file(files[0])
idx
file start end
.../03a01Fa.wav 0 days 00:00:00.150000 0 days 00:00:00.700000
0 days 00:00:00.900000 0 days 00:00:01.600000

Sometimes, it is required that a table (i.e., pandas.Series or :class`pandas.DataFrame`) is segmented and the labels from the original segments should be kept. For this, audinterface.Segment has a dedicated method process_table(). This method is useful, if a segmentation (e.g., voice activity detection) is performed on an already labelled dataset in order to do data augmentation or teacher-student training.

table = pd.DataFrame({"label": [n * 2 for n in range(len(index))]}, index=index)
table
label
file
.../03a01Fa.wav 0
.../03a01Nc.wav 2
.../16b10Wb.wav 4
table_segmented = interface.process_table(table)
table_segmented
label
file start end
.../03a01Fa.wav 0 days 00:00:00.150000 0 days 00:00:00.700000 0
0 days 00:00:00.900000 0 days 00:00:01.600000 0
.../03a01Nc.wav 0 days 00:00:00.100000 0 days 00:00:01.350000 2
.../16b10Wb.wav 0 days 00:00:00.300000 0 days 00:00:01 4
0 days 00:00:01.050000 0 days 00:00:02.500000 4

Special processing function arguments

There are some special arguments to the processing function, which will be automatically set if they are not specified in process_func_args:

argument

value

idx

running index

file

file path

root

root folder

The following processing function returns the values of "idx" and "file".

def special_args(signal, sampling_rate, idx, file):
    return idx, os.path.basename(file)

interface = audinterface.Process(process_func=special_args)
y = interface.process_files(files)
y
file start end
.../03a01Fa.wav 0 days 0 days 00:00:01.898250 (0, 03a01Fa.wav)
.../03a01Nc.wav 0 days 0 days 00:00:01.611250 (1, 03a01Nc.wav)
.../16b10Wb.wav 0 days 0 days 00:00:02.522499999 (2, 16b10Wb.wav)

For instance, we can pass a list with gender labels to the processing function and use the running index to select the appropriate f0 range.

gender = db["files"]["speaker"].get(map="gender")  # gender per file
f0_range = {
    "female": [160, 300],  # [fmin, fmax]
    "male": [60, 180],
}

def f0(signal, sampling_rate, idx, gender, f0_range):
    # extract mean f0 using a gender adapted range
    y = librosa.yin(
        signal,
        fmin=f0_range[gender.iloc[idx]][0],
        fmax=f0_range[gender.iloc[idx]][1],
        sr=sampling_rate,
    ).mean()
    return y, gender.iloc[idx]

interface = audinterface.Feature(
    ["f0", "gender"],
    process_func=f0,
    process_func_args={
        "gender": gender,
        "f0_range": f0_range,
    },
)
df = interface.process_index(gender.index)
df
f0 gender
file start end
.../03a01Fa.wav 0 days 0 days 00:00:01.898250 128.8100011977164 male
.../03a01Nc.wav 0 days 0 days 00:00:01.611250 111.63351213181389 male
.../16b10Wb.wav 0 days 0 days 00:00:02.522499999 229.09341877352415 female