Usage¶
The basic idea of audinterface
is
to provide easy and standardized interfaces
to apply a machine learning model,
or other digital signal processing algorithms
to audio files.
The only prerequisite is
the algorithm provides a callable
that takes at least the signal
as a numpy.ndarray
and the sampling rate as input.
The interface can then apply the algorithm on a list of files, a folder, or an index conform to the audformat database specification. Results are always returned containing a segmented index. In the following we load three files from the emodb database and define a list of files, a folder, and an index.
import audb
import os
media = [
"wav/03a01Fa.wav",
"wav/03a01Nc.wav",
"wav/16b10Wb.wav",
]
db = audb.load(
"emodb",
version="1.3.0",
media=media,
full_path=False,
verbose=False,
)
files = list(db.files)
folder = os.path.join(db.root, os.path.dirname(files[0]))
index = db["emotion"].index
Processing interface¶
Let’s assume we want to calculate the root mean square (RMS)
value in dB.
We first define the function
and create an interface for it using audinterface.Process
.
import audinterface
import numpy as np
def rms(signal, sampling_rate):
return 20 * np.log10(np.sqrt(np.mean(signal ** 2)))
interface = audinterface.Process(process_func=rms)
You can then use one of the
process_files()
,
process_folder()
,
or process_index()
methods
to apply the algorithm
and return the result
as a pandas.Series
.
>>> interface.process_index(index, root=db.root)
file start end
wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 -21.6901
wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 -18.0407
wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 -20.3945
dtype: float32
To calculate RMS with a sliding window, we create a new interface and set a window and hop duration.
>>> interface = audinterface.Process(process_func=rms, win_dur=1.0, hop_dur=0.5)
>>> interface.process_files(files, root=db.root)
file start end
wav/03a01Fa.wav 0 days 00:00:00 0 days 00:00:01 -20.1652
0 days 00:00:00.500000 0 days 00:00:01.500000 -23.4730
wav/03a01Nc.wav 0 days 00:00:00 0 days 00:00:01 -16.3866
0 days 00:00:00.500000 0 days 00:00:01.500000 -19.5026
wav/16b10Wb.wav 0 days 00:00:00 0 days 00:00:01 -21.7340
0 days 00:00:00.500000 0 days 00:00:01.500000 -20.2331
0 days 00:00:01 0 days 00:00:02 -18.8565
0 days 00:00:01.500000 0 days 00:00:02.500000 -20.4036
dtype: float32
Feature interface¶
When the result of the processing function has multiple dimensions
it is recommended to use audinterface.Feature
,
which returns a pandas.DataFrame
and assigns names to the dimensions/features.
def features(signal, sampling_rate):
return [signal.mean(), signal.std()]
interface = audinterface.Feature(
["mean", "std"],
process_func=features,
)
>>> interface.process_index(index, root=db.root)
mean std
file start end
wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 -0.0003 0.0823
wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 -0.0003 0.1253
wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 -0.0005 0.0956
To calculate features with a sliding window,
we create a new interface
and set a window and hop duration.
By setting
process_func_applies_sliding_window=False
the windowing is automatically handled
and single frames are passed on to the processing function.
interface = audinterface.Feature(
["mean", "std"],
process_func=features,
process_func_applies_sliding_window=False,
win_dur=1.0,
hop_dur=0.5,
)
>>> interface.process_files(files, root=db.root)
mean std
file start end
wav/03a01Fa.wav 0 days 00:00:00 0 days 00:00:01 -3.2866e-04 0.0981
0 days 00:00:00.500000 0 days 00:00:01.500000 -2.8513e-04 0.0670
wav/03a01Nc.wav 0 days 00:00:00 0 days 00:00:01 3.8935e-05 0.1516
0 days 00:00:00.500000 0 days 00:00:01.500000 -4.1219e-04 0.1059
wav/16b10Wb.wav 0 days 00:00:00 0 days 00:00:01 -4.5467e-04 0.0819
0 days 00:00:00.500000 0 days 00:00:01.500000 -4.6149e-04 0.0974
0 days 00:00:01 0 days 00:00:02 -4.6923e-04 0.1141
0 days 00:00:01.500000 0 days 00:00:02.500000 -4.4670e-04 0.0955
Feature interface for multi-channel input¶
By default, an interface will process the first channel of an audio signal. We can prove this by running the previous interface on the following multi-channel signal.
import audiofile
signal, sampling_rate = audiofile.read(
os.path.join(db.root, files[0]),
always_2d=True,
)
signal_multi_channel = np.concatenate(
[
signal,
signal * 0,
signal - 0.5,
signal + 0.5,
],
)
>>> signal_multi_channel.shape
(4, 30372)
>>> interface.process_signal(signal_multi_channel, sampling_rate)
mean std
start end
0 days 00:00:00 0 days 00:00:01 -0.0003 0.0981
0 days 00:00:00.500000 0 days 00:00:01.500000 -0.0003 0.0670
To process the second and fourth channel,
we create a new interface
and set
channels=[1, 3]
.
To reuse our processing function,
we additionally set
process_func_is_mono=True
.
This will apply the function
on each channel and combine the results.
Otherwise,
the processing function must
return an array with the correct
number of channels (here 2).
interface_multi_channel = audinterface.Feature(
["mean", "std"],
process_func=features,
process_func_is_mono=True,
process_func_applies_sliding_window=False,
win_dur=1.0,
hop_dur=0.5,
channels=[1, 3],
)
df = interface_multi_channel.process_signal(signal_multi_channel, sampling_rate)
>>> df
1 3
mean std mean std
start end
0 days 00:00:00 0 days 00:00:01 0.0 0.0 0.4997 0.0981
0 days 00:00:00.500000 0 days 00:00:01.500000 0.0 0.0 0.4997 0.0670
We can access the features of a specific channel by its index.
>>> df[3]
mean std
start end
0 days 00:00:00 0 days 00:00:01 0.4997 0.0981
0 days 00:00:00.500000 0 days 00:00:01.500000 0.4997 0.0670
Feature interface for external function¶
If we interface a function from an external library
that already applies a sliding window,
we again specfiy the
win_dur
and hop_dur
arguments.
However,
by setting
process_func_applies_sliding_window=True
we still request that the whole signal is passed on.
Now,
the processing function is responsible
for extracting the features in a framewise manner
and returning the values in the correct shape,
namely (num_channels, num_features, num_frames)
,
whereas the first dimension is optionally.
import librosa
def features(signal, sampling_rate, win_dur, hop_dur, n_mfcc):
hop_length = int(hop_dur * sampling_rate)
win_length = int(win_dur * sampling_rate)
mfcc = librosa.feature.mfcc(
y=signal,
sr=sampling_rate,
n_mfcc=13,
hop_length=hop_length,
win_length=win_length,
)
return mfcc
n_mfcc = 13
interface = audinterface.Feature(
[f"mfcc-{idx}" for idx in range(n_mfcc)],
process_func=features,
process_func_args={"n_mfcc": n_mfcc}, # "win_dur" and "hop_dur" can be omitted
process_func_applies_sliding_window=True,
win_dur=0.02,
hop_dur=0.01,
)
>>> interface.process_index(index, root=db.root)
mfcc-0 ... mfcc-12
file start end ...
wav/03a01Fa.wav 0 days 00:00:00 0 days 00:00:00.020000 -611.9933 ... 1.1514
0 days 00:00:00.010000 0 days 00:00:00.030000 -668.1758 ... 14.0685
0 days 00:00:00.020000 0 days 00:00:00.040000 -664.6128 ... 7.9498
0 days 00:00:00.030000 0 days 00:00:00.050000 -667.7147 ... 12.9575
0 days 00:00:00.040000 0 days 00:00:00.060000 -669.3674 ... 4.3968
... ... ... ...
wav/16b10Wb.wav 0 days 00:00:02.480000 0 days 00:00:02.500000 -664.6736 ... 1.8637
0 days 00:00:02.490000 0 days 00:00:02.510000 -658.9581 ... 9.3450
0 days 00:00:02.500000 0 days 00:00:02.520000 -644.1565 ... 7.4110
0 days 00:00:02.510000 0 days 00:00:02.530000 -618.5459 ... 17.6454
0 days 00:00:02.520000 0 days 00:00:02.540000 -666.8052 ... 3.7111
[605 rows x 13 columns]
Serializable feature interface¶
To use a feature extractor as an input transform
of a machine learning model
it is recommend to provide it in a serializable way
so it can be stored as part of the model.
One example of such a feature extractor is opensmile.Smile
.
To create such a feature extractor,
we create a class that inherits
from audinterface.Feature
and audobject.Object
.
import audobject
class MeanStd(audinterface.Feature, audobject.Object):
def __init__(self):
super().__init__(
["mean", "std"],
process_func=self.features,
)
def features(self, signal, sampling_rate):
return [signal.mean(), signal.std()]
fex = MeanStd()
>>> fex.process_index(index, root=db.root)
mean std
file start end
wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 -0.0003 0.0823
wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 -0.0003 0.1253
wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 -0.0005 0.0956
The advantage of the feature extraction object is that we can save it to a YAML file and re-instantiate it from there.
>>> fex.to_yaml("mean-std.yaml")
>>> fex2 = audobject.from_yaml("mean-std.yaml")
>>> fex2.process_index(index, root=db.root)
mean std
file start end
wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 -0.0003 0.0823
wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 -0.0003 0.1253
wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 -0.0005 0.0956
Segmentation interface¶
When the result of the processing function is an index
it is recommended to use audinterface.Segment
,
which returns a segmented index conform to audformat.
An example for such a processing function
would be a voice activity detection algorithm.
import auditok
import pandas as pd
def segments(signal, sampling_rate):
# Convert floating point array to 16bit PCM little-endian
ints = (signal[0, :] * 32767).astype(np.int16)
little_endian = ints.astype("<u2")
signal = little_endian.tobytes()
regions = auditok.split(
signal,
sampling_rate=sampling_rate,
sample_width=2,
channels=1,
min_dur=0.2,
energy_threshold=70,
)
index = pd.MultiIndex.from_tuples(
[
(
pd.Timedelta(region.start, unit="s"),
pd.Timedelta(region.end, unit="s"),
)
for region in regions
],
names=["start", "end"],
)
return index
interface = audinterface.Segment(process_func=segments)
>>> interface.process_file(files[0], root=db.root)
MultiIndex([('wav/03a01Fa.wav', '0 days 00:00:00.150000', ...),
('wav/03a01Fa.wav', '0 days 00:00:00.900000', ...)],
names=['file', 'start', 'end'])
Sometimes,
it is required that a table
(i.e., pandas.Series
or :class`pandas.DataFrame`)
is segmented
and the labels
from the original segments
should be kept.
For this,
audinterface.Segment
has a dedicated method
process_table()
.
This method is useful,
if a segmentation
(e.g., voice activity detection)
is performed on an already labelled dataset
in order to do data augmentation
or teacher-student training.
>>> table = pd.DataFrame({"label": [n * 2 for n in range(len(index))]}, index=index)
>>> table
label
file
wav/03a01Fa.wav 0
wav/03a01Nc.wav 2
wav/16b10Wb.wav 4
>>> interface.process_table(table, root=db.root)
label
file start end
wav/03a01Fa.wav 0 days 00:00:00.150000 0 days 00:00:00.700000 0
0 days 00:00:00.900000 0 days 00:00:01.600000 0
wav/03a01Nc.wav 0 days 00:00:00.100000 0 days 00:00:01.350000 2
wav/16b10Wb.wav 0 days 00:00:00.300000 0 days 00:00:01 4
0 days 00:00:01.050000 0 days 00:00:02.500000 4
Special processing function arguments¶
There are some special arguments
to the processing function,
which will be automatically set
if they are not specified in
process_func_args
:
argument |
value |
---|---|
idx |
running index |
file |
file path |
root |
root folder |
The following processing function
returns the values of
"idx"
and "file"
.
def special_args(signal, sampling_rate, idx, file):
return idx, os.path.basename(file)
interface = audinterface.Process(process_func=special_args)
>>> interface.process_files(files, root=db.root)
file start end
wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 (0, 03a01Fa.wav)
wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 (1, 03a01Nc.wav)
wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 (2, 16b10Wb.wav)
dtype: object
For instance, we can pass a list with gender labels to the processing function and use the running index to select the appropriate f0 range.
gender = db["files"]["speaker"].get(map="gender") # gender per file
f0_range = {
"female": [160, 300], # [fmin, fmax]
"male": [60, 180],
}
def f0(signal, sampling_rate, idx, gender, f0_range):
# extract mean f0 using a gender adapted range
y = librosa.yin(
signal,
fmin=f0_range[gender.iloc[idx]][0],
fmax=f0_range[gender.iloc[idx]][1],
sr=sampling_rate,
).mean().round(2)
return y, gender.iloc[idx]
interface = audinterface.Feature(
["f0", "gender"],
process_func=f0,
process_func_args={
"gender": gender,
"f0_range": f0_range,
},
)
>>> interface.process_index(gender.index, root=db.root)
f0 gender
file start end
wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 128.81 male
wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 111.63 male
wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 229.09 female