Usage¶
The basic idea of audinterface is
to provide easy and standardized interfaces
to apply a machine learning model,
or other digital signal processing algorithms
to audio files.
The only prerequisite is
the algorithm provides a callable
that takes at least the signal
as a numpy.ndarray
and the sampling rate as input.
The interface can then apply the algorithm on a list of files, a folder, or an index conform to the audformat database specification. Results are always returned containing a segmented index. In the following we load three files from the emodb database and define a list of files, a folder, and an index.
import audb
import os
media = [
"wav/03a01Fa.wav",
"wav/03a01Nc.wav",
"wav/16b10Wb.wav",
]
db = audb.load(
"emodb",
version="1.3.0",
media=media,
full_path=False,
verbose=False,
)
files = list(db.files)
folder = os.path.join(db.root, os.path.dirname(files[0]))
index = db["emotion"].index
Processing interface¶
Let’s assume we want to calculate the root mean square (RMS)
value in dB.
We first define the function
and create an interface for it using audinterface.Process.
import audinterface
import numpy as np
def rms(signal, sampling_rate):
return 20 * np.log10(np.sqrt(np.mean(signal ** 2)))
interface = audinterface.Process(process_func=rms)
You can then use one of the
process_files(),
process_folder(),
or process_index() methods
to apply the algorithm
and return the result
as a pandas.Series.
>>> interface.process_index(index, root=db.root)
file start end
wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 -21.6901
wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 -18.0407
wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 -20.3945
dtype: float32
To calculate RMS with a sliding window, we create a new interface and set a window and hop duration.
>>> interface = audinterface.Process(process_func=rms, win_dur=1.0, hop_dur=0.5)
>>> interface.process_files(files, root=db.root)
file start end
wav/03a01Fa.wav 0 days 00:00:00 0 days 00:00:01 -20.1652
0 days 00:00:00.500000 0 days 00:00:01.500000 -23.4730
wav/03a01Nc.wav 0 days 00:00:00 0 days 00:00:01 -16.3866
0 days 00:00:00.500000 0 days 00:00:01.500000 -19.5026
wav/16b10Wb.wav 0 days 00:00:00 0 days 00:00:01 -21.7340
0 days 00:00:00.500000 0 days 00:00:01.500000 -20.2331
0 days 00:00:01 0 days 00:00:02 -18.8565
0 days 00:00:01.500000 0 days 00:00:02.500000 -20.4036
dtype: float32
Feature interface¶
When the result of the processing function has multiple dimensions
it is recommended to use audinterface.Feature,
which returns a pandas.DataFrame
and assigns names to the dimensions/features.
def features(signal, sampling_rate):
return [signal.mean(), signal.std()]
interface = audinterface.Feature(
["mean", "std"],
process_func=features,
)
>>> interface.process_index(index, root=db.root)
mean std
file start end
wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 -0.0003 0.0823
wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 -0.0003 0.1253
wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 -0.0005 0.0956
To calculate features with a sliding window,
we create a new interface
and set a window and hop duration.
By setting
process_func_applies_sliding_window=False
the windowing is automatically handled
and single frames are passed on to the processing function.
interface = audinterface.Feature(
["mean", "std"],
process_func=features,
process_func_applies_sliding_window=False,
win_dur=1.0,
hop_dur=0.5,
)
>>> interface.process_files(files, root=db.root)
mean std
file start end
wav/03a01Fa.wav 0 days 00:00:00 0 days 00:00:01 -3.2866e-04 0.0981
0 days 00:00:00.500000 0 days 00:00:01.500000 -2.8513e-04 0.0670
wav/03a01Nc.wav 0 days 00:00:00 0 days 00:00:01 3.8935e-05 0.1516
0 days 00:00:00.500000 0 days 00:00:01.500000 -4.1219e-04 0.1059
wav/16b10Wb.wav 0 days 00:00:00 0 days 00:00:01 -4.5467e-04 0.0819
0 days 00:00:00.500000 0 days 00:00:01.500000 -4.6149e-04 0.0974
0 days 00:00:01 0 days 00:00:02 -4.6923e-04 0.1141
0 days 00:00:01.500000 0 days 00:00:02.500000 -4.4670e-04 0.0955
Feature interface for multi-channel input¶
By default, an interface will process the first channel of an audio signal. We can prove this by running the previous interface on the following multi-channel signal.
import audiofile
signal, sampling_rate = audiofile.read(
os.path.join(db.root, files[0]),
always_2d=True,
)
signal_multi_channel = np.concatenate(
[
signal,
signal * 0,
signal - 0.5,
signal + 0.5,
],
)
>>> signal_multi_channel.shape
(4, 30372)
>>> interface.process_signal(signal_multi_channel, sampling_rate)
mean std
start end
0 days 00:00:00 0 days 00:00:01 -0.0003 0.0981
0 days 00:00:00.500000 0 days 00:00:01.500000 -0.0003 0.0670
To process the second and fourth channel,
we create a new interface
and set
channels=[1, 3].
To reuse our processing function,
we additionally set
process_func_is_mono=True.
This will apply the function
on each channel and combine the results.
Otherwise,
the processing function must
return an array with the correct
number of channels (here 2).
interface_multi_channel = audinterface.Feature(
["mean", "std"],
process_func=features,
process_func_is_mono=True,
process_func_applies_sliding_window=False,
win_dur=1.0,
hop_dur=0.5,
channels=[1, 3],
)
df = interface_multi_channel.process_signal(signal_multi_channel, sampling_rate)
>>> df
1 3
mean std mean std
start end
0 days 00:00:00 0 days 00:00:01 0.0 0.0 0.4997 0.0981
0 days 00:00:00.500000 0 days 00:00:01.500000 0.0 0.0 0.4997 0.0670
We can access the features of a specific channel by its index.
>>> df[3]
mean std
start end
0 days 00:00:00 0 days 00:00:01 0.4997 0.0981
0 days 00:00:00.500000 0 days 00:00:01.500000 0.4997 0.0670
Feature interface for external function¶
If we interface a function from an external library
that already applies a sliding window,
we again specify the
win_dur and hop_dur
arguments.
However,
by setting
process_func_applies_sliding_window=True
we still request that the whole signal is passed on.
Now,
the processing function is responsible
for extracting the features in a framewise manner
and returning the values in the correct shape,
namely (num_channels, num_features, num_frames),
whereas the first dimension is optionally.
import librosa
def features(signal, sampling_rate, win_dur, hop_dur, n_mfcc):
hop_length = int(hop_dur * sampling_rate)
win_length = int(win_dur * sampling_rate)
mfcc = librosa.feature.mfcc(
y=signal,
sr=sampling_rate,
n_mfcc=13,
hop_length=hop_length,
win_length=win_length,
)
return mfcc
n_mfcc = 13
interface = audinterface.Feature(
[f"mfcc-{idx}" for idx in range(n_mfcc)],
process_func=features,
process_func_args={"n_mfcc": n_mfcc}, # "win_dur" and "hop_dur" can be omitted
process_func_applies_sliding_window=True,
win_dur=0.02,
hop_dur=0.01,
)
>>> interface.process_index(index, root=db.root)
mfcc-0 ... mfcc-12
file start end ...
wav/03a01Fa.wav 0 days 00:00:00 0 days 00:00:00.020000 -611.9933 ... 1.1514
0 days 00:00:00.010000 0 days 00:00:00.030000 -668.1758 ... 14.0685
0 days 00:00:00.020000 0 days 00:00:00.040000 -664.6128 ... 7.9498
0 days 00:00:00.030000 0 days 00:00:00.050000 -667.7147 ... 12.9575
0 days 00:00:00.040000 0 days 00:00:00.060000 -669.3674 ... 4.3968
... ... ... ...
wav/16b10Wb.wav 0 days 00:00:02.480000 0 days 00:00:02.500000 -664.6736 ... 1.8637
0 days 00:00:02.490000 0 days 00:00:02.510000 -658.9581 ... 9.3450
0 days 00:00:02.500000 0 days 00:00:02.520000 -644.1565 ... 7.4110
0 days 00:00:02.510000 0 days 00:00:02.530000 -618.5459 ... 17.6454
0 days 00:00:02.520000 0 days 00:00:02.540000 -666.8052 ... 3.7111
[605 rows x 13 columns]
Serializable feature interface¶
To use a feature extractor as an input transform
of a machine learning model
it is recommend to provide it in a serializable way
so it can be stored as part of the model.
One example of such a feature extractor is opensmile.Smile.
To create such a feature extractor,
we create a class that inherits
from audinterface.Feature
and audobject.Object.
import audobject
class MeanStd(audinterface.Feature, audobject.Object):
def __init__(self):
super().__init__(
["mean", "std"],
process_func=self.features,
)
def features(self, signal, sampling_rate):
return [signal.mean(), signal.std()]
fex = MeanStd()
>>> fex.process_index(index, root=db.root)
mean std
file start end
wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 -0.0003 0.0823
wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 -0.0003 0.1253
wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 -0.0005 0.0956
The advantage of the feature extraction object is that we can save it to a YAML file and re-instantiate it from there.
>>> fex.to_yaml("mean-std.yaml")
>>> fex2 = audobject.from_yaml("mean-std.yaml")
>>> fex2.process_index(index, root=db.root)
mean std
file start end
wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 -0.0003 0.0823
wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 -0.0003 0.1253
wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 -0.0005 0.0956
Segmentation interface¶
When the result of the processing function is an index
it is recommended to use audinterface.Segment,
which returns a segmented index conform to audformat.
An example for such a processing function
would be a voice activity detection algorithm.
import auditok
import pandas as pd
def segments(signal, sampling_rate):
# Convert floating point array to 16bit PCM little-endian
ints = (signal[0, :] * 32767).astype(np.int16)
little_endian = ints.astype("<u2")
signal = little_endian.tobytes()
regions = auditok.split(
signal,
sampling_rate=sampling_rate,
sample_width=2,
channels=1,
min_dur=0.2,
energy_threshold=70,
)
index = pd.MultiIndex.from_tuples(
[
(
pd.Timedelta(region.start, unit="s"),
pd.Timedelta(region.end, unit="s"),
)
for region in regions
],
names=["start", "end"],
)
return index
interface = audinterface.Segment(process_func=segments)
>>> interface.process_file(files[0], root=db.root)
MultiIndex([('wav/03a01Fa.wav', '0 days 00:00:00.150000', ...),
('wav/03a01Fa.wav', '0 days 00:00:00.900000', ...)],
names=['file', 'start', 'end'])
Sometimes,
it is required that a table
(i.e., pandas.Series or :class`pandas.DataFrame`)
is segmented
and the labels from the original segments
should be kept.
For this,
audinterface.Segment has a dedicated method
process_table().
This method is useful,
if a segmentation
(e.g., voice activity detection)
is performed on an already labelled dataset
in order to do data augmentation
or teacher-student training.
>>> table = pd.DataFrame({"label": [n * 2 for n in range(len(index))]}, index=index)
>>> table
label
file
wav/03a01Fa.wav 0
wav/03a01Nc.wav 2
wav/16b10Wb.wav 4
>>> interface.process_table(table, root=db.root)
label
file start end
wav/03a01Fa.wav 0 days 00:00:00.150000 0 days 00:00:00.700000 0
0 days 00:00:00.900000 0 days 00:00:01.600000 0
wav/03a01Nc.wav 0 days 00:00:00.100000 0 days 00:00:01.350000 2
wav/16b10Wb.wav 0 days 00:00:00.300000 0 days 00:00:01 4
0 days 00:00:01.050000 0 days 00:00:02.500000 4
Segmentation with feature interface¶
In some cases,
a processing function performs
both segmentation and feature extraction.
For this,
audinterface.SegmentWithFeature can be used.
This interface returns a pd.DataFrame
with a segmented index conform to audformat.
An example of such a processing function
would be a speech recognition model
that also generates timestamps for its results.
from faster_whisper import WhisperModel
import pandas as pd
model_size = "tiny"
model = WhisperModel(model_size, device="cpu")
def word_transcripts(signal, sampling_rate):
segments, _ = model.transcribe(
signal[0], task="transcribe", word_timestamps=True
)
index = []
words = []
for segment in segments:
for word in segment.words:
index.append(
(
pd.to_timedelta(word.start, unit="s"),
pd.to_timedelta(word.end, unit="s")
)
)
words.append(word.word.strip())
index = pd.MultiIndex.from_tuples(index, names=["start", "end"])
return pd.Series(data=words, index=index)
interface = audinterface.SegmentWithFeature(
feature_names="word", process_func=word_transcripts
)
>>> interface.process_file(files[0], root=db.root)
word
file start end
wav/03a01Fa.wav 0 days 00:00:00 0 days 00:00:00.360000 Der
0 days 00:00:00.360000 0 days 00:00:00.720000 Lappen
0 days 00:00:00.720000 0 days 00:00:00.880000 liegt
0 days 00:00:00.880000 0 days 00:00:01.080000 auf
0 days 00:00:01.080000 0 days 00:00:01.220000 dem
0 days 00:00:01.220000 0 days 00:00:01.820000 Eisschrank.
Similarly to audinterface.Segment,
audinterface.SegmentWithFeature
also has a method
process_table(),
which can be applied to an already labelled dataset.
>>> interface.process_table(table.head(2), root=db.root)
word label
file start end
wav/03a01Fa.wav 0 days 00:00:00 0 days 00:00:00.360000 Der 0
0 days 00:00:00.360000 0 days 00:00:00.720000 Lappen 0
0 days 00:00:00.720000 0 days 00:00:00.880000 liegt 0
0 days 00:00:00.880000 0 days 00:00:01.080000 auf 0
0 days 00:00:01.080000 0 days 00:00:01.220000 dem 0
0 days 00:00:01.220000 0 days 00:00:01.820000 Eisschrank. 0
wav/03a01Nc.wav 0 days 00:00:00 0 days 00:00:00.240000 Der 2
0 days 00:00:00.240000 0 days 00:00:00.520000 Lappen 2
0 days 00:00:00.520000 0 days 00:00:00.660000 liegt 2
0 days 00:00:00.660000 0 days 00:00:00.820000 auf 2
0 days 00:00:00.820000 0 days 00:00:00.960000 dem 2
0 days 00:00:00.960000 0 days 00:00:01.480000 Eiscrank. 2
Special processing function arguments¶
There are some special arguments
to the processing function,
which will be automatically set
if they are not specified in
process_func_args:
argument |
value |
|---|---|
idx |
running index |
file |
file path |
root |
root folder |
The following processing function
returns the values of
"idx" and "file".
def special_args(signal, sampling_rate, idx, file):
return idx, os.path.basename(file)
interface = audinterface.Process(process_func=special_args)
>>> interface.process_files(files, root=db.root)
file start end
wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 (0, 03a01Fa.wav)
wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 (1, 03a01Nc.wav)
wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 (2, 16b10Wb.wav)
dtype: object
For instance, we can pass a list with gender labels to the processing function and use the running index to select the appropriate f0 range.
gender = db["files"]["speaker"].get(map="gender") # gender per file
f0_range = {
"female": [160, 300], # [fmin, fmax]
"male": [60, 180],
}
def f0(signal, sampling_rate, idx, gender, f0_range):
# extract mean f0 using a gender adapted range
y = librosa.yin(
signal,
fmin=f0_range[gender.iloc[idx]][0],
fmax=f0_range[gender.iloc[idx]][1],
sr=sampling_rate,
).mean().round(2)
return y, gender.iloc[idx]
interface = audinterface.Feature(
["f0", "gender"],
process_func=f0,
process_func_args={
"gender": gender,
"f0_range": f0_range,
},
)
>>> interface.process_index(gender.index, root=db.root)
f0 gender
file start end
wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 134.0 male
wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 113.16 male
wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 234.86 female