Usage

audonnx offers a simple interface to load and use models in ONNX format. Models with single or multiple input and output nodes are supported.

We begin with creating some test input - a file path, a signal array and an index in audformat.

import audiofile


file = './docs/_static/test.wav'
signal, sampling_rate = audiofile.read(
    file,
    always_2d=True,
)
index = pd.MultiIndex.from_arrays(
    [
        [file, file],
        pd.to_timedelta(['0s', '3s']),
        pd.to_timedelta(['3s', '5s']),
    ],
    names=['file', 'start', 'end'],
)

Torch model

Create Torch model with a single input and output node.

import torch


class TorchModelSingle(torch.nn.Module):

    def __init__(
        self,
    ):
        super().__init__()
        self.hidden = torch.nn.Linear(18, 8)
        self.out = torch.nn.Linear(8, 2)

    def forward(self, x: torch.Tensor):
        y = self.hidden(x.mean(dim=-1))
        y = self.out(y)
        return y.squeeze()


torch_model = TorchModelSingle()

Create OpenSMILE feature extractor to convert the raw audio signal to a sequence of low-level descriptors.

import opensmile


smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.GeMAPSv01b,
    feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)

Calculate features and run Torch model.

y = smile(signal, sampling_rate)
with torch.no_grad():
    z = torch_model(torch.from_numpy(y))
z
tensor([  29.3742, -214.7639])

Export model

To export the model to ONNX format, we pass some dummy input, which allows the function to figure out correct input and output shapes. Since the number of extracted feature frames varies with the length of the input signal, we tell the function that the last dimension of the input has a dynamic size. And we assign meaningful names to the nodes.

import audeer
import os


onnx_root = audeer.mkdir('onnx')
onnx_model_path = os.path.join(onnx_root, 'model.onnx')

dummy_input = torch.randn(y.shape[1:])
torch.onnx.export(
    torch_model,
    dummy_input,
    onnx_model_path,
    input_names=['feature'],  # assign custom name to input node
    output_names=['gender'],  # assign custom name to output node
    dynamic_axes={'feature': {1: 'time'}},  # dynamic size
    opset_version=12,
)

From the exported model file we now create an object of audonnx.Model. We pass the feature extractor, so that the model can automatically convert the input signal to the desired representation. And we assign labels to the dimensions of the output node. Printing the model provides a summary of the input and output nodes.

import audonnx


onnx_model = audonnx.Model(
    onnx_model_path,
    labels=['female', 'male'],
    transform=smile,
)
onnx_model
Input:
  feature:
    shape: [18, -1]
    dtype: tensor(float)
    transform: opensmile.core.smile.Smile
Output:
  gender:
    shape: [2]
    dtype: tensor(float)
    labels: [female, male]

Get information for individual nodes.

onnx_model.inputs['feature']
{shape: [18, -1], dtype: tensor(float), transform: opensmile.core.smile.Smile}
print(onnx_model.inputs['feature'].transform)
$opensmile.core.smile.Smile:
  feature_set: GeMAPSv01b
  feature_level: LowLevelDescriptors
  options: {}
  sampling_rate: null
  channels:
  - 0
  mixdown: false
  resample: false

onnx_model.outputs['gender']
{shape: [2], dtype: tensor(float), labels: [female, male]}
onnx_model.outputs['gender'].labels
['female', 'male']

Check that the exported model gives the expected output.

onnx_model(signal, sampling_rate)
array([  29.374172, -214.76393 ], dtype=float32)

Create interface

onnx.Model does not come with a fancy interface itself, but we can use audinterface to create one.

import numpy as np
import audinterface


interface = audinterface.Feature(
    feature_names=onnx_model.outputs['gender'].labels,
    process_func=onnx_model,
)
interface.process_index(index)
female male
file start end
./docs/_static/test.wav 0 days 00:00:00 0 days 00:00:03 30.218712 -214.255234
0 days 00:00:03 0 days 00:00:05 28.661716 -211.555435

Or if we are only interested in the majority class.

interface.process_index(index).idxmax(axis=1)
file start end
./docs/_static/test.wav 0 days 00:00:00 0 days 00:00:03 female
0 days 00:00:03 0 days 00:00:05 female

Save and load

Save the model to a YAML file.

onnx_meta_path = os.path.join(onnx_root, 'model.yaml')
onnx_model.to_yaml(onnx_meta_path)
$audonnx.core.model.Model==0.7.0:
  path: model.onnx
  labels:
  - female
  - male
  transform:
    $opensmile.core.smile.Smile==2.5.0:
      feature_set: GeMAPSv01b
      feature_level: LowLevelDescriptors
      options: {}
      sampling_rate: null
      channels:
      - 0
      mixdown: false
      resample: false

Load the model from a YAML file.

import audobject

onnx_model_2 = audobject.from_yaml(onnx_meta_path)
onnx_model_2(signal, sampling_rate)
array([  29.374172, -214.76393 ], dtype=float32)

Or shorter:

onnx_model_3 = audonnx.load(onnx_root)
onnx_model_3(signal, sampling_rate)
array([  29.374172, -214.76393 ], dtype=float32)

Quantize weights

To reduce the memory print of a model, we can quantize it, compare the MobilenetV2 example. For instance, we can store model weights as 8 bit integers. For quantization make sure you have installed onnx as well as onnxruntime.

import onnxruntime.quantization


onnx_infer_path = os.path.join(onnx_root, 'model_infer.onnx')
onnxruntime.quantization.quant_pre_process(
    onnx_model_path,
    onnx_infer_path,
)
onnx_quant_path = os.path.join(onnx_root, 'model_quant.onnx')
onnxruntime.quantization.quantize_dynamic(
    onnx_infer_path,
    onnx_quant_path,
    weight_type=onnxruntime.quantization.QuantType.QUInt8,
)

The output of the quantized model differs slightly.

onnx_model_4 = audonnx.Model(
    onnx_quant_path,
    labels=['female', 'male'],
    transform=smile,
)
onnx_model_4(signal, sampling_rate)
array([  29.231592, -212.91867 ], dtype=float32)

Custom transform

So far, we have used opensmile.Smile as feature extractor. It derives from audobject.Object and is therefore serializable by default. However, using audonnx.Function we can turn any function into a serializable object. For instance, we can define a function that extracts Mel-frequency cepstral coefficients (MFCCs) with librosa.

def mfcc(x, sr):
    import librosa  # import here to make function self-contained
    y = librosa.feature.mfcc(
        y=x.squeeze(),
        sr=sr,
        n_mfcc=18,
    )
    return y.reshape(1, 18, -1)

As long as the function is self-contained (i.e. does not depend on external variables or imports) we can turn it into a serializable object.

transform = audonnx.Function(mfcc)
print(transform)
$audonnx.core.function.Function:
  func: "def mfcc(x, sr):\n    import librosa  # import here to make function self-contained\n\
    \    y = librosa.feature.mfcc(\n        y=x.squeeze(),\n        sr=sr,\n     \
    \   n_mfcc=18,\n    )\n    return y.reshape(1, 18, -1)\n"
  func_args: {}

And use it to initialize our model.

onnx_model_5 = audonnx.Model(
    onnx_model_path,
    labels=['female', 'male'],
    transform=transform,
)
onnx_model_5
Input:
  feature:
    shape: [18, -1]
    dtype: tensor(float)
    transform: audonnx.core.function.Function(mfcc)
Output:
  gender:
    shape: [2]
    dtype: tensor(float)
    labels: [female, male]

Then we can save and load the model as before.

onnx_model_5.to_yaml(onnx_meta_path)
onnx_model_6 = audonnx.load(onnx_root)
onnx_model_6(signal, sampling_rate)
array([-32.218803,  47.5395  ], dtype=float32)

Multiple nodes

Define a model that takes as input the raw audio in addition to the features and provides two more output nodes - the output from the hidden layer and a confidence value.

class TorchModelMulti(torch.nn.Module):

    def __init__(
        self,
    ):

        super().__init__()

        self.hidden_left = torch.nn.Linear(1, 4)
        self.hidden_right = torch.nn.Linear(18, 4)
        self.out = torch.nn.ModuleDict(
            {
                'gender': torch.nn.Linear(8, 2),
                'confidence': torch.nn.Linear(8, 1),
            }
        )

    def forward(self, signal: torch.Tensor, feature: torch.Tensor):

        y_left = self.hidden_left(signal.mean(dim=-1))
        y_right = self.hidden_right(feature.mean(dim=-1))
        y_hidden = torch.cat([y_left, y_right], dim=-1)
        y_gender = self.out['gender'](y_hidden)
        y_confidence = self.out['confidence'](y_hidden)

        return (
            y_hidden.squeeze(),
            y_gender.squeeze(),
            y_confidence,
        )

Export the new model to ONNX format and load it. Note that we do not assign labels to all output nodes. In that case, they are automatically created from the name of the output node. And since the first node expects the raw audio signal, we do not set a transform for it.

onnx_multi_path = os.path.join(onnx_root, 'model.onnx')

torch.onnx.export(
    TorchModelMulti(),
    (
        torch.randn(signal.shape),
        torch.randn(y.shape[1:]),
    ),
    onnx_multi_path,
    input_names=['signal', 'feature'],
    output_names=['hidden', 'gender', 'confidence'],
    dynamic_axes={
        'signal': {1: 'time'},
        'feature': {1: 'time'},
    },
    opset_version=12,
)

onnx_model_7 = audonnx.Model(
    onnx_multi_path,
    labels={
        'gender': ['female', 'male']
    },
    transform={
        'feature': smile,
    },
)
onnx_model_7
Input:
  signal:
    shape: [1, -1]
    dtype: tensor(float)
    transform: None
  feature:
    shape: [18, -1]
    dtype: tensor(float)
    transform: opensmile.core.smile.Smile
Output:
  hidden:
    shape: [8]
    dtype: tensor(float)
    labels: [hidden-0, hidden-1, hidden-2, (...), hidden-5, hidden-6, hidden-7]
  gender:
    shape: [2]
    dtype: tensor(float)
    labels: [female, male]
  confidence:
    shape: [1]
    dtype: tensor(float)
    labels: [confidence]

By default, returns a dictionary with output for every node.

onnx_model_7(signal, sampling_rate)
{'hidden': array([ 7.6027595e-02, -8.8102180e-01, -4.5986521e-01, -2.4757692e-01,
        -3.5762204e+02, -6.2052740e+02, -6.6322235e+02, -2.5100070e+02],
       dtype=float32),
 'gender': array([-85.367584, 330.33902 ], dtype=float32),
 'confidence': array([-17.792288], dtype=float32)}

To request a specific node use the outputs argument.

onnx_model_7(
    signal,
    sampling_rate,
    outputs='gender',
)
array([-85.367584, 330.33902 ], dtype=float32)

Or provide a list of names to request several outputs.

onnx_model_7(
    signal,
    sampling_rate,
    outputs=['gender', 'confidence'],
)
{'gender': array([-85.367584, 330.33902 ], dtype=float32),
 'confidence': array([-17.792288], dtype=float32)}

To concatenate the outputs to a single array, do:

onnx_model_7(
    signal,
    sampling_rate,
    outputs=['gender', 'confidence'],
    concat=True,
)
array([-85.367584, 330.33902 , -17.792288], dtype=float32)

Create interface and process a file.

outputs = ['gender', 'confidence']
interface = audinterface.Feature(
    feature_names=onnx_model_7.labels(outputs),
    process_func=onnx_model_7,
    process_func_args={
        'outputs': outputs,
        'concat': True,
    },
)
interface.process_file(file)
female male confidence
file start end
./docs/_static/test.wav 0 days 0 days 00:00:05.247687500 -85.367584 330.33902 -17.792288

Run on the GPU

To run a model on the GPU install onnxruntime-gpu. Note that the version has to fit the CUDA installation. We can get the information from this table.

Then select CUDA device when loading the model:

import os
import audonnx

model = audonnx.load(..., device='cuda:2')

With onnxruntime-gpu<1.8 it is not possible to directly specify an ID. In that case do:

os.environ['CUDA_VISIBLE_DEVICES'] = '2'
model = audonnx.load(..., device='cuda')