audformat.utils

concat

audformat.utils.concat(objs, *, overwrite=False)[source]

Concatenate objects.

Objects must be conform to table specifications.

The new object contains index and columns of both objects. Missing values will be set to NaN. If at least one object is segmented, the output has a segmented index.

Columns with the same identifier are combined to a single column. This requires that both columns have the same dtype and if overwrite is set to False, values in places where the indices overlap have to match or one column contains NaN. If overwrite is set to True, the value of the last object in the list is kept.

Parameters
Return type

Union[Series, DataFrame]

Returns

concatenated objects

Raises

Example

>>> obj1 = pd.Series(
...     [0., 1.],
...     index=filewise_index(['f1', 'f2']),
...     name='float',
... )
>>> obj2 = pd.DataFrame(
...     {
...         'float': [1., 2.],
...         'string': ['a', 'b'],
...     },
...     index=segmented_index(['f2', 'f3']),
... )
>>> concat([obj1, obj2])
                 float string
file start  end
f1   0 days NaT    0.0    NaN
f2   0 days NaT    1.0      a
f3   0 days NaT    2.0      b
>>> obj1 = pd.Series(
...     [0., 1.],
...     index=filewise_index(['f1', 'f2']),
...     name='float',
... )
>>> obj2 = pd.DataFrame(
...     {
...         'float': [np.nan, 2.],
...         'string': ['a', 'b'],
...     },
...     index=filewise_index(['f2', 'f3']),
... )
>>> concat([obj1, obj2])
      float string
file
f1      0.0    NaN
f2      1.0      a
f3      2.0      b
>>> obj1 = pd.Series(
...     [0., 0.],
...     index=filewise_index(['f1', 'f2']),
...     name='float',
... )
>>> obj2 = pd.DataFrame(
...     {
...         'float': [1., 2.],
...         'string': ['a', 'b'],
...     },
...     index=segmented_index(['f2', 'f3']),
... )
>>> concat([obj1, obj2], overwrite=True)
                 float string
file start  end
f1   0 days NaT    0.0    NaN
f2   0 days NaT    1.0      a
f3   0 days NaT    2.0      b

duration

audformat.utils.duration(obj, *, root=None, num_workers=1, verbose=False)[source]

Total duration of all entries present in the object.

The object might contain a segmented or a filewise index. For a segmented index the duration is calculated from its start and end values. If an end value is NaT or the object contains a filewise index the duration is calculated from the media file by calling audiofile.duration().

Parameters
  • obj (Union[Index, Series, DataFrame]) – object conform to table specifications

  • root (Optional[str]) – root directory under which the files referenced in the index are stored. Only relevant when the duration of the files needs to be detected from the file

  • num_workers (int) – number of parallel jobs. Only relevant when the duration of the files needs to be detected from the file If None will be set to the number of processors on the machine multiplied by 5

  • verbose (bool) – show progress bar. Only relevant when the duration of the files needs to be detected from the file

Return type

Timedelta

Returns

duration

Example

>>> idx = segmented_index(
...     files=['a', 'b', 'c'],
...     starts=[0, 1, 3],
...     ends=[1, 2, 4],
... )
>>> duration(idx)
Timedelta('0 days 00:00:03')

intersect

audformat.utils.intersect(objs)[source]

Intersect index objects.

Index objects must be conform to table specifications.

If at least one object is segmented, the output is a segmented index.

Parameters

objs (Sequence[Index]) – index objects conform to table specifications

Return type

Index

Returns

intersection of index objects

Raises

ValueError – if one or more objects are not conform to table specifications

Example

>>> i1 = filewise_index(['f1', 'f2', 'f3'])
>>> i2 = filewise_index(['f2', 'f3', 'f4'])
>>> intersect([i1, i2])
Index(['f2', 'f3'], dtype='object', name='file')
>>> i3 = segmented_index(
...     ['f1', 'f2', 'f3', 'f4'],
...     [0, 0, 0, 0],
...     [1, 1, 1, 1],
... )
>>> i4 = segmented_index(
...     ['f1', 'f2', 'f3'],
...     [0, 0, 1],
...     [1, 1, 2],
... )
>>> intersect([i3, i4])
MultiIndex([('f1', '0 days', '0 days 00:00:01'),
            ('f2', '0 days', '0 days 00:00:01')],
           names=['file', 'start', 'end'])
>>> intersect([i1, i2, i3, i4])
MultiIndex([('f2', '0 days', '0 days 00:00:01')],
           names=['file', 'start', 'end'])

join_labels

audformat.utils.join_labels(labels)[source]

Combine scheme labels.

Parameters

labels (Sequence[Union[List, Dict]]) – sequence of labels to join. For dictionary labels, labels further to the right can overwrite previous labels

Returns

joined labels

Raises

Example

>>> join_labels([{'a': 0, 'b': 1}, {'b': 2, 'c': 2}])
{'a': 0, 'b': 2, 'c': 2}

join_schemes

audformat.utils.join_schemes(dbs, scheme_id)[source]

Join and update scheme of databases.

This joins the given scheme of several databases using audformat.utils.join_labels() and replaces the scheme in each database with the joined one. The dtype of all audformat.Column objects that reference the scheme in the databases will be updated. Removed labels are set to NaN.

This might be useful, if you want to combine databases with audformat.Database.update().

Parameters
  • dbs (Sequence[Database]) – sequence of databases

  • scheme_id (str) – scheme ID of a scheme with labels that should be joined

Example

>>> db1 = Database('db1')
>>> db2 = Database('db2')
>>> db1.schemes['scheme_id'] = Scheme(labels=['a'])
>>> db2.schemes['scheme_id'] = Scheme(labels=['b'])
>>> join_schemes([db1, db2], 'scheme_id')
>>> db1.schemes
scheme_id:
  dtype: str
  labels: [a, b]

map_language

audformat.utils.map_language(language)[source]

Map language to ISO 639-3.

Parameters

language (str) – language string

Return type

str

Returns

mapped string

Raises

ValueError – if language is not supported

Example

>>> map_language('en')
'eng'
>>> map_language('eng')
'eng'
>>> map_language('English')
'eng'

read_csv

audformat.utils.read_csv(*args, **kwargs)[source]

Read object from CSV file.

Automatically detects the index type and returns an object that is conform to table specifications. If conversion is not possible, an error is raised.

See pandas.read_csv() for supported arguments.

Parameters
  • *args – arguments

  • **kwargs – keyword arguments

Return type

Union[Index, Series, DataFrame]

Returns

object conform to table specifications

Raises

ValueError – if CSV file is not conform to table specifications

Example

>>> from io import StringIO
>>> string = StringIO('''file,start,end,value
... f1,00:00:00,00:00:01,0.0
... f1,00:00:01,00:00:02,1.0
... f2,00:00:02,00:00:03,2.0''')
>>> read_csv(string)
file  start            end
f1    0 days 00:00:00  0 days 00:00:01    0.0
      0 days 00:00:01  0 days 00:00:02    1.0
f2    0 days 00:00:02  0 days 00:00:03    2.0
Name: value, dtype: float64

to_filewise_index

audformat.utils.to_filewise_index(obj, root, output_folder, *, num_workers=1, progress_bar=False)[source]

Convert to filewise index.

If input is segmented, each segment is saved to a separate file in output_folder. The directory structure of the original data is preserved within output_folder. If input is filewise no action is applied.

Parameters
  • obj (Union[Index, Series, DataFrame]) – object conform to table specifications

  • root (str) – path to root folder of data. Even if the file paths of frame are absolute, this argument is needed in order to reconstruct the directory structure of the original data

  • output_folder (str) – path to folder of the created audio segments. If it’s relative (absolute), then the file paths of the returned data frame are also relative (absolute)

  • num_workers (int) – number of threads to spawn

  • progress_bar (bool) – show progress bar

Return type

Union[Index, Series, DataFrame]

Returns

object with filewise index

Raises

ValueError – if output_folder contained in path to files of original data

to_segmented_index

audformat.utils.to_segmented_index(obj, *, allow_nat=True, root=None, num_workers=1, verbose=False)[source]

Convert to segmented index.

If the input a filewise table, start and end will be added as new levels to the index. By default, start will be set to 0 and end to NaT.

If allow_nat is set to False, all occurrences of end=NaT are replaced with the duration of the file. This, however, requires that the referenced file exists. If file names in the index are relative, the root argument can be used to provide the location where the files are stored.

Parameters
  • obj (Union[Index, Series, DataFrame]) – object conform to table specifications

  • allow_nat (bool) – if set to False, end=NaT is replaced with file duration

  • root (Optional[str]) – root directory under which the files referenced in the index are stored

  • num_workers (Optional[int]) – number of parallel jobs. If None will be set to the number of processors on the machine multiplied by 5

  • verbose (bool) – show progress bar

Return type

Union[Index, Series, DataFrame]

Returns

object with segmented index

Raises

union

audformat.utils.union(objs)[source]

Create union of index objects.

Index objects must be conform to table specifications.

If at least one object is segmented, the output is a segmented index.

Parameters

objs (Sequence[Index]) – index objects conform to table specifications

Return type

Index

Returns

union of index objects

Raises

ValueError – if one or more objects are not conform to table specifications

Example

>>> i1 = filewise_index(['f1', 'f2', 'f3'])
>>> i2 = filewise_index(['f2', 'f3', 'f4'])
>>> union([i1, i2])
Index(['f1', 'f2', 'f3', 'f4'], dtype='object', name='file')
>>> i3 = segmented_index(
...     ['f1', 'f2', 'f3', 'f4'],
...     [0, 0, 0, 0],
...     [1, 1, 1, 1],
... )
>>> i4 = segmented_index(
...     ['f1', 'f2', 'f3'],
...     [0, 0, 1],
...     [1, 1, 2],
... )
>>> union([i3, i4])
MultiIndex([('f1', '0 days 00:00:00', '0 days 00:00:01'),
            ('f2', '0 days 00:00:00', '0 days 00:00:01'),
            ('f3', '0 days 00:00:00', '0 days 00:00:01'),
            ('f3', '0 days 00:00:01', '0 days 00:00:02'),
            ('f4', '0 days 00:00:00', '0 days 00:00:01')],
           names=['file', 'start', 'end'])
>>> union([i1, i2, i3, i4])
MultiIndex([('f1', '0 days 00:00:00',               NaT),
            ('f1', '0 days 00:00:00', '0 days 00:00:01'),
            ('f2', '0 days 00:00:00',               NaT),
            ('f2', '0 days 00:00:00', '0 days 00:00:01'),
            ('f3', '0 days 00:00:00',               NaT),
            ('f3', '0 days 00:00:00', '0 days 00:00:01'),
            ('f3', '0 days 00:00:01', '0 days 00:00:02'),
            ('f4', '0 days 00:00:00',               NaT),
            ('f4', '0 days 00:00:00', '0 days 00:00:01')],
           names=['file', 'start', 'end'])