Combine tables

It can happen that labels in your database are stored in tables of different type as some labels belong to the whole file, others don’t. The following examples highlights this with the labels for age and likability:

import audformat.testing


db = audformat.testing.create_db(minimal=True)
db.schemes["age"] = audformat.Scheme(
    audformat.define.DataType.INTEGER,
    minimum=20,
    maximum=50,
)
db.schemes["likability"] = audformat.Scheme(
    audformat.define.DataType.FLOAT,
)
audformat.testing.add_table(
    db,
    table_id="age",
    index_type=audformat.define.IndexType.FILEWISE,
    columns="age",
    num_files=3,
)
audformat.testing.add_table(
    db,
    table_id="likability",
    index_type=audformat.define.IndexType.SEGMENTED,
    columns="likability",
    num_files=4,
)

Which results in the following two pandas.DataFrame:

display(
    db["age"].get(),
    db["likability"].get(),
)
age
file
audio/001.wav 42
audio/002.wav 42
audio/003.wav 38
likability
file start end
audio/001.wav 0 days 00:00:00.076266121 0 days 00:00:00.619946911 0.435226
0 days 00:00:01.148784506 0 days 00:00:01.521320981 0.841527
0 days 00:00:01.752849085 0 days 00:00:01.821368375 0.646402
0 days 00:00:01.992789395 0 days 00:00:02.011086746 0.329573
0 days 00:00:02.462067804 0 days 00:00:02.520139007 0.283158
audio/002.wav 0 days 00:00:00.414579600 0 days 00:00:00.942075097 0.481720
0 days 00:00:01.852233207 0 days 00:00:02.707899678 0.260380
0 days 00:00:03.355573727 0 days 00:00:03.438177386 0.403139
0 days 00:00:03.577501107 0 days 00:00:03.703908381 0.822152
0 days 00:00:03.916778363 0 days 00:00:04.771936687 0.510847
audio/003.wav 0 days 00:00:01.388510311 0 days 00:00:01.639387281 0.666027
0 days 00:00:02.244054817 0 days 00:00:03.013083281 0.048390
0 days 00:00:03.311775472 0 days 00:00:03.988677214 0.115562
0 days 00:00:04.399860268 0 days 00:00:04.471247023 0.313857
0 days 00:00:04.474487824 0 days 00:00:04.695643611 0.631005
audio/004.wav 0 days 00:00:00.588490047 0 days 00:00:01.148163469 0.529147
0 days 00:00:01.406743481 0 days 00:00:02.263371657 0.038501
0 days 00:00:03.505024941 0 days 00:00:03.561733415 0.953864
0 days 00:00:03.637172321 0 days 00:00:03.896031945 0.328594
0 days 00:00:04.152167598 0 days 00:00:04.826367184 0.867975

You can simply combine both tables with:

combined_table = db["likability"] + db["age"]

Which results in the following pandas.DataFrame:

combined_table.get()
likability age
file start end
audio/001.wav 0 days 00:00:00.076266121 0 days 00:00:00.619946911 0.435226 <NA>
0 days 00:00:01.148784506 0 days 00:00:01.521320981 0.841527 <NA>
0 days 00:00:01.752849085 0 days 00:00:01.821368375 0.646402 <NA>
0 days 00:00:01.992789395 0 days 00:00:02.011086746 0.329573 <NA>
0 days 00:00:02.462067804 0 days 00:00:02.520139007 0.283158 <NA>
audio/002.wav 0 days 00:00:00.414579600 0 days 00:00:00.942075097 0.481720 <NA>
0 days 00:00:01.852233207 0 days 00:00:02.707899678 0.260380 <NA>
0 days 00:00:03.355573727 0 days 00:00:03.438177386 0.403139 <NA>
0 days 00:00:03.577501107 0 days 00:00:03.703908381 0.822152 <NA>
0 days 00:00:03.916778363 0 days 00:00:04.771936687 0.510847 <NA>
audio/003.wav 0 days 00:00:01.388510311 0 days 00:00:01.639387281 0.666027 <NA>
0 days 00:00:02.244054817 0 days 00:00:03.013083281 0.048390 <NA>
0 days 00:00:03.311775472 0 days 00:00:03.988677214 0.115562 <NA>
0 days 00:00:04.399860268 0 days 00:00:04.471247023 0.313857 <NA>
0 days 00:00:04.474487824 0 days 00:00:04.695643611 0.631005 <NA>
audio/004.wav 0 days 00:00:00.588490047 0 days 00:00:01.148163469 0.529147 <NA>
0 days 00:00:01.406743481 0 days 00:00:02.263371657 0.038501 <NA>
0 days 00:00:03.505024941 0 days 00:00:03.561733415 0.953864 <NA>
0 days 00:00:03.637172321 0 days 00:00:03.896031945 0.328594 <NA>
0 days 00:00:04.152167598 0 days 00:00:04.826367184 0.867975 <NA>
audio/001.wav 0 days 00:00:00 NaT NaN 42
audio/002.wav 0 days 00:00:00 NaT NaN 42
audio/003.wav 0 days 00:00:00 NaT NaN 38

Or, if you just want to have the likability information for all segments, for which age information is available:

df_likability = db["likability"].get(
    db["age"].files,
)

Which results in the following pandas.DataFrame:

df_likability
likability
file start end
audio/001.wav 0 days 00:00:00.076266121 0 days 00:00:00.619946911 0.435226
0 days 00:00:01.148784506 0 days 00:00:01.521320981 0.841527
0 days 00:00:01.752849085 0 days 00:00:01.821368375 0.646402
0 days 00:00:01.992789395 0 days 00:00:02.011086746 0.329573
0 days 00:00:02.462067804 0 days 00:00:02.520139007 0.283158
audio/002.wav 0 days 00:00:00.414579600 0 days 00:00:00.942075097 0.481720
0 days 00:00:01.852233207 0 days 00:00:02.707899678 0.260380
0 days 00:00:03.355573727 0 days 00:00:03.438177386 0.403139
0 days 00:00:03.577501107 0 days 00:00:03.703908381 0.822152
0 days 00:00:03.916778363 0 days 00:00:04.771936687 0.510847
audio/003.wav 0 days 00:00:01.388510311 0 days 00:00:01.639387281 0.666027
0 days 00:00:02.244054817 0 days 00:00:03.013083281 0.048390
0 days 00:00:03.311775472 0 days 00:00:03.988677214 0.115562
0 days 00:00:04.399860268 0 days 00:00:04.471247023 0.313857
0 days 00:00:04.474487824 0 days 00:00:04.695643611 0.631005

Or, if you want to have the age information for segments in the likeability table:

df_age = db["age"].get(df_likability.index)

Which results in the following pandas.DataFrame:

df_age
age
file start end
audio/001.wav 0 days 00:00:00.076266121 0 days 00:00:00.619946911 42
0 days 00:00:01.148784506 0 days 00:00:01.521320981 42
0 days 00:00:01.752849085 0 days 00:00:01.821368375 42
0 days 00:00:01.992789395 0 days 00:00:02.011086746 42
0 days 00:00:02.462067804 0 days 00:00:02.520139007 42
audio/002.wav 0 days 00:00:00.414579600 0 days 00:00:00.942075097 42
0 days 00:00:01.852233207 0 days 00:00:02.707899678 42
0 days 00:00:03.355573727 0 days 00:00:03.438177386 42
0 days 00:00:03.577501107 0 days 00:00:03.703908381 42
0 days 00:00:03.916778363 0 days 00:00:04.771936687 42
audio/003.wav 0 days 00:00:01.388510311 0 days 00:00:01.639387281 38
0 days 00:00:02.244054817 0 days 00:00:03.013083281 38
0 days 00:00:03.311775472 0 days 00:00:03.988677214 38
0 days 00:00:04.399860268 0 days 00:00:04.471247023 38
0 days 00:00:04.474487824 0 days 00:00:04.695643611 38

So far we have combined tables using the + operator. The result is a table that is no longer attached to a database. That means that meta information about the media or referenced schemes is discarded. If you want to keep this information, you can use audformat.Table.update(), which also works across databases, as we will demonstrate with the following example.

First we create a second database and add a gender scheme:

db2 = audformat.testing.create_db(minimal=True)
db2.schemes["gender"] = audformat.Scheme(
    labels=["female", "male"],
)
db2.schemes
gender:
  dtype: str
  labels: [female, male]

Next, we add a table and fill in some gender information:

audformat.testing.add_table(
    db2,
    table_id="gender_and_age",
    index_type=audformat.define.IndexType.FILEWISE,
    columns="gender",
    num_files=[2, 3, 4],
).get()
gender
file
audio/002.wav female
audio/003.wav female
audio/004.wav male

Now, we update the table with age values from the other database.

db2["gender_and_age"].update(db["age"]).get()
gender age
file
audio/002.wav female 42
audio/003.wav female 38
audio/004.wav male <NA>
audio/001.wav NaN 42

And also copies the according scheme to the database:

db2.schemes
age:
  {dtype: int, minimum: 20, maximum: 50}
gender:
  dtype: str
  labels: [female, male]