cstr-vctk

Created by Junichi Yamagishi, Christophe Veaux, Kirsten MacDonald

version

1.0.0

license

CC-BY-4.0

usage

commercial

languages

eng

format

flac

channel

1

sampling rate

48000

bit depth

16

duration

3 days 10:39:12.224729162

files

88328, duration distribution: 1.2 s cstr-vctk-1.0.0-file-duration-distribution 16.6 s

repository

audb-public

Description

The CSTR’s VCTK Corpus (Centre for Speech Technology Voice Cloning Toolkit) includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive. The newspaper texts were taken from Herald Glasgow, with permission from Herald & Times Group. Each speaker has a different set of the newspaper texts selected based a greedy algorithm that increases the contextual and phonetic coverage. The details of the text selection algorithms are described in the following paper: C. Veaux, J. Yamagishi and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,https://doi.org/10.1109/ICSDA.2013.6709856 The rainbow passage and elicitation paragraph are the same for all speakers. The rainbow passage can be found at International Dialects of English Archive: (http://web.ku.edu/~idea/readings/rainbow.htm). The elicitation paragraph is identical to the one used for the speech accent archive (http://accent.gmu.edu). The details of the the speech accent archive can be found at http://www.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009AACL.pdf All speech data was recorded using an identical recording setup: an omni-directional microphone (DPA 4035) and a small diaphragm condenser microphone with very wide bandwidth (Sennheiser MKH 800), 96kHz sampling frequency at 24 bits and in a hemi-anechoic chamber of the University of Edinburgh. (However, two speakers, p280 and p315 had technical issues of the audio recordings using MKH 800). All recordings were converted into 16 bits, were downsampled to 48 kHz, and were manually end-pointed. This corpus was originally aimed for HMM-based text-to-speech synthesis systems, especially for speaker-adaptive HMM-based speech synthesis that uses average voice models trained on multiple speakers and speaker adaptation technologies. This corpus is also suitable for DNN-based multi-speaker text-to-speech synthesis systems and neural waveform modeling. The dataset was was referenced in the Google DeepMind work on WaveNet: https://arxiv.org/pdf/1609.03499.pdf . Please note while text files containing transcripts of the speech are provided for 109 of the 110 recordings, in the ‘/txt’ folder, the ‘p315’ text was lost due to a hard disk error.

Example

flac/p245_099_mic1.flac

../_images/cstr-vctk-1.0.0-player-waveform.png

Tables

Click on a row to toggle a preview.

ID

Type

Columns

files

filewise

recording_id, speaker, age, gender, language, accent, region, transcription, microphone

file

recording_id

speaker

age

gender

language

accent

region

transcription

microphone

flac/p225_001_mic1.flac

p225_001

p225

23

female

eng

English

Southern England

Please call Stella.

mic1

flac/p225_001_mic2.flac

p225_001

p225

23

female

eng

English

Southern England

Please call Stella.

mic2

flac/p225_002_mic1.flac

p225_002

p225

23

female

eng

English

Southern England

Ask her to bring these things with her from the store.

mic1

flac/p225_002_mic2.flac

p225_002

p225

23

female

eng

English

Southern England

Ask her to bring these things with her from the store.

mic2

flac/p225_003_mic1.flac

p225_003

p225

23

female

eng

English

Southern England

Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother...

mic1

88328 rows x 9 columns

speaker

misc

age, gender, language, accent, region

speaker

age

gender

language

accent

region

p225

23

female

eng

English

Southern England

p226

22

male

eng

English

Surrey

p227

38

male

eng

English

Cumbria

p228

22

female

eng

English

Southern England

p229

23

female

eng

English

Southern England

110 rows x 5 columns

Schemes

ID

Dtype

Labels

accent

str

age

int

gender

str

female, male

language

str

eng

microphone

str

mic1, mic2

recording_id

str

region

str

speaker

str

transcription

str