Optimizing spectral and contextual features for neural network detection of Korean /y/ in spontaneous speech
Soonhyun Hong (Inha University)| pp.151-178
Abstract
This study examines how acoustic dynamics and contextual information support automatic detection of the Korean glide /y/ in spontaneous speech. Using the Seoul Corpus, we trained artificial neural network classifiers on formant measurements F1, F2, and F3 sampled at onset, 20%, and 50% of the vocalic interval, using both single- point and two-point temporal schemes and single- and multi-formant feature sets. We also evaluated contextual predictors, including vowel identity, preceding consonant place and manner, word internal position, F0, token duration, and speaker gender, both individually and in combination. Models were evaluated with stratified five-fold cross validation, and performance is reported as positive-yV F1 score. The results show that early F2 information, especially at onset and 20%, provides the strongest acoustic evidence for /y/. Contextual predictors, most notably preceding consonant place and manner and vowel identity, yield substantial additional gains, with the best configurations reaching an approximate F1 score of 0.90. Once these contextual cues are included, compact single-point models at 20% perform nearly as well as two- point early models, enabling efficient low latency feature designs. These findings sharpen empirical characterizations of Korean /y/ realization and provide practical guidance for feature selection in automatic /y/ detection.
Keywords
/y/ detection, Seoul Corpus, formant transitions, contextual cues, neural network