The purpose of this study is to test models that automatically classify Korean nouns
into native Korean, Sino-Korean, and loanwords by applying a machine learning
model, naïve Bayes classification. In this study, 500 native Korean words, Sino-
Korean words, and loanwords were collected, and after romanizing and decomposing
them into bigram and trigram lists, the bigrams and trigrams were entered into the
naïve Bayes classifier. We tested models with and without syllable boundaries, and
found that both the bigram and trigram models were over 80% accurate. Contrary to
the expectation that the performance of the models would improve as more
information about Korean phonotactics was included in the training and validation
data, the difference in performance between the bigram and trigram models was not
significant. The model that included syllable boundaries in the phoneme sequence
information had slightly higher accuracy than the model without syllable boundary
information. When comparing the classification results of all five models, the
accuracy of the bigram model with syllable boundaries was 83.55%, which was the
best. For now, we have modified the model to consider only phoneme sequence
information and syllable boundaries, but it is expected that the accuracy of the model
can be improved by training the model while excluding bigrams and trigrams, which
occur in similar proportions in all categories, and by increasing the size of the data.
Keywords
phonotactics, native Korean, Sino-Korean, loanword, machine learning, Naïve Bayes classification, bigram model, trigram model