SOLVING UNKNOWN WORD PROBLEMS IN NATURAL  LANGUAGE PROCESSING

Chalermpol Tapsai; Wilailuk Rakbumrung

Chalermpol Tapsai
Wilailuk Rakbumrung

Keywords: unknown word, natural language processing, complete Soundex, Thai language.

Abstract

Unknown words are a major problem that makes the Natural Language Processing (NLP) impossible to correctly analyze the meaning of the sentence. This research aim is to provide a model that will allow the NLP to correctly diagnose unknown words and replaced by the correct words. To complete this, the researcher firstly analyzes the characteristics of the unknown words that are not recognized by the NLP model. By collecting 12,800 text files of messages from various sources including both online and offline, cover all levels of language, formal, semi-formal, and non-formal. These text files were analyzed for the characteristics of the unknown words and classified into 7 types: Excess of alphabets, Missing of alphabets, Repetition of alphabets, Typo error, Misplacement of alphabets, Slang words and Mixed type error. To overcome the unknown words' problem. Complete Soundex is used to correct the misspelled words, the diversity of spelling, slangs and modification of traditional words by analyzing the unknown words to provide the correct words with the highest similarity. Evaluation of the model is performed by inputting the test dataset, which is natural language sentences collected from a sample group of 125 people with a total of 3,750 sentences, into the model to detect the unknown words and analyze to provide the correct words that can be used to replace the unknown word. Then collect all outputs and calculate the precision, recall and F1-score value. The result showed that the performance of the model was very good. The precision, recall and F1-score value are all greater than 90% and the unknown words that cannot be corrected by the model are 6.88% of the overall unknown words. There are 2 reasons that makes this model unable to solve these unknown words: 1) too much misspelling position in the unknown word and 2) the Word segmentation module cannot specify the boundaries of the unknown words correctly. In order to solve this problem, there should be more research to improve the process of analysis of unknown words' boundaries and using of co-occurrence word analysis which will help improve the model's efficiency.