How to Use Machine Learning to Identify Compound Words

How to Use Machine Learning to Identify Compound Words There are certain words that exist in multiple forms depending on the context in which they’re used. For example, bookshelf can be either one word or two depending on whether it refers to the shelf for books or bookshelves themselves. Because of the many ways compound words can appear, it’s not always easy to recognize them by simply looking at their text format. Fortunately, you can use machine learning to identify compound words automatically so that you can make your software more accurate and efficient than ever before. In natural language processing, compound words are a special type of word. In general, they are two or more base words that are combined into one single word. This can happen in several ways: by taking affixes from each base (example: boat -boat) or through portmanteau blending (example: sofa-couch). Another important detail about compound words is that they usually have different morphological characteristics depending on whether they were created through affixes or portmanteau blending. Therefore, it is important to know how these kinds of compound words were created in order to be able to identify them correctly and make sure we use them correctly in documents as well. I’ll talk about more details later on.

Model To train a model for identifying compound words, start by taking out a large sample of compound words (with and without hyphens) from an online dictionary. You can then remove all but those that match the dictionary definition of a compound word (basically what you’d find in Merriam-Webster or your favorite online source). Next, split up these samples based on whether they contain hyphens. This will give you two training datasets—one for words with hyphens and one for words without them. Take these datasets and run them through Python (I recommend using Anaconda 3). Once you have cleaned, normalized, organized and analyzed your data, construct two training models—one that detects compounds with hyphens and one that detects compounds without them.

Results To determine if two words are compound words, we used a technique called word2vec. First, we identified how similar each of our target words were by calculating their cosine similarity and comparing it to a threshold of 0.3 (words with similarities higher than 0.3 were likely to be compound words). Next, we queried a trained word2vec model for each word and found its embedding vectors by comparing it with other similar words (this is known as distancization)*. Each vector had 128 dimensions (one dimension for each unique token) and represented each unique token in an n-dimensional space.

Turn static files into dynamic content formats.

Create a flipbook