NLP POS tagging using Hidden Markov Models and Mahout
Yogesh Pawar Big Data Architect and Trainer STLSOFT www.stlsoft.in Pune 411 057
Abstract This article discusses use of Hidden Markov Models (HMM) for Natural Language Processing (NLP), implemented on Hadoop. NLP involves voluminous unstructured text processing to derive value. Hidden Markov Models are widely used in Natural Language Processing for Part of Speech (POS) tagging. Parameterised Hidden Markov Models are included in Hadoop machine learning library i.e. Mahout. Large training datasets can be processed in distributed mode in Hadoop and subsequently further large datasets can be analysed. Textual data can be processed in parallel without need for global state, making Hadoop map reduce suitable distributed framework for NLP.
1. Introduction Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. – Wikipedia 1