Coding Mines: Nepali Parser

I thought it wont be fair to keep the research and findings of my and mine friends limited to some sectors of hard disk. So I think of publishing objectives that we achieved during researching on Nepali Parser two years back.

This work wont be performed and the result that we have achieved wont be at this level without affable support from Madan Puraskar Pustakalaya.(An organization) They provided us the requirement and the rules that needed to incorporate in the technical part of the project.

Another expertise came from Mr. G. Prasanna David- former Lecturer of Gandaki College of Engineering and Science. (He holds great expertise in Linux System Administration,Networking, Natural Language Processing and many more..).

Mr.Chandra Kanta Adhikari (Teacher of Gandaki Boarding School, Nepali Department) provided his expertise, suggestions and guidelines for us. His expertise in Nepali Language was really appreciable.

We did this work as the major project to fulfill requirement of our B.E final semester.

As the project cannot be done in single, there are my friends Mr. Shakeel Shrestha, Ms. Muna khadka in our team. We all had equal contribution from our side to reach our goal.

The term 'Parser' itself clarifies the meaning but on the level of research and implementation, we needed to break down into two components-Chunker and Parse Tree Generator. While the Chunker part is the core essential part but the importance of Tree Generator is inevitable as it plays the vital role in user view of the result prepared by the Chunker.I will first elaborate the Chunker part.

Chunking is the part of the Natural Language Processing in which the relationship between the words or phrases present in a sentence is determined and is grouped together. The Chunker in this context takes some tagged input sentence and then performs chunking or so called as finding the relation between words or between phrases and represents them in parenthesized notation. The system operates on the basis of certain rules. The input sentence should be given in Unicode value of the word following a slash and then the tag of the word in capital English letters. Two consecutive words can have only a single spacing between them. It then finds the phrases from the input words according to those rules. In simple, the generated phrases are called as chunks.

The Chunker in this context is domain specific and chunks the simple sentences like those we can find in simple school books for primary level. Complex sentences are not considered as the context of this project. To have a broad domain coverage well analyzed and complete set of rules need to be provided.

Such a system is useful for many other systems like Machine Translator, Grammar Checking, information retrieval etc. Below is the block diagram of the whole system.

Nepali Parser

2 comments: