1) Sentence completion task using web-scale data

We propose a method to automatically answer SAT-style sentence completion questions using web-scale data. Web-scale da-ta have been used in many language studies and have been found to be a very useful resource for improving accuracy in sentence completion task. Our method employs assorted N-gram probability information for each candidate word. We also proposed back-off strategy was used to remove zero probabilities. We found that the accuracy of our proposed method improved by 52-87% over the current state-of-the-art.

2) Semantic graph based approach for text mining

A semantic network is a graphical notation, for representing knowledge in form of interconnected nodes and arcs. In this paper we propose a novel approach to construct a semantic graph from a text document. Our approach considers all the nouns of a document and builds a semantic graph, such that it represents entire document. We think that our graph captures many properties of the text documents and can be used for different application in the field of text mining and NLP, such as keyword extraction and to know the nature of the document. Our approach to construct a semantic graph is independent of any language. We performed an experimental analysis to validate our results to extract keywords of document and to derive nature of graph. We present the experimental result on construction of graph on FIRE data set and present its application for keyword extraction and commenting on the nature of document.

3) Text Simplification Tools: Using Machine Learning to Discover Features that Identify Difficult Text

Although providing understandable information is a critical component in healthcare, few tools exist to help clinicians identify difficult sections in text. We systematically examine sixteen features for predicting the difficulty of health texts using six different machine learning algorithms. Three represent new features not previously examined: medical concept density, specificity (calculated using word-level depth in MeSH); and ambiguity (calculated using the number of UMLS Metathesaurus concepts associated with a word). We examine these features for a binary prediction task on 118,000 simple and difficult sentences from a sentence-aligned corpus. Using all features, random forests is the most accurate with 84% accuracy. Model analysis of the six models and a complementary ablation study shows that the specificity and ambiguity features are the strongest predictors (24% combined impact on accuracy). Notably, a training size study showed that even with a 1% sample (1,062 sentences) an accuracy of 80% can be achieved.

Dhiya: A stemmer for morphological level analysis of Gujarati language

To understand a language, analysis has to be done at word level, sentence level, context level and discourse level. Morphological analysis comes at the base of all, as it is the first step to understand a given sentence. One of the tasks that can be done at morphological level is stemming. To identify the stem term of a given word is stemming. Stemming is one of the important activities which is not just related to Natural Language Processing domain, but is equally important in Information Retrieval domain. In this paper, authors suggest DHIYA a stemmer for Gujarati language. This stemmer is based on the morphology of Gujarati language. To develop the stemmer, inflections which appeared most in Gujarati text were identified. Based on it, the rule set was created. For training and evaluation of the stemmer's performance the EMILLE corpus is used. The accuracy of the stemmer is 92.41%.

Medical Document Classification Based on MeSH

One of the most challenging projects in information systems is extracting information from unstructured texts, including medical document classification. I am developing a classification algorithm that classifies a medical document by analyzing its content and categorizing it under predefined topics from the Medical Subject Headings (MeSH). I collected a corpus of 50 full-text journal articles (N=50) from MEDLINE, which were already indexed by experts based on MeSH. Using natural language processing(NLP), my algorithm classifies the collected articles under MeSH subject headings. I evaluated the algorithm's outcome by measuring its precision and recall of resulting subject headings from the algorithm, comparing results to the actual documents' subject headings. The algorithm classified the articles correctly under 45% to 60% of the actual subject headings and got 40% to 53% of the total subject headings correct. This holds promising solutions for the global health arena to index and classify medical documents expeditiously.

Korean-Thai Lexicon for Natural Language Processing

This paper presents Korean-Thai lexicon. This research aims to study and collect necessary features to construct the Korean-Thai lexicon for natural language processing (NLP) and speech processingresearches. The research method used for study was that of (1) creating Korean-Thai lexicon consisting of 7 parts : Korean words, Korean Revised Romanization, part of speech, sub part of speech, special characteristic, Thai meaning and description of meaning (2) Korean transcription. According to lack of useful tools for the Korean- Thai machine translation, therefore we have a proposal for creating Korean-Thai lexicon for machine translation. The Korean-Thai lexicon consists of 36,000 Korean words. As it would take a lot of time and effort to gather enough Korean words to cover all domains, Korean Revised Romanization was applied for some words such as terminology, names and places.

Analysis of stock market using text mining and natural language processing

Stock market has become one of the major components of economy not only in developed countries but also in third world developing countries. Making decision in stock market is not really easy because a lot of factors are involved with every choice we make. Therefore, a lot of analysis is required to make an optimal move on stock market which may involve price trend, market's nature, company's stability, different news and rumors about stocks etc. The objective of this study is to extract fundamental information from relevant news sources and use them to analyze or sometimes forecast the stock market from the common investor's viewpoint. We surveyed the existing business text mining researches and proposed a framework that uses our text parser and analyzer algorithm with an open source natural language processing tool to analyze (machine learning and text mining), retrieve (naturallanguage processing), forecast (compare with historic data) investment decisions from any text data source on stock market. For our research we used the data of Dhaka Stock Exchange (DSE), capital market of Bangladesh.

CyberMate ∼ Artificial Intelligent business help desk assistant with instance messaging services

Most of the existing Artificial Intelligent software agents are based on a single specific purpose whereas CyberMate is a multipurpose AI agent based on pattern recognition with Chatterbot approach to interact with remote users. The system introduces a new technology to automate customer care services by hosting business information and technical support to address inquiries by the end users. CyberMate allows remote end users to connect via Instant Messaging services and get requested information using English, Tamil, Sinhala language or any natural language. The system can act as a dedicated AI agent between client and the service hosting party that can provide information via instance messages. This is a much faster and familiar solution than hosting a Frequently Asked Questions (FAQ) page as a part of a business website and even the system can reduces the work load of the customer care service waiting queue, which leads to decrease the customer's dissatisfaction and make service accessible from any device which has Internet connectivity. The business information can be easily modeled into the CyberMate system using the newly introduced CyberMate Scripting Language (CSL) and its own development environment (CSL-IDE).

On novel chatterbot system by means of web information

Recently, the use of various chatterbots has been proposed to simulate conversation with human users. Several chatterbots can talk with users very well without a high-level contextual understanding. However, it may be difficult for chatterbots to reply to specific and interesting sentences because chatterbots lack intelligence. To solve this problem, we propose a novel chatterbot that can directly use Web information. We carried out computational experiments by applying the proposed chatterbot to “2channel” (2ch) and “Twitter”.

CHARLIE: An AIML-based chatterbot which works as an interface among INES and humans

INES (INtelligent Educational System) is a functional prototype of an online learning platform, which combines three essential capabilities related to e-learning activities. These capabilities are those concerning to a LMS (learning management system), a LCMS (learning content management system), and an ITS (intelligent tutoring system). To carry out all this functionalities, our system, as a whole, comprises a set different tools and technologies, as follows: semantic managing users (administrators, teachers, students...) and contents tools, an intelligent chatterbot able to communicate with students in natural language, an intelligent agent based on BDI (believes, desires, intentions) technology that acts as the brain of the system, an inference engine based on JESS (a rule engine for the Java platform) and ontologies (to modelled the user, his/her activities, and the learning contents) that contribute with the semantics of the system, etc. At the present paper we will focus on the chatterbot, CHARLIE (CHAtteR Learning Interface Entity), developed and used in the platform, which is an AIML-based (artificial intelligence markup language) bot. We will specifically address its performance and its contribution to INES.

Extending chatterbot system into multimodal interaction framework with embodied contextual understanding

This work aims to realize multimodal interaction with embodied contextual understanding based on the simple chatterbot system. A system framework is proposed to integrate the dialogue system into a 3D simulation platform, SIGVerse to attain multimodal interaction. The chatterbot AIML implementations are described in the achievement of the conversations with embodied contextual understanding in HRI simulations.

java Projects

Friday, April 25, 2014

IEEE Projects in Natural Language Processing -

2) Semantic graph based approach for text mining

3) Text Simplification Tools: Using Machine Learning to Discover Features that Identify Difficult Text

Dhiya: A stemmer for morphological level analysis of Gujarati language

Korean-Thai Lexicon for Natural Language Processing

Analysis of stock market using text mining and natural language processing

CyberMate ∼ Artificial Intelligent business help desk assistant with instance messaging services

On novel chatterbot system by means of web information

CHARLIE: An AIML-based chatterbot which works as an interface among INES and humans

Extending chatterbot system into multimodal interaction framework with embodied contextual understanding

No comments:

Post a Comment

Pages - Menu

Categories