Semantic Web for Effective Healthcare Systems. Группа авторов
information extraction and text analysis techniques.
There exist many challenges while analyzing the social media text or user-generated content. In languages like English, the same word has multiple meaning (polysemy), and different words have same meaning (synonymy). People show “variety” and use heterogeneous words while expressing their views. It often leads to complication in processing the textual data. Most of the feature extraction techniques do not consider the semantic relationships between the terms. Subjectivity that exists in text processing techniques adds complexity to the process, which in turn impacts the evaluation of results. Also, the rare availability of gold-standard or annotated text data for different domains add more challenges to text analysis [6]. Hence, the identification and application of suitable Natural Language Processing (NLP) techniques are the main research focus in text data analysis.
Figure 1.1 Decision-making process from social media reviews.
Text analytics supports the context matching between the reader and the writer. This challenge can be managed if different vocabularies of features and their relationship are well represented in the data model. For example, content based contextual user feedback analysis enables the users to buy new products or avail any service by highlighting the best features of products or services. Challenges and issues in information retrieval problems are overcome if Ontology representation and topic modelling techniques are used for modeling the text documents. The chapter focuses on extracting relevant features from the set of documents and building domain ontology for them. The Ontology helps in building the predictive or sentiment analysis model by using suitable information retrieval (IR) techniques and contextual representation of data, so as to enable automated decision-making process, before buying a new product or availing a new service, as shown in Figure 1.2.
Figure 1.2 User-generated content analysis (UCA) model.
1.1.1 Ontology-Based Information Extraction
Ontology describes a domain of classes. It is defined as a conceptual model of knowledge representation. The concepts of the domain (classes), their attributes, their properties and their relationships are well described by the Ontology model. It also explains the meanings of the terms applicable to the domain. Ontology is one of the key components of semantic web technology. The semantic web technologies like Ontology, RDF and Sparql are used in describing different words and their dependencies by modeling the textual data. Components of Ontology include:
• Concepts are also known as Classes. It is a unit of knowledge, shared among identified group of persons for the concept’s domain. There exists relationship among concepts.
• Instances are individuals of concepts. They represent specific elements attached to the domain ontology. Instances are the “thing” represented by a concept.
Information Extraction (IE) and Ontology are related with one another like: Ontology is used in information extraction as part of understanding process of the domain; on the other hand, IE is used to design and enrich Ontology [7]. Common vocabulary and shared understanding among different people are enabled by Ontology. The contextual representation of data semantics is well described by the Ontology [8]. The UML diagrams along with Ontology support the biologists by classifying the entities and interactions between proteins and genes [9]. The terms (vocabularies) and the concepts (classes) in the source Ontology are used in term matching, thereby used in tagging the text documents. Thus the Ontology and their specifications are used in the information extraction process.
1.1.2 Ontology-Based Knowledge Representation
Knowledge is data that represents the outcome of computer-based cognitive processes such as perception, learning, association, and reasoning, or the translation of knowledge acquired by human [10]. It is the language by which human express their understanding about the concept. The concepts and the instances of a particular domain are expressed in the knowledge base also referred as the semantic knowledge dictionary. It is one of the most important techniques to represent the knowledge for a domain. Domain Ontology is developed to formally define the concepts, relationships, and rules so as to include the semantic content of the domain. The semantic approach uses the concepts in the documents to establish the contextual relationship rather than the terms. Issues like synonymy and polysemy may not be resolved if terms are used as indices while modelling the text documents. Various semantic-based information extraction approaches like Latent Semantic Indexing [11] and Latent Dirichlet Allocation [12, 13] techniques are used for building the relationship among the indexed terms, so as to represent the contexts between the concepts. This chapter focuses on developing domain Ontology to represent the features and their related terms mentioned in the product/service reviews generated in social media web sites.
1.2 Related Work
Ontology facilitates the shared understanding among the people by formalizing the conceptualization of a specific domain. The contextual representation of data semantics is well described by the Ontology [8]. Ontology defines concepts (domain) by using the common vocabulary and describes attributes, behavior, relationships and constraints. The UML diagrams along with The interactions between proteins and genes are well explained by Ontology representation which would support the biologists for classification [9]. Reviews on hotels and movies are classified using the rule-based systems and Ontology [14–16]. Document annotation and rules were used to create knowledge base of web documents from the extraction of semantic data like named entities [14, 17, 18]. Ontology learning and RDF repositories were used for building the knowledge and information management which in turn enabled the automatic annotation and retrieval of documents [19]. Wordnet Ontology was used in extracting the sentiments based on lexicon dictionaries [20, 21].
Information extraction process uses Ontology for understanding the domain and for extracting the relevant information. Its complexity is reduced as it is domain specific. IE techniques are then used for populating and enhancing the Ontology. These Ontologies can be enriched from the useful sources of knowledge [7]. SVM classification along with SentiWordNet enabled the building of sentiment dictionary for positive and negative categorization of text documents [23]. Opinion extraction techniques along with entropy-based classification techniques are used for building structured Ontology for the datasets Digital Camera [24]. Classification of products and their attributes based on their hierarchy was done using the hierarchical learning sentiment ontology tree (HL-SOT) algorithm which in turn used for opinion mining of products and their features [25].
Knowledge base refers the dictionary for the vocabulary used to represent concepts of a specific domain. The Ontology provides the semantic knowledge for class instances like a dictionary. The meaning of the documents may be extracted using the semantic-based approach by establishing the suitable context within the document, instead of using terms present in the document. Related terms were extracted and categorized using the semantic-based approaches like LSI [11] and LDA [13] techniques. Ontology-based sentiment analysis model was developed for mining product features from customer reviews [1]. Ontology along with Genetic Algorithm, a hybrid-model, was used for automatic grouping of Chinese proposals into different clusters resulted in >90% F-measure value [26]. Sentiment lexicons of emotional categories were derived from the twitter posts of mobile products by using Ontology learning and the lexicon-based techniques [27]. Ontology and vector analysis method was used in feature selection and sentiment analysis of movie reviews [22]. Ontology-based sentiment analysis model along with rule-based classification was used in the postal services