EVENT EXTRACTION FROM DIGITAL MEDIA

Gupta, Swati

EVENT EXTRACTION FROM DIGITAL MEDIA

Gupta, Swati

URI: http://localhost:8081/xmlui/handle/123456789/15327

Date: 2018-09

Abstract:

Nowadays, digital media has become a source of huge amount of up-to-date information which increases exponentially day by day. The information is concealed with unstructured data and the end user cannot directly access the desired information from it. The solution to deal with this problem is to collect important facts from unstructured data and store them in such a way that it can help end user to serve their queries. The procedure of specifically organizing and consolidating data that is explicitly expressed or implied in one or more natural language documents, is known as Information Extraction (IE). Generally, the information in digital media is reported in the form of events. Events can be represented as several types such as with its specific names, change of state, situations, actions, relations, etc. Standard dataset such as ACE format represents events as a triplet of event mention, event trigger, event arguments and divides the events among eight categories. Thus, event extraction is an important task of information extraction as it helps in developing various systems like story-telling, news event exploration, social media information fusion, question answering, etc. To tackle the information overload issue, this thesis focuses on extracting information from news media and social media (Twitter) in terms of events and related key-phrases. In particular, the subsequent problems are addressed: • Named event extraction from news headlines dataset using a knowledge-driven approach. The knowledge-driven approach uses patterns or templates that define the expert domain-specific knowledge. The named events are enriched with their type, categories, popular durations, and popularity. The system utilizes the syntactic and semantic patterns of headlines to identify the named events. Named events are short Abstract phrases that represent the name of events like 2016 Rio Olympic Games, 2G Case, and Adarsh Society Scam. Named events are categorized into candidate-level and highlevel categories using URL information, and popular durations of named events are extracted using temporal information of news headlines. • Key-phrase extraction from news content for the purpose of offering the news audience a broad overview of news events, with especially high content volume. Given an input query, the system extracts key-phrases and enriches them by tagging, ranking, and finding the role for frequently associated key-phrases. The system utilizes the syntactic and linguistic features of text to extract the key-phrases from the news media content (text). • Event extraction from a large-scale Twitter repository using an unsupervised approach. The amount of acquired data from streaming media like Twitter is vast in nature. It contains readily available information regarding important events taking place during the time span. Hence, it is indeed difficult to deploy supervised learning strategies for analyzing the tweets for meaningful information extraction. On top of that, the tweets are unstructured in nature given the diversities of the end-users who put the tweets. A self-learning max-margin clustering approach which deploys the notion of Support Vector Machine (SVM) in an unsupervised setup is used to cluster semantically similar tweets. In this thesis, machine learning algorithms and Natural Language Processing (NLP) tools are used to extract the data from news media and Twitter. For each of the previously mentioned subjects, significant literature is studied thoroughly and the limitations of some existing methods are highlighted. The main motive to select the problems defined in this thesis is to prepare the methods that solve those limitations to the feasible extent. News media data (headlines, articles, meta keywords, etc.) and Twitter data are used to evaluate the performance of the proposed methods with respect to relevant state-of-the-art methods.

Show full item record