Abstract:
Nowadays, digital media has become a source of huge amount of up-to-date information
which increases exponentially day by day. The information is concealed with unstructured
data and the end user cannot directly access the desired information from it. The solution to
deal with this problem is to collect important facts from unstructured data and store them in
such a way that it can help end user to serve their queries. The procedure of specifically organizing
and consolidating data that is explicitly expressed or implied in one or more natural
language documents, is known as Information Extraction (IE). Generally, the information in
digital media is reported in the form of events. Events can be represented as several types
such as with its specific names, change of state, situations, actions, relations, etc. Standard
dataset such as ACE format represents events as a triplet of event mention, event trigger,
event arguments and divides the events among eight categories. Thus, event extraction is
an important task of information extraction as it helps in developing various systems like
story-telling, news event exploration, social media information fusion, question answering,
etc.
To tackle the information overload issue, this thesis focuses on extracting information
from news media and social media (Twitter) in terms of events and related key-phrases. In
particular, the subsequent problems are addressed:
• Named event extraction from news headlines dataset using a knowledge-driven approach.
The knowledge-driven approach uses patterns or templates that define the
expert domain-specific knowledge. The named events are enriched with their type,
categories, popular durations, and popularity. The system utilizes the syntactic and
semantic patterns of headlines to identify the named events. Named events are short
Abstract
phrases that represent the name of events like 2016 Rio Olympic Games, 2G Case, and
Adarsh Society Scam. Named events are categorized into candidate-level and highlevel
categories using URL information, and popular durations of named events are
extracted using temporal information of news headlines.
• Key-phrase extraction from news content for the purpose of offering the news audience
a broad overview of news events, with especially high content volume. Given an input
query, the system extracts key-phrases and enriches them by tagging, ranking, and
finding the role for frequently associated key-phrases. The system utilizes the syntactic
and linguistic features of text to extract the key-phrases from the news media content
(text).
• Event extraction from a large-scale Twitter repository using an unsupervised approach.
The amount of acquired data from streaming media like Twitter is vast in nature. It
contains readily available information regarding important events taking place during
the time span. Hence, it is indeed difficult to deploy supervised learning strategies for
analyzing the tweets for meaningful information extraction. On top of that, the tweets
are unstructured in nature given the diversities of the end-users who put the tweets.
A self-learning max-margin clustering approach which deploys the notion of Support
Vector Machine (SVM) in an unsupervised setup is used to cluster semantically similar
tweets.
In this thesis, machine learning algorithms and Natural Language Processing (NLP) tools
are used to extract the data from news media and Twitter. For each of the previously mentioned
subjects, significant literature is studied thoroughly and the limitations of some existing
methods are highlighted. The main motive to select the problems defined in this thesis is
to prepare the methods that solve those limitations to the feasible extent. News media data
(headlines, articles, meta keywords, etc.) and Twitter data are used to evaluate the performance
of the proposed methods with respect to relevant state-of-the-art methods.