dc.description.abstract |
Many business and scientific domains require the collection and analysis of time series
data. Some typical application domains are finance, sales, biometrics, and weather
forecasting. Data mining when performed on time series data is called time series data
mining. In the last decade, there have been several attempts to model time series data,
design query languages for it, and to develop access structures for efficient storage and
retrieval of time series data. The work presented in this thesis is an effort to propose new
and efficient techniques for feature extraction from time series data, similarity search in
time sequences, clustering and association rule mining from time series data.
In the first part, novel feature extraction techniques for time series data have been
proposed using first moments, second (time weighted) moments and cumulative variation
in slopes. The techniques that employ moments are based on the observation that a high
dimensional time sequence can be represented as a point called centroid (meaning center
of gravity) in the 2-dimensional space. Thus time sequences are mapped as centroids in
the 2-dimensional plane. The techniques for feature extraction based on cumulative
variation of weighted slopes utilize the weighted sum of variation of slopes computed at
corresponding points of the time sequences. The weights assigned to the slopes are
dependent on the location of the slope along the time axis, thus meaning that a particular
slope exists at a certain position along the time axis.
The proposed schemes for feature extraction have been found to be faster than those
existing so far for this purpose. This can be concluded from the time complexities of the
various feature extraction techniques. Moreover our suggested techniques are simple to
understand and implement.
The next part of the thesis deals with the problem of similarity search in time series
data. We have suggested new techniques for similarity search in time series data and have
proved them to be superior to the competing techniques. Two of the proposed techniques
for similarity search rely on moments. Centroids are computed using these moments (first
or second). These techniques are based on the observation that similar time sequences
would have their centroids close to each other.
Our proposed techniques for similarity search from time series data based on moments
are simple to understand, easy to visualize and more time efficient as compared to their
existing counterparts. Moreover they are capable of handling variable length queries,
horizontal and vertical shifts between the time sequences, global scaling or shrinking of
the time sequences both along the amplitude or the time axes. They are also capable of
handling flexible distance measures including the weighted Euclidean distance measure
which is not possible with most other techniques that have been suggested so far.
Two more similarity search techniques have been suggested that employ cumulative
variation in slopes or cumulative variation in time weighted slopes for assessing
similarity in time sequences. Similar time sequences would have their cumulative
variations in slopes to be very small. Ideally for exact matches, this parameter would
evaluate to zero. The cumulative variation in time weighted slopes also are based on the
same idea. The only difference is that the variation in slopes is assigned weights
in
depending upon their location along the time axis. These slope based techniques also
have all the advantages that have been mentioned earlier for moment based methods of
similarity search.
New schemes for clustering time sequences have been proposed in the later part of the
thesis. These techniques are based on the concept of whole sequence clustering. The
sequences are mapped to a 2-dimensional plane as points called centroids using first or
second moments. These points are then clustered using the &-means clustering algorithm.
Another clustering technique has been proposed based on cumulative weighted slopes.
The sum of weighted slopes is used for feature extraction from time sequences. These
cumulative slopes are then clustered using the &-means clustering algorithm.
In the last portion of the thesis, techniques for association rule mining from time series
data have been proposed. They are based on discretizing the time sequences. The time
series are divided into all possible subsequences using a fixed size window. Feature
extraction is then done from these subsequences using the method of first or second
moments or cumulative weighted slopes. These features are then clustered using A>means
clustering algorithm. Each cluster represents a basic shape. The clusters (shapes) with
high frequency are then used for mining association rules from time series data. We have
used the Apriori algorithm to prune the candidate set of shapes. Our proposed technique
is capable of global as well as local rule discovery within one particular time sequence or
across multiple sequences.
Our work also includes extensive implementation examples that have been generated
synthetically specially for testing the suggested techniques. Real life case data has also
been used for experiments in every proposed technique. |
en_US |