Abstract:
Over the past few decades advances in genomic technologies have led to an explo
sive growth in the biological information generated by the scientific community.
There are over 65 billion nucleotides from more than 61 million individual se
quences in GenBank as on September 2006. With the enormous amount of
genomic and proteomic data available in the public domain, it is becoming in
creasingly important to be able to analyze the data and interpret the results to
decipher the connections between the genomic data and the biological function
ality of living cells and organisms.
Mapping the symbolic data into one or more numerical sequences opens the
possibility of applying signal processing techniques, especially digital signal pro
cessing (DSP) for solving highly relevant problems of biological sequence anal
ysis. Genomic signal processing (GSP) is a quickly evolving interdisciplinary
field that blends bioscience, medicine and signal processing. GSP offers several
robust and computationally efficient tools like discrete Fourier transform (DFT),
digital filters, discrete wavelet transform and several other tools for obtaining
solutions to biological problems. In several biological problems, application of
signal processing techniques forms the foundation of data analysis.
The goal of the current research work is to apply digital signal processing
(DSP) concepts for solving important problems related to sequence analysis.
iii -
Abstract
The thesis combines the advantages of DSP with pattern recognition technique
for identification and classification of sequence patterns. GSP techniques have
been successful especially for identification of hidden structures but have not
made much impact in sequence identification and sequence classification prob
lems. Furthermore, there exist very few signal processing techniques for protein
sequence analysis. In this thesis a broad methodology for analysis of DNA and
protein sequences is proposed. The aim of the thesis is not to replace the ex
isting techniques but to provide complementary approaches, to explore novel
applications of signal processing in bioinformatics, to devise simple and efficient
algorithms, to provide novel biological features and to apply machine learning
algorithms for improving the analysis capability of GSP algorithms. In chapter
two of the thesis, a brief review of the existing techniques and their limitations
for the problems that were taken up for the current research work is presented.
Chapter three of the thesis presents a signal processing technique for iden
tification of exact and inexact tandem repeat patterns in DNA sequences. It
is well known that tandem repeats in telomeres play important role in cancer
and are linked to over a dozen major neurodegenerative genetic disorders in hu
mans. Short tandem repeats are used for DNA fingerprinting. Despite their
importance, locating and characterizing these repeats within anonymous DNA
sequences remain a challenge. In past, signal processing (SP) algorithms based
on DFT and short periodicity transform (PT) techniques have been applied for
identifying tandem repeats. Periodicity transform based approach is computa
tionally expensive and inaccurate for inexact tandem repeat identification, espe
cially where it occurs due to insertion and deletion operations in DNA sequences.
Furthermore, both DFT and PT techniques for the case of inexact repeats cannot
clearly ascertain whether a pattern is due to period 'P' or its multiple. The pro-
IV
Abstract
posed algorithm applies a novel periodicity measure based on orthogonal exactly
periodic subspace decomposition (EPSD) technique. The algorithm is based on
the concept of identifying local periods in the input signal and is robust in iden
tifying inexact and hidden repeat patterns which otherwise are very difficult to
detect. The EPSD measure also resolves the problems that were present in pre
vious signal processing based approaches. The time complexity of the algorithm
is 0(NLwlog2Lw), where N is the length of the DNA sequence and Lw is the
window length for identifying repeats. To demonstrate the capabilities of the
algorithm, experiments were performed on artificially generated DNA sequences
and actual DNA sequences covering both exact and inexact repeats.
Chapter four of the thesis addresses the problem of identifying exact and inex
act inverted repeats present in DNA sequences. Fast correlation and periodicity
measure based algorithms are presented in this chapter for identifying both exact
and inexact IRs. Inverted repeats (IRs) are widespread in both prokaryotic and
eukaryotic genomes, and have been associated with a large number of possible
functions. Identification of inverted repeats and especially inexact inverted re
peats in a DNA sequence has remained one of the challenging problems in DNA
sequence analysis. Most of the existing methods for inverted repeat identification
are either very difficult to handle, as they require a large number of input param
eters or are inefficient in identifying inexact inverted repeats. Also, till date no
signal processing algorithm exists for identifying IRs. The algorithms require the
user to input only two easily understood parameters: maximum inverted repeat
size and minimum length of contiguous repeat. This makes IRs identification
job easier for the users, especially for biologists. The algorithm is evaluated by
performing experiment on biological dataset download from NCBI website. The
obtained results are compared with standard tool available online and the results
Abstract
show the effectiveness of the proposed approach. In Chapter five the correlation
based approach is extended for RNA secondary structure prediction problem
after modification in merging algorithm for IR detection.
Chapter six of the thesis deals with the problem of identifying protein coding
patterns and presents a novel pattern recognition framework based on wavelet
variance features for identifying protein coding DNA patterns. The identification
task is a very challenging because there is no specific criterion based on which
every coding and non-coding pattern can be identified. Currently, the most
accurate identification techniques are based on linear/slope model of Z-curve
components. However, the linear model provides apoor approximation for highly
non-linear Z-curve components. In-addition, the slope based techniques ignore
the local statistical information present in DNA sequences which are important
for identifying small coding patterns. The existing signal processing methods
are based on only period-3 feature of protein coding region. In the proposed
approach a wavelet based time series analysis technique has been applied for
extracting coding feature from Z-curve components. Till now, wavelets have
never been applied in identification of coding patterns. Also, pattern recognition
approach has not been explored for identification of coding DNA sequences. The
wavelet coefficient provides both local and global information contents of DNA
sequences. The proposed approach provided a 10-fold cross-validation accuracy
of more than 93% on recall patterns of Yeast genome. Furthermore, a combined
feature vector (i.e., slope andwavelet variance features) based SVM classification
is also proposed. The combined feature vector provided a 10-fold cross-validation
accuracy of 96% for recall patterns of Yeast genome and more than 96% recall
pattern accuracy for the E. coli genome.
Chapter seven of the thesis presents the development of a novel feature vector
VI
Abstract
for efficient identification of G-protein-coupled receptors (GPCRs), GPCRs fam
ilies, subfamilies and sub-subfamilies using SVM. GPCRs are one of the largest
groups of proteins in vertebrate species. Their classification and functional an
notation are very important in present medical and pharmaceutical research
because GPCRs play key roles in many diseases. The large dimension of fea
ture vector for the existing popular SVM based technique (SVMpred) makes the
classification task quite expensive in terms of computational and memory used.
The proposed feature vector is based on wavelet variance of seven important
physicochemical properties of amino acids. Furthermore, the dimension of the
proposed feature vector is also reduced to 35. This helps in building faster and
memory efficient classifier which can be implemented on any normal desktop
computer. The technique classifies GPCRs and non-GPCRs using a 5-fold crossvalidation
with accuracy and Matthews correlation coefficient (MCC) of 99.9%
and 0.998 respectively. The technique is further able to detect major classes
or families, subfamilies and sub-subfamilies of GPCRs with a total accuracy of
97.63%, 96.64% and 93.38% respectively. In addition, the technique classifies
the human GPCRs with accuracy and MCC of 99.88% and 0.998 respectively.
Finally in chapter eight, the contributions made in the thesis are summarized
and scope of future work is outlined.