TECHNIQUES FOR ANALYSIS OF GENOMIC SEQUENCE DATA

Gupta, Ravi

DSpace Home
→
ELECTRONICS AND COMMUNICATION ENGINEERING (FORMERLY ELECTRONICS & COMPUTER ENGINEERING)
→
DOCTORAL THESES (E & C)
→
View Item

TECHNIQUES FOR ANALYSIS OF GENOMIC SEQUENCE DATA

Gupta, Ravi

URI: http://hdl.handle.net/123456789/1797

Date: 2007

Abstract:

Over the past few decades advances in genomic technologies have led to an explo sive growth in the biological information generated by the scientific community. There are over 65 billion nucleotides from more than 61 million individual se quences in GenBank as on September 2006. With the enormous amount of genomic and proteomic data available in the public domain, it is becoming in creasingly important to be able to analyze the data and interpret the results to decipher the connections between the genomic data and the biological function ality of living cells and organisms. Mapping the symbolic data into one or more numerical sequences opens the possibility of applying signal processing techniques, especially digital signal pro cessing (DSP) for solving highly relevant problems of biological sequence anal ysis. Genomic signal processing (GSP) is a quickly evolving interdisciplinary field that blends bioscience, medicine and signal processing. GSP offers several robust and computationally efficient tools like discrete Fourier transform (DFT), digital filters, discrete wavelet transform and several other tools for obtaining solutions to biological problems. In several biological problems, application of signal processing techniques forms the foundation of data analysis. The goal of the current research work is to apply digital signal processing (DSP) concepts for solving important problems related to sequence analysis. iii - Abstract The thesis combines the advantages of DSP with pattern recognition technique for identification and classification of sequence patterns. GSP techniques have been successful especially for identification of hidden structures but have not made much impact in sequence identification and sequence classification prob lems. Furthermore, there exist very few signal processing techniques for protein sequence analysis. In this thesis a broad methodology for analysis of DNA and protein sequences is proposed. The aim of the thesis is not to replace the ex isting techniques but to provide complementary approaches, to explore novel applications of signal processing in bioinformatics, to devise simple and efficient algorithms, to provide novel biological features and to apply machine learning algorithms for improving the analysis capability of GSP algorithms. In chapter two of the thesis, a brief review of the existing techniques and their limitations for the problems that were taken up for the current research work is presented. Chapter three of the thesis presents a signal processing technique for iden tification of exact and inexact tandem repeat patterns in DNA sequences. It is well known that tandem repeats in telomeres play important role in cancer and are linked to over a dozen major neurodegenerative genetic disorders in hu mans. Short tandem repeats are used for DNA fingerprinting. Despite their importance, locating and characterizing these repeats within anonymous DNA sequences remain a challenge. In past, signal processing (SP) algorithms based on DFT and short periodicity transform (PT) techniques have been applied for identifying tandem repeats. Periodicity transform based approach is computa tionally expensive and inaccurate for inexact tandem repeat identification, espe cially where it occurs due to insertion and deletion operations in DNA sequences. Furthermore, both DFT and PT techniques for the case of inexact repeats cannot clearly ascertain whether a pattern is due to period 'P' or its multiple. The pro- IV Abstract posed algorithm applies a novel periodicity measure based on orthogonal exactly periodic subspace decomposition (EPSD) technique. The algorithm is based on the concept of identifying local periods in the input signal and is robust in iden tifying inexact and hidden repeat patterns which otherwise are very difficult to detect. The EPSD measure also resolves the problems that were present in pre vious signal processing based approaches. The time complexity of the algorithm is 0(NLwlog2Lw), where N is the length of the DNA sequence and Lw is the window length for identifying repeats. To demonstrate the capabilities of the algorithm, experiments were performed on artificially generated DNA sequences and actual DNA sequences covering both exact and inexact repeats. Chapter four of the thesis addresses the problem of identifying exact and inex act inverted repeats present in DNA sequences. Fast correlation and periodicity measure based algorithms are presented in this chapter for identifying both exact and inexact IRs. Inverted repeats (IRs) are widespread in both prokaryotic and eukaryotic genomes, and have been associated with a large number of possible functions. Identification of inverted repeats and especially inexact inverted re peats in a DNA sequence has remained one of the challenging problems in DNA sequence analysis. Most of the existing methods for inverted repeat identification are either very difficult to handle, as they require a large number of input param eters or are inefficient in identifying inexact inverted repeats. Also, till date no signal processing algorithm exists for identifying IRs. The algorithms require the user to input only two easily understood parameters: maximum inverted repeat size and minimum length of contiguous repeat. This makes IRs identification job easier for the users, especially for biologists. The algorithm is evaluated by performing experiment on biological dataset download from NCBI website. The obtained results are compared with standard tool available online and the results Abstract show the effectiveness of the proposed approach. In Chapter five the correlation based approach is extended for RNA secondary structure prediction problem after modification in merging algorithm for IR detection. Chapter six of the thesis deals with the problem of identifying protein coding patterns and presents a novel pattern recognition framework based on wavelet variance features for identifying protein coding DNA patterns. The identification task is a very challenging because there is no specific criterion based on which every coding and non-coding pattern can be identified. Currently, the most accurate identification techniques are based on linear/slope model of Z-curve components. However, the linear model provides apoor approximation for highly non-linear Z-curve components. In-addition, the slope based techniques ignore the local statistical information present in DNA sequences which are important for identifying small coding patterns. The existing signal processing methods are based on only period-3 feature of protein coding region. In the proposed approach a wavelet based time series analysis technique has been applied for extracting coding feature from Z-curve components. Till now, wavelets have never been applied in identification of coding patterns. Also, pattern recognition approach has not been explored for identification of coding DNA sequences. The wavelet coefficient provides both local and global information contents of DNA sequences. The proposed approach provided a 10-fold cross-validation accuracy of more than 93% on recall patterns of Yeast genome. Furthermore, a combined feature vector (i.e., slope andwavelet variance features) based SVM classification is also proposed. The combined feature vector provided a 10-fold cross-validation accuracy of 96% for recall patterns of Yeast genome and more than 96% recall pattern accuracy for the E. coli genome. Chapter seven of the thesis presents the development of a novel feature vector VI Abstract for efficient identification of G-protein-coupled receptors (GPCRs), GPCRs fam ilies, subfamilies and sub-subfamilies using SVM. GPCRs are one of the largest groups of proteins in vertebrate species. Their classification and functional an notation are very important in present medical and pharmaceutical research because GPCRs play key roles in many diseases. The large dimension of fea ture vector for the existing popular SVM based technique (SVMpred) makes the classification task quite expensive in terms of computational and memory used. The proposed feature vector is based on wavelet variance of seven important physicochemical properties of amino acids. Furthermore, the dimension of the proposed feature vector is also reduced to 35. This helps in building faster and memory efficient classifier which can be implemented on any normal desktop computer. The technique classifies GPCRs and non-GPCRs using a 5-fold crossvalidation with accuracy and Matthews correlation coefficient (MCC) of 99.9% and 0.998 respectively. The technique is further able to detect major classes or families, subfamilies and sub-subfamilies of GPCRs with a total accuracy of 97.63%, 96.64% and 93.38% respectively. In addition, the technique classifies the human GPCRs with accuracy and MCC of 99.88% and 0.998 respectively. Finally in chapter eight, the contributions made in the thesis are summarized and scope of future work is outlined.

Show full item record