Skip redundant pieces
ITTC Bioinformatics Cluster

A Computing Facility for Bioinformatics and Life Sciences Research

Project Active: Start Date 2004-09-29

Broadly, we can view computing associated with life sciences research as the application of advanced information technology to solve biological problems. These technologies are aimed at organizing biological data, analyzing the data, and then facilitating its interpretation. The knowledge gained from this process can then be used to build predictive models of biological systems. Thus, life sciences research needs information technologies, e.g., advanced algorithms, high performance computing, data and information management (including databases and data mining systems), and software to support communications and collaboration.

While some off-the-shelf solutions may exist for some aspects of these technologies, it is clear that new solutions are needed. This will necessitate making advances in the areas of algorithms, high performance computing, parallel computing, statistical pattern analysis, modeling, signal processing, data management, and data mining.

A computing facility (including computing cluster, storage facility, backup system and associated software) will facilitate the development, testing and deployment of these new computing technologies in support of research on a variety of life sciences problems. The facility will initially provide for research into new algorithms and methods for genomics (including analysis of microarry data), proteomics, molecular dynamics, molecular docking and analysis of magneto-encephalography (MEG) data. The facility will also support various public domain software and academic licensed software to serve needs of our research and development community. The close collaboration between chemists, biologists, mathematicians, and computer scientists using this facility will not only result in the creation of new computing and information technologies but also will directly lead to increasing our understanding of biological systems.

Faculty Investigator(s): Victor Frost (PI), Xue-wen Chen, Terry Clark
Staff Investigator(s): Adam Hock, David Johnson
Student Investigator(s): Doug Herbers, Justin Ward
Primary Sponsor(s): NIH-Health Resources and Services Administration (HRSA) (of U.S. Dept. of Health and Human Services)


An Online Clearinghouse for Bioinformatics Software Sharing and Evaluation

Project Active: Start Date 2006-09-29, Projected End Date 2007-04-30

The registry will help prevent the duplication of comparable tools and publicize technical developments. It will collect user feedback and provide reliable cross-platform documentation.

After testing is complete, all K-INBRE (Kansas- Idea Network of Biomedical Research Excellence) participants and other Central States INBRE institutions will have access on a password-protected registration basis. A successful beta test would further broaden access to academic institutions on a national scale, subject to registration and verification of academic status. Developers of a given tool could remove access to that tool and any associated documentation in the event of an anticipated commercialization initiative. The Clearinghouse would be housed on an existing K-INBRE-funded server.

Faculty Investigator(s): Gerald Lushington (PI), Jianwen Fang
Primary Sponsor(s): University of Kansas Medical Center Research Institute


CAREER: Machine Learning Approaches for Genome-wide Biological Network Interference

Project Active: Start Date 2007-02-22, Projected End Date 2010-04-30

Because of technological limitations, molecular biology research has had to focus on individual genes and gene products. This has led to a wealth of knowledge about individual cellular components and their functions. Isolated cellular components are not sufficient to understand most cellular functions, which are carried out by complex networks. It is therefore imperative to employ network-based approaches to address the complexity of living systems.

Scientists in life-science research must find how to computationally model and elucidate complex networks from high-throughput biological data sets. Thus, this research focuses on developing and applying novel computational methods for reconstructing genome-wide biological networks from high-throughput data.

ITTC researchers will develop and apply novel computational approaches for uncovering networks of interactions between genes and proteins. They will conduct related educational activities in a newly established bioinformatics program in the Department of Electrical Engineering and Computer Science at the University of Kansas. A wide-range of students, from high school through graduate school, will receive special training opportunities in the interdisciplinary area of bioinformatics.

ITTC Investigator Xue-wen Chen will develop machine learning methods for effectively integrating multiple prior knowledge from different data sources, highly heterogeneous data learning, and large-scale network learning. Learning with prior knowledge and highly heterogeneous data sources are fundamental to computational biology, information theory, machine learning, data mining, and other areas. The research will produce new methods and user-friendly software for molecular biologists.

Faculty Investigator(s): Xue-wen Chen (PI)
Student Investigator(s): Mei Liu, Jong Jeong, Bing Han, Michael Wasikowski, Jae Kim, Alexander Senf, Matthew Mandelbaum, Patrick Dermyer
Primary Sponsor(s): NSF


CAREER: Mining Genome-wide Chemical-Structure Activity Relationships in Emergent Chemical Genomics Databases

Project Active: Start Date 2009-07-01, Projected End Date 2014-06-30

ITTC will develop an integrated research and education program for advancing the underlying theoretical and computational principles of data mining in the emergent chemical genomics databases. The core technical innovations are advances in (i) developing effective kernel-based representations and structure pattern extraction and selection methods to capture the intrinsic characteristics of irregular and discrete spaces such as the chemical space, (ii) designing methods for adaptive and scalable similarity search in large databases of complex data and methods for accurate classification model construction with imbalanced and out-of-domain data, and (iii) deriving application oriented validation.

A key strength of this work is the application of the theoretic and computational advancements to real-world problems, namely, chemical toxicity prediction based on microarray gene expression profiles and high-throughput chemical screening. By developing innovative tools for graphs and geometric structures, ITTC will enable much better techniques for searching, mining, and analyzing domains of complex data. The timely effort integrates and advances knowledge in three communities: cheminformatics, data mining, and machine learning.

Faculty Investigator(s): Jun Huan (PI)
Student Investigator(s): Brian Quanz
Primary Sponsor(s): National Science Foundation


Computational Prediction of Beta-Sheet Arrangement (K-INBRE)

Project Active: Start Date 2005-07-01, Projected End Date 2006-04-30

It is widely believed that protein misfolding into beta aggregates or fibrils is a significant contributor to the onset of Alzheimer’s, Parkinson’s, and other neurodegenerative diseases. Although knowledge of the mechanism for conformational change may be critical to control of these diseases, considerable uncertainty exists about the nature of aggregate formation and the nature of the fibrils. Computational prediction of the transformation may present a plausible approach to resolving some of the uncertainties. Use of long-range interactions in prediction of ?-strand arrangement in the formation of ?-sheets may well be an essential step to forecasting/determining the 3-D structure of proteins from amino acid sequences. Thus, a better understanding of ?-strand arrangement in ?-sheets will not only provide possible solutions to prevent intermolecular ?-sheet formation associated with neurodegenerative diseases but may also contribute to the success of 3-D structure prediction.

Faculty Investigator(s): Jianwen Fang (PI)
Primary Sponsor(s): University of Kansas Medical Center Research Institute, Inc. (KUMCRI)


Computational Proteomics: Protein Interaction Prediction

Project Active: Start Date 2004-09-01, Projected End Date 2006-06-30

Proteins perform biological functions by interacting with other molecules. During the protein-protein interaction, the conserved domains physically interact with each other. Thus, understanding protein interactions at domain level gives detailed functional insights upon proteins that are either characterized or newly discovered. However, unlike protein-protein interactions that can be discovered by some high throughput technologies, domain-domain interactions largely remain unknown. This project addresses this issue by developing computational models to infer domain-domain interactions from protein-protein interactions; the model can then be used to validate and predict unknown protein interactions.

The long-term objectives of this research include better understanding protein functions based on their domain structures and predicting protein domains in terms of their functions. Specific aims include:

(1) The development of new computational models for inferring domain-domain interactions and for predicting protein-protein interactions. Kernel-based learning models will be developed to extract information from known protein-protein interactions, which is then used to infer the probabilities of domain-domain interactions. The newly developed computational models allow us to (i) predict the undiscovered protein-protein interactions, (ii) identify protein domains in terms of protein functions, and (iii) validate the newly discovered protein-protein interactions through biological experiments or other means.

(2) The development of an online system based on the computational models. This system will allow users to find the possible proteins that will interact with newly discovered proteins, validate protein-protein interactions, and identify protein domains.

In collaboration with Higuchi Biosciences Center

Faculty Investigator(s): Xue-wen Chen (PI), Xue-wen Chen
Primary Sponsor(s): National Institutes of Health


Constructing Gene Networks from Microarray Data for Age-Dependent Epiliptogenesis

Project Active: Start Date 2004-07-01, Projected End Date 2005-06-30

Epilepsy, characterized by the repetitive occurrence of seizures, currently afflicts approximately 4 percent of Americans of all backgrounds and ages. There are no current therapies available which can completely arrest the epileptic process in most individuals. In order to develop effective prevention and therapeutic intervention approaches, the molecular mechanisms of epilepsy must be identified. Bioinformatics approaches will unravel relationships among the specific genes and generate hypotheses on the molecular mechanisms of the epileptogenic process.

This project will establish collaborative research programs in neuroscience and bioinformatics, programs that integrate experimental biology with computation and modeling. As the research is interdisciplinary in nature, a collaborative team including scientists from the Department of Electrical Engineering and Computer Science (Dr. Xue-wen Chen) and the Center for Neurobiology and Immunology Research at HBC (Drs. Eli Michaelis and Xinkun Wang) has been formed. Dr. Michaelis and Dr. Wang will conduct neurobiological experiments and generate epileptogenic microarray data, while Dr. Chen will develop new computational models to analyze the generated microarray data, and construct gene networks from them. Researchers will build collaborations between the two disciplines, neuroscience and bioinformatics. The problem-solving skills of computer scientists are expected to complement and enhance the discovery orientation of scientists in the biosciences field. This interdisciplinary collaboration will increase the likelihood of finding potential gene targets for clinical epilepsy prevention and intervention.

Faculty Investigator(s): Xue-wen Chen (PI)
Primary Sponsor(s): Center of Biomedical Research Excellence (COBRE)-NIH


Development of an Integrated Bioinformatics Information Infrastructure

Project Active: Start Date 2004-10-13, Projected End Date 2006-09-29

The Army's chemical and biological defense research and development interests reflect numerous activities that should benefit significantly from the increased facility of data flow and hypothesis testing that arise from an enhanced informatics infrastructure. Chemical and biological defense research is multifaceted, involving issues from the sub-cellular level through ecological and geographic dynamics of a disease. The same is true of current life science research activities at the University of Kansas. Given the related nature of various of the KU efforts and those under way within Edgewood Chemical Biological Center (ECBC), it is logical to expect that bioinformatics infrastructure to be developed under this effort at KU in conjunction with local research activities should be relevant to, and readily extensible to, the information management needs within the ECBC.

ITTC researchers will develop system architecture to provide the computing; data storage, and networking capabilities that will compose a bioinformatics infrastructure to facilitate multi-faceted bioresearch efforts. The system architecture will support large-scale processing of heterogeneous data from diverse sources as well as sophisticated algorithms to extract meaningful information and suggest new experiments. This architecture must support the way biologists work, e.g., revision and redesign experiments based on results from previous experiments. Providing feedback to earlier stages of an experiment based on downstream data is important to improving the efficiency of biological investigations. Thus, work flow, information retrieval, data storage, processing, and networking issues will factor into the design. The system design will specify suitable data and compute servers. Collaborative environments will also be a component of the overall system architecture. The network to support the system will be specified. Collaborations and sharing of knowledge across all aspects of multi-faceted bioresearch endeavors will be greatly enhanced through the systematic design of the supporting information and computing systems.

Faculty Investigator(s): Victor Frost (PI), Terry Clark, Susan Gauch, Gary Minden
Student Investigator(s): Alexander Garrett, Lance Feagan, Justin Rohrer, Jesse Stanley, Keith Preston, Doug Herbers, Andrew Ozor, Justin Ward, Heather Amthauer
Primary Sponsor(s): U.S. Army


First Award: Rapid Integration of Genomic Data from Multiple Sources

Project Active: Start Date 2005-03-21, Projected End Date 2006-05-31

The research will automate data integration and schema extensions toward intuitive and flexible interfaces to object-oriented databases for biologists, expert and non-expert users, and software systems. The target application scenario involves large collections of primary genomic data stored in an object-oriented genomic database. Here users are interested in integrating data and schemas from external sources with a comprehensive warehouse. In this work, XML is chosen .as the input format for data; the target genomic data warehouse is the public domain Genomic Unified Schema, GUS. A framework is designed and developed to admit new data types (schema) and dynamically incorporate them into through the database object layer using an automatically generated interface. This interface will automatically generate mappings between input data and data warehouse objects from compliant (based on a current prototype) and schema definitions. Input data may conform to the target GUS schema, or to new schemas. The proposed functionality will extend the XMLGUS data loading system developed by the PI. This interface has proven successful in application settings, yet it is tedious to generate manually the interface grammar. Thus, toward addressing and managing the complexity of GUS, a part of the three-year proposed work extends the XMLGUS framework to generate the variable components automatically

Faculty Investigator(s): Terry Clark (PI)
Student Investigator(s): Krishna Kotcherlakota, Yi Jia
Primary Sponsor(s): NSF & KTEC


K-INBRE Cellular Pathogen Gene Identification via Graph Data Mining

Project Active: Start Date 2007-06-27, Projected End Date 2008-04-30

Genomics efforts continue to yield a myriad of new protein sequences. They offer unprecedented opportunities for knowledge-based sequence annotations that aim to automatically transfer experimentally gained biological knowledge from model organisms to newly sequenced genomes to expedite biological discovery. Applying rigorous data mining methods to large, sequentially diverse, and clinically-important protein families, like the immunologic proteins, can yield reliable, intuitively predictive models readily extensible to annotating novel sequences. This would enable rational experimental design that may lead to improved medicine against refractory pathogens. Specifically, for characterizing and annotating immunological proteins, we plan to devise, refine, and disseminate statistical geometric analysis methods. We will include rigorous protein structure representation using geometric graphs, identifying conserved substructure patterns in protein structures based on graph database mining, mapping structure patterns to sequence motifs, and annotating genes using the obtained sequence motif with advanced statistical learning methods such as support vector machine.

Our choice of foci herein is based on strong preliminary results in each of the above objectives, including (1) the development of Delaunay Tessellation and almost-Delaunay Tessellation for statistical geometric analysis of protein structures, (2) development of state-of-the art subgraph mining algorithm that retrieves recurring subgraphs in a group of graph represented 3D protein structures, (3) applying machine learning techniques, such as support vector machine, to building annotating model with high specificity and sensitivity.

Faculty Investigator(s): Jun Huan (PI)
Student Investigator(s): Lin Yi, Vincent Buhr, Yi Jia, Jae Kim, Xiaotong (Cindy) Lin
Primary Sponsor(s): KUMCRI (flow-through from NIH)


K-INBRE: Complete, Upgrade and Enhance Data Handling in the Analytical Proteomics Laboratory

Project Active: Start Date 2007-06-26, Projected End Date 2008-04-30

Researchers will refine preliminary software designed to generate statistically justifiable and robust protein identifications especially for the KU investigators looking at targeted proteomes of 100s to 1000s proteins, e.g. Mitochondrial, Lipid Rafts, Liver Microsomes, Protein-Protein interaction pull downs
-- Current research will be used to validate peptide hits by the calculation of parameters matching the training set. This will improve robustness of PMF from 4700 data.
-- Build the "Inverse" data base for the determination of false hits and thus reliability of searches. This project will formalize the utilization of SeQuest, along with Mascot and X!Tandem to obtain Scaffold processed valid hits.
-- Adapt 4700 ms/ms data to scaffold data stream.

Protein Mapping
Propose a variety of refinement tasks including:
-- Improve custom database generation to allow search engines to assign amino acid substitutions, PTMs or non-specific cleavage assignments to the unassigned spectra in a data set. Included would be a systematic way to identify and exclude contamination proteins.
-- Adapt MSQuant software or develop custom tools with the primary MS data system to process LC/MS 1 data for quantitation by SILAC or peptide/protein internal standard introduction.
-- Continue "processing of high performance" MS 1 data for improved protein coverage and ill of trace components. Project findings were reported in a 2006 ASMS poster.

Signal Processing
Propose a variety of refinement tasks including:
-- Develop a search filter to exclude MS/MS data that are not peptidic (a set of criteria have been published) including developing a library of contaminating spectra (NOT proteins, e.g. detergents, phthalates buffer and salt clusters)
-- Adapt an existing filter routine for MS 1 data that falls outside of the required "mass defect space" (the subject of a recent JASMS publication). The resulting routine would be applicable to both MALDI and high resolution ESI data.

Faculty Investigator(s): Gerald Lushington (PI), Jianwen Fang


K-INBRE: Web Server Tracker, an Automated Literature, Protein/DNA Sequence and Domain Tracking System

Project Active: Start Date 2007-06-26, Projected End Date 2008-04-30

Tracker is a widely used automatic literature and protein/DNA sequence and domain tracking system developed at KU under the K-BRIN program. ITTC investigators are procuring a powerful web application server to replace an old server that currently runs Tracker. Researchers also plan to update the application to meet new needs as specified by users.

Faculty Investigator(s): Jianwen Fang (PI), Gerald Lushington
Student Investigator(s): Brian Quanz, Raymond Anderson
Primary Sponsor(s): KUMCRI (flow-through from NIH)


Unified Data Format for Mass Spectrometry Analysis (UDF)

Project Active: Start Date 2005-01-13, Projected End Date 2005-06-30

Despite the similarity of information content across a wide variety of vendor-specific mass spectrometry formats (i.e., the pervasive mass/charge ratio), tools developed to process the data coming from one instrument are rarely capable of processing data derived from another platform. There is a great desire to be able to do so, since specific analysis options available on one platform are often of value to (and unavailable to) data arising from another. This communications issue can be largely overcome by: a) constructing a set of conversion routines to deposit all data (except that . arising from a small number of vendors that contractually forbid format reverse- . engineering) into a consistent and unified format, and by developing commensurate routines for back-converting from this unified format to vendor specific structures.

University of Kansas researchers are developing a suite of data conversion and compression routines capable of generating compact repositories of mass spectrometry data in a unified format suitable for efficient analysis and rapid reconstitution to vendor-specific form. Such a suite will permit the data from multiple mass spectrometric platforms to be amalgamated into a single, homogeneous form suitable for collective analysis. It will also allow data to be accessible to interrogation by a much broader slate of vendor supplied analysis tools than is currently the case. By storing the unified-format in a highly compressed form with efficient compression / extraction schemes, researchers hope to help establish a sensible standard for data intensive endeavors such as proteomics and lipidomics. Having established such a standard, researchers envision that niche analysis methods (e.g., isotope signature identification, contaminant elimination, recognition of post-translational modifications, etc.) that may not currently be available within slate of vendor-provided tools, can be programmed to operate in an efficient fashion on the standardized compressed data medium.

Faculty Investigator(s): John Gauch
Student Investigator(s): Praveen Lakkaraju
Primary Sponsor(s): Kansas Idea Network of Biomedical Research Excellence (KINBRE)-NIH


First Award: Identify Informative Genes for Cancer Classification

Project Expired On: 2005-06-30

With the completion of human genome project and the advance of microarray technologies, it is now possible to explore the whole genome both systematically and comprehensively. Microarrays have been extensively used for screening gene expressions and for exploiting important clues to understanding the role of genes and the underlying gene regulatory networks. Use of microarrays is rapidly generating large amounts of data (typically terabytes) that create both opportunities and challenging problems. Conventional methods are increasingly unable to deal with the huge amount of data. For example, when applied to cancer classification, microarray data are overwhelming conventional machine learning algorithms because the number of samples is much less than the number of features (genes). A major challenge is the identification of informative genes for cancer classification from gene expression measurements. In fact, it has been demonstrated that only a small number of genes are relevant to a specific cancer classification problem. Identifying these relevant genes is important in numerous microarray-based applications such as drug discovery, early disease detection, and proper treatment guidance.

This project addresses the problems of identification of informative genes for cancer classification. The main objective of this work is to perform a preliminary investigation on a new margin and genetic-algorithm-based feature-selection algorithm. In addition, we will conduct a comparative and comprehensive study of several fundamental gene selection algorithms in microarray-based cancer classification problems to assess their performance on different data sets on the equal footing.

The intellectual merit of this project will include major progress in the informative gene identification problem that is a primary challenge in microarray data analysis, better understanding of feature selection algorithms in small sample problems, and potential solutions to choosing suitable gene selection algorithms for given problems. The new gene selection algorithm is expected to perform equally well on both training and test data in classification problems. When combined with support vector machines, the new algorithm will be able to predict the data that are unseen during training, even for small training samples.

The broader impacts resulting from project activities include a robust method to extract information from large datasets; the potential integration of the small number of identified genes into cancer diagnosis process; the applications to gene function discoveries; and the integration of research activities into a new bioinformatics course, Machine Learning with Life Science Applications. The class will be offered to graduate and senior undergraduate students in EECS and students in other department such as Biology who are interested in bioinformatics.

Faculty Investigator(s): Xue-wen Chen (PI)
Student Investigator(s): Mei Liu, Manjunath Narayana
Primary Sponsor(s): NSF and KTEC