Research Statement                                                   Debra S. Goldberg, Ph.D.

I began my scientific career as a biologist, became a computer scientist, and am now active in the field of computational biology, which brings my interests together.  A common theme of my work at TRW, Cornell, and Harvard Medical School was the development of combinatorial algorithms to find underlying patterns in noisy, error-prone data.  There will be a continuing need for efficient algorithms to find and exploit patterns in biological data.

I have supervisory experience in both industry and academia, and I have the skills to initiate and guide collaborations with colleagues and students.  As a graduate student, I brought together Ph.D. advisors from the departments of computer science (Jon Kleinberg) and plant breeding (Susan McCouch).  While they did not know each other previously, we took time for each to better understand the other.s field, and this resulted in a project truly interesting to all. 

My biological background enables me to collaborate effectively with biologists, thus ensuring that my work is meaningful and relevant.  Mathematicians (computer scientists) and biologists often think about problems in different ways.  While some may find this an impediment to collaboration, I find that they complement one another, and I embrace diverse styles of attacking a problem.  This respect for others also allows me to facilitate communication between people with different backgrounds.

Summary

A theme in my work has been the development of combinatorial and graph algorithms to solve problems in genomics.  I have a particular interest in helping biologists determine the function of uncharacterized genes or proteins.  In the past, I worked in comparative genomics and genomic network analysis, and experimentalist collaborators validated some of my predictions.  In the future, I plan to integrate more sources of data to improve protein function prediction.  I will continue to collaborate with biologists for problem definition and during application development to ensure my work is relevant, and for experimental validation to ensure my work has practical value.

Background 

Genomic networks.  Much genomic and proteomic data is naturally suited to description by a graph (called a network in many disciplines), with nodes representing genes or gene products (proteins).  Many of the high-throughput experimental methods used to determine genome-wide data on genes or proteins are both expensive and error-prone.  However, biologically-relevant patterns can still be discerned in the global network topology.  Studies have shown that diverse networks have similar structures with respect to several large-scale topological attributes.  For example, many biological networks have been reported to be small-world (short average path length yet highly clustered) and scale-free (degree distribution follows a power law), and we can exploit these properties to make better use of the data.

Function prediction for uncharacterized proteins.  Determining the function of genes and proteins is the single defining challenge of the emerging field of functional genomics.  Understanding how proteins function is critical for finding new drug targets, and is essential if we hope to gain a systems-level understanding of any organism.  Computational methods that can find the protein function .signals. in noisy data are becoming increasingly critical to guide experimental research.  By integrating distinct data types and making use of topological patterns, we can have more confidence in the results produced than in the individual input data. 

Past Work

Comparative genome maps (doctoral thesis).  Comparative genome maps are a powerful tool for gaining insights about the genomes of related organisms, such as genome evolution and gene function determination. The underlying maps for individual species are often themselves constructed from incomplete or inconsistent data.  We represent a chromosome as a sequence of genes, or a linear graph, with edges indicating adjacency on the chromosome.  The construction of comparative maps was formalized, providing a rigorous framework upon which mathematical models can be built.  Broadly applicable models were then formulated that emphasize the global similarities important in tracing chromosome evolution and de-emphasize lower-level differences.  From this we developed dynamic programming algorithms, implemented in the open source software .DeCAL. (Detecting Common Ancestral Linkage-segments) which constructs comparative maps within a few minutes on a standard UNIX workstation.  The utility of these methods was demonstrated on diverse pairs of species, and on species maps with widely differing resolutions. 

Exploiting network topology to assess protein interactions (post‑doctoral work).  The quality of function predictions depends on the quality of the experimental data, and thus high-confidence results should be given greater weight than low-confidence results.  We showed that in the yeast S. cerevisiae protein interaction network, true interactions create a small-world network while false interactions do not.  To exploit the small-world properties we developed a measure of local neighborhood cohesiveness around potentially interacting protein pairs.  Using this measure, we were able to accurately stratify the reliability of protein-protein interactions observed in error-prone yeast two-hybrid studies and predict previously unknown interactions.  We were also able to use this measure to assess and predict genetic interactions in yeast and protein interactions in the worm C. elegans.

Integrated early embryogenesis network for C. elegans (post‑doctoral work).  In an interdisciplinary collaboration, we generated a network of functional relationships for early embryogenesis in worm (the events between fertilization and the four-cell stage).  We combined protein interaction and co-expression data with phenotype similarity based on one of the first large-scale studies of RNAi phenotypes affecting early embryogenesis.  From this network, we predicted gene pairs functioning in a common pathway or complex during embryogenesis.  We also predicted additional genes involved in early embryogenesis based on the integrated data and topological properties of the network.  Our results were validated computationally, and we have experimentalist collaborators who will be testing our predictions using localization studies.

Current Work

Shared protein function network for S. cerevisiae (post‑doctoral work).  In current work, we are integrating graph-theoretic algorithms with machine learning methods to improve protein function prediction.  Our approach is comprised of two steps.  In step one, we use decision trees to estimate the probability that each protein pair shares a function.  Decision trees were determined to give results as good as more complex machine learning approaches, and provide biologists with insights into attributes important for protein function prediction.  We have included many types of data in the construction of these decision trees, including sequence similarity (homology), protein interactions, genetic interactions, gene co-expression, and various topological measures.  In step two, we use these probabilities to construct a weighted graph of proteins.  Two proteins are connected by an edge if their probability of sharing a function exceeds a low threshold, and the edge is weighted by this probability.  Then, the probability an uncharacterized protein has a particular function will be computed as a measure of how well-connected it is to proteins known to have the function.  This formulation of the two-terminal network reliability problem, which is known to be #P‑complete, will be estimated using a Monte Carlo method.  By integrating diverse functional genomic data while considering network topology, and assessing the confidence of our results, we expect to significantly improve function prediction for uncharacterized proteins.

Revealing complexity in network topology (post-doctoral work).  One global network property.the degree distribution.provides fundamental insights into processes as diverse as genome evolution, the attack tolerance of an electrical power grid, and disease epidemics.  In particular, a power law (.scale-free.) degree distribution often has distinct practical implications.  We systematically fit various degree models to 63 networks of 10 different data types.  Different models were compared using maximum likelihood fitting and the Bayesian Information Criterion (to account for model complexity).  We considered four basic models from which the degree distribution can be drawn.power-law (scale-free), exponential, Poisson, and truncated power-law.and complex models that allow two underlying link types.  We showed that many networks in diverse fields previously reported to be scale-free or follow another simple model have a more complex structure than previously recognized.  Our NetClass (Network Classification) software is available on the web.

We plan to fit models of network evolution to observed degree distributions.  Several recent studies have also shown that the method of sampling can dramatically change the observed degree distribution, so we will incorporate sampling when analyzing networks that are not completely determined.  If we can more accurately determine the form of the degree distribution of a network, we may better understand the evolutionary processes that created it, and may be able to assess the error rate of the experimental data from only the structure of the data.

Future Work

In order to keep computational biology relevant, it must be tightly coupled with experimental work.  For this reason, my experimentalist collaborators will influence much of my future work. 

Discovering the function of uncharacterized genes will long remain a central goal in biomedical research.  I will work to find new ways to use patterns in error-prone data to improve function prediction and guide experimental functional studies.  I will ensure that my new methods are made practically useful and readily available to experimentalists.  This involves several key factors that are often overlooked in the rush to publish new methods.  For example, the software must be robust to different input formats and errors, and should have an intuitive user interface. 

Predict function of uncharacterized genes.  There is still much to be done in this area.  I believe partitioning the shared function graph will create subgraphs that are likely to be more uniform in protein function.  I would like to develop new topological measures that consider the specificity of a node.s edges to a subcategory of nodes (e.g., proteins known to have a particular function).  I will continue to work with experimental collaborators to test some of my predictions using phenotype or localization studies.

Improved automated ortholog assignment.  A homolog pair is a pair of genes descended from a common ancestral gene.  An ortholog pair is a homolog pair that diverged via a speciation event (as opposed to gene duplication within a species).  Orthologs often perform the same function in each organism, so assessment of orthology is important to function prediction.  Orthology is frequently assessed manually, or simply assigned for genes that are each other.s closest match in the genome (the bi‑directional best hit criterion).  I propose to improve automated assessment of orthology using a graph-theoretic approach.  I will use many data networks from each species, including protein interactions, genetic interactions, the regulatory network, gene co-expression, and chromosome maps.  Gene nodes of different species will be connected via edges weighted by a measure of homology (similar DNA or amino acid sequence).  I will develop a score to measure the similarity of network neighborhoods.  Synteny (common ancestral chromosome segments) will be assessed either using this score to assess the similarity of genomic neighborhoods or using results from DeCAL.