Research
Statement Debra S. Goldberg, Ph.D.
I began my scientific career as a biologist, became a computer
scientist, and am now active in the field of computational biology, which
brings my interests together. A common
theme of my work at TRW, Cornell, and
I have
supervisory experience in both industry and academia, and I have the skills to
initiate and guide collaborations with colleagues and students. As a graduate student, I brought together
Ph.D. advisors from the departments of computer science (Jon Kleinberg) and
plant breeding (Susan McCouch). While
they did not know each other previously, we took time for each to better
understand the other.s field, and this resulted in a project truly interesting
to all.
My
biological background enables me to collaborate effectively with biologists,
thus ensuring that my work is meaningful and relevant. Mathematicians (computer scientists) and
biologists often think about problems in different ways. While some may find this an impediment to
collaboration, I find that they complement one another, and I embrace diverse
styles of attacking a problem. This
respect for others also allows me to facilitate communication between people
with different backgrounds.
A theme in
my work has been the development of combinatorial and graph algorithms to solve
problems in genomics. I have a
particular interest in helping biologists determine the function of
uncharacterized genes or proteins. In
the past, I worked in comparative genomics and genomic network analysis, and
experimentalist collaborators validated some of my predictions. In the future, I plan to integrate more
sources of data to improve protein function prediction. I will continue to collaborate with
biologists for problem definition and during application development to ensure
my work is relevant, and for experimental validation to ensure my work has
practical value.
Genomic networks. Much genomic and proteomic data is naturally
suited to description by a graph (called a network
in many disciplines), with nodes representing genes or gene products
(proteins). Many of the high-throughput
experimental methods used to determine genome-wide data on genes or proteins
are both expensive and error-prone.
However, biologically-relevant patterns can still be discerned in the
global network topology. Studies have
shown that diverse networks have similar structures with respect to several
large-scale topological attributes. For
example, many biological networks have been reported to be small-world (short
average path length yet highly clustered) and scale-free (degree distribution
follows a power law), and we can exploit these properties to make better use of
the data.
Function prediction for uncharacterized
proteins. Determining the
function of genes and proteins is the single defining challenge of the emerging
field of functional genomics.
Understanding how proteins function is critical for finding new drug
targets, and is essential if we hope to gain a systems-level understanding of
any organism. Computational methods that
can find the protein function .signals. in noisy data are becoming increasingly
critical to guide experimental research.
By integrating distinct data types and making use of topological
patterns, we can have more confidence in the results produced than in the
individual input data.
Comparative genome maps (doctoral
thesis). Comparative genome maps are a powerful tool for gaining insights about
the genomes of related organisms, such as genome evolution and gene function
determination. The underlying maps for individual species are often themselves
constructed from incomplete or inconsistent data. We represent a chromosome as a sequence of
genes, or a linear graph, with edges indicating adjacency on the
chromosome. The construction of comparative
maps was formalized, providing a rigorous framework upon which mathematical
models can be built. Broadly applicable
models were then formulated that emphasize the global similarities important in
tracing chromosome evolution and de-emphasize lower-level differences. From this we developed dynamic programming
algorithms, implemented in the open source software .DeCAL. (Detecting Common
Ancestral Linkage-segments) which constructs comparative maps within a few
minutes on a standard UNIX workstation.
The utility of these methods was demonstrated on diverse pairs of
species, and on species maps with widely differing resolutions.
Exploiting network topology to assess protein
interactions (post‑doctoral work). The quality of function predictions depends on the quality of the
experimental data, and thus high-confidence results should be given greater
weight than low-confidence results. We
showed that in the yeast S. cerevisiae
protein interaction network, true interactions create a small-world network
while false interactions do not. To
exploit the small-world properties we developed a measure of local neighborhood
cohesiveness around potentially interacting protein pairs. Using this measure, we were able to
accurately stratify the reliability of protein-protein interactions observed in
error-prone yeast two-hybrid studies and predict previously unknown
interactions. We were also able to use
this measure to assess and predict genetic interactions in yeast and protein
interactions in the worm C. elegans.
Integrated early embryogenesis network for C. elegans (post‑doctoral work). In an interdisciplinary collaboration, we generated a network of
functional relationships for early embryogenesis in worm (the events between
fertilization and the four-cell stage).
We combined protein interaction and co-expression data with phenotype
similarity based on one of the first large-scale studies of RNAi phenotypes
affecting early embryogenesis. From this
network, we predicted gene pairs functioning in a common pathway or complex
during embryogenesis. We also predicted
additional genes involved in early embryogenesis based on the integrated data
and topological properties of the network.
Our results were validated computationally, and we have experimentalist
collaborators who will be testing our predictions using localization studies.
Shared protein function network for S. cerevisiae (post‑doctoral work). In current work, we are integrating graph-theoretic algorithms with
machine learning methods to improve protein function prediction. Our approach is comprised of two steps. In step one, we use decision trees to
estimate the probability that each protein pair shares a function. Decision trees were determined to give
results as good as more complex machine learning approaches, and provide
biologists with insights into attributes important for protein function
prediction. We have included many types
of data in the construction of these decision trees, including sequence
similarity (homology), protein interactions, genetic interactions, gene
co-expression, and various topological measures. In step two, we use these probabilities to
construct a weighted graph of proteins.
Two proteins are connected by an edge if their probability of sharing a
function exceeds a low threshold, and the edge is weighted by this
probability. Then, the probability an
uncharacterized protein has a particular function will be computed as a measure
of how well-connected it is to proteins known to have the function. This formulation of the two-terminal network
reliability problem, which is known to be #P‑complete, will be estimated
using a
Revealing complexity in network topology
(post-doctoral work). One
global network property.the degree distribution.provides fundamental insights
into processes as diverse as genome evolution, the attack tolerance of an
electrical power grid, and disease epidemics.
In particular, a power law (.scale-free.) degree distribution often has
distinct practical implications. We
systematically fit various degree models to 63 networks of 10 different data
types. Different models were compared
using maximum likelihood fitting and the Bayesian Information Criterion (to
account for model complexity). We
considered four basic models from which the degree distribution can be
drawn.power-law (scale-free), exponential, Poisson, and truncated power-law.and
complex models that allow two underlying link types. We showed that many networks in diverse
fields previously reported to be scale-free or follow another simple model have
a more complex structure than previously recognized. Our NetClass (Network Classification)
software is available on the web.
We
plan to fit models of network evolution to observed degree distributions. Several recent studies have also shown that
the method of sampling can dramatically change the observed degree
distribution, so we will incorporate sampling when analyzing networks that are
not completely determined. If we can
more accurately determine the form of the degree distribution of a network, we
may better understand the evolutionary processes that created it, and may be
able to assess the error rate of the experimental data from only the structure
of the data.
In
order to keep computational biology relevant, it must be tightly coupled with
experimental work. For this reason, my
experimentalist collaborators will influence much of my future work.
Discovering the function of uncharacterized
genes will long remain a central goal in biomedical research. I will work to find new ways to use patterns
in error-prone data to improve function prediction and guide experimental
functional studies. I will ensure that
my new methods are made practically useful and readily available to
experimentalists. This involves several
key factors that are often overlooked in the rush to publish new methods. For example, the software must be robust to
different input formats and errors, and should have an intuitive user
interface.
Predict function of uncharacterized genes. There is still much to be done in
this area. I believe partitioning the
shared function graph will create subgraphs that are likely to be more uniform
in protein function. I would like to
develop new topological measures that consider the specificity of a node.s
edges to a subcategory of nodes (e.g., proteins known to have a particular
function). I will continue to work with
experimental collaborators to test some of my predictions using phenotype or
localization studies.
Improved automated ortholog assignment. A homolog pair is a pair of genes descended from a common
ancestral gene. An ortholog pair is a homolog pair that
diverged via a speciation event (as opposed to gene duplication within a
species). Orthologs often perform the
same function in each organism, so assessment of orthology is important to
function prediction. Orthology is
frequently assessed manually, or simply assigned for genes that are each
other.s closest match in the genome (the bi‑directional
best hit criterion). I
propose to improve automated assessment of orthology using a graph-theoretic
approach. I will use many data networks
from each species, including protein interactions, genetic interactions, the
regulatory network, gene co-expression, and chromosome maps. Gene nodes of different species will be connected
via edges weighted by a measure of homology (similar DNA or amino acid
sequence). I will develop a score to
measure the similarity of network neighborhoods. Synteny
(common ancestral chromosome segments) will be assessed either using this score
to assess the similarity of genomic neighborhoods or using results from
DeCAL.