\documentclass[12pt]{article} \pagestyle{empty} \setlength{\oddsidemargin}{.25in} \setlength{\topmargin}{-.25in} \setlength{\textwidth}{6in} \setlength{\textheight}{8in} \begin{document} \begin{center}{\large Research Statement}\\ Reuben J. Settergren\end{center} My work in computational biology involves designing software to analyze data from physical mapping and sequencing experiments. In molecular biology, physical mapping is the process of locating landmarks on a string of DNA. Once correctly placed, these landmarks can be used to guide the search for new genes, since once the location of a gene is bounded by known landmarks, only genes between those bounds must be tested. Physical mapping also plays an integral role in large-scale sequencing, or determination of the actual string of nucleotides (commonly abbreviated A, C, T, and G) which comprise a piece of DNA. Since current technology limits the length of directly sequencable DNA to on the order of $10^3$ nucleotide base-pairs, larger sequences must be assembled from smaller parts; physical maps form the scaffold for this assembly. The Human Genome Project (HGP) is a national effort to sequence all of the human genetic material, approximately three billion base-pairs in length. Sequencing on such an unprecedented scale can only be successful with efficient automation, both in the laboratory and on the computer. And even though the HGP will naturally terminate once it produces a public database which will facilitate research of human genetics, technology and software for physical mapping must continue to be improved, as geneticists continue to study DNA of other species. The most common physical mapping methods involve some form of clone fingerprinting. A library of overlapping clones (subsequences) of the target DNA is built by randomly fracturing many copies of the DNA, either with restriction enzymes (which cut the DNA at every occurence of some small sequence of nucleotides) or with radiation. Biological experimentation on the library of clones then yields for each clone a ``fingerprint'', from which must be inferred information about incidence with any of the landmarks for the map, or overlap with the other clones. \bigskip {\bf Theoretical thresholds for unambiguous clone libraries.} Fingerprinting experiments can only be successful when performed on a high quality clone library. Part of my research investigtes the question of how many clones or probes must be in a library to ensure that only the correct reconstruction of the genome is possible. For instance, if there is a gap in coverage of the genome, it is impossible to guarantee correct reconstruction of the data, since the order of the two sets of overlapping clones (contigs) could be reversed with respect to each other. Even in the absence of gaps, there are other pathological configurations of the data which allow more than one reconstruction of the data (therefore at least one possible incorrect reconstruction). Part of this work involves a characterization of what sets of data are unambiguous, which is related to a known concept in the theory of interval graphs. The clone densities where such ambiguities disappear exhibit strong threshold tendencies (which are borne out by laboratory experience). Using a standard probabilistic model due to Lander and Waterman, I have been able to theoretically justify this threshold behavior for a number of properties of clone distributions, and am continuing to investigate other aspects of this problem. %For instance, using a standard probabilistic model due to Lander and %Waterman, let the genome be represented by a real interval of length %$L$, and let $n$ clones be subintervals, with left endpoints sampled %uniformly from the genome, and lengths distributed uniformly between 1 %and $b$. Let $n=L(\ln L + \omega)$ as $L\rightarrow\infty$. Then if %$\omega$ is any increasing function, the probability that the clones %form a single contig (i.e. there are no gaps) approaches 1, and if %$\omega$ is any decreasing function, the probability approaches 0. %Therefore, $n=L(\ln L + \omega)$, for constant $\omega$, is a %threshold for connectivity of clones. Other, similar, thresholds have %been found under this model, for other properties of the clone %distribution. \bigskip {\bf Ordering clone libraries. } Once a quality clone library is gathered, the clones can be fingerprinted with various biological experiments, and that data is used to place the clones on the genome in the correct order, so a map can be built. % %Three common forms of fingerprinting use oligonucleotide %probes, unique probes, and restriction enzymes. For the probing %methods, each clone is tested for containment of a number of DNA %probes. Oligonucleotide probes are small, and occur possibly many %times across the genome, while unique probes are large enough to %occur, with high probability, only once. In Multiple Complete Digest %(MCD) mapping, instead of probing for containment of subsequences, %restriction enzymes are used to fragment each of the clones. The %fingerprint of each clone then consists of the sets of fragment %lengths into which it is cut by each of the restriction enzymes. % %One essential step in building a physical map from a library of %fingerprinted clones is to order the clones correctly according to %their placement on the genome. % Since each technology for fingerprinting yields a different type of data, previous research has focused on devising specialized algorithms for ordering clones from each type of experiment. I am developing a more unified method for ordering clone libraries. A biological experiment which produces clone fingerprint data can be modeled probabilistically, and according to that model, the probability of any two clones overlapping, given their fingerprint data, can be estimated. Thus the results of any fingerprinting experiment can be described with a matrix of overlap probabilities. Given such a matrix, my algorithms use simulated annealing to search for a maximum-likelihood ordering. % %Simulated annealing is a local %search method that converges on an optimal solution by using a %neighborhood structure to traverse the solution space, all the time %gradually improving the solution quality. Here the solution space is %the set of all orderings of $n$ clones, and simulated annealing %considers a large number of orderings, so the likelihood of any %particular ordering must be calculated efficiently. I have reduced a %direct $O(n^2)$ approach from the literature to $O(cn)$ in the average %case, where $n$ is the number of clones, and $c$ is the average depth %of the coverage of the genome (the sum of the clone lengths, divided %by the genome length). In addition, it is important to take advantage %of the neighborhood structure followed by the annealing process. I %have designed heuristics that, using information from the likelihood %calculation of one clone ordering, can estimate the likelihood of a %neighboring ordering in $O(c^2)$. % Using a number of heuristics that take advantage of the neighborhood structure of the permutation space that the simulated annealing traverses, my algorithms can efficiently produce good clone orderings from a variety of real and simulated data. \bigskip {\bf Multiple Complete Digest Mapping.} In MCD mapping, the fingerprint of each clone is the sets of fragment sizes that result from digesting the clone with some restriction enzymes. For this technique, after the clones have been correctly ordered, work is still necessary to determine which fragments in the fingerprints of adjacent clones correspond to the same physical fragment of the original DNA. I am participating in a project, led by Richard Karp at the University of Washington, to developm a graphical, interactive system, called XRMAP, for performing just this task. XRMAP provides an interactive, graphical environment that, in addition to to performing MCD mapping, provides users with feedback on problem areas in the data, and allows users to edit the data. Work on this project is ongoing, and XRMAP promises to be of great use to scientists attempting to automate large-scale sequencing projects. \bigskip {\bf Goals.} In addition to continuing work in areas related to the problems discussed above, I am interested in graph theory and combinatorics, and the application of mathematics and computer science to problems in science and industry. I am also interested in developing research projects that can involve undergraduates, to prepare them either for graduate school, or to be able to deal creatively and effectively with the new problems they will face in their lives after college. \end{document} \begin{thebibliography}{9} \bibitem{me} Alizadeh, F., H. Chen, and R. Settergren, ``Thresholds for Ordering Libraries of Variable Length Clones'', RUTCOR Research Report RRR \# 5-96, 1996. \bibitem{AKNW} Alizadeh, F., R. M. Karp, L. A. Newberg, and D. K. Weisser, ``Physical Mapping of Chromosomes: A Combinatorial Problem in Molecular Biology,'' {\it Algorithmica} {\bf 13} (1995), pp. 52-76. \bibitem{AKWZ} Alizadeh, F., R. M. Karp, D. K. Weisser, and Geoffrey Zweig, ``Physical Mapping of Chromosomes Using Unique Probes,'' RUTCOR research report RRR \# 48-95, 1995. \bibitem{DFS} Dyer, M., A. Frieze, and S. Suen, ``Ordering Clone Libraries in Computational Biology,'' to appear in {\it Computational Biology}. \bibitem{MCD} Fasulo, D., T. Jiang, R. M. Karp, R. Settergren, E. Thayer, ``An Algorithmic Approach to Multiple Complete Digest Mapping'', RECOMB, Jan 1997. \bibitem{LW} Lander, E., and M. Waterman, ``Genomic Mapping by Fingerprinting Random Clones: A Mathematical Analysis,'' {\it Genomics} {\bf 2}(3), 231--239, 1988. \bibitem{Nelson} Nelson, D., ``Statistical Methods in Physical Mapping,'' Ph. D. dissertation, Berkeley, 1995. \bibitem{slonim} Slonim, D. ``Radiation Hybrid Mapping'', RECOMB, Jan 1997. \end{thebibliography}