# «Abstract. Changes of gene ordering under rearrangements have been extensively used as a signal to reconstruct phylogenies and ancestral genomes. ...»

Reconstructing Ancestral Genomic Orders Using

Binary Encoding and Probabilistic Models

Fei Hu1,2, Lingxi Zhou2, and Jijun Tang1,2,

School of Computer Science and Technology, Tianjin University, China

Department of Computer Science & Engineering, Univ. of South Carolina, USA

jtang@cse.sc.edu

Abstract. Changes of gene ordering under rearrangements have been

extensively used as a signal to reconstruct phylogenies and ancestral

genomes. Inferring the gene order of an extinct species has the potential in revealing a more detailed evolutionary history of species descended from it. Current tools used in ancestral reconstruction may fall into parsimonious and probabilistic methods according to the criteria they follow.

In this study, we propose a new probabilistic method called PMAG to infer the ancestral genomic orders by calculating the conditional probabilities of gene adjacencies using Bayes’ theorem. The method incorporates a transition model designed particularly for genomic rearrangement scenarios, a reroot procedure to relocate the root to the target ancestor that is inferred as well as a greedy algorithm to connect adjacencies with high conditional probabilities into valid gene orders.

We conducted a series of simulation experiments to assess the performance of PMAG and compared it against previously existing probabilistic methods (InferCARsPro) and parsimonious methods (GRAPPA).

As we learned from the results, PMAG can reconstruct more correct ancestral adjacencies and yet run several orders of magnitude faster than InferCARsPro and GRAPPA.

Keywords: ancestral genome, gene order, probabilistic method.

1 Introduction 1.1 Overview Evolutionary biologists have had a long tradition in reconstructing genomes of extinct ancestral species. Mutations in a genomic sequence are made up not only at the level of base-pair changes but also by rearrangement operations on chromosomal structures such as inversions, transpositions, ﬁssions and fusions [1].

Over the past few years, ancestral gene-order inference has brought profound predictions of protein functional shift and positive selection [2].

Methods for ancestral genome reconstruction either assume a given phylogeny that represents the evolutionary history among given species, known as the small Corresponding author.

Z. Cai et al. (Eds.): ISBRA 2013, LNBI 7875, pp. 17–27, 2013.

c Springer-Verlag Berlin Heidelberg 2013 18 F. Hu, L. Zhou, and J. Tang phylogeny problem (SPP); or search the most appropriate tree along with a set of ancestral genomes to ﬁt the observed data, called the big phylogeny problem (BPP). Most of parsimony methods (such as GRAPPA [3], MGR [4,5]) typically solve the SPP exactly by searching a set of ancestral gene orders to minimize the sum of the rearrangement distance over the entire edges of the phylogeny. Ma proposed another method for the SPP in the probabilistic framework (InferCARsPro [6]) by approximating the conditional probabilities for all possible gene adjacencies in an ancestral genome.

Current methods such as GRAPPA and InferCARsPro are capable to handle modern whole-genome data due to their intrinsic high complexity. In this paper, we propose a new probabilistic method called PMAG to reconstruct ancestral genomic orders given a phylogeny. We conducted extensive experiments to evaluate the performance of PMAG with other existing methods. According to the results, our new method can outperform all the other methods under study and still run at least hundreds of times faster than GRAPPA and InferCARsPro.

1.2 Genome Rearrangement Given a set of n genes labeled as {1, 2, · · ·, n}, a genome can be represented by an ordering of these genes. Each gene is assigned with an orientation that is either positive, written i, or negative, written −i. Two genes i and j form an adjacency if i is immediately followed by j, or, equivalently, −j is immediately followed by −i. An breakpoint of two genomes is deﬁned as an adjacency appears in one but not in the other.

Genome rearrangement operations can change the ordering of genes. An inversion operation (also called reversal ) reverses a segment of a chromosome. A transposition is an operation that swaps two adjacent segments of a chromosome. In the case of multiple chromosomes, translocation breaks a chromosome and reattaches a portion of it to another chromosome. Later Yancopoulos et al. [8] proposed a universal double-cut-and-join (DCJ) operation that accounts for all common events which resulted in a new genomic distance that can be computed in linear time.

1.3 Parsimony Methods for Ancestral Gene-Order Reconstruction To ﬁnd a solution for SPP, parsimony algorithms typically iterate over each internal node to solve for the median genomes until the sum of all edge distances (tree score) is minimized. The median problem can be formalized as follows: give a set of m genomes with permutations {xi }1≤i≤m and a distance measurement d, ﬁnd m another permutation xt such that the median score deﬁned as i=1 d(xi, xt ) is minimized. GRAPPA and MGR (as well as their recently enhanced versions) are two widely-referenced methods that implemented a selection of median solvers for phylogeny and ancestral gene-order inference. However solving even the simplest case of median problem when m equals to three is NP-hard for most distance measurements.

Reconstructing Ancestral Genomic Orders 19 Exact solutions to the problem of ﬁnding a median of three genomes can be obtained for the inversion, breakpoint and DCJ distances. Among all the median solvers, the best one is the DCJ median solver proposed by Xu and Sankoﬀ (ASMedian [9]) based on the concept of adequate subgraph. Adequate subgraphs allow decompositions of an multiple breakpoint graph into smaller and easier graphs. Though the ASMedian solver could remarkably scale down the computational expenses of median searching, it yet runs very slow when the genomes are distant.

where priors are assumed equal and the likelihood P (Dx |Xi in x) can be calculated recursively in a post-order traversal fashion summed over q possible congurations. Its transition matrix is deﬁned as an extension of the Jukes-Cantor model such that probability of transition from any character to any diﬀerent character is always equal.

Let sx (·) denote the successor of a gene and px (·) denote the predecessor of a gene, an adjacency pair Ax (i, j) can be viewed as sx (i) = j and px (j) = i simultaneously. After ﬁnishing the calculation of conditional probabilities for every successor and predecessor relationships, the conditional probability of an adjacency Ax (i, j) in genome x can be approximated as

** P (Ax (i, j)|Dx ) = P (px (j) = i|Dx ) × P (sx (i) = j|Dx )**

Finally a fast greedy algorithm is adopted to connect adjacencies into contiguous ancestral regions. Although InferCARsPro showed good results and speedup over parsimonious methods, it is still too slow and inaccurate when dealing with even small number of distant genomes.

We investigated the following intrinsic characteristics of InferCARsPro that account for its diﬃculties in handling complex datasets, which in turn motivated us to propose our new method.

– InferCARsPro uses a neutral model accounting for all changes of adjacencies, however biased model for phylogeny reconstruction has been successfully applied for genome rearrangement scenarios [11].

– The total number of states for each gene is exactly equal to 2×n−2 where n is the number of genes. Thus computing the likelihood score on such excessive number of states clearly incurs huge computational burden.

20 F. Hu, L. Zhou, and J. Tang – The conditional probability of an adjacency is approximated from the predecessor and successor relations. Although such approximation is intuitive, it is more desirable to directly calculate the conditional probability of an adjacency.

– InferCARsPro requires branch lengths of a given phylogeny as part of its inputs, but it is not always handy to obtain in practice.

**2 Algorithm Detail**

Given the topology of a model tree and a collection of gene orders at the leaves, our approach ﬁrst encodes the gene orders into binary sequences and estimates the parameters in the transition model for adjacency changes. Ancestral nodes in the model tree are inferred independently and in each inference, we reroot the model tree to have the target ancestor as the root of a new tree. Then we utilize a probabilistic inference tool to compute the conditional probabilities of all the adjacencies encoded in the binary sequence of the target ancestor. At last we use a greedy algorithm as used in Ma’s work to connect the adjacencies into contiguous regions. We call our new approach Probabilistic Method of Ancestral Genomics (PMAG).

**2.1 Encoding Gene Orders into Binary Sequences**

A gene order can be expressed as a sequence of adjacency information that speciﬁes presence or absence of all the adjacencies [10,11]. Denote the head of a gene i by ih and its tail by it. We refer +i as an indication of direction from head to tail (ih → it ) and otherwise −i as (it → ih ). There are a total of four scenarios for two consecutive genes a and b in forming an adjacency: {at, bt }, {ah, bt }, {at, bh }, and {ah, bh }. If gene c is at the ﬁrst or last place of a linear chromosome, then we have a corresponding singleton set, {ct } or {ch }, called a telomere. A genome can then be expressed as a multiset of adjacencies and telomeres. For instance, a linear chromosome consists of four genes, (+1,+2,can be represented by the multiset of adjacencies and telomeres {{1h}, {1t, 2h }, {2t, 3t }, {3h, 4t }, {4h }}. We further write 1 (0) to indicate presence (absence) of an adjacency and we consider only those adjacencies and telomeres that appear at least once in the input genomes. Table 1 shows an example of encoding two artiﬁcial genomes into binary sequences.

Given a dataset D with m species and each of n genes, let k indicate the total number of linear chromosomes in D, then there are up to 2n+2 distinct adjacencies and telomeres. However in reality if the length of the binary sequences extracted from D is l, then l is typically far smaller. In fact, in the extreme case when genomes in D share no adjacency and telomere, l equals at most to n×m+k, and since m and k are commonly much smaller than n, thus the length of the binary sequences for a dataset is usually linear rather than quadratic to the number of genes.

Reconstructing Ancestral Genomic Orders 21 Table 1. Example of encoding gene orders into binary sequences

2.2 Estimating Transition Parameters Since we are handling binary sequences with two characters, we use a general time-reversible framework to simulate the transitions from presence (1) to absence (0) and vice versa. Thus the rate matrix is

The matrix involves 3 parameters: the relative rate a, and two frequencies π0 and π1.

Severl models have been proposed to probabilistically characterize the changes of gene adjacencies by common types of rearrangement operations such as inversion, transposition as well as DCJ [7,11]. In this study, we use the model that has been successfully applied for phylogeny reconstruction in the context of genome rearrangement as suggested in [11]. In particular, every DCJ operation breaks two random adjacencies uniformly chosen from the gene-order string and subsequently creates two new ones. Since each genome contains n + O(1) adjacencies and telomeres where n is the gene number and O(1) equals to the number of linear chromosomes in the genome, thus the probability that an adjacency changes from presence (1) to absence (0) in the sequence is n+O(1) under one operation. Since there are up to 2n+2 possible adjacencies and telomeres, the probability for an adjacency changing from absence (0) to presence (1) is 2n2 +O(n). Therefore we come to the conclusion that the transition from 1 to 0 is roughly 2n times more likely than that from 0 to 1.

2.3 Inferring the Probabilities of Ancestral Adjacencies for the Root Node In principle, our probabilistic inference is categorized as marginal reconstruction which assigns characters to a single ancestral genome at a time. Once we have the tree topology and binary sequences encoding the input gene orders, we use 22 F. Hu, L. Zhou, and J. Tang the extended probabilistic approach for sequence data described by Yang [12] to infer the ancestral gene orders at the root node. In the binary sequences, each site represents an adjacency with character either 0 (absence) or 1 (presence) and for each site we seek to calculate the conditional probability of observing that adjacency. As the true branch lengths are not available, we take advantage of the widely-used maximum-likelihood estimation from the binary sequences at the leaves to estimate the branch length.

Suppose x is the root of a model tree, then the conditional probability that node x has the character sx at the site, given Dx representing the observed data at the site in all leaves of the subtree rooted at x, is P (sx )P (Dx |sx ) πsx Lx (sx ) P (sx |Dx ) = = P (Dx ) sx πsx Lx (sx )

where f and g are the two direct descendants of x. pij (t) deﬁnes the transition probability that character i changes to j after an evolutionary distance t. Following the deduction of transition probability in [13], our transition-probability matrix can be written as pij (t) = πj + e−t (δij − πj ) Here the δij is 1 if i = j, otherwise δij is 0. In order to set up the 2n ratio, we simply set the rate a to 1 and add a direct assignment of the two frequencies in the code. For instance, if the character frequencies are π0 = 0.1 and π1 = 0.9, then the rate of 0 to 1 transitions is 10 times as high as the rate of transitions in the other direction under the same evolutionary distance.