«Forschungsinstitut für die Biologie landwirtschaftlicher Nutztiere (FBN), Dummerstorf, Germany CHRISTINE BAES and NORBERT REINSCH Computing the ...»
Arch. Tierz., Dummerstorf 50 (2007) 3, 294-308
Forschungsinstitut für die Biologie landwirtschaftlicher Nutztiere (FBN), Dummerstorf, Germany
CHRISTINE BAES and NORBERT REINSCH
Computing the condensed conditional gametic QTL relationship
matrix and its inverse
The inverse of the conditional gametic relationship matrix ( G -1 ) for a marked quantitative trait locus (MQTL) is
required for estimation of gametic effects in best linear unbiased prediction (BLUP) of breeding values if marker data are available. Calculation of the “condensed” gametic relationship matrix G * - a version of G where linear dependencies have been removed - and its inverse G *-1 is described using a series of simplified equations following a known algorithm. The software program COBRA (covariance between relatives for a marked QTL) is introduced, and techniques for storing and computing the condensed gametic relationship matrix G * and the non-zero elements of its inverse are discussed. The program operates with both simple pedigrees and those augmented by transmission probabilities derived from marker data. Using sparse matrix storage techniques, G * and its inverse can be efficiently stored in computer memory. COBRA is written in FORTRAN 90/95 and runs on a variety of computers. Pedigree data and information for a single MQTL in the German Holstein population are used to test the efficiency of the program.
Key Words: marker assisted selection, best linear unbiased prediction, gametic relationship matrix Zusammenfassung Titel der Arbeit: Berechnung der eingedampften gametischen QTL Verwandtschaftsmatrix und ihrer Inversen Die Inverse G -1 der Gametischen Verwandtschaftsmatrix für einen QTL (quantitative trait locus) mit gekoppelten Markern wird für eine markerunterstützte Zuchtwertschätzung benötigt. Es wird beschrieben wie die eingedampfte gametische Verwandtschaftsmatrix G * - eine Variante von G ohne lineare Abhängigkeiten in einer Abfolge einfacher Rechenschritte nach einem bekannten Algorithmus berechnet werden kann. Das EDV-Programm COBRA (covariance between relatives for a marked QTL) wird vorgestellt, um anschließend Techniken für die Speicherung und die Berechnung von G * und die Nicht-Nullelemente ihrer Inversen zu besprechen. COBRA kann sowohl einfache Pedigreedaten als auch Pedigreedaten mit Markerinformation verarbeiten, wobei Speichertechniken für dünnbesetzte Matrizen eingesetzt werden. COBRA ist in FORTRAN 90/95 geschrieben und kann auf einer Vielzahl verschiedener Rechner laufen. Pedigreedaten und Informationen für einen einzelnen MQTL in der deutschen Holstein-Population werden benutzt, um die Effizienz des Programms zu prüfen.
Schlüsselwörter: Marker-gestützte Selektion, bester linearer unverzerrter Prediktor, gametische Verwandtschaftsmatrix 1. Introduction The joint utilization of marker and phenotype information in current genetic prediction models is evolving rapidly. FERNANDO and GROSSMAN (1989) incorporated marked quantitative trait loci (MQTL) information into the existing best linear unbiased prediction (BLUP) breeding value estimation model by splitting the ‘genetic’ portion of the model into the additive effects of the unique QTL gametes and the polygenic effects. In order to include MQTL information in the estimation model, the Arch. Tierz. 50 (2007) 3 inverse of the conditional covariance matrix of QTL allele effects G is required ( G −1 ).The efficient computation of matrices like G and G −1 for large pedigrees is crucial for further successful incorporation of marker data in genetic evaluation models.
A numerically efficient algorithm for the calculation of G and its inverse for an MQTL was developed by ABDEL-AZIM and FREEMAN (2001) based on the work of FERNANDO and GROSSMAN (1989), VAN ARENDONK et al. (1994) and WANG et al. (1995). TUCHSCHERER et al. (2004) showed that the calculation of G depends on the mode of gamete identification (gametes identified by markers vs.
gametes identified by parental origin), although the final MA BLUP breeding value of each animal is identical irrespective of the gamete identification method employed.
They suggested that gamete identification by parental origin may have practical advantages compared to that by markers; for example, fewer values are required to denote marker related transmission probabilities. More importantly, the generalisation developed by TUCHSCHERER et al. (2004) showed that under certain circumstances identical rows and columns in G may occur if parents pass identical copies of their gametes to their offspring (i.e. G may be rank deficient); in such cases, the inverse matrix G −1 is not defined. By excluding duplicate gamete information, the number of gametic effects in G can be reduced to a smaller set of unique effects in a ‘condensed’ gametic relationship matrix G * ; G * is always of full rank and G *−1 is defined.
Additionally, the smaller G * matrix requires less memory and may be calculated faster than a larger one.
In the first section of this article, the calculation of G * and its inverse is described to illustrate the practical application of the generalised algorithm presented by TUCHSCHERER et al (2004). Secondly, the software program COBRA (covariance between relatives at a marked QTL) is introduced, and techniques for determining, storing and computing the condensed gametic relationship matrix G * and the non-zero elements of its inverse are discussed. Finally, pedigree data and information for a single marker in the German Holstein population are used to test the efficiency of the program.
Calculating G * and its inverse 2.
The matrix G was developed by SMITH (1984) and SMITH and ALLAIRE (1985) to calculate the probability of any two gametes being identical by descent in an inbred population (cited by SCHAEFFER et al., 1989). It is symmetric and contains one row and one column for each gamete (i), which are calculated from the rows and columns belonging to the gametes’ predecessor gametes (called the gametes’ parents here forth). The diagonal elements of G are equal to one ( G*i,i ) = 1.0 for gamete i), as the ( probability of a gamete being identical to itself is always equal to one. The probability that the parental gamete received from the sire of the animal, Pga and the parental gamete from the dam Mga are identical by descent is the inbreeding coefficient fa of individual a, G*Pg, Mg ) = f a ).
( a a In a pedigree with n animals, the matrix G has the dimension 2n x 2n because every animal is assumed to have two unique gametes. However, if an exact copy of a parental gamete is passed to its offspring, the effects of that gamete are included twice BAES; REINSCH: Computing the condensed conditional gametic QTL relationship matrix and its inverse in G ; the computation of G −1 fails due to the linear dependencies in G (see section 4.
in TUCHSCHERER et al., 2004). If copied alleles are excluded and only unique gametes are assigned rows and columns in G *, the linear dependencies caused by identical rows and columns no longer exist. This means that animals with two unique gametes contribute two rows and two columns to G *, animals with one unique gamete and one copy of a gamete contribute only one row and column, and for animals with two copied gametes no rows or columns are added. The determination of unique and copied gametes depends on the pedigree and transmission probability information and is outlined below.
2.1 Assumptions Consider one or several marker loci closely linked to a quantitative trait locus (MQTL), with linkage equilibrium between the markers and the MQTL. A recombination rate of zero may occur between an MQTL and one (or more) marker(s).
Markers may be single, flanking or multiple for the MQTL, and the number of markers and the distance between them is not limited. The first allele is assumed to be the paternal gamete (Pg) and the second is considered to be the maternal gamete (Mg).
The paternal transmission probability (T(Pg)) is the probability that the sire of an animal passed his paternal (first) gamete to his offspring. Likewise, the maternal transmission probability (T(Mg)) is the probability that the dam of an animal passed her paternal (first) gamete to her offspring. The probability of the maternal (second) gamete of the sire or dam being passed on to the offspring can be calculated by subtracting the paternal or maternal transmission probability (respectively) from one. Thus two values (paternal and maternal) actually represent all four possible probabilities.
2.2 Method The calculation of the gametic relationship matrix using recursive algorithms requires an ordered pedigree in which parental animals occur before their progeny, and transmission probabilities for all non-founder individuals conditional on marker information for the paths sire progeny and dam progeny. The calculation of transmission probabilities is described by MAYER et al. (2007, submitted). For missing and non-informative markers the transmission probability is 0.5; the resulting contributions to the relationship matrix are identical to those in the classical numerator relationship matrix.
If an animal has a transmission probability of one for a certain gamete, that animal will receive the paternal gamete from its respective parent with 100% certainty; the gamete received by the animal is an exact copy of the parental gamete and is not unique.
Conversely, if a transmission probability of zero occurs for a given gamete, the animal will receive the maternal gamete from its respective parent with 100% certainty.
It is possible to set up a gametic pedigree following the calculation of transmission probabilities; the number of unique gametes can be determined and the size of G * can be calculated. All unique paternal and maternal gametes are assigned an integer identification number in ascending order. Predecessor gametes of paternal gametes, Pg(Pg) and Mg(Pg), and predecessor gametes of maternal gametes, Pg(Mg) and Mg(Mg) must also be considered.
In contrast to the gametic identification numbering method employed in the calculation of G, non-unique gametes do not receive unique identification numbers in Arch. Tierz. 50 (2007) 3 G *. If a gamete is passed from a parent to its offspring with 100% certainty, it is included in the gametic pedigree with the same identification number as the gamete from which it originated.
Calculation of the matrix G * 2.3 The calculation of G * is described by equation (9) in TUCHSCHERER et al. (2004).
If the information in the pedigree is sorted with parents precede their progeny, it is possible to build G * recursively starting with the top left corner and working towards the right. The transmission probability assigned to the gamete depends on whether it is the paternal or maternal gamete of the animal in question.
The matrix G * is symmetric, therefore only the upper triangular matrix needs to be calculated; the lower triangular matrix is identical to its upper counterpart. Let (i,j) represent the row and column indices for an element in G * and assume that ij. The elements of each column are calculated from row one until the last row before the diagonal (j-1), using the information from the predecessor gametes and the transmission probability of the gamete being calculated.
For columns of paternal gametes with known predecessors (i.e. j is a paternal gamete), G *i, j ) = T( Pg ) j × G *i, Pg( Pg ) j ) + (1 − T( Pg ) j ) × G *i, Mg( Pg ) j ) (1) ( ( ( where T(Pg)j is the transmission probability for the paternal gamete j, i denotes the row to be calculated, Pg(Pg)j is the paternal predecessor gamete of the paternal gamete j and Mg(Pg)j is the maternal predecessor gamete of the paternal gamete j. The calculation of columns of maternal gametes with known parents (i.e. j is a maternal gamete) is accomplished similarly, G *i, j ) = T( Mg ) j × G *i, Pg( Mg ) j ) + (1 − T( Mg ) j ) × G *i, Mg( Mg ) j ) (2) ( ( ( where T(Mg)j is the transmission probability for the maternal gamete j, i is the row to be calculated, Pg(Mg)j is the paternal predecessor gamete of the gamete j and Mg(Mg)j is the maternal predecessor gamete of the gamete j. If the predecessors of i and j are unknown, G*i, j ) = 0. In this way, G * can be calculated for all unique gametes and no (
linear dependencies occur.
2.4 The inbreeding coefficient The inbreeding coefficient at the MQTL, fa, is the probability that the gametes at the MQTL in individual a are identical by descent and can be found in the element (Pg,Mg) of G * for each individual. In G, this value is always located directly above the diagonal of the maternal gamete (or below the diagonal of the paternal gamete) for every individual. Although G * also contains inbreeding coefficients for each individual, they are not necessarily located in the same position as in G (i.e. directly above the maternal gamete of the individual) because the matrix is ordered differently due to the exclusion of non-unique gametes.
It is important to realize that the fa of an animal is a function of its respective transmission probabilities and the relationships between its parental gametes. Although the fa are the only elements of G * required to calculate its inverse, it is apparent that the fa of an individual depends on certain pre-existing elements of G *. Technically, BAES; REINSCH: Computing the condensed conditional gametic QTL relationship matrix and its inverse only very specific elements of G * are needed for the calculation of its inverse. TIER (1990) described how the minimum subset of matrix G required for the calculation of G −1 can be determined. However, the computation method presented here is sufficient for medium-sized pedigrees with a half sib structure.