A threshold of disequilibrium parameter using Multiallelic model

advertisement

A threshold of disequilibrium parameter using cumulative relative frequency of Haplotypes on

Multiallelic model

Makoto Tomita

1

Nanzan University

27 Seirei-cho, Seto, Aichi, 489-0863, JAPAN.

tomita@nanzan-u.ac.jp

Summary.

A domain in which recombination does not often occur, but in which linkage disequilibrium is present, is known as a “haplotype block”, or “LD block”.

In a biallelic model, there are many methods for identifying haplotype blocks using disequilibrium parameters, particularly, the method of Kamatani et al.

. Although their method has a high calculation time requirement, it is advantageous among current methods when analyzeing large amounts of genotype data. Tomita et al.

thought that the cause of the extended calculation time lies in the threshold of D ′ in the initial condition to identify haplotype blocks. They have reported on a method for identifying a more appropriate a haplotype block, which greatly shortens the calculation time by setting the optimal threshold of D ′ . Therefore, we report to apply their method into a multiallelic model.

Key words: linkage disequilibrium, haplotype block, multiallelic model

1 Introduction

A major goal of current human genome-wide studies is to identify the genetic basis of complex disorders. As an early research, for one of analysis of statistical genetics, there is linkage disequilibrium analysis, especially. It means that “linkage” is a relation in which not much time has passed between generations, and “linkage disequilibrium” is the state at which the last generation is also related over long periods of time. And methods of trait mapping based on theories of linkage disequilibrium analysis have developed quickly in recent years. Why do we think about linkage disequilibrium? The data that we analyze has genotype and phenotype. Genotype is

DNA data, and phenotype is data that can be confirmed by man’s eyes , for instance, information of disease or a blood pressure, etc. And, we want to know the relation between genotype and phenotype. However, kinds of genotype study hugeness thoroughly. It can be achieved to decrease this kind by knowing linkage disequilibrium.

Then, we want to know the range where linkage disequilibrium is strong. Therefore, our purpose is that the range where linkage disequilibrium is strong identifies somewhere.

1032 Authors Suppressed Due to Excessive Length

The chromosome of human causes a recombination by each generation. Many recombinations mean the diversity of genotype. On DNA sequences, there are domain

“hotspot” where recombinations briskly occurred often. On the other hand, there is a domain where recombination conversely did not occur often, yet maintained linkage disequilibrium exists. This is known as “haplotype block”, or “LD block”. Although the value of D ′ , a disequilibrium parameter, as an important step when identifying a haplotype block has been used until now. But D ′ is backed more experientially than theoretically. Tomita et al (2005) advocated that it is considered a identification of a haplotype block and D ′ on biallelic model. On this report, we have applied the method for multiallelic model.

2 Recent studies and its problem on biallelic model

2.1 Recent studies

Gabriel et al.

(2002) defined “strong linkage disequilibrium” as the state where the upper bound of 95% confidence interval of D ′ exceeds 0.98.

And Zhu et al.

evaluated haplotypes in which each relative frequency is more than 0.04.

Furthermore, the ideas of Gabriel’s and Zhu’s methods were combined to identify haplotype blocks in another paper (Kamatani et al.

2004). Their detailed procedures are as follows.

1 Since the loci with minor allele frequencies of less than 0.1 are likely to have been generated by recent mutations, they were excluded from the genotype data for the determination of haplotype blocks.

2 Initial satellite of a block (they define “minimum block”) was constructed using a pair of adjacent SNPs with D ′ ≥ 0 .

9.

3 Using the satellite block, possible extension of the block to an adjacent SNP in ether of the directions is examined. If the haplotype heterozygosity is unchanged by the extension to one direction, then the block is extended to that direction.

If the extension increases the haplotype heterozygosity, then the SNP before the extension is considered as an end of the block. It is judge by whether or not a block adheres without an allele of locus in which the cumulative haplotype relative frequency added to the major haplotype of 0.95 (or 0.9) changes a tendency.

2.2 A Problem of a Threshold

In the above system for the identification of haplotype blocks, the initial conditions for satellite blocks are D ′ ≥ 0 .

9 as determined by experiences. Alternative conditions are worth to be considered. Because of the strict conditions, the calculation time for the identification of blocks may become huge since the satellite blocks are very short. We therefore examined other thresholds to construct initial satellite blocks in the following section.

Title Suppressed Due to Excessive Length 1033

2.3 A method for a model with two kinds of allele and two loci

There is SNPs which is a kinds of DNA marker. SNPs are single nucleotide polymorphisms. There are three billion loci on a human all chromosome. There are two allele in all locus. There are 4 kinds of alleles, that is adenine(A), cytosine(C), guanine(G) and thymine(T). SNP is locus with two kinds of allele. However, the one of less than 5% which is a relative frequency of minor allele is not called SNP.

There is a definition assumed 1% or 0 .

1% excluding 5%, too.

Tomita et al (2005) have introduced their idea, as follows. When we consider linkage disequilibrium, the easiest model is a case in which there are only two loci and allele on each locus of only two kinds. (See Table 1) Of course, in the SNP marker there are only two kinds of allele. Locus 1 is the 1st locus, and locus 2 is 2nd locus. Since there are only two kinds of allele for all loci, let 1 and 2 be a number for allele on locus 1, and 3 and 4 be a number for allele on locus 2. And as a relative frequency of each allele, let p

1 and p

2 be relative frequencies on locus

1, and p

3 and p

4 be relative frequencies on locus 2 where it is p

2

= 1 − p

1 and p

4

= 1 − p

3

. A “haplotype” is one with a combination of alleles of two or more loci, and a “diplotype” is one pair of two haplotypes. There are 1-3, 1-4, 2-3 and 2-4 for

“haplotype” on this model. If it is assumed that there is no correlation between alleles of locus 1 and alleles of locus 2, for example, each chromosome differs for loci, and we can consider that to be p

1

· p

3 as a frequency of haplotype 1-3 in theory.

Table 1.

4 haplotypes on 2 loci.

locus 2 allele 3 allele 4 locus 1 allele 1 p

13 allele 2 p

14 p p

23

24

Let a genotype of locus 1 be G

1

, and a genotype of locus 2 be G

2

. And let alleles of locus 1 be { X

1

, X

2

} , and alleles of locus 1 be { X

3

, X

4

} . Random variables

G i

, i = 1 , 2 can be written by G

1

{ X

1

= { X

1

, X

2

} , G

2

= { X

3

, X

4

} by using only

, X

2

, . . . , X

4

} . Then allele relative frequencies are P { X

1

= 1 } = p

1 or

P { X

2

= 1 } = p

1 on the allele 1 of locus 1 and P { X

1

= 2 on the allele 2 on locus 1. Alleles { 3 , 4 } are also similar.

} = p

2 or P { X

2

= 2 } = p

2

And let

P { X

1

= 1 } = p

1

≥ P { X

2

= 2 } = p

2

,

P { X

3

= 3 } = p

3

≥ P { X

4

= 4 } = p

4

, that is X

1 is a major allele on locus 1 and X

3 is a major one on locus 2.

And let relative frequencies of haplotypes on observation be

1034 Authors Suppressed Due to Excessive Length

P { X

1

= 1 , X

3

= 3 } = p

13

,

P { X

1

= 1 , X

4

= 4 } = p

14

,

P { X

2

= 2 , X

3

= 3 } = p

23

,

P { X

2

= 2 , X

4

= 4 } = p

24

.

Linkage equilibrium exists when

P { X

1

= 1 , X

3

= 3 } = P { X

1

= 1 } × P { X

3

= 3 } , and linkage disequilibrium exists when

P { X

1

= 1 , X

3

= 3 } 6 = P { X

1

= 1 } × P { X

3

= 3 }

A disequilibrium parameter D is denoted by

D = P { X

1

= 1 , X

3

= 3 } − P { X

1

= 1 } × P { X

3

= 3 } .

2.4 A Disequilibrium Parameter and Cumulative relative frequencies of haplotypes

Let coefficients α, β be

P { X

1

= 1 } = p

1

= α, P { X

3

= 3 } = p

3

= β,

α > β, 0 .

5 < α < 1 .

0 , 0 .

5 < β < 1 .

0 .

Therefore, we get the following equation.

P { X

1

= 1 , X

3

= 3 } + P { X

2

= 2 , X

4

= 4 }

= αβ + (1 − α )(1 − β ) + 2 D.

Let the cumulative frequency f of two haplotypes { 1-3, 2-4 } be f = P { X

1

= 1 , X

3

= 3 } + P { X

2

= 2 , X

4

= 4 } .

We get the following equation about D of f ,

D ( f ) =

α (1 − β ) + (1 − α ) β + f − 1

.

2

And let D max in following equations be normalized D , when D ≥ 0

D max

= min[ α (1 − β ) , (1 − α ) β ] , and when D < 0

(1)

(2)

(3)

(4)

(5)

Title Suppressed Due to Excessive Length 1035

D max

= max[ − αβ, − (1 − α )(1 − β )] , (6)

.

then D ′ tion (6).

is the following equation by using Equation (4), Equation (5) and Equa-

D ′ ( f ) =

α (1 − β ) + (1 − α ) β + f − 1

2 D max

(7)

Equation (7) is the expression of disequilibrium parameter D ′ by a cumulative frequency f of haplotypes { 1-3, 2-4 } . It is then the kernel of their method.

2.5 Unit of haplotype block

There is a very close relationship between the cumulative relative frequency of haplotypes and the value of D ′ . When D ′ = 0 .

9 and each allele relative frequency is 0.5, then 0.95 is the cumulative haplotype relative frequency analytically. And whenever each allele relative frequency shifts from 0.5, it also turns out that this relationship collapses.

Next, we think about f of Equation (7). It can be written by

P { X

1

= 1 , X

3

= 3 } ≥ αf,

P { X

2

= 2 , X

4

= 4 } ≥ (1 − β ) f.

(8)

The greatest lower bound of a cumulative relative frequency of haplotypes { 1-3,

2-4 } is ( α + (1 − β )) f given by Equation (8).

inf [ P { X

1

= 1 , X

3

= 3 } + P { X

2

= 2 , X

4

= 4 } ] = ( α + (1 − β )) f

Then, the “unit of haplotype block” is satisfied by the following equation

D ′ ≥ D ′ ( f unit

) , where is f unit

= ( α − β + 1) f.

(9)

It is expected that large shortening of the calculation time of block identification given by using our optimal threshold value D ′ ( f unit

). (See Equation(9))

The above is a technique that they proposed.(Tomita et al.

, 2005) They have compared it by the data open to the public in Hapmap project, therefore the calculation time is shortened more than one of Kamatani et al.

by about ten times.

However, the same block was able to be identified almost.

3 Applied a threshold of

D ′

using cumulative relative frequency of haplotypes for multiallelic model

3.1 Disequilibrium parameter

D ′

on multi-allelic model

It is D ′ on multi-allelic model which is loci have more than 3 kinds of allele, as follows (Hedrick, 1987).

1036 Authors Suppressed Due to Excessive Length

D ′ multiallelic

= m

X n

X p i q j

| D ′ ij

| i =1 j =1 where,

(10)

D ij

= x ij

− p i q j

, (11) p i

, q j are relative frequencies of allele and x ij is relative frequencies of haplotype.

(When there are m kinds of allele on locus1 and n kinds of allele on locus2.)

3.2 An application to multi-allelic model

We think to apply Equation (7) to Equation (10) for a threshold of linkage disequilibrium, too.

It is all haplotype matrix H by a following one. (It is m for number of kinds of allele on locus1 and n for number of kinds of allele on locus2.)

H =

0

B

B

B x

11 x

12

. . . x

1 n x

21

.

..

x

.

..

22

. . . x

. .

.

.

..

2 n x m 1 x m 2

. . . x mn

1

C

C

C

A x

11 is paid to attention to calculate disequilibrium parameter D easy here. Then calculating D ij with x ij becomes the same procedure in case of D

11 with x

11

. Therefore, let above matrix is thought as follows.

0 x

11 x

12

. . . x

1 n

1

H =

B

B

B x

21 x

22

. . . x

2 n

.

..

..

.

. .

.

..

.

x m 1 x m 2

. . . x mn

C

C

C

A

0 x

11

B =

B

X

· 1 where

1 X

1

·

X other(11)

C

C

,

A

X

1

·

= { x

12

, . . . , x

1 n

} ,

X

·

1

= { x

21

, . . . , x m 1

} ,

X other(11)

= { x

22

, . . . , x

2 n

, x

32

, . . . , x

3 n

, . . . , x m 2

, . . . , x mn

} .

And Equation (1) can be transformed as follows.

Title Suppressed Due to Excessive Length 1037

D = ( p

1

+ p

2

) D

= ( p

1

( p

3

+ p

4

) + p

2

( p

3

+ p

4

)) D

= p

1 p

2 p

3 p

4

+ p

2 p

4

D + p

1 p

3

D + D

2

− p

1 p

2 p

3 p

4

+ p

2 p

3

D + p

1 p

4

D − D

2

= p

1 p

2 p

3 p

4

+ p

2 p

4

D + p

1 p

3

D + D

2

− ( p

1 p

2 p

3 p

4

− p

2 p

3

D − p

1 p

4

D + D

2

)

= ( p

1

· p

3

+ D )( p

2

· p

4

+ D ) − ( p

1

· p

4

− D )( p

2

· p

3

− D )

= p

13

· p

24

− p

14

· p

23

= P { X

1

= 1 , X

3

= 3 } · P { X

2

= 2 , X

4

= 4 }

− P { X

1

= 1 , X

4

= 4 } · P { X

2

= 2 , X

3

= 3 }

Then Equation (11) is as follows by Equation ( ??

) when i = 1 and j = 1.

D

11

= x

11

− p

1 q

1

= x

11

·

X

X other(11)

X

X

1

·

·

X

X

·

1

, where

X

X

1

·

=

X x

12

+ . . .

+ x

1 n

,

X

X

·

1

=

X x

21

+ . . .

+ x m 1

,

X

X other(11)

=

X x

22

+ . . .

+ x

2 n

+ x

32

+ . . .

+ x

3 n

+ . . .

+ x m 2

+ . . .

+ x mn

.

Because D

11 has been able to be calculated, D ij can be calculated by a similar procedure. Then we got a following equation when i = 1 , . . . , n and j = 1 , .

D ij

= x ij

·

X

X other(ij)

X

X i

·

·

X

X

· j

, where

(12)

X

X i ·

= x i 1

+ . . .

+ x in

− x ij

,

X

X

· j

= x

1 j

+ . . .

+ x mj

− x ij

,

X

X other(ij)

= 1 − x ij

X

X i ·

X

X

· j

.

Equation (12) is substituted for Equation (4).

D ij

( f ij

) = p i

(1 − q j

) + (1 − p i

) q j

2

+ f ij

− 1 where p i is a relative frequency of i th allele on locus1, q j is a relative frequency of j th allele on locus2,

1038 Authors Suppressed Due to Excessive Length f ij

= ( x ij

+ D ij

) + (

X

X other(ij)

+ D ij

)

= x ij

+

X

X other(ij)

+ 2 D ij

.

Then,

(13)

D ′ ij

( f ij

) = p i

(1 − q j

) + (1 − p i

) q j

+ f − 1

.

2 D max

Equation (14) is substituted for Equation (10).

(14)

D ′ multiallelic

( f ) = m

X n

X p i q j

| D ′ ij

( f ij

) | i =1 j =1

(15)

We can use the threshold of D ′ multiallelic

( f ij

) when analyzing multiallelic DNA data, for example, microsatellite markers.

Acknowledgment

The present study was supported by grants from Pache Research Subsidy I-A-2,

Nanzan University (2006). And it has supported by grants from The Hori Information Science Promotion Foundation.

References

[G02] Gabriel S.B.

et al.

(2002). The structure of haplotype blocks in the human genome. Science. 296, 2225-2229.

[H87] Hedrick P.W. (1987), Gametic Disequilibrium Measures: Proceed With Caution. Genetics. 117(2), 331-341.

[K01] Kamatani N. edited (2001). Statistical Genetics in Post-Genome (In

Japanese). Yodosha in Japan.

[KSKISKIKHN04] Kamatani N., Sekine A., Kitamoto T., Iida A., Saito S., Kogame

A., Inoue E., Kawamoto M., Harigai M. and Nakamura Y. (2004). Large-scale single-nucleotide polymorphism (SNP) and haplotype analyses, using dense SNP

Maps, of 199 drug-related genes in 752 subjects: the analysis of the association between uncommon SNPs within haplotype blocks and the haplotypes constructed with haplotype-tagging SNPs. American Journal of Human Genetics.

75(2), 190-203.

[KMKMMTK02] Kitamura Y., Moriguchi M., Kaneko H., Morisaki H., Morisaki T.,

Toyama K. and Kamatani N. (2002). Determination of probability distribution of diplotype configuration (diplotype distribution) for each subject from genotypic data using the EM algorithm. Annual of Human Genetics. 66(3), 183-193.

[T02] Tomita M. (2002). A Relationship Between a Contingency Table and a Linkage

Disequilibrium. Proceedings of The 4 th Conference of Asian Regional Section of the International Association for Statistical Computing. 70-73.

Title Suppressed Due to Excessive Length 1039

[T03] The International HapMap Consortium. (2003). The International HapMap

Project, Nature 426(6968), 789-796.

[TT05] Tomita M. and Takemura R. (2005). A Threshold to Identify Haplotype

Blocks on Various DNA Data. Proceeding of The 5 th IASC Asian Conference on Statistical Computing.171-174.

[ZCR01] Zapata C., Carollo C. and Rodriguez S. (2001). Sampling Variance and distribution of the D ′ measure of overall gemetic diseequilibrium between multiallelic loci. Annual of Human Genetics. 65(4), 395-406.

[ZYCLICWC03] Zhu X., Yan D., Cooper R.S., Luke A., Ikeda M.A., Chang Y.P.,

Weder A. and Chakravarti A. (2003). Linkage disequilibrium and haplotype diversity in the genes of the renin-angiotensin system: findings from the family blood pressure program. Genome Research. 13, 173-181.

Download