Statistical Analysis of Molecular Data

advertisement

Analysis of Molecular Data

Anas and Tomohiko Yoshida

(Faculty of Agriculture, Utsunomiya University)

For details , ask Anas (anasyayak@yahoo.com")

I. Overview

The development of PCR methods gave a wide impact of using the molecular techniques for detecting genetic variation in natural population. There are some molecular techniques, namely

Amplified Fragment Length Polymorphisms (AFLPs); Random Amplified Polymorphic DNA

(RAPDs); Restriction Fragment Length Polymorphisms (RFLPs); and Simple Sequence Repeats

(SSRs) that were usually used in genetic diversity study . We will not discuss which technique is the most applicable for genetic diversity study here. Here we will outline only the statistical analysis of molecular data. More detail explanation of molecular technique can be read in a number of papers

(e.g. Silva and Russo, 2000; Karp et al., 1996).

Three main steps are involved in the statistical analysis of molecular data in diversity studies:

1. Data collection .

Entry of band information into the computer. It can be done manually or it can be read from gel directly by a computer installed with software. We usually use an Image Reader

Software Version 1.01 (Fuji Photo Film Co, LTD) to read band from a gel, while size of

DNA fragments was determined using Science Lab. 2001 – Image Gauge Ver.4.0 software

(Fuji Photo Film Co, LTD).

Scoring of band. The band (alleles) can be scored as presence/absence (1 or 0) in the case of dominant markers, which can not distinguish homozygous and heterozygous (such as

RAPDs or AFLPs) or alleles can be coded simply as their interger size in base pair in the case of codominant markers which heterozygous yield two band and those that are homozygous yields one band (such as SSRs – microsatellite / minisatellite or RFLPs).

However, SSRs and RFLPs alleles (band) can also be scored as presence/absence, but some genetic information will be lost (such as F coefficient, G

ST

, H

T

, H

S

, etc.).

2. Data analysis that can be grouped into two main categories:

F coefficient and other differentiation measures. It is usually applied for intra - population analysis.

Genetic distance and analyses of molecular variance-AMOVA that is usually used in inter-population analysis.

3.

Interpretation of the data .

There are many software packages with different analytical methods that can be downloaded from internet. Some of the software packages most commonly used in population genetics and related topic are:

1.

AMOVA (Analysis of Mol. Var.)

2.

Arlequin http://lgb.unige.ch/arlequin/

3.

GDA (Genetic Data Analysis)

4.

GENEPOP http://wbiomed.curtin.edu.au/genepop/

5.

Phylip http://evolution.genetics.washington.edu/phylip/getme.html

6.

Microsatellite http://oscar.gen.tcd.ie/~sdepark/ms-toolkit/

7.

TreeView http://taxonomy.zoology.gla.ac.uk/rod/treeview.html

Almost of them can be obtained free of charge. You can find another software program in this address: http://courses.washington.edu/fish543/Software.htm

. To avoid our confusing about methodology, we will focus only to SSR data analysis. First we will discuss a calculation method of

F coefficient using excel. The calculation using statistical package software will be discussed in the last part of this paper.

II. Data analysis

1. Coefficient of gene differentiation for one level of structure for the total population.

G

ST =

H

T

H

S

H

T

(Nei, 1973)

G

ST is measure of the relative differentiation among subpopulation.

H

T

is heterozygosity in the total population. H

T = p i i i i i for any number of allele. i i

= allele frequency.

H

S is the average heterozygosity in subpopulation. (see example 1)

2.

Coefficient of gene differentiation for more than one level of structure for the total population.

For example, subpopulation may be grouped into regions, or divided in to colonies (Weir and

Cockerham, 1984; Nei 1987). ( See example 1 )

F

SR =

H

R

H

R

H

S partition the variation into the diversity among subpopulation within a region.

F

ST =

H

T

H

R partition of the variation among regions for the total population.

H

T

Example 1 .

Table 1 gives the allelic frequency for microsatellite locus Sb1-10 for the two groups of sorghum germplasm in which there are two populations. Two groups of sorghum are Breeding

Materials and Japanese Cultivated Sorghum. Breeding Materials contain of two subpopulations

(Utsunomiya Univ. and ICRISAT). Japanese Cultivated Sorghum divided into Japan accessions and USA accessions.

Task:

1.

Determine the proportion of variation among subpopulation within regions .

2.

Determine the proportion of variation among regions .

To do this, first calculate the Hardy-Weinberg heterozygosity within each of four subpopulations ( H given in the right rightmost column ). For example, H for Utsuno. Univ. =

1- (0.6² + 0.4²) = 0.48. The mean of these four values is H

S

= (0.480+0.645+0.870+0.870)/4 =

0.716

.

Next calculate the mean allelic frequency within each germplasm group. For example, mean allelic frequency for Breeding Materials group =1– (0.527

2 + 0.336

2 + 0.136

2 ) = 0.590. Using these mean allelic frequencies, the Hardy-Weinberg heteroszigosity for each region can be calculated ( also in the rightmost column of Table 1 ). The weighted mean of these values is H

R

=

[2(0.590 + 2(0.888)]/4 = 0.739

.

Finally, the mean allelic frequency of each allele can be calculate. For example, Total mean of allele 180 – band size 180 bp) = (0.103+0.206)/4 = 0.077. The mean allelic frequency for all alleles is shown in bottom row and the Hardy-Weinberg heterozigosity of these frequencies is

H

T

= 1 – (0.077

2

+0.360

2 +….+0.006

2

) = 0.798

.

With these values, the proportion of variation among subpopulation within regions is F

SR

=

0.031

and, the proportion of variation among regions is F

RT

= 0.074

. The proportion of variation among region was higher (average over five loci is 0.042) than variation among subpopulation

(average over five loci is 0.027).

Note : we have calculated these values for practice purposes and we have considered all faint or shadow bands (alleles) in analysis. We have calculated without corrections for sample size as suggested by Nei and Roychoudhury (1974). However, the unbiased estimate of the expected heterozigosity at a locus is (if N < 50):

H

E =

2N 1

1 i= 1 p i i i i

N is sample size. For references and another example, see

Hedrick, P.W. 2000. Genetics of Populations. Jones and

Barlett.

Table 1.

The frequency of alleles at the microsatellite locus for the sorghum populations from two groups germplasm, and where

indicates allele is absent. Numeral in parenthesis after population name is the sample size for each subpopulation.

180

A llele

190 200 210 220 230 240 250 260 400 440 450 460 470 610 725 H

B reed. M at.

U tsu. U niv. (40)

Icrisat (49)

M ean

0.600

0.400

-

0.455

0.273

0.273

0.527

0.336

0.136

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

0.480

0.645

0.590

Japan C ul.

Japan A cc. (103) 0.103

0.179

0.179

0.026

0.051

0.154

0.026

0.026

0.026

0.154

0.026

U SA A cc. (131) 0.206

0.206

0.029

0.059

0.029

0.118

0.029

0.118

0.000

0.029

0.029

0.088

0.029

0.029

0.026

-

0.026

-

0.870

0.870

M ean 0.154

0.193

0.104

0.042

0.040

0.136

0.028

0.072

0.013

0.015

0.015

0.044

0.092

0.028

0.013

0.013

0.888

Total m ean 0.077

0.360

0.220

0.089

0.020

0.068

0.014

0.036

0.006

0.007

0.007

0.022

0.046

0.014

0.006

0.006

0.798

H

E = p i i i H

T =

(0.103+0.206)/4 p i i i i i

H

S

Note: Analysis of F statistic uses the Dispan statistical program will be discussed in the last part

of this paper.

3.

Making of Phylogenetic tree or dendogram

To make phylogenetic or dendogram tree, we need at least four different statistical software (or four steps). All this software can be obtained free of charge from the internet. There is also software that can do everything for you but it is usually not share software and you must spend a lot of money for that.

A.

MICROSAT software. This software is needed to make input data file . However, this software can calculate also observed heterozigosity, expected heterozygosity and other genetic parameters if your data only have one band/allele. Microsat can run in Excel program as add in software. How to install Microsat into your computer can be read directly in the instruction manual from this author.

Point : Each of statistical package software usually need different input data file, thus first we must prepare an appropriate input data file for what statistical software we have. The common input data files are text files (example: sorghum .txt – with txt extension) and dat files (example: sorghum .dat – with dat extension).

Enter bands into a computer was explained in overview.

Scoring of band: simply code as their interger size in base pair (Ex. 250 base pair).

Prepare input data in Excel for Microsat as follows: a.

Arrange your data in excel program as follows: (according allele size – example haploid data).

Locus (SSR marker) allele A

4

= 310 bp

Pop A haplotype

Pop B b.

Click Tool – select Microsatellite – select Microsatellite Toolkit. Select input data format (in this case we select haploid data). Uncheck data error and press OK. c.

The next screen is confirmation number of allele and number of locus. You can exclude sample (allele) or loci in this section. If you want to analysis all pop. or loci, just click

OK.

2

3

1

4 d.

The next screen is about format option and F statistic. Please check Arlequin –

Microsatellite format because we will use Arlequin statistical software for the next step.

In the bottom please check treat population separately.

5

6

7

8 e.

Microsatellite will create automatically a new worksheet data in excel with a name

Arlequin. Change all question mark with zero.

Point : If you do not change the question mark with zero, Arlequin will treat (this question mark) as missing data. f.

Click Tool – select Microsatellite – select Save Sheet as Textfile.

Point : Keep your work in Arlequin worksheet. When the program asks you to give file name, enter file name with arp extension (example sorghum.

arp ). Save your file in

same folder with Arlequin program.

B. ARLEQUIN software. This software is needed to make genetic distance matrix. Method that we will apply is estimation of distances between standard haplotypes – number of pair wise differences (see page 82-83 in the manual of this program). How to install Arlequin software

1 program into your computer can be read directly in instruction manual from the authors. a.

Open Arlequin program – click open project – browse the file name that we already made use microsatellite software. Follows the instruction in the image below.

Selected file

5

6

7

9

10

11

4

3

8

2 b.

Arlequin will save the computation result in sorghum.res folder name. You can see the result in sorghum_main.html file inside this folder. See image below.

Copy this selected area and save as textfile using

Windows Notepad

c.

Copy the distance matrix by right click on your mouse – open the notepad program (you will find it in Accessories) and save it (example sorghum tree.txt

) in Phylip software program folder. You must rearrange data in text file before it will be ready for Phylip program (see example input data in Phylip program).

Point: you can use only upper-triagular data matrix (above diagonal) or you can use all data matrix. You will get a different tree and you can decide which one is most reliable with your data.

C. PHYLIP program. This program is needed to make tree or dendogram. Method that we will apply is Neighbor-joining method from Nei and Saitou (1987). How to install Phylip package program into your computer can be read directly in instruction manual from the authors. a. Open the Neighbor program – type file name matrix data. If you accept all setting in this program, you just type Y (yes) and press enter. The program will run making a tree. b. Result can be seen in outfile or outtree. You can see directly the tree in outfile file and read using Windows Notepad. If you want to see with other tree software program, you can use outtree file as a source.

Type file name here

If you use square data matrix, you do not need to set this section.

D. TREEVIEW software. This program is needed to see tree or dendogram. How to install

TreeView software into your computer can be read directly in instruction manual from the authors. Open TreeView program – click File – open file. Browse the outtree file in phylip folder (you can also put directly an outtree file into TreeView program) and you can see the tree/dendogram (see image below), which it is much better than tree display in Phylip

program.

Can 1

Can 2

H13

H11

C9

C8

Elite inbred lines

U.S.A.Pure lines

Japan Pure lines

Note : We re-tree from the original tree using midpoint value by Retree program in Phylip package program

Download