Composed by Zahid Hussain Gopang, 2k10-IT; Attempt any Five Questions. Q.1: & 2 Objective and MCQ's Not Compulsorry. (Just in 20 minut if u can solve) Q.3: What iz Statistics and Role In IT? Answere: Statisticians apply statistical thinking and methods to a wide variety of scientific, social, and business endeavors in such areas as astronomy, biology, education, economics, engineering, genetics, marketing, medicine, psychology, public health, sports, among many. "The best thing about being a statistician is that you get to play in everyone else's backyard." Statistics is the science of learning from data, and of measuring, controlling, and communicating uncertainty; and it thereby provides the navigation essential for controlling the course of scientific and societal advances. Statistics is a profession related to information. Therefore, the success of statistics depends on the way inwhich we are able to collect, process, safeguard and disseminate this information. Having entered the 21st century, the management of statistical information is unimaginable without a support of adequate tools ofmodern information technologies Q.4: Representation of Data by Chart? For answere of Q:No.4 click on the Q:no4 to download the answere of it on website, in MS powerpoint Formate. Q.5: Distribution of Frequency Table.? Answere: The frequency (f) of a particular observation is the number of times the observation occurs in the data. The distribution of a variable is the pattern of frequencies of the observation. Frequency distributions are portrayed as frequency tables, histograms, or polygons. Frequency distributions can show either the actual number of observations falling in each range or the percentage of observations. In the latter instance, the distribution is called a relative frequency distribution. Frequency distribution tables can be used for both categorical and numeric variables. Continuous variables should only be used with class intervals, which will be explained shortly. Example 1 – Constructing a frequency distribution table A survey was taken on Maple Avenue. In each of 20 homes, people were asked how many cars were registered to their households. The results were recorded as follows: 1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0 Use the following steps to present this data in a frequency distribution table. 1. Divide the results (x) into intervals, and then count the number of results in each interval. In this case, the intervals would be the number of households with no car (0), one car (1), two cars (2) and so forth. 2. Make a table with separate columns for the interval numbers (the number of cars per household), the tallied results, and the frequency of results in each interval. Label these columns Number of cars, Tally and Frequency. 3. Read the list of data from left to right and place a tally mark in the appropriate row. For example, the first result is a 1, so place a tally mark in the row beside where 1 appears in the interval column (Number of cars). The next result is a 2, so place a tally mark in the row beside the 2, and so on. When you reach your fifth tally mark, draw a tally line through the preceding four marks to make your final frequency calculations easier to read. 4. Add up the number of tally marks in each row and record them in the final column entitled Frequency. Your frequency distribution table for this exercise should look like this: Table 1. Frequency table for the number of cars registered in each household Number of cars (x) Tally Frequency (f) 4 0 6 1 5 2 3 3 2 4 By looking at this frequency distribution table quickly, we can see that out of 20 households surveyed, 4 households had no cars, 6 households had 1 car, etc. Top of Page Example 2 – Constructing a cumulative frequency distribution table A cumulative frequency distribution table is a more detailed table. It looks almost the same as a frequency distribution table but it has added columns that give the cumulative frequency and the cumulative percentage of the results, as well. At a recent chess tournament, all 10 of the participants had to fill out a form that gave their names, address and age. The ages of the participants were recorded as follows: 36, 48, 54, 92, 57, 63, 66, 76, 66, 80 Use the following steps to present these data in a cumulative frequency distribution table. 1. Divide the results into intervals, and then count the number of results in each interval. In this case, intervals of 10 are appropriate. Since 36 is the lowest age and 92 is the highest age, start the intervals at 35 to 44 and end the intervals with 85 to 94. 2. Create a table similar to the frequency distribution table but with three extra columns. o In the first column or the Lower value column, list the lower value of the result intervals. For example, in the first row, you would put the number 35. o The next column is the Upper value column. Place the upper value of the result intervals. For example, you would put the number 44 in the first row. o The third column is the Frequency column. Record the number of times a result appears between the lower and upper values. In the first row, place the number 1. o The fourth column is the Cumulative frequency column. Here we add the cumulative frequency of the previous row to the frequency of the current row. Since this is the first row, the cumulative frequency is the same as the frequency. However, in the second row, the frequency for the 35–44 interval (i.e., 1) is added to the frequency for the 45–54 interval (i.e., 2). Thus, the cumulative frequency is 3, meaning we have 3 participants in the 34 to 54 age group. 1+2=3 o The next column is the Percentage column. In this column, list the percentage of the frequency. To do this, divide the frequency by the total number of results and multiply by 100. In this case, the frequency of the first row is 1 and the total number of results is 10. The percentage would then be 10.0. 10.0. (1 ÷ 10) X 100 = 10.0 o The final column is Cumulative percentage. In this column, divide the cumulative frequency by the total number of results and then to make a percentage, multiply by 100. Note that the last number in this column should always equal 100.0. In this example, the cumulative frequency is 1 and the total number of results is 10, therefore the cumulative percentage of the first row is 10.0. 10.0. (1 ÷ 10) X 100 = 10.0 3. The cumulative frequency distribution table should look like this: Lower Value 35 45 55 65 75 85 Table 2. Ages of participants at a chess tournament Upper Cumulative Cumulative Frequency (f) Percentage Value frequency percentage 44 1 1 10.0 10.0 54 2 3 20.0 30.0 64 2 5 20.0 50.0 74 2 7 20.0 70.0 84 2 9 20.0 90.0 94 1 10 10.0 100.0 For more information on how to make cumulative frequency tables, see the section on Cumulative frequency and Cumulative percentage. Top of Page Class intervals If a variable takes a large number of values, then it is easier to present and handle the data by grouping the values into class intervals. Continuous variables are more likely to be presented in class intervals, while discrete variables can be grouped into class intervals or not. To illustrate, suppose we set out age ranges for a study of young people, while allowing for the possibility that some older people may also fall into the scope of our study. The frequency of a class interval is the number of observations that occur in a particular predefined interval. So, for example, if 20 people aged 5 to 9 appear in our study's data, the frequency for the 5–9 interval is 20. The endpoints of a class interval are the lowest and highest values that a variable can take. So, the intervals in our study are 0 to 4 years, 5 to 9 years, 10 to 14 years, 15 to 19 years, 20 to 24 years, and 25 years and over. The endpoints of the first interval are 0 and 4 if the variable is discrete, and 0 and 4.999 if the variable is continuous. The endpoints of the other class intervals would be determined in the same way. Class interval width is the difference between the lower endpoint of an interval and the lower endpoint of the next interval. Thus, if our study's continuous intervals are 0 to 4, 5 to 9, etc., the width of the first five intervals is 5, and the last interval is open, since no higher endpoint is assigned to it. The intervals could also be written as 0 to less than 5, 5 to less than 10, 10 to less than 15, 15 to less than 20, 20 to less than 25, and 25 and over. Rules for data sets that contain a large number of observations In summary, follow these basic rules when constructing a frequency distribution table for a data set that contains a large number of observations: find the lowest and highest values of the variables decide on the width of the class intervals include all possible values of the variable. In deciding on the width of the class intervals, you will have to find a compromise between having intervals short enough so that not all of the observations fall in the same interval, but long enough so that you do not end up with only one observation per interval. It is also important to make sure that the class intervals are mutually exclusive. Top of Page Example 3 – Constructing a frequency distribution table for large numbers of observations Thirty AA batteries were tested to determine how long they would last. The results, to the nearest minute, were recorded as follows: 423, 369, 387, 411, 393, 394, 371, 377, 389, 409, 392, 408, 431, 401, 363, 391, 405, 382, 400, 381, 399, 415, 428, 422, 396, 372, 410, 419, 386, 390 Use the steps in Example 1 and the above rules to help you construct a frequency distribution table. Answer The lowest value is 363 and the highest is 431. Using the given data and a class interval of 10, the interval for the first class is 360 to 369 and includes 363 (the lowest value). Remember, there should always be enough class intervals so that the highest value is included. The completed frequency distribution table should look like this: Table 3. Life of AA batteries, in minutes Battery life, minutes (x) Tally Frequency (f) 2 360–369 370–379 380–389 390–399 400–409 410–419 420–429 430–439 Total 3 5 7 5 4 3 1 30 Top of Page Relative frequency and percentage frequency An analyst studying these data might want to know not only how long batteries last, but also what proportion of the batteries falls into each class interval of battery life. This relative frequency of a particular observation or class interval is found by dividing the frequency (f) by the number of observations (n): that is, (f ÷ n). Thus: Relative frequency = frequency ÷ number of observations The percentage frequency is found by multiplying each relative frequency value by 100. Thus: Percentage frequency = relative frequency X 100 = f ÷ n X 100 Top of Page Example 4 – Constructing relative frequency and percentage frequency tables Use the data from Example 3 to make a table giving the relative frequency and percentage frequency of each interval of battery life. Here is what that table looks like: Table 4. Life of AA batteries, in minutes Battery life, minutes (x) Frequency (f) Relative frequency Percent frequency 2 0.07 7 360–369 3 0.10 10 370–379 5 0.17 17 380–389 390–399 400–409 410–419 420–429 430–439 Total 7 5 4 3 1 30 0.23 0.17 0.13 0.10 0.03 1.00 23 17 13 10 3 100 An analyst of these data could now say that: 7% of AA batteries have a life of from 360 minutes up to but less than 370 minutes, and that the probability of any randomly selected AA battery having a life in this range is approximately 0.07. Keep in mind that these analytical statements have assumed that a representative sample was drawn. In the real world, an analyst would also refer to an estimate of variability (see section titled Measures of spread) to complete the analysis. For our purpose, however, it is enough to know that frequency distribution tables can provide important information about the population from which a sample was drawn. In statistics, a frequency distribution is an arrangement of the values that one or more variables take in a sample. Each entry in the table contains the frequency or count of the occurrences of values within a particular group or interval, and in this way, the table summarizes the distribution of values in the sample. Contents 1 Univariate frequency tables 2 Joint frequency distributions 3 Applications 4 See also 5 Notes Univariate frequency tables Rank Degree of agreement Number 1 Strongly agree 20 2 Agree somewhat 30 3 Not sure 20 4 Disagree somewhat 15 5 Strongly disagree 15 A different tabulation scheme aggregates values into bins such that each bin encompasses a range of values. For example, the heights of the students in a class could be organized into the following frequency table. Height range Number of students Cumulative number less than 5.0 feet 25 25 5.0–5.5 feet 35 60 5.5–6 feet 20 80 6.0–6.5 feet 20 100 A frequency distribution shows us a summarized grouping of data divided into mutually exclusive classes and the number of occurrences in a class. It is a way of showing unorganized data e.g. to show results of an election, income of people for a certain region, sales of a product within a certain period, student loan amounts of graduates, etc. Some of the graphs that can be used with frequency distributions are histograms, line graphs, bar charts and pie charts. Frequency distributions are used for both qualitative and quantitative data. Joint frequency distributions Bivariate joint frequency distributions are often presented as (two-way) contingency tables: Two-way contingency table with marginal frequencies Dance Sports TV Total 2 10 8 20 Men 6 8 30 Women 16 18 16 16 50 Total The total row and total column report the marginal frequencies or marginal distribution, while the body of the table reports the joint frequencies.[1] Applications Managing and operating on frequency tabulated data is much simpler than operation on raw data. There are simple algorithms to calculate median, mean, standard deviation etc. from these tables. Statistical hypothesis testing is founded on the assessment of differences and similarities between frequency distributions. This assessment involves measures of central tendency or averages, such as the mean and median, and measures of variability or statistical dispersion, such as the standard deviation or variance. A frequency distribution is said to be skewed when its mean and median are different. The kurtosis of a frequency distribution is the concentration of scores at the mean, or how peaked the distribution appears if depicted graphically—for example, in a histogram. If the distribution is more peaked than the normal distribution it is said to be leptokurtic; if less peaked it is said to be platykurtic. Letter frequency distributions are also used in frequency analysis to crack codes and are referred to the relative frequency of letters in different languages. Q.6: Measure Dispration or variability? Measures of variability: the range, inter-quartile range and standard deviation Study guide For a printer-friendly PDF version of this guide, click here This guide outlines three methods used to summarise the variability in a dataset. It will help you identify which measure is most appropriate to use for a particular set of data. Examples are also given of the use of these measures and how the standard deviation can be calculated using Excel. Other Useful Guides: Using Averages, Working with Percentages Introduction Measures of average such as the median and mean represent the typical value for a dataset. Within the dataset the actual values usually differ from one another and from the average value itself. The extent to which the median and mean are good representatives of the values in the original dataset depends upon the variability or dispersion in the original data. Datasets are said to have high dispersion when they contain values considerably higher and lower than the mean value. In figure 1 the number of different sized tutorial groups in semester 1 and semester 2 are presented. In both semesters the mean and median tutorial group size is 5 students, however the groups in semester 2 show more dispersion (or variability in size) than those in semester 1. Dispersion within a dataset can be measured or described in several ways including the range, inter-quartile range and standard deviation. The Range The range is the most obvious measure of dispersion and is the difference between the lowest and highest values in a dataset. In figure 1, the size of the largest semester 1 tutorial group is 6 students and the size of the smallest group is 4 students, resulting in a range of 2 (6-4). In semester 2, the largest tutorial group size is 7 students and the smallest tutorial group contains 3 students, therefore the range is 4 (7-3). The range is simple to compute and is useful when you wish to evaluate the whole of a dataset. The range is useful for showing the spread within a dataset and for comparing the spread between similar datasets. An example of the use of the range to compare spread within datasets is provided in table 1. The scores of individual students in the examination and coursework component of a module are shown. To find the range in marks the highest and lowest values need to be found from the table. The highest coursework mark was 48 and the lowest was 27 giving a range of 21. In the examination, the highest mark was 45 and the lowest 12 producing a range of 33. This indicates that there was wider variation in the students’ performance in the examination than in the coursework for this module. Since the range is based solely on the two most extreme values within the dataset, if one of these is either exceptionally high or low (sometimes referred to as outlier) it will result in a range that is not typical of the variability within the dataset. For example, imagine in the above example that one student failed to hand in any coursework and was awarded a mark of zero, however they sat the exam and scored 40. The range for the coursework marks would now become 48 (48-0), rather than 21, however the new range is not typical of the dataset as a whole and is distorted by the outlier in the coursework marks. In order to reduce the problems caused by outliers in a dataset, the inter-quartile range is often calculated instead of the range. The Inter-quartile Range The inter-quartile range is a measure that indicates the extent to which the central 50% of values within the dataset are dispersed. It is based upon, and related to, the median. In the same way that the median divides a dataset into two halves, it can be further divided into quarters by identifying the upper and lower quartiles. The lower quartile is found one quarter of the way along a dataset when the values have been arranged in order of magnitude; the upper quartile is found three quarters along the dataset. Therefore, the upper quartile lies half way between the median and the highest value in the dataset whilst the lower quartile lies halfway between the median and the lowest value in the dataset. The inter-quartile range is found by subtracting the lower quartile from the upper quartile. For example, the examination marks for 20 students following a particular module are arranged in order of magnitude. The median lies at the mid-point between the two central values (10th and 11th) = half-way between 60 and 62 = 61 The lower quartile lies at the mid-point between the 5th and 6th values = half-way between 52 and 53 = 52.5 The upper quartile lies at the mid-point between the 15th and 16th values = half-way between 70 and 71 = 70.5 The inter-quartile range for this dataset is therefore 70.5 - 52.5 = 18 whereas the range is: 80 - 43 = 37. The inter-quartile range provides a clearer picture of the overall dataset by removing/ignoring the outlying values. Like the range however, the inter-quartile range is a measure of dispersion that is based upon only two values from the dataset. Statistically, the standard deviation is a more powerful measure of dispersion because it takes into account every value in the dataset. The standard deviation is explored in the next section of this guide. Calculating the Inter-quartile range using Excel The method Excel uses to calculate quartiles is not commonly used and tends to produce unusual results particularly when the dataset contains only a few values. For this reason you may be best to calculate the inter-quartile range by hand. The Standard Deviation The standard deviation is a measure that summarises the amount by which every value within a dataset varies from the mean. Effectively it indicates how tightly the values in the dataset are bunched around the mean value. It is the most robust and widely used measure of dispersion since, unlike the range and inter-quartile range, it takes into account every variable in the dataset. When the values in a dataset are pretty tightly bunched together the standard deviation is small. When the values are spread apart the standard deviation will be relatively large. The standard deviation is usually presented in conjunction with the mean and is measured in the same units. In many datasets the values deviate from the mean value due to chance and such datasets are said to display a normal distribution. In a dataset with a normal distribution most of the values are clustered around the mean while relatively few values tend to be extremely high or extremely low. Many natural phenomena display a normal distribution. For datasets that have a normal distribution the standard deviation can be used to determine the proportion of values that lie within a particular range of the mean value. For such distributions it is always the case that 68% of values are less than one standard deviation (1SD) away from the mean value, that 95% of values are less than two standard deviations (2SD) away from the mean and that 99% of values are less than three standard deviations (3SD) away from the mean. Figure 3 shows this concept in diagrammatical form. If the mean of a dataset is 25 and its standard deviation is 1.6, then 68% of the values in the dataset will lie between MEAN-1SD (25-1.6=23.4) and MEAN+1SD (25+1.6=26.6) 99% of the values will lie between MEAN-3SD (25-4.8=20.2) and MEAN+3SD (25+4.8=29.8). If the dataset had the same mean of 25 but a larger standard deviation (for example, 2.3) it would indicate that the values were more dispersed. The frequency distribution for a dispersed dataset would still show a normal distribution but when plotted on a graph the shape of the curve will be flatter as in figure 4. Population and sample standard deviations There are two different calculations for the Standard Deviation. Which formula you use depends upon whether the values in your dataset represent an entire population or whether they form a sample of a larger population. For example, if all student users of the library were asked how many books they had borrowed in the past month then the entire population has been studied since all the students have been asked. In such cases the population standard deviation should be used. Sometimes it is not possible to find information about an entire population and it might be more realistic to ask a sample of 150 students about their library borrowing and use these results to estimate library borrowing habits for the entire population of students. In such cases the sample standard deviation should be used. Formulae for the standard deviation Whilst it is not necessary to learn the formula for calculating the standard deviation, there may be times when you wish to include it in a report or dissertation. The standard deviation of an entire population is known as σ (sigma) and is calculated using: Where x represents each value in the population, μ is the mean value of the population, Σ is the summation (or total), and N is the number of values in the population. The standard deviation of a sample is known as S and is calculated using: Where x represents each value in the population, x is the mean value of the sample, Σ is the summation (or total), and n-1 is the number of values in the sample minus 1. Calculating the standard deviation using Excel Excel has functions to calculate the population and sample standard deviations. The appropriate commands are entered into the formula bar towards the top of the spreadsheet and the corresponding cells in the spreadsheet are updated to show the result. For an example of calculating the population standard deviation, imagine you wish to know how fuel-efficient a new car that you have just purchased is. You calculate how many kilometres you have done per litre on your first five trips. This information is presented as column A of the spreadsheet (figure 5). As you have only made 5 trips you do not have any further information and you are therefore measuring the whole population at this point in time. The command to find the population standard deviation in Excel is =STDEVP(VALUES) and in this case the command is =STDEVP(A2:A6) which gives an answer of 0.49. Basing your results on the population standard deviation and assuming that your first 5 trips in your new car have been typical of your usual journeys, you can be 99% confident that your new car will do between 14.75 (MEAN-3SD) and 17.69 (MEAN+3SD) kilometres per litre. The same data can be used to demonstrate how to calculate the sample standard deviation in Excel. In this case, imagine that the data in column A represent the kilometres per litre found for a sample of 5 new cars tested by the manufacturer. The population standard deviation is calculated using =STDEV(VALUES) and in this case the command is =STDEV(A2:A6) which produces an answer of 0.55. The sample standard deviation will always be greater than the population standard deviation when they are calculated for the same dataset. This is because the formula for the sample standard deviation has to take into account the possibility of there being more variation in the true population than has been measured in the sample. Based on their sample of 5 cars, and therefore using the sample standard deviation, the manufacturers could state with 99% confidence that similar cars will do between 14.57 (MEAN3SD) and 17.87 (MEAN+3SD) kilometres per litre . These examples show the quick method of calculating standard deviations using a cell range. Each of the commands can also be written out in a longer format with the individual kilometres/litre entered. For example entering: =STDEV(16.13,16.40,15.81,17.07,15.69) produces an identical result to =STDEV(A2:A6). However, if one of the values in column A was found to be incorrect and adjusted, the cell range method would automatically update the calculation of the standard deviation whereas the longer format will require manual adjustment of the command. Further information about using Excel to perform calculations can be found here. Summary The range, inter-quartile range and standard deviation are all measures that indicate the amount of variability within a dataset. The range is the simplest measure of variability to calculate but can be misleading if the dataset contains extreme values. The inter-quartile range reduces this problem by considering the variability within the middle 50% of the dataset. The standard deviation is the most robust measure of variability since it takes into account a measure of how every value in the dataset varies from the mean. However, care must be taken when calculating the standard deviation to consider whether the entire population or a sample is being examined and to use the appropriate formula. Further help and information Further study guides covering a range of numeracy topics are available from the Student Learning Centre in College House. The Maths Help service can also provide advice and support for students who have questions about any aspect of maths or numeracy including using the range, inter-quartile range and standard deviation. Measures of statistical dispersion A measure of statistical dispersion is a nonnegative real number that is zero if all the data are the same and increases as the data become more diverse. Most measures of dispersion have the same units as the quantity being measured. In other words, if the measurements are in metres or seconds, so is the measure of dispersion. Such measures of dispersion include: Sample standard deviation Interquartile range (IQR) or Interdecile range Range Mean difference Median absolute deviation (MAD) Average absolute deviation (or simply called average deviation) Distance standard deviation These are frequently used (together with scale factors) as estimators of scale parameters, in which capacity they are called estimates of scale. Robust measures of scale are those unaffected by a small number of outliers, and include the IQR and MAD. All the above measures of statistical dispersion have the useful property that they are locationinvariant, as well as linear in scale[clarification needed]. So if a random variable X has a dispersion of SX then a linear transformation Y = aX + b for real a and b should have dispersion SY = |a|SX. Other measures of dispersion are dimensionless. In other words, they have no units even if the variable itself has units. These include: Coefficient of variation Quartile coefficient of dispersion Relative mean difference, equal to twice the Gini coefficient There are other measures of dispersion: Variance (the square of the standard deviation) — location-invariant but not linear in scale. Variance-to-mean ratio — mostly used for count data when the term coefficient of dispersion is used and when this ratio is dimensionless, as count data are themselves dimensionless, not otherwise. Some measures of dispersion have specialized purposes, among them the Allan variance and the Hadamard variance. For categorical variables, it is less common to measure dispersion by a single number; see qualitative variation. One measure that does so is the discrete entropy. Sources of statistical dispersion In the physical sciences, such variability may result from random measurement errors: instrument measurements are often not perfectly precise, i.e., reproducible, and there is additional inter-rater variability in interpreting and reporting the measured results. One may assume that the quantity being measured is stable, and that the variation between measurements is due to observational error. A system of a large number of particles is characterized by the mean values of a relatively few number of macroscopic quantities such as temperature, energy, and density. The standard deviation is an important measure in Fluctuation theory, which explains many physical phenomena, including why the sky is blue.[2] In the biological sciences, the quantity being measured is seldom unchanging and stable, and the variation observed might additionally be intrinsic to the phenomenon: It may be due to interindividual variability, that is, distinct members of a population differing from each other. Also, it may be due to intra-individual variability, that is, one and the same subject differing in tests taken at different times or in other differing conditions. Such types of variability are also seen in the arena of manufactured products; even there, the meticulous scientist finds variation. In economics, finance, and other disciplines, regression analysis attempts to explain the dispersion of a dependent variable, generally measured by its variance, using one or more independent variables each of which itself has positive dispersion. The fraction of variance explained is called the coefficient of determination. A partial ordering of dispersion A mean-preserving spread (MPS) is a change from one probability distribution A to another probability distribution B, where B is formed by spreading out one or more portions of A's probability density function while leaving the mean (the expected value) unchanged.[3] The concept of a mean-preserving spread provides a partial ordering of probability distributions according to their dispersions: of two probability distributions, one may be ranked as having more dispersion than the other, or alternatively neither may be ranked as having more dispersion. Q.7: Centeral Tendancy Beta? Beta distribution From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with Beta function. Beta Probability density function Cumulative distribution function Notation Beta(α, β) Parameters Support pdf α > 0 shape (real) β > 0 shape (real) CDF Mean (see digamma function and see section: Geometric mean) Median Mode for α, β >1 Variance (see trigamma function and see section: Geometric variance) Skewness Ex. kurtosis Entropy MGF CF (see Confluent hypergeometric function) Fisher information see section: Fisher information matrix In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution. The beta distribution has been applied to model the behavior of random variables limited to intervals of finite length in a wide variety of disciplines. For example, it has been used as a statistical description of allele frequencies in population genetics;[1] time allocation in project management / control systems;[2] sunshine data;[3] variability of soil properties;[4] proportions of the minerals in rocks in stratigraphy;[5] and heterogeneity in the probability of HIV transmission.[6] In Bayesian inference, the beta distribution is the conjugate prior probability distribution for the Bernoulli, binomial and geometric distributions. For example, the beta distribution can be used in Bayesian analysis to describe initial knowledge concerning probability of success such as the probability that a space vehicle will successfully complete a specified mission. The beta distribution is a suitable model for the random behavior of percentages and proportions. The usual formulation of the beta distribution is also known as the beta distribution of the first kind, whereas beta distribution of the second kind is an alternative name for the beta prime distribution Probability density function The probability density function of the beta distribution, for 0 ≤ x ≤ 1, and shape parameters α > 0 and β > 0, is a power function of the variable x and of its reflection (1−x) as follows: where Γ(z) is the gamma function. The beta function, , appears as a normalization constant to ensure that the total probability integrates to 1. In the above equations x is a realization—an observed value that actually occurred—of a random process X. This definition includes both ends x = 0 and x = 1, which is consistent with definitions for other continuous distributions supported on a bounded interval which are special cases of the beta distribution, for example the arcsine distribution, and consistent with several authors, such as N. L. Johnson and S. Kotz.[7][8][9][10] However, several other authors, including W. Feller,[11][12][13] choose to exclude the ends x = 0 and x = 1, (such that the two ends are not actually part of the density function) and consider instead 0 < x < 1. Several authors, including N. L. Johnson and S. Kotz,[7] use the symbols p and q (instead of α and β) for the shape parameters of the beta distribution, reminiscent of the symbols traditionally used for the parameters of the Bernoulli distribution, because the beta distribution approaches the Bernoulli distribution in the limit as both shape parameters α and β approach the value of zero. In the following, that a random variable X is beta-distributed with parameters α and β will be denoted by:[14][15] Other notations for beta-distributed random variables used in the statistical literature are [16] and .[11] Cumulative distribution function CDF for symmetric beta distribution vs. x and alpha=beta CDF for skewed beta distribution vs. x and beta= 5 alpha The cumulative distribution function is where beta function. is the incomplete beta function and is the regularized incomplete Properties Measures of central tendency Mode The mode of a Beta distributed random variable X with α, β > 1 is given by the following expression:[7] When both parameters are less than one (α, β < 1), this is the anti-mode: the lowest point of the probability density curve.[9] Letting α = β, the expression for the mode simplifies to 1/2, showing that for α = β > 1 the mode (resp. anti-mode when α, β < 1), is at the center of the distribution: it is symmetric in those cases. See "Shapes" section in this article for a full list of mode cases, for arbitrary values of α and β. For several of these cases, the maximum value of the density function occurs at one or both ends. In some cases the (maximum) value of the density function occurring at the end is finite, for example in the case of α = 2, β = 1 (or α = 1, β = 2), the right-triangle distribution, while in several other cases there is a singularity at the end, and hence the value of the density function approaches infinity at the end, for example in the case α = β = 1/2, the arcsine distribution. The choice whether to include, or not to include, the ends x = 0, and x = 1, as part of the density function, whether a singularity can be considered to be a mode, and whether cases with two maxima are to be considered bimodal, is responsible for some authors considering these maximum values at the end of the density distribution to be considered[14] modes or not.[12] Mode for Beta distribution for 1 ≤ α ≤ 5 and 1 ≤ β ≤ 5 Median Median for Beta distribution for 0≤α≤5 and 0≤β≤5 (Mean - Median) for Beta distribution versus alpha and beta from 0 to 2 The median of the beta distribution is the unique real number for which the regularized incomplete beta function . There is no general closed-form expression for the median of the beta distribution for arbitrary values of α and β. Closed-form expressions for particular values of the parameters α and β follow:[citation needed] For symmetric cases α = β, median = 1/2. For α = 1 and β > 0, median = [0,1] distribution) For α > 0 and β = 1, median = (this case is the power function [0,1] distribution[12]) For α = 3 and β = 2, median = 0.6142724318676105... the real [0,1] solution to the quartic equation 1−8x3+6x4 = 0. For α = 2 and β = 3, median = 0.38572756813238945... = 1−median(Beta(3, 2)) (this case is the mirror-image of the power function The following are the limits with one parameter finite (non zero) and the other approaching these limits:[citation needed] A reasonable approximation of the value of the median of the beta distribution, for both α and β greater or equal to one, is given by the formula[17] For α ≥ 1 and β ≥ 1, the relative error (the absolute error divided by the median) in this approximation is less than 4% and for both α ≥ 2 and β ≥ 2 it is less than 1%. The absolute error divided by the difference between the mean and the mode is similarly small: Mean Mean for Beta distribution for 0 ≤ α ≤ 5 and 0 ≤ β ≤ 5 The expected value (mean) (μ) of a Beta distribution random variable X with two parameters α and β is a function of only the ratio β/α of these parameters:[7] Letting α = β in the above expression one obtains μ = 1/2, showing that for α = β the mean is at the center of the distribution: it is symmetric. Also, the following limits can be obtained from the above expression: Therefore, for β/α → 0, or for α/β → ∞, the mean is located at the right end, x = 1. For these limit ratios, the beta distribution becomes a one-point degenerate distribution with a Dirac delta function spike at the right end, x = 1, with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the right end, x = 1. Similarly, for β/α → ∞, or for α/β → 0, the mean is located at the left end, x = 0. The beta distribution becomes a 1 point Degenerate distribution with a Dirac delta function spike at the left end, x = 0, with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the left end, x = 0. Following are the limits with one parameter finite (non zero) and the other approaching these limits: While for typical unimodal distributions (with centrally located modes, inflexion points at both sides of the mode, and longer tails) (with Beta(α, β) such that α, β > 2) it is known that the sample mean (as an estimate of location) is not as robust as the sample median, the opposite is the case for uniform or "U-shaped" bimodal distributions (with Beta(α, β) such that α, β ≤ 1), with the modes located at the ends of the distribution. As Mosteller and Tukey remark ([18] p. 207) "the average of the two extreme observations uses all the sample information. This illustrates how, for short-tailed distributions, the extreme observations should get more weight." By contrast, it follows that the median of "U-shaped" bimodal distributions with modes at the edge of the distribution (with Beta(α, β) such that α, β ≤ 1) is not robust, as the sample median drops the extreme sample observations from consideration. A practical application of this occurs for example for random walks, since the probability for the time of the last visit to the origin in a random walk is distributed as the arcsine distribution Beta(1/2, 1/2):[11][19] the mean of a number of realizations of a random walk is a much more robust estimator than the median (which is an inappropriate sample measure estimate in this case). Geometric mean (Mean - GeometricMean) for Beta distribution versus α and β from 0 to 2, showing the asymmetry between α and β for the geometric mean Geometric means for Beta distribution Purple = G(x), Yellow = G(1−x), smaller values alpha and beta in front Geometric Means for Beta distribution Purple = G(x), Yellow = G(1−x), larger values alpha and beta in front The logarithm of the geometric mean GX of a distribution with random variable X is the arithmetic mean of ln(X), or, equivalently, its expected value: For a beta distribution, the expected value integral gives: where ψ is the digamma function. Therefore the geometric mean of a beta distribution with shape parameters α and β is the exponential of the digamma functions of α and β as follows: While for a beta distribution with equal shape parameters α = β, it follows that skewness = 0 and mode = mean = median = 1/2, the geometric mean is less than 1/2: 0 < GX < 1/2. The reason for this is that the logarithmic transformation strongly weights the values of X close to zero, as ln(X) strongly tends towards negative infinity as X approaches zero, while ln(X) flattens towards zero as X → 1. Along a line α = β, the following limits apply: Following are the limits with one parameter finite (non zero) and the other approaching these limits: The accompanying plot shows the difference between the mean and the geometric mean for shape parameters α and β from zero to 2. Besides the fact that the difference between them approaches zero as α and β approach infinity and that the difference becomes large for values of α and β approaching zero, one can observe an evident asymmetry of the geometric mean with respect to the shape parameters α and β. The difference between the geometric mean and the mean is larger for small values of α in relation to β than when exchanging the magnitudes of β and α. N.L.Johnson and S.Kotz[7] suggest the logarithmic approximation to the digamma function ψ(α) ≈ ln(α-1/2) which results in the following approximation to the geometric mean: Numerical values for the relative error in this approximation follow: [(α = β = 1): 9.39%]; [(α = β = 2): 1.29%]; [(α = 2, β = 3): 1.51%]; [(α = 3, β = 2): 0.44%]; [(α = β = 3): 0.51%]; [(α = β = 4): 0.26%];[(α = 3, β = 4): 0.55%]; [(α = 4, β = 3): 0.24%]. Similarly, one can calculate the value of shape parameters required for the geometric mean to equal 1/2. Let's say that we know one of the parameters, β, what would be the value of the other parameter, α, required for the geometric mean to equal 1/2 ?. The answer is that (for β > 1), the value of α required tends towards β + 1/2 as β → ∞. For example, all these couples have the same geometric mean of 1/2: [β = 1, α = 1.4427], [β = 2, α = 2.46958], [β = 3, α = 3.47943], [β = 4, α = 4.48449], [β = 5, α = 5.48756], [β = 10, α = 10.4938], [β = 100, α = 100.499]. The fundamental property of the geometric mean, which can be proven to be false for any other mean, is This makes the geometric mean the only correct mean when averaging normalized results, that is results that are presented as ratios to reference values.[20] This is relevant because the beta distribution is a suitable model for the random behavior of percentages and it is particularly suitable to the statistical modelling of proportions. The geometric mean plays a central role in maximum likelihood estimation, see section "Parameter estimation, maximum likelihood." Actually, when performing maximum likelihood estimation, besides the geometric mean GX based on the random variable X, also another geometric mean appears naturally: the geometric mean based on the linear transformation (1−X), the mirror-image of X, denoted by G(1−X): Along a line α = β, the following limits apply: Following are the limits with one parameter finite (non zero) and the other approaching these limits: It has the following approximate value: Although both GX and G(1−X) are asymmetric, in the case that both shape parameters are equal α = β, the geometric means are equal: GX= G(1−X). This equality follows from the following symmetry displayed between both geometric means: Harmonic mean Harmonic mean for Beta distribution for 0<α<5 and 0<β<5 (Mean - HarmonicMean) for Beta distribution versus alpha and beta from 0 to 2 Harmonic Means for Beta distribution Purple=H(X), Yellow=H(1-X), smaller values alpha and beta in front Harmonic Means for Beta distribution Purple=H(X), Yellow=H(1-X), larger values alpha and beta in front The inverse of the harmonic mean (HX) of a distribution with random variable X is the arithmetic mean of 1/X, or, equivalently, its expected value. Therefore, the harmonic mean (HX) of a beta distribution with shape parameters α and β is: The harmonic mean (HX) of a Beta distribution with α < 1 is undefined, because its defining expression is not bounded in [0, 1] for shape parameter α less than unity. Letting α = β in the above expression one obtains , showing that for α = β the harmonic mean ranges from 0, for α = β = 1, to 1/2, for α = β → ∞. Following are the limits with one parameter finite (non zero) and the other approaching these limits: The harmonic mean plays a role in maximum likelihood estimation for the four parameter case, in addition to the geometric mean. Actually, when performing maximum likelihood estimation for the four parameter case, besides the harmonic mean HX based on the random variable X, also another harmonic mean appears naturally: the harmonic mean based on the linear transformation (1−X), the mirror-image of X, denoted by H(1−X): The harmonic mean (H(1−X)) of a Beta distribution with β < 1 is undefined, because its defining expression is not bounded in [0, 1] for shape parameter β less than unity. Letting α = β in the above expression one obtains , showing that for α = β the harmonic mean ranges from 0, for α = β = 1, to 1/2, for α = β → ∞. Following are the limits with one parameter finite (non zero) and the other approaching these limits: Although both HX and H(1−X) are asymmetric, in the case that both shape parameters are equal α = β, the harmonic means are equal: HX = H(1−X). This equality follows from the following symmetry displayed between both harmonic means: . Measures of statistical dispersion Variance The variance (the second moment centered around the mean) of a Beta distribution random variable X with parameters α and β is:[7] Letting α = β in the above expression one obtains , showing that for α = β the variance decreases monotonically as α = β increases. Setting α = β = 0 in this expression, one finds the maximum variance var(X) = 1/4[7] which only occurs approaching the limit, at α = β = 0. The beta distribution may also be parametrized in terms of its mean μ (0 < μ < 1) and sample size ν = α + β (ν > 0) (see section below titled "Mean and sample size"): Using this parametrization, one can express the variance in terms of the mean μ and the sample size ν as follows: Since ν = (α + β) > 0, it must follow that var(X) < μ(1−μ) For a symmetric distribution, the mean is at the middle of the distribution, μ = 1/2, and therefore: Also, the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions: Geometric variance and co variance log geometric variances vs. α and β log geometric variances vs. α and β The logarithm of the geometric variance of a distribution with random variable X is the second moment of the logarithm of X centered around the geometric mean of X, (ln(GX): and therefore, the geometric variance is: In the Fisher information matrix, and the curvature of the log likelihood function, the logarithm of the geometric variance of the reflected variable (1-X) and the logarithm of the geometric covariance between X and (1-X) appear: For a beta distribution, higher order logarithmic moments can be derived by using the representation of a beta distribution as a proportion of two Gamma distributions and differentiating through the integral. They can be expressed in terms of higher order poly-gamma functions. See the section titled "Other moments, Moments of transformed random variables, Moments of logarithmically transformed random variables". The variance of the logarithmic variables and covariance of lnX and ln(1−X) are: where the trigamma function, denoted ψ1(α), is the second of the polygamma functions, and is defined as the derivative of the digamma function: Therefore, The accompanying plots show the log geometric variances and log geometric covariance versus the shape parameters α and β. The plots show that the log geometric variances and log geometric covariance are close to zero for shape parameters α and β greater than 2, and that the log geometric variances rapidly rise in value for shape parameter values α and β less than unity. The log geometric variances are positive for all values of the shape parameters. The log geometric covariance is negative for all values of the shape parameters, and it reaches large negative values for α and β less than unity. Following are the limits with one parameter finite (non zero) and the other approaching these limits: Limits with two parameters varying: Although both ln(varGX) and ln(varG(1−X)) are asymmetric, in the case that both shape parameters are equal α = β, the log geometric variances are equal: ln(varGX) = ln(varG(1−X)). This equality follows from the following symmetry displayed between both log geometric variances: . The log geometric covariance is symmetric: Mean absolute deviation around the mean Ratio of Mean Abs.Dev. to Std.Dev. for Beta distribution with α and β ranging from 0 to 5 Ratio of Mean Abs.Dev. to Std.Dev. for Beta distribution with mean 0≤μ≤1 and sample size 0<ν≤10 The mean absolute deviation around the mean for the beta distribution with shape parameters α and β is:[12] The mean absolute deviation around the mean is a more robust estimator of statistical dispersion than the standard deviation for beta distributions with tails and inflection points at each side of the mode, Beta(α, β) distributions with α,β > 2, as it depends on the linear (absolute) deviations rather than the square deviations from the mean. Therefore the effect of very large deviations from the mean are not as overly weighted. The term "absolute deviation" does not uniquely identify a measure of statistical dispersion, as there are several measures that can be used to measure absolute deviations, and there are several measures of central tendency that can be used as well. Thus, to uniquely identify the absolute deviation it is necessary to specify both the measure of deviation and the measure of central tendency. Unfortunately, the statistical literature has not yet adopted a standard notation, as both the mean absolute deviation around the mean and the median absolute deviation around the median have been denoted by their initials "MAD" in the literature, which may lead to confusion, since in general, they may have values considerably different from each other. Using Stirling's approximation to the Gamma function, N.L.Johnson and S.Kotz[7] derived the following approximation for values of the shape parameters greater than unity (the relative error for this approximation is only −3.5% for α = β = 1, and it decreases to zero as α → ∞, β → ∞ ): At the limit α → ∞, β → ∞, the ratio of the mean absolute deviation to the standard deviation (for the beta distribution) becomes equal to the ratio of the same measures for the normal distribution: . For α = β = 1 this ratio equals , so that from α = β = 1 to α, β → ∞ the ratio decreases by 8.5%. For α = β = 0 the standard deviation is exactly equal to the mean absolute deviation around the mean. Therefore this ratio decreases by 15% from α = β = 0 to α = β = 1, and by 25% from α = β = 0 to α, β → ∞ . However, for skewed beta distributions such that α → 0 or β → 0, the ratio of the standard deviation to the mean absolute deviation approaches infinity (although each of them, individually, approaches zero) because the mean absolute deviation approaches zero faster than the standard deviation. Using the parametrization in terms of mean μ and sample size ν = α + β > 0: α = μν, β = (1−μ)ν one can express the mean absolute deviation around the mean in terms of the mean μ and the sample size ν as follows: For a symmetric distribution, the mean is at the middle of the distribution, μ = 1/2, and therefore: Also, the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions: Skewness Skewness for Beta Distribution as a function of variance and mean The skewness (the third moment centered around the mean, normalized by the 3/2 power of the variance) of the beta distribution is[7] Letting α = β in the above expression one obtains γ1 = 0, showing once again that for α = β the distribution is symmetric and hence the skewness is zero. Positive skew (right-tailed) for α < β, negative skew (left-tailed) for α > β. Using the parametrization in terms of mean μ and sample size ν = α + β: one can express the skewness in terms of the mean μ and the sample size ν as follows: The skewness can also be expressed just in terms of the variance var and the mean μ as follows: The accompanying plot of skewness as a function of variance and mean shows that maximum variance (1/4) is coupled with zero skewness and the symmetry condition (μ = 1/2), and that maximum skewness (positive or negative infinity) occurs when the mean is located at one end or the other, so that that the "mass" of the probability distribution is concentrated at the ends (minimum variance). The following expression for the square of the skewness, in terms of the sample size ν = α + β and the variance var, is useful for the method of moments estimation of four parameters: This expression correctly gives a skewness of zero for α = β, since in that case (see section titled "Variance"): . For the symmetric case (α = β), skewness = 0 over the whole range, and the following limits apply: For the asymmetric cases (α ≠ β) the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions: Kurtosis Excess Kurtosis for Beta Distribution as a function of variance and mean The beta distribution has been applied in acoustic analysis to assess damage to gears, as the kurtosis of the beta distribution has been reported to be a good indicator of the condition of a gear.[21] Kurtosis has also been used to distinguish the seismic signal generated by a person's footsteps from other signals. As persons or other targets moving on the ground generate continuous signals in the form of seismic waves, one can separate different targets based on the seismic waves they generate. Kurtosis is sensitive to impulsive signals, so it's much more sensitive to the signal generated by human footsteps than other signals generated by vehicles, winds, noise, etc.[22] Unfortunately, the notation for kurtosis has not been standardized. Kenney and Keeping[23] use the symbol γ2 for the excess kurtosis, but Abramowitz and Stegun[24] use different terminology. To prevent confusion[25] between kurtosis (the fourth moment centered around the mean, normalized by the square of the variance) and excess kurtosis, when using symbols, they will be spelled out as follows:[12][13] Letting α = β in the above expression one obtains . Therefore for symmetric beta distributions, the excess kurtosis is negative, increasing from a minimum value of −2 at the limit as {α = β} → 0, and approaching a maximum value of zero as {α = β} → ∞. The value of −2 is the minimum value of excess kurtosis that any distribution (not just beta distributions, but any distribution of any possible kind) can ever achieve. This minimum value is reached when all the probability density is entirely concentrated at each end x = 0 and x = 1, with nothing in between: a 2-point Bernoulli distribution with equal probability 1/2 at each end (a coin toss: see section below "Kurtosis bounded by the square of the skewness" for further discussion). The description of kurtosis as a measure of the "peakedness" (or "heavy tails") of the probability distribution, is strictly applicable to unimodal distributions (for example the normal distribution). However, for more general distributions, like the beta distribution, a more general description of kurtosis is that it is a measure of the proportion of the mass density near the mean. The higher the proportion of mass density near the mean, the higher the kurtosis, while the higher the mass density away from the mean, the lower the kurtosis. For α ≠ β, skewed beta distributions, the excess kurtosis can reach unlimited positive values (particularly for α → 0 for finite β, or for β → 0 for finite α) because all the mass density is concentrated at the mean when the mean coincides with one of the ends. Minimum kurtosis takes place when the mass density is concentrated equally at each end (and therefore the mean is at the center), and there is no probability mass density in between the ends. Using the parametrization in terms of mean μ and sample size ν = α + β: one can express the excess kurtosis in terms of the mean μ and the sample size ν as follows: The excess kurtosis can also be expressed in terms of just the following two parameters: the variance var, and the sample size ν as follows: and, in terms of the variance var and the mean μ as follows: The plot of excess kurtosis as a function of the variance and the mean shows that the minimum value of the excess kurtosis (−2, which is the minimum possible value for excess kurtosis for any distribution) is intimately coupled with the maximum value of variance (1/4) and the symmetry condition: the mean occurring at the midpoint (μ = 1/2). This occurs for the symmetric case of α = β = 0, with zero skewness. At the limit, this is the 2 point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. (A coin toss: one face of the coin being x = 0 and the other face being x = 1.) Variance is maximum because the distribution is bimodal with nothing in between the two modes (spikes) at each end. Excess kurtosis is minimum: the probability density "mass" is zero at the mean and it is concentrated at the two peaks at each end. Excess kurtosis reaches the minimum possible value (for any distribution) when the probability density function has two spikes at each end: it is bi"peaky" with nothing in between them. On the other hand, the plot shows that for extreme skewed cases, where the mean is located near one or the other end (μ = 0 or μ = 1), the variance is close to zero, and the excess kurtosis rapidly approaches infinity when the mean of the distribution approaches either end. Alternatively, the excess kurtosis can also be expressed in terms of just the following two parameters: the square of the skewness, and the sample size ν as follows: From this last expression, one can obtain the same limits published practically a century ago by Karl Pearson in his paper,[26] for the beta distribution (see section below titled "Kurtosis bounded by the square of the skewness"). Setting α + β= ν = 0 in the above expression, one obtains Pearson's lower boundary (values for the skewness and excess kurtosis below the boundary (excess kurtosis + 2 − skewness2 = 0) cannot occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the "impossible region"). The limit of α + β = ν → ∞ determines Pearson's upper boundary. therefore: Values of ν= α + β such that ν ranges from zero to infinity, 0 < ν < ∞, span the whole region of the beta distribution in the plane of excess kurtosis versus squared skewness. For the symmetric case (α = β), the following limits apply: For the unsymmetric cases (α ≠ β) the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions: Characteristic function Main article: Confluent hypergeometric function Re(characteristic function) symmetric case α = β ranging from 25 to 0 Re(characteristic function) symmetric case α = β ranging from 0 to 25 Re(characteristic function) β = α + 1/2; α ranging from 25 to 0 Re(characteristic function) α = β + 1/2; β ranging from 25 to 0 Re(characteristic function) α = β + 1/2; β ranging from 0 to 25 The characteristic function is the Fourier transform of the probability density function. The characteristic function of the beta distribution is Kummer's confluent hypergeometric function (of the first kind):[7][24][27] where is the rising factorial, also called the "Pochhammer symbol". The value of the characteristic function for t = 0, is one: . Also, the real and imaginary parts of the characteristic function enjoy the following symmetries with respect to the origin of variable t: The symmetric case α = β simplifies the characteristic function of the beta distribution to a Bessel function, since in the special case α + β = 2α the confluent hypergeometric function (of the first kind) reduces to a Bessel function (the modified Bessel function of the first kind using Kummer's second transformation as follows: ) In the accompanying plots, the real part (Re) of the characteristic function of the beta distribution is displayed for symmetric (α = β) and skewed (α ≠ β) cases. Other moments Moment generating function It also follows[7][12] that the moment generating function is In particular MX(α; β; 0) = 1. Higher moments Using the moment generating function, the k-th raw moment is given by[7] the factor multiplying the (exponential series) term in the series of the moment generating function where (x)(k) is a Pochhammer symbol representing rising factorial. It can also be written in a recursive form as Moments of transformed random variables Moments of linearly transformed, product and inverted random variables One can also show the following expectations for a transformed random variable,[7] where the random variable X is Beta-distributed with parameters α and β: X ~ Beta(α, β). The expected value of the variable (1−X) is the mirror-symmetry of the expected value based on X: Due to the mirror-symmetry of the probability density function of the beta distribution, the variances based on variables X and (1−X) are identical, and the covariance on X(1-X) is the negative of the variance: These are the expected values for inverted variables, (these are related to the harmonic means, see section titled "Harmonic mean"): The following transformation by dividing the variable X by its mirror-image X/(1−X) results in the expected value of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI):[7] Variances of these transformed variables can be obtained by integration, as the expected values of the second moments centered around the corresponding variables: The following variance of the variable X divided by its mirror-image (X/(1−X) results in the variance of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI):[7] The covariances are: These expectations and variances appear in the four-parameter Fisher information matrix (section titled "Fisher information," "four parameters") Moments of logarithmically transformed random variables Plot of logit(X) = ln(X/(1−X)) (vertical axis) vs. X in the domain of 0 to 1 (horizontal axis). Logit transformations are interesting, as they usually transform various shapes (including J-shapes) into (usually skewed) bell-shaped densities over the logit variable, and they may remove the end singularities over the original variable Expected values for logarithmic transformations (useful for maximum likelihood estimates, see section titled "Parameter estimation, Maximum likelihood" below) are discussed in this section. The following logarithmic linear transformations are related to the geometric means GX and G(1−X) (see section titled "Geometric mean"): Where the digamma function ψ(α) is defined as the logarithmic derivative of the gamma function:[24] Logit transformations are interesting,[28] as they usually transform various shapes (including Jshapes) into (usually skewed) bell-shaped densities over the logit variable, and they may remove the end singularities over the original variable: Johnson[29] considered the distribution of the logit - transformed variable ln(X/1−X), including its moment generating function and approximations for large values of the shape parameters. This transformation extends the finite support [0, 1] based on the original variable X to infinite support in both directions of the real line (−∞, +∞). Higher order logarithmic moments can be derived by using the representation of a beta distribution as a proportion of two Gamma distributions and differentiating through the integral. They can be expressed in terms of higher order poly-gamma functions as follows: therefore the variance of the logarithmic variables and covariance of ln(X) and ln(1−X) are: where the trigamma function, denoted ψ1(α), is the second of the polygamma functions, and is defined as the derivative of the digamma function: . The variances and covariance of the logarithmically transformed variables X and (1−X) are different, in general, because the logarithmic transformation destroys the mirror-symmetry of the original variables X and (1−X), as the logarithm approaches negative infinity for the variable approaching zero. These logarithmic variances and covariance are the elements of the Fisher information matrix for the beta distribution. They are also a measure of the curvature of the log likelihood function (see section on Maximum likelihood estimation). The variances of the log inverse variables are identical to the variances of the log variables: It also follows that the variances of the logit transformed variables are: Quantities of information (entropy) Given a beta distributed random variable, X ~ Beta(α, β), the differential entropy of X is[30](measured in nats), the expected value of the negative of the logarithm of the probability density function: where f(x; α, β) is the probability density function of the beta distribution: The digamma function ψ appears in the formula for the differential entropy as a consequence of Euler's integral formula for the harmonic numbers which follows from the integral: The differential entropy of the beta distribution is negative for all values of α and β greater than zero, except at α = β = 1 (for which values the beta distribution is the same as the uniform distribution), where the differential entropy reaches its maximum value of zero. It is to be expected that the maximum entropy should take place when the beta distribution becomes equal to the uniform distribution, since uncertainty is maximal when all possible events are equiprobable. For α or β approaching zero, the differential entropy approaches its minimum value of negative infinity. For (either or both) α or β approaching zero, there is a maximum amount of order: all the probability density is concentrated at the ends, and there is zero probability density at points located between the ends. Similarly for (either or both) α or β approaching infinity, the differential entropy approaches its minimum value of negative infinity, and a maximum amount of order. If either α or β approaches infinity (and the other is finite) all the probability density is concentrated at an end, and the probability density is zero everywhere else. If both shape parameters are equal (the symmetric case), α = β, and they approach infinity simultaneously, the probability density becomes a spike (Dirac delta function) concentrated at the middle x = 1/2, and hence there is 100% probability at the middle x = 1/2 and zero probability everywhere else. The (continuous case) differential entropy was introduced by Shannon in his original paper (where he named it the "entropy of a continuous distribution"), as the concluding part[31] of the same paper where he defined the discrete entropy. It is known since then that the differential entropy may differ from the infinitesimal limit of the discrete entropy by an infinite offset, therefore the differential entropy can be negative (as it is for the beta distribution). What really matters is the relative value of entropy. Given two beta distributed random variables, X1 ~ Beta(α, β) and X2 ~ Beta(α', β'), the cross entropy is (measured in nats)[32] The cross entropy has been used as an error metric to measure the distance between two hypotheses.[33][34] Its absolute value is minimum when the two distributions are identical. It is the information measure most closely related to the log maximum likelihood [32](see section on "Parameter estimation. Maximum likelihood estimation")). The relative entropy, or Kullback–Leibler divergence DKL(X1, X2), is a measure of the inefficiency of assuming that the distribution is X2 ~ Beta(α', β') when the distribution is really X1 ~ Beta(α, β). It is defined as follows (measured in nats). The relative entropy, or Kullback–Leibler divergence, is always non-negative. A few numerical examples follow: X1 ~ Beta(1, 1) and X2 ~ Beta(3, 3); DKL(X1, X2) = 0.598803; DKL(X2, X1) = 0.267864; h(X1) = 0; h(X2) = −0.267864 X1 ~ Beta(3, 0.5) and X1 ~ Beta(0.5, 3); DKL(X1, X2) = 7.21574; DKL(X2, X1) = 7.21574; h(X1) = −1.10805; h(X2) = −1.10805. The Kullback–Leibler divergence is not symmetric DKL(X1, X2) ≠ DKL(X2, X1) for the case in which the individual beta distributions Beta(1, 1) and Beta(3, 3) are symmetric, but have different entropies h(X1) ≠ h(X2). The value of the Kullback divergence depends on the direction traveled: whether going from a higher (differential) entropy to a lower (differential) entropy or the other way around. In the numerical example above, the Kullback divergence measures the inefficiency of assuming that the distribution is (bell-shaped) Beta(3, 3), rather than (uniform) Beta(1, 1). The "h" entropy of Beta(1, 1) is higher than the "h" entropy of Beta(3, 3) because the uniform distribution Beta(1, 1) has a maximum amount of disorder. The Kullback divergence is more than two times higher (0.598803 instead of 0.267864) when measured in the direction of decreasing entropy: the direction that assumes that the (uniform) Beta(1, 1) distribution is (bellshaped) Beta(3, 3) rather than the other way around. In this restricted sense, the Kullback divergence is consistent with the second law of thermodynamics. The Kullback–Leibler divergence is symmetric DKL(X1, X2) = DKL(X2, X1) for the skewed cases Beta(3, 0.5) and Beta(0.5, 3) that have equal differential entropy h(X1) = h(X2). The symmetry condition: follows from the above definitions and the mirror-symmetry f(x; α, β) = f(1−x; α, β) enjoyed by the beta distribution. Relationships between statistical measures Mean, mode and median relationship If 1 < α < β then mode ≤ median ≤ mean.[17] Expressing the mode (only for α > 1 and β > 1), and the mean in terms of α and β: If 1 < β < α then the order of the inequalities are reversed. For α > 1 and β > 1 the absolute distance between the mean and the median is less than 5% of the distance between the maximum and minimum values of x. On the other hand, the absolute distance between the mean and the mode can reach 50% of the distance between the maximum and minimum values of x, for the (pathological) case of α = 1 and β = 1 (for which values the beta distribution approaches the uniform distribution and the differential entropy approaches its maximum value, and hence maximum "disorder"). For example, for α = 1.0001 and β = 1.00000001: mode = 0.9999; PDF(mode) = 1.00010 mean = 0.500025; PDF(mean) = 1.00003 median = 0.500035; PDF(median) = 1.00003 mean − mode = −0.499875 mean − median = −9.65538 × 10−6 (where PDF stands for the value of the probability density function) Mean, geometric mean and harmonic mean relationship :Mean, Median, Geometric Mean and Harmonic Mean for Beta distribution with 0 < α = β < 5 It is known from the inequality of arithmetic and geometric means that the geometric mean is lower than the mean. Similarly, the harmonic mean is lower than the geometric mean. The accompanying plot shows that for α = β, both the mean and the median are exactly equal to 1/2, regardless of the value of α = β, and the mode is also equal to 1/2 for α = β > 1, however the geometric and harmonic means are lower than 1/2 and they only approach this value asymptotically as α = β → ∞. Kurtosis bounded by the square of the skewness Beta distribution α and β parameters vs. excess Kurtosis and squared Skewness As remarked by Feller,[11] in the Pearson system the beta probability density appears as type I (any difference between the beta distribution and Pearson's type I distribution is only superficial and it makes no difference for the following discussion regarding the relationship between kurtosis and skewness). Karl Pearson showed, in Plate 1 of his paper [26] published in 1916, a graph with the kurtosis as the vertical axis (ordinate) and the square of the skewness as the horizontal axis (abscissa), in which a number of distributions were displayed.[35] The region occupied by the beta distribution is bounded by the following two lines in the (skewness2,kurtosis) plane, or the (skewness2,excess kurtosis) plane: or, equivalently, (At a time when there were no powerful digital computers), Karl Pearson accurately computed further boundaries,[10][26] for example, separating the "U-shaped" from the "J-shaped" distributions. The lower boundary line (excess kurtosis + 2 − skewness2 = 0) is produced by skewed "U-shaped" beta distributions with both values of shape parameters α and β close to zero. The upper boundary line (excess kurtosis − (3/2) skewness2 = 0) is produced by extremely skewed distributions with very large values of one of the parameters and very small values of the other parameter. Karl Pearson showed [26] that this upper boundary line (excess kurtosis − (3/2) skewness2 = 0) is also the intersection with Pearson's distribution III, which has unlimited support in one direction (towards positive infinity), and can be bell-shaped or J-shaped. His son, Egon Pearson, showed [35] that the region (in the kurtosis/squared-skewness plane) occupied by the beta distribution (equivalently, Pearson's distribution I) as it approaches this boundary (excess kurtosis − (3/2) skewness2 = 0) is shared with the noncentral chi-squared distribution. Karl Pearson[36] (Pearson 1895, pp. 357, 360, 373–376) also showed that the gamma distribution is a Pearson type III distribution. Hence this boundary line for Pearson's type III distribution is known as the gamma line. (This can be shown from the fact that the excess kurtosis of the gamma distribution is 6/k and the square of the skewness is 4/k, hence (excess kurtosis − (3/2) skewness2 = 0) is identically satisfied by the gamma distribution regardless of the value of the parameter "k"). Pearson later noted that the chi-squared distribution is a special case of Pearson's type III and also shares this boundary line (as it is apparent from the fact that for the chi-squared distribution the excess kurtosis is 12/k and the square of the skewness is 8/k, hence (excess kurtosis − (3/2) skewness2 = 0) is identically satisfied regardless of the value of the parameter "k"). This is to be expected, since the chi-squared distribution X ~ χ2(k) is a special case of the gamma distribution, with parametrization X ~ Γ(k/2, 1/2) where k is a positive integer that specifies the "number of degrees of freedom" of the chi-squared distribution. An example of a beta distribution near the upper boundary (excess kurtosis − (3/2) skewness2 = 0) is given by α = 0.1, β = 1000, for which the ratio (excess kurtosis)/(skewness2) = 1.49835 approaches the upper limit of 1.5 from below. An example of a beta distribution near the lower boundary (excess kurtosis + 2 − skewness2 = 0) is given by α= 0.0001, β = 0.1, for which values the expression (excess kurtosis + 2)/(skewness2) = 1.01621 approaches the lower limit of 1 from above. In the infinitesimal limit for both α and β approaching zero symmetrically, the excess kurtosis reaches its minimum value at −2. This minimum value occurs at the point at which the lower boundary line intersects the vertical axis (ordinate). (However, in Pearson's original chart, the ordinate is kurtosis, instead of excess kurtosis, and it increases downwards rather than upwards). Values for the skewness and excess kurtosis below the lower boundary (excess kurtosis + 2 − skewness2 = 0) cannot occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the "impossible region." The boundary for this "impossible region" is determined by (symmetric or skewed) bimodal "U"-shaped distributions for which parameters α and β approach zero and hence all the probability density is concentrated at each end: x = 0 and x = 1 with practically nothing in between them. Since for α ≈ β ≈ 0 the probability density is concentrated at the two ends x = 0 and x = 1, this "impossible boundary" is determined by a 2-point distribution: the probability can only take 2 values (Bernoulli distribution), one value with probability p and the other with probability q = 1−p. For cases approaching this limit boundary with symmetry α = β, skewness ≈ 0, excess kurtosis ≈ −2 (this is the lowest excess kurtosis possible for any distribution), and the probabilities are p ≈ q ≈ 1/2. For cases approaching this limit boundary with skewness, excess kurtosis ≈ −2 + skewness2, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities right end x = 1. at the left end x = 0 and at the Symmetry All statements are conditional on α, β > 0 Probability density function reflection symmetry Cumulative distribution function reflection symmetry plus unitary translation Mode reflection symmetry plus unitary translation Median reflection symmetry plus unitary translation Mean reflection symmetry plus unitary translation Geometric Means each is individually asymmetric, the following symmetry applies between the geometric mean based on X and the geometric mean based on its reflection (1-X) Harmonic means each is individually asymmetric, the following symmetry applies between the harmonic mean based on X and the harmonic mean based on its reflection (1-X) . Variance symmetry Geometric variances each is individually asymmetric, the following symmetry applies between the log geometric variance based on X and the log geometric variance based on its reflection (1X) Geometric covariance symmetry Mean absolute deviation around the mean symmetry Skewness skew-symmetry Excess kurtosis symmetry Characteristic function symmetry of Real part (with respect to the origin of variable "t") Characteristic function skew-symmetry of Imaginary part (with respect to the origin of variable "t") Characteristic function symmetry of Absolute value (with respect to the origin of variable "t") Differential entropy symmetry Relative Entropy (also called Kullback–Leibler divergence) symmetry Fisher information matrix symmetry Geometry of the probability distribution function Inflection points Inflection point location versus α and β showing showing regions with one inflection point Inflection point location versus α and β showing region with two inflection points For certain values of the shape parameters α and β, the probability distribution function has inflection points, at which the curvature changes sign. The position of these inflection points can be useful as a measure of the dispersion or spread of the distribution. Defining the following quantity: Points of inflection occur,[7][9][12][13] depending on the value of the shape parameters α and β, as follows: (α > 2, β > 2) The distribution is bell-shaped (symmetric for α = β and skewed otherwise), with two inflection points, equidistant from the mode: (α = 2, β > 2) The distribution is unimodal, positively skewed, right-tailed, with one inflection point, located to the right of the mode: (α > 2, β = 2) The distribution is unimodal, negatively skewed, left-tailed, with one inflection point, located to the left of the mode: (1 < α < 2, β > 2) The distribution is unimodal, positively skewed, right-tailed, with one inflection point, located to the right of the mode: (0 < α < 1, 1 < β < 2) The distribution has a mode at the left end x = 0 and it is positively skewed, right-tailed. There is one inflection point, located to the right of the mode: (α > 2, 1 < β < 2) The distribution is unimodal negatively skewed, left-tailed, with one inflection point, located to the left of the mode: (1 < α < 2, 0 < β < 1) The distribution has a mode at the right end x=1 and it is negatively skewed, left-tailed. There is one inflection point, located to the left of the mode: There are no inflection points in the remaining (symmetric and skewed) regions: U-shaped: (α < 1, β < 1) upside-down-U-shaped: (1 < α < 2, 1 < β < 2), reverse-J-shaped (α < 1, β > 2) or Jshaped: (α > 2, β < 1) The accompanying plots show the inflection point locations (shown vertically, ranging from 0 to 1) versus α and β (the horizontal axes ranging from 0 to 5). There are large cuts at surfaces intersecting the lines α = 1, β = 1, α = 2, and β = 2 because at these values the beta distribution change from 2 modes, to 1 mode to no mode. Shapes PDF for symmetric beta distribution vs. x and alpha=beta from 0 to 30 PDF for symmetric beta distribution vs. x and alpha=beta from 0 to 2 PDF for skewed beta distribution vs. x and beta= 2.5 alpha from 0 to 9 PDF for skewed beta distribution vs. x and beta= 5.5 alpha from 0 to 9 PDF for skewed beta distribution vs. x and beta= 8 alpha from 0 to 10 The beta density function can take a wide variety of different shapes depending on the values of the two parameters α and β. The ability of the beta distribution to take this great diversity of shapes (using only two parameters) is partly responsible for finding wide application for modeling actual measurements: Symmetric (α = β) the density function is symmetric about 1/2 (blue & teal plots). median = mean = 1/2. skewness = 0. α=β<1 o U-shaped (blue plot). o bimodal: left mode = 0, right mode =1, anti-mode = 1/2 o 1/12 < var(X) < 1/4[7] o −2 < excess kurtosis(X) < −6/5 o α = β = 1/2 is the arcsine distribution var(X) = 1/18 excess kurtosis(X) = −3/2 o α = β → 0 is a 2 point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. A coin toss: one face of the coin being x = 0 and the other face being x = 1. a lower value than this is impossible for any distribution to reach. The differential entropy approaches a minimum value of −∞ α=β=1 o the uniform [0, 1] distribution o no mode o var(X) = 1/12 o excess kurtosis(X) = −6/5 o The (negative anywhere else) differential entropy reaches its maximum value of zero α=β>1 o symmetric unimodal o mode = 1/2. o 0 < var(X) < 1/12[7] o −6/5 < excess kurtosis(X) < 0 o α = β = 3/2 is a semi-elliptic [0, 1] distribution, see: Wigner semicircle distribution var(X) = 1/16. excess kurtosis(X) = −1 o α = β = 2 is the parabolic [0, 1] distribution var(X) = 1/20 excess kurtosis(X) = −6/7 o α = β > 2 is bell-shaped, with inflection points located to either side of the mode 0 < var(X) < 1/20 −6/7 < excess kurtosis(X) < 0 o α = β → ∞ is a 1 point Degenerate distribution with a Dirac delta function spike at the midpoint x = 1/2 with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the single point x = 1/2. The differential entropy approaches a minimum value of −∞ Skewed (α ≠ β) The density function is skewed. An interchange of parameter values yields the mirror image (the reverse) of the initial curve, some more specific cases: α < 1, β < 1 o U-shaped o Positive skew for α < β, negative skew for α > β. o o o bimodal: left mode = 0, right mode = 1, anti-mode = 0 < median < 1. 0 < var(X) < 1/4 α > 1, β > 1 o unimodal (magenta & cyan plots), o Positive skew for α < β, negative skew for α > β. o o o 0 < median < 1 0 < var(X) < 1/12 α < 1, β ≥ 1 o reverse J-shaped with a right tail, o positively skewed, o strictly decreasing, convex o mode = 0 o 0 < median < 1/2. o (maximum variance occurs for , or α = Φ the golden ratio conjugate) α ≥ 1, β < 1 o J-shaped with a left tail, o negatively skewed, o strictly increasing, convex o mode = 1 o 1/2 < median < 1 o (maximum variance occurs for , or β = Φ the golden ratio conjugate) α = 1, β > 1 o positively skewed, o strictly decreasing (red plot), o a reversed (mirror-image) power function [0,1] distribution o mode = 0 o α = 1, 1 < β < 2 concave o 1/18 < var(X) < 1/12. α = 1, β = 2 a straight line with slope −2, the right-triangular distribution with right angle at the left end, at x = 0 o var(X) = 1/18 α = 1, β > 2 reverse J-shaped with a right tail, convex 0 < var(X) < 1/18 α > 1, β = 1 o negatively skewed, o strictly increasing (green plot), o the power function [0, 1] distribution[12] o mode =1 o 2 > α > 1, β = 1 concave o 1/18 < var(X) < 1/12 α = 2, β = 1 a straight line with slope +2, the right-triangular distribution with right angle at the right end, at x = 1 o var(X) = 1/18 α > 2, β = 1 J-shaped with a left tail, convex 0 < var(X) < 1/18 Parameter estimation Method of moments Two unknown parameters Two unknown parameters ( of a beta distribution supported in the [0,1] interval) can be estimated, using the method of moments, with the first two moments (sample mean and sample variance) as follows. Let: be the sample mean estimate and be the sample variance estimate. The method-of-moments estimates of the parameters are conditional on conditional on When the distribution is required over known interval other than [0, 1] with random variable X, say [a, c] with random variable Y, then replace with and with in the above couple of equations for the shape parameters (see "Alternative parametrizations, four parameters" section below).,[37] where: Four unknown parameters Solutions for parameter estimates vs. (sample) excess Kurtosis and (sample) squared Skewness Beta distribution All four parameters ( of a beta distribution supported in the [a, c] interval -see section "Alternative parametrizations, Four parameters"-) can be estimated, using the method of moments developed by Karl Pearson, by equating sample and population values of the first four central moments (mean, variance, skewness and excess kurtosis).[7][38][39] The excess kurtosis was expressed in terms of the square of the skewness, and the sample size ν = α + β, (see previous section "Kurtosis") as follows: One can use this equation to solve for the sample size ν= α + β in terms of the square of the skewness and the excess kurtosis as follows:[38] This is the ratio (multiplied by a factor of 3) between the previously derived limit boundaries for the beta distribution in a space (as originally done by Karl Pearson[26]) defined with coordinates of the square of the skewness in one axis and the excess kurtosis in the other axis (see previous section titled "Kurtosis bounded by the square of the skewness"): The case of zero skewness, can be immediately solved because for zero skewness, α = β and hence ν = 2α = 2β, therefore α = β = ν/2 (Excess kurtosis is negative for the beta distribution with zero skewness, ranging from -2 to 0, so that -and therefore the sample shape parameters- is positive, ranging from zero when the shape parameters approach zero and the excess kurtosis approaches -2, to infinity when the shape parameters approach infinity and the excess kurtosis approaches zero). For non-zero sample skewness one needs to solve a system of two coupled equations. Since the skewness and the excess kurtosis are independent of the parameters , the parameters can be uniquely determined from the sample skewness and the sample excess kurtosis, by solving the coupled equations with two known variables (sample skewness and sample excess kurtosis) and two unknowns (the shape parameters): resulting in the following solution:[38] Where one should take the solutions as follows: for (negative) sample skewness < 0, and for (positive) sample skewness > 0. The accompanying plot shows these two solutions as surfaces in a space with horizontal axes of (sample excess kurtosis) and (sample squared skewness) and the shape parameters as the vertical axis. The surfaces are constrained by the condition that the sample excess kurtosis must be bounded by the sample squared skewness as stipulated in the above equation. The two surfaces meet at the right edge defined by zero skewness. Along this right edge, both parameters are equal and the distribution is symmetric U-shaped for α = β < 1, uniform for α = β = 1, upside-down-Ushaped for 1 < α = β < 2 and bell-shaped for α = β > 2. The surfaces also meet at the front (lower) edge defined by "the impossible boundary" line (excess kurtosis + 2 - skewness2 = 0). Along this front (lower) boundary both shape parameters approach zero, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities at the left end x = 0 and at the right end x = 1. The two surfaces become further apart towards the rear edge. At this rear edge the surface parameters are quite different from each other. As remarked, for example, by Bowman and Shenton,[40] sampling in the neighborhood of the line (sample excess kurtosis - (3/2)(sample skewness)2 = 0) (the just-J-shaped portion of the rear edge where blue meets beige), "is dangerously near to chaos", because at that line the denominator of the expression above for the estimate ν = α + β becomes zero and hence ν approaches infinity as that line is approached. Bowman and Shenton [40] write that "the higher moment parameters (kurtosis and skewness) are extremely fragile (near that line). However the mean and standard deviation are fairly reliable." Therefore the problem is for the case of four parameter estimation for very skewed distributions such that the excess kurtosis approaches (3/2) times the square of the skewness. This boundary line is produced by extremely skewed distributions with very large values of one of the parameters and very small values of the other parameter. See section titled "Kurtosis bounded by the square of the skewness" for a numerical example and further comments about this rear edge boundary line (sample excess kurtosis - (3/2)(sample skewness)2 = 0). As remarked by Karl Pearson himself [41] this issue may not be of much practical importance as this trouble arises only for very skewed J-shaped (or mirror-image J-shaped) distributions with very different values of shape parameters that are unlikely to occur much in practice). The usual skewed skewed-bellshape distributions that occur in practice do not have this parameter estimation problem. The remaining two parameters can be determined using the sample mean and the sample variance using a variety of equations.[7][38] One alternative is to calculate the support interval range based on the sample variance and the sample kurtosis. For this purpose one can solve, in terms of the range , the equation expressing the excess kurtosis in terms of the sample variance, and the sample size ν (see section titled "Kurtosis" and "Alternative parametrizations, four parameters"): to obtain: Another alternative is to calculate the support interval range based on the sample variance and the sample skewness.[38] For this purpose one can solve, in terms of the range , the equation expressing the squared skewness in terms of the sample variance, and the sample size ν (see section titled "Skewness" and "Alternative parametrizations, four parameters"): to obtain:[38] The remaining parameter can be determined from the sample mean and the previously obtained parameters: and finally, of course, : . In the above formulas one may take, for example, as estimates of the sample moments: The estimators G1 for sample skewness and G2 for sample kurtosis are used by DAP/SAS, PSPP/SPSS, and Excel. However, they are not used by BMDP and (according to [42]) they were not used by MINITAB in 1998. Actually, Joanes and Gill in their 1998 study[42] concluded that the skewness and kurtosis estimators used in BMDP and in MINITAB (at that time) had smaller variance and mean-squared error in normal samples, but the skewness and kurtosis estimators used in DAP/SAS, PSPP/SPSS, namely G1 and G2, had smaller mean-squared error in samples from a very skewed distribution. It is for this reason that we have spelled out "sample skewness", etc., in the above formulas, to make it explicit that the user should choose the best estimator according to the problem at hand, as the best estimator for skewness and kurtosis depends on the amount of skewness (as shown by Joanes and Gill[42]). Q.8: Degration? Land Degradation: An overview H. ESWARAN1, R. LAL2 and P. F. REICH3 Published in: Eswaran, H., R. Lal and P.F. Reich. 2001. Land degradation: an overview. In: Bridges, E.M., I.D. Hannam, L.R. Oldeman, F.W.T. Pening de Vries, S.J. Scherr, and S. Sompatpanit (eds.). Responses to Land Degradation. Proc. 2nd. International Conference on Land Degradation and Desertification, Khon Kaen, Thailand. Oxford Press, New Delhi, India. Land degradation will remain an important global issue for the 21st century because of its adverse impact on agronomic productivity, the environment, and its effect on food security and the quality of life. Productivity impacts of land degradation are due to a decline in land quality on site where degradation occurs (e.g. erosion) and off site where sediments are deposited. However, the on-site impacts of land degradation on productivity are easily masked due to use of additional inputs and adoption of improved technology and have led some to question the negative effects of desertification. The relative magnitude of economic losses due to productivity decline versus environmental deterioration also has created a debate. Some economists argue that the on-site impact of soil erosion and other degradative processes are not severe enough to warrant implementing any action plan at a national or an international level. Land managers (farmers), they argue, should take care of the restorative inputs needed to enhance productivity. Agronomists and soil scientists, on the other hand, argue that land is a non-renewable resource at a human time-scale and some adverse effects of degradative processes on land quality are irreversible, e.g. reduction in effective rooting depth. The masking effect of improved technology provides a false sense of security. The productivity of some lands has declined by 50% due to soil erosion and desertification. Yield reduction in Africa due to past soil erosion may range from 2 to 40%, with a mean loss of 8.2% for the continent. In South Asia, annual loss in productivity is estimated at 36 million tons of cereal equivalent valued at US$5,400 million by water erosion, and US$1,800 million due to wind erosion. It is estimated that the total annual cost of erosion from agriculture in the USA is about US$44 billion per year, i.e. about US$247 per ha of cropland and pasture. On a global scale the annual loss of 75 billion tons of soil costs the world about US$400 billion per year, or approximately US$70 per person per year. Only about 3% of the global land surface can be considered as prime or Class I land and this is not found in the tropics. Another 8% of land is in Classes II and III. This 11% of land must feed the six billion people today and the 7.6 billion expected in 2020. Desertification is experienced on 33% of the global land surface and affects more than one billion people, half of whom live in Africa. Land degradation, a decline in land quality caused by human activities, has been a major global issue during the 20th century and will remain high on the international agenda in the 21st century. The importance of land degradation among global issues is enhanced because of its impact on world food security and quality of the environment. High population density is not necessarily related to land degradation; it is what a population does to the land that determines the extent of degradation. People can be a major asset in reversing a trend towards degradation. However, they need to be healthy and politically and economically motivated to care for the land, as subsistence agriculture, poverty, and illiteracy can be important causes of land and environmental degradation. Land degradation can be considered in terms of the loss of actual or potential productivity or utility as a result of natural or anthropic factors; it is the decline in land quality or reduction in its productivity. In the context of productivity, land degradation results from a mismatch between land quality and land use (Beinroth et al., 1994). Mechanisms that initiate land degradation include physical, chemical, and biological processes (Lal, 1994). Important among physical processes are a decline in soil structure leading to crusting, compaction, erosion, desertification, anaerobism, environmental pollution, and unsustainable use of natural resources. Significant chemical processes include acidification, leaching, salinization, decrease in cation retention capacity, and fertility depletion. Biological processes include reduction in total and biomass carbon, and decline in land biodiversity. The latter comprises important concerns related to eutrophication of surface water, contamination of groundwater, and emissions of trace gases (CO2, CH4, N2O, NOx) from terrestrial/aquatic ecosystems to the atmosphere. Soil structure is the important property that affects all three degradative processes. Thus, land degradation is a biophysical process driven by socioeconomic and political causes. Factors of land degradation are the biophysical processes and attributes that determine the kind of degradative processes, e.g. erosion, salinization, etc. These include land quality (Eswaran et al., 2000) as affected by its intrinsic properties of climate, terrain and landscape position, climax vegetation, and biodiversity, especially soil biodiversity. Causes of land degradation are the agents that determine the rate of degradation. These are biophysical (land use and land management, including deforestation and tillage methods), socioeconomic (e.g. land tenure, marketing, institutional support, income and human health), and political (e.g. incentives, political stability) forces that influence the effectiveness of processes and factors of land degradation. Depending on their inherent characteristics and the climate, lands vary from highly resistant, or stable, to those that are vulnerable and extremely sensitive to degradation. Fragility, extreme sensitivity to degradation processes, may refer to the whole land, a degradation process (e.g. erosion) or a property (e.g. soil structure). Stable or resistant lands do not necessarily resist change. They are in a stable steady state condition with the new environment. Under stress, fragile lands degrade to a new steady state and the altered state is unfavorable to plant growth and less capable of performing environmental regulatory functions. Effects of land degradation on productivity Information on the economic impact of land degradation by different processes on a global scale is not available. Some information for local and regional scales is available and has been reviewed by Lal (1998). In Canada, for example, on-farm effects of land degradation were estimated to range from US$700 to US$915 million in 1984 (Girt, 1986). The economic impact of land degradation is extremely severe in densely populated South Asia, and sub-Saharan Africa. On plot and field scales, erosion can cause yield reductions of 30 to 90% in some root-restrictive shallow lands of West Africa (Mbagwu et al.,1984; Lal, 1987). Yield reductions of 20 to 40% have been measured for row crops in Ohio (Fahnestock et al., 1995) and elsewhere in Midwest USA (Schumacher et al., 1994). In the Andean region of Colombia, workers from the University of Hohenheim, Germany (Ruppenthal, 1995), have observed severe losses due to accelerated erosion on some lands. Few attempts have been made to assess the global economic impact of erosion. The productivity of some lands in Africa (Dregne, 1990) has declined by 50% as a result of soil erosion and desertification. Yield reduction in Africa (Lal, 1995) due to past soil erosion may range from 2 to 40%, with a mean loss of 8.2% for the continent. If accelerated erosion continues unabated, yield reductions by 2020 may be 16.5%. Annual reduction in total production for 1989 due to accelerated erosion was 8.2 million tons for cereals, 9.2 million tons for roots and tubers, and 0.6 million tons for pulses. There are also serious (20%) productivity losses caused by erosion in Asia, especially in India, China, Iran, Israel, Jordan, Lebanon, Nepal, and Pakistan (Dregne, 1992). In South Asia, annual loss in productivity is estimated at 36 million tons of cereal equivalent valued at US$5,400 million by water erosion, and US$1,800 million due to wind erosion (UNEP, 1994). It is estimated that the total annual cost of erosion from agriculture in the USA is about US$44 billion per year, about US$247 per ha of cropland and pasture. On a global scale the annual loss of 75 billion tons of soil costs (at US$3 per ton of soil for nutrients and US$2 per ton of soil, for water) the world about US$400 billion per year, or approximately US$70 per person per year (Lal, 1998). Soil compaction is a worldwide problem, especially with the adoption of mechanized agriculture. It has caused yield reductions of 25 to 50% in some regions of Europe (Ericksson et al., 1974) and North America, and between 40 and 90% in West African countries (Charreau, 1972; Kayombo and Lal, 1994). In Ohio, reductions in crop yields are 25% in maize, 20% in soybeans, and 30% in oats over a seven-year period (Lal, 1996). On-farm losses through land compaction in the USA have been estimated at US$1.2 billion per year (Gill, 1971). Nutrient depletion as a form of land degradation has a severe economic impact at the global scale, especially in sub-Saharan Africa. Stoorvogel et al. (1993) have estimated nutrient balances for 38 countries in sub-Saharan Africa. Annual depletion rates of soil fertility were estimated at 22 kg N, 3 kg P, and 15 kg K ha-1. In Zimbabwe, soil erosion results in an annual loss of N and P alone totaling US$1.5 billion. In South Asia, the annual economic loss is estimated at US$600 million for nutrient loss by erosion, and US$1,200 million due to soil fertility depletion (Stocking, 1986; UNEP, 1994). An estimated 950 million ha of salt-affected lands occur in arid and semi-arid regions, nearly 33% of the potentially arable land area of the world. Productivity of irrigated lands is severely threatened by build up of salt in the root zone. In South Asia, annual economic loss is estimated at US$500 million from waterlogging, and US$1,500 million due to salinization (UNEP, 1994). Potential and actual economic impact globally is not known. It is not known either for soil acidity and the resultant toxicity of high concentrations of Al and Mn in the root zone, a serious problem in sub-humid and humid regions (Eswaran et al., 1997a). It is in the context of these global economic and environmental impacts of land degradation, and numerous functions of value to humans, that land degradation, desertification, and resilience concepts are relevant (Eswaran, 1993). They are also important in developing technologies for reversing land degradation trends and mitigating the greenhouse effect through land and ecosystem restoration. As land resources are essentially non-renewable, it is necessary to adopt a positive approach to sustainable management of these finite resources. Views on land degradation Land degradation has received widespread debate at the global level as evidenced by the literature: UNEP, 1992; Johnson and Lewis, 1995; Oldeman et al., 1992; Middleton and Thomas, 1997; Dregne, 1992; Maingnet, 1994; Lal and Stewart, 1994; Lal et al., 1997. At least two distinct schools have emerged regarding the prediction, severity, and impact of land degradation. One school believes that it is a serious global threat posing a major challenge to humans in terms of its adverse impact on biomass productivity and environment quality (Pimentel et al., 1995; Dregne and Chou, 1994). Ecologists, soil scientists, and agronomists primarily support this argument. The second school, comprising primarily economists, believes that if land degradation is a severe issue, why market forces have not taken care of it. Supporters argue that land managers (e.g. farmers) have vested interest in their land and will not let it degrade to the point that it is detrimental to their profits (Crosson, 1997). There are a number of factors that perpetuate the debate on land degradation: 1. Definition: There are numerous terms and definitions that are a source of confusion, misunderstanding, and misinterpretation. A wide range of terms is used in the literature, often with distinct disciplinary-oriented meaning, and leading to misinterpretation among disciplines. Some common terms used are soil degradation, land degradation, and desertification. While there is a clear distinction between ‘soil’ and ‘land’ (the term land refers to an ecosystem comprising land, landscape, terrain, vegetation, water, climate), there is no clear distinction between the terms ‘land degradation’ and ‘desertification’. Desertification refers to land degradation in arid, semi-arid, and sub-humid areas due to anthropic activities (UNEP, 1993; Darkoh, 1995). Many researchers argue that this definition of desertification is too narrow because severe land degradation resulting from anthropic activities can also occur in the temperate humid regions and the humid tropics. The term ‘degradation’ or ‘desertification’ refers to irreversible decline in the ‘biological potential’ of the land. The ‘biological potential’ in turn depends on numerous interacting factors and is difficult to define. The confusion is further exacerbated by the definition of ‘dryland’ where different definitions are used. It is important to standardize the terminology, and develop a precise, objective, and unambiguous definition accepted by all disciplines. 2. Extent and rate of land degradation: Because of different definitions and terminology, a large variation in the available statistics on the extent and rate of land degradation also exists. Two principal sources of data include the global estimates of desertification by Dregne and Chou (1994), and of land degradation by the International Soil Reference and Information Centre (Oldeman et al., 1992; Oldeman, 1994). Table 1 shows that degraded lands in dry areas of the world amount to 3.6 billion ha or 70% of the total 5.2 billion ha of the total land areas considered in these regions. In comparison, in Table 2, Oldeman (1994), shows that the global extent of land degradation (by all processes and all ecoregions) is about 1.9 billion ha. The principal difference between the two estimates is the status of vegetation. Although the estimates by Dregne and Chou cover only dry areas, they also include the status of vegetation on the rangeland. Therefore, the estimates in Tables 3 and 4 are not directly comparable. There is also a difference in terminology used to express the severity of land degradation. Dregne and Chou used the terms slight, moderate, severe, and very severe to designate the severity of degradation. Oldeman used the terms light, moderate, strong, and extreme, and these terms may not be comparable to those of Dregne and Chou. Oldeman et al. (1992), on the basis of expert judgment, attempted to differentiate natural from human-induced degradation. Eswaran and Reich (1998) attempted to evaluate vulnerability to land degradation and desertification. Differences in terminology and approaches, and also the areas included in the assessment, mean that the estimates of the three workers are difficult to compare (Tables 1, 2, and 3). Table 1. Estimates of all degraded lands (in million km2) in dry areas (Dregne and Chou, 1994). Continent Total area Degraded area † % degraded Africa 14.326 10.458 73 Asia 18.814 13.417 71 Australia and the Pacific 7.012 3.759 54 Europe 1.456 0.943 65 North America 5.782 4.286 74 Table 1. Estimates of all degraded lands (in million km2) in dry areas (Dregne and Chou, 1994). Continent Total area South America Total Degraded area † % degraded 4.207 3.058 73 51.597 35.922 70 † Comprises land and vegetation. 3. Table 2. Estimates of the global extent (in million km2) of land degradation (Oldeman, 1994). Type Light Moderate Strong + Extreme Total Water erosion 3.43 5.27 2.24 10.94 Wind erosion 2.69 2.54 0.26 5.49 Chemical degradation 0.93 1.03 0.43 2.39 Physical degradation 0.44 0.27 0.12 0.83 Total 7.49 9.11 3.05 19.65 4. Table 3. Vulnerability to desertification and wind and water erosion (Eswaran and Reich, 1998). Only arid, semi-arid, and sub-humid areas (in million km2) are considered according to the definition of UNEP. Estimates of water erosion include humid areas. Severity Low Desertification 14.653 Water erosion 17.331 Wind erosion 9.250 Table 3. Vulnerability to desertification and wind and water erosion (Eswaran and Reich, 1998). Only arid, semi-arid, and sub-humid areas (in million km2) are considered according to the definition of UNEP. Estimates of water erosion include humid areas. Severity Desertification Moderate Water erosion Wind erosion 13.668 15.373 6.308 High 7.135 10.970 7.795 Very high 7.863 12.196 9.320 43.319 55.870 32.373 Total 5. 6. Land–vegetation relationships: The problem is further confounded by the definition of the term 'vegetation degradation'. It may imply reduction in biomass, decrease in species diversity, or decline in quality in terms of the nutritional value for livestock and wildlife. There is a need for establishing distinct criteria for evaluating vegetation degradation. Table 4 shows an example of vegetation degradation in Australia, but in combination with soil erosion. The quantity and quality of vegetation are not considered. Table 4. Vegetation degradation in pastoral areas of Australia (Woods, 1983; Mabbutt, 1992). Type Area (‘000 km2) Total 3.4 Undegraded 1.5 Degraded 1.9 i. Vegetation degradation with little erosion 1.0 ii. Vegetation degradation and some erosion 0.5 Table 4. Vegetation degradation in pastoral areas of Australia (Woods, 1983; Mabbutt, 1992). Type Area (‘000 km2) iii. Vegetation degradation and substantial erosion 0.3 iv. Vegetation degradation and severe erosion 0.1 v. Dryland salinity 0.001 7. 8. Land degradation processes: Different processes of land degradation also confound the available statistics on soil and/or land degradation. Principal processes of land degradation include erosion by water and wind, chemical degradation (comprising acidification, salinization, leaching etc.) and physical degradation (comprising crusting, compaction, hard-setting etc.). Some lands or landscape units are affected by more than one process, of water and wind erosion, salinization, and crusting or compaction. Unless a clear distinction is made, there is a considerable chance of overlap and double accounting. Table 5 shows an example of overlapping degradative processes with increasing risks of double accounting. Table 5. Land degradation on cropland in Australia (Woods, 1983; Mabbutt, 1992). Type Area (‘000 km2) Total 443 Undegraded 142 Degraded 301 i. Water erosion 206 ii. Wind erosion 52 iii. Combined water and wind erosion 42 Table 5. Land degradation on cropland in Australia (Woods, 1983; Mabbutt, 1992). Type Area (‘000 km2) iv. Salinity and water erosion 0.9 v. Others 0.5 9. 10. Methods of assessment of land degradation: Global assessment of land degradation is not an easy task, and a wide range of methods is used (Lal et al., 1997). Therefore, data generated by different methods are not comparable. Further, most statistics refer to the risks of degradation or desertification (based on climatic factors and land use) rather than the actual (present) state of the land. Table 3 shows vulnerability to desertification and erosion, and these estimates are much higher than those by Dregne and Chou (Table 1) and Oldeman (Table 2), and are suggestive of the risks of land degradation. The actual degradation may not occur because of judicious land use and advances in land management technologies. Comparison of these data on different estimates shows wide variations because of different methods and criteria used, and highlights the importance of developing uniform criteria and standardizing methods of assessment of land degradation. 11. Land degradation and productivity: A major shortcoming of the available statistics on land degradation is the lack of cause–effect relationship between severity of degradation and productivity. Criteria for designating different classes of land degradation (e.g. low, moderate, high) are generally based on land properties rather than their impact on productivity. In fact, assessing the productivity effects of land degradation is a challenging task (Lal, 1998). Difficulties in obtaining global estimates of the impact of land degradation on productivity in turn created problems and raised skepticism. Table 6 from the International Board for Soil Research and Management (IBSRAM) indicates the problems involved in relating land degradation by erosion to crop yield. The data from China show that despite significant differences in cumulative soil loss and water runoff, there were no differences in corn yield. Similar inferences can be drawn with regard to the impact of cumulative soil erosion on yield of rice in Thailand. Whereas soil loss ranged from 330 to 1,478 t ha-1, the corresponding yield of rice ranged from 4.0 to 5.3 t ha-1. The lowest yield was obtained from treatments causing the least soil loss. Crop yield is an integrative effect of numerous variables. In addition, erosional (and other degradative processes) effects on crop yield or biomass potential depend on changes in land quality with respect to specific parameters. Table 7 shows that the yield of sisal was correlated with pH, CEC, and Al saturation but not with soil organic C and N contents. Assessing the productivity effects of land degradation requires a thorough understanding of the processes involved at the soil–plant–atmosphere continuum. These processes are influenced strongly by land use and management. Table 6. Cumulative soil loss and runoff in relation to crop yield in three ASIALAND Sloping Lands Network countries (Sajjapongse, 1998). Country China Philippines Thailand Treatment Crop Runoff(mm) Cumulative yield (Mg ha-1) Control † 1992– 95 Corn 122 762 15.3 Alley cropping 1992– 95 Corn 59 602 15.9 Control 1990– 94 Corn 341 801 5.6 Alley cropping (low input) 1990– 94 Corn 26 43 14.3 Alley cropping (high 1990– input) 94 Corn 15 31 18.7 Control 1989– 95 Rice 1,478 1,392 4.5 Hillside ditch 1989– 95 Rice 134 446 4.8 Alley cropping 1989– 95 Rice 330 538 4.0 Agroforestry 1989– 95 Rice 850 872 5.3 † Control = Farmer’s practice 12. Period Soil loss (Mg ha1) Table 7. Relationship between yield of sisal and soil fertility (0–20 cm depth) decline in Tanga region of Tanzania (Hartemink, 1995). Land properties Yield levels Sisal yield (Mg ha-1) 2.3 1.8 1.5 Property value pH (1:2.5 in H2O) 6.50 5.40 5.00 Soil organic carbon (%) 1.60 1.90 1.50 Total soil nitrogen (%) 0.11 0.16 0.12 Cation exchange capacity (cmol kg-1) 9.30 7.00 5.00 0 20.00 50.00 Al saturation (% ECEC) Issues and challenges There are sufficient studies and reviews (e.g. Barrow, 1991; Blaikie and Brookfield, 1987; Johnson and Lewis, 1995) that clearly demonstrate the fact that land degradation affects all facets of life. Many issues that confront those working in the domain of land resources include technologies to reduce degradation and also the techniques to assess and monitor land degradation. A number of questions remain unanswered and these include: Is land degradation inevitable? Are there adequate early warning indicators of land degradation? The absence of land tenure and the resulting lack of stewardship is a major issue in some countries that detracts from adequate care for the land; how can this be resolved? Declining soil quality resulting largely from human-induced degradation triggers social unrest; what is the societal responsibility of soil scientists? How can soil scientists better participate in developing public policy? Local actions have global impact; what are the areas for international collaboration? Who pays, who wins in the economics of land degradation? Degradation results in loss of intergenerational equity and the value of bequests. How do we quantify and create awareness? Is there a link between land degradation and the health of humans and animals? Declining soil quality leads to diminishing economic growth in countries where wealth is largely agrarian. How can the rates of resource consumption be quantified? Land degradation often destroys or reduces the natural beauty of landscapes. How might the aesthetic value of land be quantified? How to create greater awareness of the perils of land degradation in society and political leadership? There are three steps involved in the process of addressing the problem: assessment, monitoring, and application of mitigating technologies. All three steps are in the purview of agriculturists and specifically, soil scientists. The latter clearly have the responsibility for soil science, and over the past decade substantial progress has been made in communicating the dangers of land degradation. However, much remains to be done. Soil science has made significant contributions to the task of soil resource assessment but its practitioners have shown little or no interest in the additional task of monitoring the resource base (Mermut and Eswaran, 1997). This still remains a new area of investigation requiring guidelines, standards, and procedures. The challenge is to adopt an internationally acceptable procedure for this task. Soil scientists have an obligation not only to show the spatial distribution of stressed systems but also to provide reasonable estimates of their rates of degradation. They should develop early warning indicators of degradation to enable them to collaborate with others, such as social scientists, to develop and implement mitigating technologies. Soil scientists also have a role in assisting national decision-makers to develop appropriate land use policies. There are many, usually confounding, reasons why land users permit their land to degrade. Many of the reasons are related to societal perceptions of land and the values they place on land. Degradation is also a slow imperceptible process and so many people are not aware that their land is degrading. Creating awareness and building up a sense of stewardship are important steps in the challenge of reducing degradation. Consequently, appropriate technology is only a partial answer. The main solution lies in the behaviour of the farmer who is subject to economic and social pressures of the community/country in which he/she lives. Food security, environmental balance, and land degradation are strongly inter-linked and each must be addressed in the context of the other to have measurable impact. This is the challenge of the 21st century for which we must be prepared. Desertification Desertification is a form of land degradation occurring particularly, but not exclusively, in semiarid areas. Figure 1 indicates the areas of the world vulnerable to desertification. The semi-arid to weakly arid areas of Africa are particularly vulnerable, as they have fragile soils, localized high population densities, and generally a low-input form of agriculture. About 33% of the global land surface (42 million km2) is subject to desertification. Table 8 shows the vulnerability of land to desertification in some Asian countries. Twenty-five percent of the region is affected and if not addressed the quality of life of large sections of the population will be affected. Many of these countries cannot afford losses in agricultural productivity. There are no good estimates of the number of persons affected by desertification nor of the number who directly or indirectly contribute to the process. A recent study by Reich et al. (this volume) provides some estimates for Africa. Table 8. Estimates of vulnerability to desertification in some Asian countries. Vulnerability to desertification Countries Total land area (km2) Low Area Afghanistan 647,500 Bangladesh 133,910 Moderate High Very high % Area % Area 0.46 39,088 6.04 43,838 6.77 436,480 67.41 85,163 63.60 0 0 0 0 0 0 2,954 % Area % Bhutan 47,000 1,407 2.99 0 0 0 0 0 0 Brunei 6,627 0 0 0 0 0 0 0 0 China 9,326,410 262,410 2.81 239,107 2.56 65,638 0.70 72,214 0.77 India 2,973,190 1,277,328 42.96 744,148 25.03 206,317 6.94 165,912 5.58 Indonesia 1,826,440 29,596 1.62 46,290 2.53 232 0.01 Japan 374,744 0 0 0 0 0 0 693 0.18 Cambodia 176,520 45,731 25.91 118,155 66.94 0 0 0 0 Laos 230,800 48,963 21.21 35,386 15.33 0 0 0 0 Malaysia 328,550 0 0 0 0 0 0 0 0 Mongolia 1,565,000 26,345 1.68 40,511 2.59 19 2,104 0.13 Myanmar 657,740 13,477 2.05 130,903 19.90 140,387 21.34 5,289 0.29 0 20,630 3.14 Table 8. Estimates of vulnerability to desertification in some Asian countries. Vulnerability to desertification Countries Total land area (km2) Low Area Moderate % Nepal 136,800 North Korea 120,410 0 Pakistan 778,720 Papua New Guinea Philippines Area % Area Very high % Area % 8,698 6.36 0 0 228 0.17 0 0 0 0 0 0 0 31,474 4.04 39,605 5.09 17,032 2.19 452,860 4,892 1.08 8,175 1.81 27 0.01 0 0.00 298,170 20,952 7.03 16,621 5.57 1,708 0.57 0 0.00 638 0 0 0 0 0 0 0 0 South Korea 98,190 0 0 0 0 0 0 0 0 Sri Lanka 64,740 6,337 9.79 3,421 5.28 0 0.00 Taiwan 32,260 2,902 9.00 0.62 277 0.86 0 0.00 Thailand 511,770 90,241 17.63 320,581 62.64 7,265 1.42 0 0.00 Vietnam 325,360 47,516 14.60 59,238 18.21 375 0.12 0 0.00 8.91 371,836 1.76 872,843 4.13 Singapore Total 20,131 14.72 High 24,393 37.68 201 21,115,069 2,135,245 10.11 1,880,584 181,503 23.31 As shown in Table 9, a high population density in an area that is highly vulnerable to desertification poses a very high risk for further land degradation. Conversely, a low population density in an area where the vulnerability is also low poses, in principle, a low risk. The Mediterranean countries of North Africa are very highly prone to desertification. In Morocco, for example, erosion is so extensive that the petrocalcic horizon of some Palexeralfs is exposed at the surface. In the Sahel, there are pockets of very high-risk areas. The West African countries, with their dense populations, have a major problem to contain the processes of land degradation. Table 10 provides the area in each of the classes of Table 9. About 2.5 million km2 of land are under low risk, 3.6 are under moderate risk, 4.6 are under high risk, and 2.9 million km2 are under very high risk. The region that has high propensity is located along the desert margins and occupies about 5% of the landmass. It is estimated that about 22 million people (2.9% of total population) live in this area. The low, moderate, and high vulnerability classes occupy 14, 16, and 11% respectively and together impact about 485 million people. Cumulatively, desertification affects about 500 million Africans and though they have relatively good soil resources (Eswaran et al., 1997b,c) their productivity will be seriously undermined by land degradation and desertification. Table 9. Matrix for risk assessment of human-induced desertification. 1 = low risk; 2, 3 = moderate risk; 4, 5, 6 = high risk; 7, 8, 9 = very high risk. (After Reich et al., 1999.) Population density (persons km2) Vulnerability class < 10 11–40 > 41 Low 1 3 6 Moderate 2 5 8 High/Very High 4 7 9 Table 10. Land area (1,000 km2) of Africa in risk classes. (After Reich et al., 1999.) Population density (persons km2) Vulnerability class < 10 Low 2,476 11–40 1,005 > 40 750 Table 10. Land area (1,000 km2) of Africa in risk classes. (After Reich et al., 1999.) Population density (persons km2) Vulnerability class < 10 11–40 > 40 Moderate 2,608 1,180 976 High/very high 2,643 1,074 825 A New Agenda Although soils are an ecologically important component of the environment, the availability of research and development funds does not match their significance in terms of the cost to society if soils become degraded. The perception that enough is already known about soils, so that generalizations can be made for all soils, is incorrect. There is also a failure to recognize that agriculture is one of the major ‘stressors’ of the environment (Virmani et al., 1994), particularly from a soil degradation point of view (Beinroth et al., 1994). If these attitudes prevail, major catastrophes in the future become more probable. The purpose of such discussions is to emphasize that soil is virtually a non-renewableresource. It is a basic philosophy that society has an obligation to protect soil, conserve it, or even enhance its quality for future generations. The role of society in sustaining agriculture can be demonstrated, and conversely, the role of soil in sustaining society. The paradigms that brought some countries of the world to agricultural affluence must be evaluated, as well as the policies and practices that have contributed to land degradation and decline in productivity in other countries. We need to look at current concerns, and the urgent need to develop new paradigms for managing soil resources, as proposed by Sanchez (1994) for example, that will carry us through the next few decades. Some of the valid arguments of the past now have little validity in the face of contemporary environmental degradation. New concepts must be defined, the research gaps identified, and the needs that will enable our institutions to meet the challenges and requirements of the next 20 years must be indicated. Land degradation results from mismanagement of land and thus deals with two interlocking, complex systems: the natural ecosystem and the human social system. Interactions between the two systems determine the success or failure of resource management programs. To avert the catastrophe resulting from land degradation, which threatens many parts of the world, the following concepts from Eswaran and Dumanski (1994) are relevant: Environment and agriculture are intrinsically linked and research and development must address both of them. Land degradation is as much a socioeconomic problem as it is a biophysical problem. Land degradation and economic growth or lack of it (poverty) are intractably linked; (people living in the lower part of the poverty spiral are in a weak position to provide the stewardship necessary to sustain the resource base. As a consequence, they move further down the poverty spiral—a vicious cycle is set in motion). Implementation of mitigation research to manage land degradation can only succeed if land users have control and commitment to maintain the quality of the resources. The focus of agricultural research should shift from increasing productivity to enhancing sustainability, recognizing that land degradation caused by agriculture can be minimized and made compatible with the environment. Land use must match land quality; appropriate national policies should be implemented to ensure this occurs to reduce land degradation; (a framework for evaluation of sustainable land management [Dumanski et al., 1992] is a powerful tool to assess such discrepancies and assure sustainability). The thrust of a new agenda for resource assessment and monitoring with respect to land degradation (including desertification), has several components. It must be stressed that any research and development activity should be in the larger context of the ecosystem as addressed by Sanchez (1994) and Greenland et al. (1994). Components of a national strategy to address land degradation (and desertification) comprise: Studies on long-term water needs (quality and quantity). A network of monitoring sites to detect changes in natural resource conditions. Working with farmers by understanding and incorporating indigenous knowledge. Including land degradation aspects in research on cropping and farming systems, and soil and water management. Convincing decision-makers that climate change, desertification, quality of life, and sustainability are all interlinked and addressing one helps the other. Initiating research on a new paradigm that is holistic and focuses on these issues. A 'scale-sensitive information system' or database on land degradation must be made available for the vertical network of decision-makers to enable them to make effective policies concerning the use and management of resources. Decision-makers at all levels of society should be able to participate in the design and implementation of any tool that affects the social, economic, and ecological well-being. This ensures successful implementation of the program. Figure 1. Desertification vulnerability. Conclusions Agenda 21 (UNCED, 1992, Chapter 12) emphasizes land degradation through desertification, and the international community, particularly through UN organizations, has launched several activities to address it. Other aspects of land degradation only receive a casual mention in Agenda 21, and are briefly considered in Chapter 10 under the general heading, "Integrated Approach to the Planning and Management of Land Resources". In this sense the problem of land degradation has been diluted and as such has not received the global attention that it deserves. Though the stated objective in Agenda 21 is, "to strengthen regional and global systematic observation networks linked to the development of national systems for the observation of land degradation and desertification caused both by climate fluctuations and by human impact, and to identify priority areas for action", we believe that we have yet to mobilize the soil science community to develop a proactive programme in this area. Land degradation remains a serious global threat but the science concerning it contains both myths and facts. The debate is perpetuated by confusion, misunderstanding, and misinterpretation of the available information. Important challenges are: To mobilize the scientific community to mount an integrated programme for methods, standards, data collection, and research networks for assessment and monitoring of soil and land degradation. To develop land use models that incorporate both natural and human-induced factors that contribute to land degradation and that could be used for land use planning and management. To develop information systems that link environmental monitoring, accounting, and impact assessment to land degradation. To help develop policies that encourage sustainable land use and management and assist in the greater use of land resource information for sustainable agriculture. To develop economic instruments for the assessment of land degradation and encourage the sustainable use of land resources. To rationalize the wide range of terminology and definitions with different meanings among different disciplines associated with land degradation. To standardize methods of assessment of the extent of land degradation. To develop non-uniform criteria for assessing the severity of land degradation. To overcome the difficulty in evaluating the on-farm economic impact of land degradation on productivity. There is an urgent need to address these issues through a multi-disciplinary approach, but the most urgent need is to develop an objective, quantifiable, and precise concept based on scientific principles. Q:9: Last Lecture of Last Class, take from your friend, here we try to answere but u will not get? Estimation & Testing of Hypothesis? 3.2 Estimation and Hypothesis Testing The logistic regression model just developed is a generalized linear model with binomial errors and link logit. We can therefore rely on the general theory developed in Appendix B to obtain estimates of the parameters and to test hypotheses. In this section we summarize the most important results needed in the applications. 3.2.1 Maximum Likelihood Estimation Although you will probably use a statistical package to compute the estimates, here is a brief description of the underlying procedure. The likelihood function for n independent binomial observations is a product of densities given by Equation 3.3. Taking logs we find that, except for a constant involving the combinatorial terms, the log-likelihood function is logL(β) = { yi log(i) + (ni-yi)log(1-i)}, where i depends on the covariates xi and a vector of p parameters β through the logit transformation of Equation 3.9. At this point we could take first and expected second derivatives to obtain the score and information matrix and develop a Fisher scoring procedure for maximizing the log-likelihood. As shown in Appendix B, the procedure is equivalent to iteratively re-weighted least squares (IRLS). Given a current estimate [^(β)] of the parameters, we calculate the linear predictor [^()] = xi[^(β)] and the fitted values [^(μ)] = logit-1(). With these values we calculate the working dependent variable z, which has elements ^ yi- μ i ^ zi = + ni, i ^ ^ μ (ni- μ ) i i where ni are the binomial denominators. We then regress z on the covariates calculating the weighted least squares estimate ^ β = (XWX)-1XWz, where W is a diagonal matrix of weights with entries wii = ^ (ni- ^ )/ni. μ μ i i (You may be interested to know that the weight is inversely proportional to the estimated variance of the working dependent variable.) The resulting estimate of β is used to obtain improved fitted values and the procedure is iterated to convergence. Suitable initial values can be obtained by applying the link to the data. To avoid problems with counts of 0 or ni (which is always the case with individual zero-one data), we calculate empirical logits adding 1/2 to both the numerator and denominator, i.e. we calculate yi+1/2 zi = log , ni-yi+1/2 and then regress this quantity on xi to obtain an initial estimate of β. The resulting estimate is consistent and its large-sample variance is given by ^ var( β ) = (XWX)-1 (3.12) where W is the matrix of weights evaluated in the last iteration. Alternatives to maximum likelihood estimation include weighted least squares, which can be used with grouped data, and a method that minimizes Pearson's chi-squared statistic, which can be used with both grouped and individual data. We will not consider these alternatives further. 3.2.2 Goodness of Fit Statistics Suppose we have just fitted a model and want to assess how well it fits the data. A measure of discrepancy between observed and fitted values is the deviance statistic, which is given by ni-yi yi D = 2 { yi log( ) +(ni-yi)log( ^ μi ^ ni- μi ) }, (3.13) where yi is the observed and [^(μi)] is the fitted value for the i-th observation. Note that this statistic is twice a sum of `observed times log of observed over expected', where the sum is over both successes and failures (i.e. we compare both yi and ni-yi with their expected values). In a perfect fit the ratio observed over expected is one and its logarithm is zero, so the deviance is zero. In Appendix B we show that this statistic may be constructed as a likelihood ratio test that compares the model of interest with a saturated model that has one parameter for each observation. With grouped data, the distribution of the deviance statistic as the group sizes ni for all I, converges to a chi-squared distribution with n-p d.f., where n is the number of groups and p is the number of parameters in the model, including the constant. Thus, for reasonably large groups, the deviance provides a goodness of fit test for the model. With individual data the distribution of the deviance does not converge to a chi-squared (or any other known) distribution, and cannot be used as a goodness of fit test. We will, however, consider other diagnostic tools that can be used with individual data. An alternative measure of goodness of fit is Pearson's chi-squared statistic, which for binomial data can be written as ^ μ ni (yi)2 i 2P = i . (3.14) ^ ^ μ ( ni- μ ) i i Note that each term in the sum is the squared difference between observed and fitted values yi and [^(μ)]i, divided by the variance of yi, which is μi(ni-μi)/ni, estimated using [^(μ)]i for μi. This statistic can also be derived as a sum of `observed minus expected squared over expected', where the sum is over both successes and failures. With grouped data Pearson's statistic has approximately in large samples a chi-squared distribution with n-p d.f., and is asymptotically equivalent to the deviance or likelihood-ratio chisquared statistic. The statistic can not be used as a goodness of fit test with individual data, but provides a basis for calculating residuals, as we shall see when we discuss logistic regression diagnostics. 3.2.3 Tests of Hypotheses Let us consider the problem of testing hypotheses in logit models. As usual, we can calculate Wald tests based on the large-sample distribution of the m.l.e., which is approximately normal with mean β and variance-covariance matrix as given in Equation 3.12. In particular, we can test the hypothesis H0:βj = 0 concerning the significance of a single coefficient by calculating the ratio of the estimate to its standard error ^ βj z= . ^ ^ var ( β ) j This statistic has approximately a standard normal distribution in large samples. Alternatively, we can treat the square of this statistic as approximately a chi-squared with one d.f. The Wald test can be use to calculate a confidence interval for βj. We can assert with 100(1-α)% confidence that the true parameter lies in the interval with boundaries ^ β z1-α/2 j ^ ^ var ( β ) , j where z1-α/2 is the normal critical value for a two-sided test of size α. Confidence intervals for effects in the logit scale can be translated into confidence intervals for odds ratios by exponentiating the boundaries. The Wald test can be applied to tests hypotheses concerning several coefficients by calculating the usual quadratic form. This test can also be inverted to obtain confidence regions for vectorvalue parameters, but we will not consider this extension. For more general problems we consider the likelihood ratio test. A key to construct these tests is the deviance statistic introduced in the previous subsection. In a nutshell, the likelihood ratio test to compare two nested models is based on the difference between their deviances. To fix ideas, consider partitioning the model matrix and the vector of coefficients into two components β1 X = (X1,X2) and β = β2 with p1 and p2 elements, respectively. Consider testing the hypothesis H0: β2 = 0, that the variables in X2 have no effect on the response, i.e. the joint significance of the coefficients in β2. Let D(X1) denote the deviance of a model that includes only the variables in X1 and let D(X1+X2) denote the deviance of a model that includes all variables in X. Then the difference 2 = D(X1) - D(X1+X2) has approximately in large samples a chi-squared distribution with p2 d.f. Note that p2 is the difference in the number of parameters between the two models being compared. The deviance plays a role similar to the residual sum of squares. In fact, in Appendix B we show that in models for normally distributed data the deviance is the residual sum of squares. Likelihood ratio tests in generalized linear models are based on scaled deviances, obtained by dividing the deviance by a scale factor. In linear models the scale factor was σ2, and we had to divide the RSS's (or their difference) by an estimate of σ2 in order to calculate the test criterion. With binomial data the scale factor is one, and there is no need to scale the deviances. The Pearson chi-squared statistic in the previous subsection, while providing an alternative goodness of fit test for grouped data, cannot be used in general to compare nested models. In particular, differences in deviances have chi-squared distributions but differences in Pearson chisquared statistics do not. This is the main reason why in statistical modelling we use the deviance or likelihood ratio chi-squared statistic rather than the more traditional Pearson chi-squared of elementary statistics. Estimation and Hypothesis Testing Estimation vs Testing Often the questions of interest are ones of estimation What is the mean fuel-economy of this engine? By how much does this new drug delay relapse? How old is the species Homo sapiens ? Hypothesis testing is concerned with questions such as Does this sample come from a population with mean 0 ? Are these two samples from the same population? Does the new drug delay relapse? There is a null hypothesis H 0 and usually some specification of the alternative hypothesis H 1 . Parameters and Statistics Both estimation and testing are concerned with a parameter θ , which should (if possible) be a meaningful quantity. We assume we have some sample x =( x 1 ,x 2 ,...,x n ) from a random variable X whose distribution depends on θ ∈ Θ . A statistic t = t ( x ) is any number calculated from the sample. Since the sample is a random observation of X 1 ,X 2 ,...,X n , we can regard t as a sample of the random variable T = t ( X ) . The distribution of T is called the sampling distribution . Estimation A statistic T is an estimator of θ if its intention is to be close to the (unknown) value of θ . To perform statistical inference for an estimator T of θ we will often need to derive its distribution. Suppose the population has mean and variance σ 2 . Then we can often use ̄ X n ∼ N ( ,σ 2 /n ) . This holds exactly if the population is normal, and approxim ately otherwise (by the Central Limit Theorem). Normal Theory If X 1 ,X 2 ,...,X n are iid N ( ,σ 2 ) , then ̄ X n = 1 n n X i =1 X i and S 2 = 1 n − 1 n X i =1 ( X i − ̄ X n ) 2 are independent random variables. In addition, ̄ X n ∼ N , σ 2 n and ( n − 1) S 2 σ 2 ∼ χ 2 ( n − 1) . Hence, in the normal case, we know the sampling distribution s of ̄ X n and S 2 if σ 2 is known. Student’s t distribution From the first result, ̄ X − σ/ √ n ∼ N (0 , 1) . However, in general σ 2 will not be known. We can not simply replace σ by its estimate s , as this will increase the variability of the left-hand-sid e. It can be shown that ̄ X − S/ √ n ∼ t n − 1 . The density function of a t -distribution is similar to a normal distribution but with fatter tails.