lOMoARcPSD|27878101 Ca foundation statistics ca foundation (Institute of Chartered Accountants of India) Studocu is not sponsored or endorsed by any college or university Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 STATISTICS NOTES HARIS SIR STATISTICS Page 1 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 CONCISE NOTES – STATISTICS – CA FOUNDATION Statistical Description of Data Statistic came from latin word Status, Italien word statista, German world statistic and French world statistique Statistic can be defined in singular and plural sense. In plural sense it means the data collected-qualitative and quantitative. While in singular it refers to the methods applied on these data. There are 2 types of errors in statistics. Statistics applies in economics, business management and commerce. Variable refers to quantitative characteristic. Discrete if finite and continuous if it assumes value in a given interval Data can be primary or secondary. Primary data if directly collected from the source, collected at first source. If date has been obtained not from primary source. Collection of primary data. Interview, mailed questionnaire, observation & filled questionnaire method Interview method- Personal Interview, Indirect interview & Telephone Interview method. Telephone - quick and non-expensive method Sources of Secondary data International sources, Government sources, private and quasi government, unpublished sources. The data collected should be 1st scrutinize to check whether the data is correct and accurate. Classification-refers to the process of arranging data. It makes data more relevant, precise and condensed, make it comparable & serve as base for analysis. Classified data in respect of an attribute are referred as qualitative. Data can be frequency data and non-frequency data. Time series is an eg of non-frequency data. Internal consistency of data can be checked with given related series Presentation of data Textual presentation-presenting data in paragraphs, it is simple however Non-statistical, not preferred. Tabular presentation-presentation in tables Table has 4 components STATISTICS Page 2 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 Caption (upper part-showing column and sub column), box head (entire upper part), stub (left part giving details of rows), and body (main part that contains numbers). Footnotes are provided in the bottom part. In general the number of types of tabulation is 2. Diagrammatical presentation Line diagram or historigram e.g. graph, shows relation-ship between two variables, ogives. Bar diagram (horizontal for qualitative data & vertical for quantitative data) Divided bar charts or percentage bar diagrams for compaing and relating different components of a variable Pie chart. Circular diagrams are two dimensional Frequency distribution is tabular representation of statistical data, when made in respect of discrete series it is Discrete distribution and when relates to continuous data it is called Grouped Frequency Distribution Mutually exclusive series are for continuous series. Mutually inclusive series are for discrete series. Relative frequency lies between 0 & 1. Frequency density corresponding to a class Interval is a ratio of class frequency/class length. Class Limit means upper and lower limit of class interval Class Boundary refers to actual class limit of an interval. They are included in class interval Graphical Representation of frequency distribution Histogram- helps in comparison of frequencies, calculation of mode. It is also an area diagram, classes are overlapping. The width of all classes is equal. Frequency Polygon- meant for single frequency distribution, all its classes have equal width Ogives or cumulative frequency graph-for cumulative distributionused for quartiles, median etc. Frequency curve is a smooth curve for which area is 1. It is a limiting factor of frequency polygon & Histogram. 1. Belt Shaped-most commonly used, eg where bell shape curve is usedProfits of a company, 2. U shaped 3. J shaped 4. Mixed curve STATISTICS Page 3 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 Measure of Central Tendency and Measure of Dispersion It shows clusterness (collection) of the data, it represents a series. Arithmetic Mean Sum of all observation divided by no. of observation ∑ Individual Series Frequency distribution Grouped frequency distribution ∑ A+A+∑ *c ∑ ∑ A= Assumed Mean Di= C= Class interval Xi= mid-point of class interval If all the observation of the series is k, then mean is also k Sum of deviation from mean is always 0. (xi-X)=0 or fi(xi - X)=0 It is affected by change of origin (addition) and change of scale (multiplication) if y=a+bx, then Y=a+bx (Y represents mean of y and X represents mean of X). Combined mean= Affected by extremes, best averages, best on all observations, cannot be calculated in open end frequency, affected by sampling fluctuations. It represents a series. It is rigidly defined. Most commonly used Median Positional value, divide a series into two parts, not affected by extremes. Median= For others = in case of even number, median is simple average of two middle values. For grouped frequency = L + L= lower limit N=Total frequency Cf= Less than cumulative frequency w F-Frequency in median class STATISTICS Page 4 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 For moderately skewed distribution, Y=a+bx, then Y=a+bx then Yme=a+bxme Deviation from median is minimum ( xi - A) where A is median Not based on all observation, best for calculation in open end frequency, not affected by sampling fluctuation or present of extremes. Cumulative frequency are to be calculated for its calculation It is calculated by use of ogives. Quartiles, Deciles, Percentiles It is also a positional average, Quartiles, deciles and percentiles divides equation into 410 and 100 parts respectively. There are 3,9,99 quartiles, deciles and percentiles. Q2=D5=P50=Median Can be calculated through ogives also. For quartiles = L + For deciles = L + For percentiles = L + Mode Highest value in the series. A series can be multimodal also. It is not uniquely defined; it does not exist if all the observations are equal. It represents the number which has been repeatedly most of the times. Mode==L+ i Where it is difficult to compute mode with the above formula, mode can be calculated using the equation Mean-Mode – 3(Mean -Median) or Mode= 3 Median-2 Mean For moderately skewed distribution Y=a+bx, then Ymo=o+bxmo Graphically it can be calculated by histogram Geometric Mean Nth root of the product of the observation G= (x1*x2*……..* xn)/1/n For grouped frequency, G= (x1f1*x2f2*…….*xnfn)1/n where 1/n= 1/f Logarithm of G of a set of observation is the AM of the logarithm of the observation i.e. log G=1/r logx STATISTICS Page 5 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 If all the observation of a series are k, then the GM is also k GM of product of two variables is the product of their GM’s i.e z=x*, GM of 2= (GM of x)*(GM of Y) The ratio of two variables is the ratio of their GM It can be calculated roll the observation have some sign and none of them is 0 Use- for calculation of growth rate of population Harmonic Mean It is reciprocal of the AM of reciprocal of the observations H= for grouped frequency H= ∑ ∑ If all the observations are k , then HM is also K. combined HM = Used for calculation of averages of prices used for average of speed Weighted average are as follows Weighted AM = ∑ ∑ Weighted GM= Antelog Weighted HM= ∑ ∑ ∑ ∑ It is generally used when all the observations are not of equal values. A.M≥G.M≥ H.M., FOR 2 numbers AM *HM= GM2. For finding rates generally GM & HM ore used. For symmetrical distribution means median= mode. Measure of Dispersion It represents scatterness of the series. It can be a) Absolute and b) Relative. Absolute (or 1st measure of dispersion) -a) Range, b) Mean Deviation, c) Standard Deviation and d) Quartile Deviation Relative (or 2nd degree dispersion)-a) Coefficient of Range, b) Coef of MD, c) Coef of SD and d) Coef of QD. Absolute v/s Relative absolute unit based, whereas relative is unit free. Absolute measure are not used for comparisons, relative are hard to compute, Range STATISTICS Page 6 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 Range= L-S, Coef. Of Range= x 100, it is affected by presence of extremes, just based on two observations. Affected by change of origin nor of scale. It is based on two variables. Y=a+bX, then Ry=(b)* Rx Mean Deviation means the arithmetic mean of the deviation from means or median, For ungrouped frequency, MD = For grouped frequency, MD = Coef of MD = (MD by A)x ∑| | ∑ | | where A is assumed mean or median MD takes its value minimum when calculated from median, it is not changed by change of origin but affected by change of scale, for y=a+bx, MDy= (b) * MDX Standard Deviation Best method to calculate dispersion. Rigidly defined. For any frequency distribution, SD =√ For any frequency distribution, SD =√ ( ) ( ) Variance= SD2, Coeff. Of variation represents the variation in a series =SD/AM *100, more is CV means more dispersion If all the variables of the series are k, then SD is 0. Remain unchanged by change of origin but change by change of scale y=a+bx, SDy=(b)*SDX Combined standard deviation = √ Where d1= Where d2= 1- 2- Where = It is used for finding dispersion of a group of series. For any two numbers SD is square of range. Quartile Deviation Quartile deviation or semi quartile deviation – Qd= coef. Of QD= *100 STATISTICS Page 7 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 It is best measure for open and distribution, it is not affected by change of origin but affected by change of scale. Just based on 50% of the observation Correlation and Regression Correlation shows association or relation between two variables, whereas regression shows the value of variable based on other Bivariate data Can be marginal and conditional distribution Collected for 2 variables at the same time For distribution p+q, number of cells are pq, Some cells can be 0 or negative For p x q, marginal distribution is 2, and conditional are p+q Correlation (r) Between-1 to 1.-1 means perfect negative, 0 to-1 means negative, 0 means no correlation, while 0 to 1 means positive and +1 means perfect positive correlation, Scatter diagram- used for nonlinear distribution calculate correlation graphically. Karl pearson product moment correlation o Used only for linear variables o r=rxy= o Cov(x,y)= ( ∑( ) ̅)( ) SDx= √ SDy= √ for bivariate distribution Cov(x,y)= ∑ ( )( ) The coefficient of correlation is unit free It remains unchanged by change of origin or scale, but it changes its sign with the change of sign of variables. If sign in both variables is same, remains some. While if sign differs sign also changes. See eg 12.80 page 12.18 or module STATISTICS Page 8 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 Spearman rank correlation rR=1-6 ∑ ( ) D= xi-yi, xi are rankings given to x and y series, Ranks are given in descending order, 1 is given to highest and so on In case of repetitive ranks∑ ∑ ( ) rR= 1-6x ( ( ) ) it represents number of repletion, (n3-n) will come the number of times numbers are repeated. Coefficient of Concurrent Deviation Rc= √ , M represents n1 The deviation in the value of X and Y is known to be concurrent if both the deviation have the same sign. C denotes number of concurrent deviation Regression (b) Y=a+bx, a and b are regression parameters, regression equation Y on X ,byx, methods based on least square Regression coefficient also represents slope of regression equation ( (a) byx= ) = a= y-b, (where y= mean of y, x =mean of x) For solving a and b yi=na+bxi xiyi=axi+bxi2 (b) direct method byx = ∑ ∑ ∑ (∑ ∑ ) For equation on X on Y, replace y with x. changes with change in scale, but not with change origin u= and y = Others byx= > * Bvu the regression equation intersect on their means. R= √ STATISTICS Page 9 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 Coefficient of determination r2 = Explained variance/Total variance Coefficient of non-determination (1-r2) Two regression lines coincides when r=-1 or 1, and perpendicular if r=0. Spurious correlation relation between two variables who are not casually related. Coefficient of correlation is unit free. The errors in case of regression can be positive, negative or zero. Product of regression coefficient must be numerically less than 1. Probability and Expected value by mathematical expectation Probability means likelihood. Two types of probability are subjective (based on judgement) and objective. Experiment refers to something which produces results. Random experiments are those where results are based on chances. Events are the results. Composite events are those which can be further break into simple events. Mutually exclusive events are those in which occurrence of I don't impact occurrence of other. Exhaustive event are those which constitute universe. Equally likely event are those whose occurrence are equal. P (A) = Chances of happening/Total event. (given by bernouli and laplace), always between 0 to 1.P(A')=1-P(A). Odds in favour-chances of happening/chances of not happening. Odds in against chances of non-happening/chances of happening Formulae's • P(AUB)=P(A) P(B)- P(AnB) • P(AUBUC)=P(A) + P(B) + P(C)-P(AnB)-P(BnC)-P(CnA)+P(AnBnC) • P(A-B)=P(A)nP(AnB’) • P(AUBUC)- 1 for mutually exhaustive events, STATISTICS Page 10 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 • • • • • • P(AnB)= 0 for mutually exclusive events. P(AnB)= P(A)*P(B), for independent events P(B/A)=P(AnB/P(A), conditional probability µ=E(x)= pixi, expected value variance = (xi-µ)2= E(x2)- µ2 if y=a+bx, then expected value µ y=a+Bµx,and standard deviation SDy=| | x SDx • Expectation of constant k is k • Expectation of sum of two random variables is the sum of two variables i.e. E(x+y)=E(x)+E(y), • Expectation of product of two random variables is the product of two variables i.e. E(x*y)= E(x)* E(y). Theoretical Distribution • A probability distribution where total probability is distributed to different mass points in case of discrete variable and to different class interval in case of continuous distribution is known as theoretical distribution. It exists only in theory. it is used for short term projects and statistical analysis can be possible on theoretical distribution • Probability distribution can be - discrete ( Binomial or Poisson Distribution) or continuous distribution (normal, t , f. chi-square distribution) • Binomial Distribution Discrete probability distribution, invented by Bernoulli. Features- two outcome (mutually exclusive and exhaustive), independent, trials are finite in number. It is based on methods of moments. Binomial function nCxpxqn-x, where n is number of trial, p is probability of happening, q is probability of non-happening, x is the number of happenings. It is bi-parametric Len and p, mean= np. Variance= npq, variance is always less than mean, variance is maximum when p= q =0.5 and maximum variance is n/4 i.e. It is symmetric when p=0.5. When p is small, binomial distribution is skewed towards right STATISTICS Page 11 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 It can be bi- modal or uni-modal, if (n+1)p is a integer then mode= (n+1)p otherwise (n+1)p-1 if it is non- integer Additive property if X and Y are two independent variables such that X-6(n1, p) and Y-6(n2,p) then (X+Y)= 6(n1+n2, p) Eg cricket matches between two nations Poisson Distribution Discrete variable invented by Sir Denis Poisson. It is on expanded version of binomial distribution where n is very large and p is very small. B (n1,p)- ~P(m) ) Poisson Distribution = x~P= ( where m=np. Sum of total distribution 1 i.e xf(x)=1 uniparametric i.e µ= mean= np. Variance = µ It can be un-modal on bi-mode is m-1 if m is non-integer and m if m is integer. Additive property= if X and Y are two independent variables such that X~P(m1) and Y~P(m2) then (X+Y) ~P(m1+m2) Uses like no of printing mistakes per page on large book, no of road accident on a busy road per minute, no of radio-active elements per minute in a fusion process, no of demand per minute for a health care. It is symmetrical when mean value is high Normal or Gaussion Distribution Continuous probability distribution. Most important and universally accepted. It is based on two parameters µ and σ 2, denoted by X-~N(u, σ 2), whereas probability distribution (or probability density Function) is √ e ( ) Where µ and σ are constants Under normal distribution, Mean= Median= Mode. At mean the probability is highest. It is bell shaped symmetric on both side of the axis. The area on both side is 0.5. It is bi parametric with µ and σ2. The mean of the normal distribution isµ the standard deviation of the normal distribution is σ. Mean deviation= 0.8 σ Quartile deviation 0.675 with Q1= µ- 0.675 σ, Q3 =µ0.675 σ Point of inflexion on the curve = µ+ σ and µ- σ Distribution of area under normal curve STATISTICS Page 12 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 P(µ- σ<µ+ σ)= 0.6828 or P(-1 <x< 1)=0.6828 P(µ-2 σ < X<µ+2 σ)=0.9546 or P(-2<x< 2)=0.9546 P (µ-3 σ < x<µ+3 σ) - 0.9973 or P (-1<x< 1)-0.9973 Additive property= if X and Y are two independent variables such that X~N(µ1, σ 1) and y ~N(µ2, σ 2) then (X+Y) ~N(µ1+ µ2,( σ12+ σ22)1/2 Use wages of workers, Others Probability function is also known as frequency distribution Probability density functions is associated with continuous distribution Index number It means ratio or average of ratios expressed as a percentage, having two or more time periods, one of the time period is base period. Issues in index number-selection of data, base period, use of weights, use of average- best is GM, choice of variable and formula. Price relative =(Pn/P0)x 100 Methods-Simple- (Aggregative and Relative) & Weighted (Aggregative and Relative) o Simple Aggregative method=P/P0) o Simple average of relative = (Pn/P0) Weighted Average o Laspreyes =PnQ0/P0Q0 o Paasche=PnQn/P0QN o Marshall Edge worth - Pn (Q0+ Qn)/ P0(Q0+ Qn) o Fischer Ideal (L*P)1/2 o Bowley= (L+P)/2 o Weighted average of price relative- PnQo/P0Q0 (For calculating quantity index just replace with Q.) Chain index=(link relative of current year chain index of previous year) X100 Link relative of current year =( P1/P0 )x100 Value Index= PnQn/PoQo Deflated value(Real value)= Current value) Price Index for current year Shifting of price index= (Original Price Index/Price index for the year on which base is to be shifted) x100 STATISTICS Page 13 Downloaded by Rahul B (personalrahul101@gmail.com) lOMoARcPSD|27878101 Tests of Adequacy o Unit test= Index should be free from any unit, weighted average satisfies it o Time reversal test=P01*P10=1 o Factor reversal test- P01x Q01 - V01 o Only fischer satisfies all the above test. o Circular Test= P0I XP12x P21 (simple geometric mean of price relative and weighted aggregative with fixed weights satisfies it) Others o Index for base year is always 100 o GM makes index number time reversal o Activate Window o P01= means 1 on 0 o Cost of living index PnQ0/ P0Q0. It is also a weighted index o Group index number =(price relative *W)/ w o Price index using simple GM, Log lon = 2+1/n log (Pn/P0) STATISTICS Page 14 Downloaded by Rahul B (personalrahul101@gmail.com)