Uploaded by Rahul B

ca-foundation-statistics

advertisement
lOMoARcPSD|27878101
Ca foundation statistics
ca foundation (Institute of Chartered Accountants of India)
Studocu is not sponsored or endorsed by any college or university
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
STATISTICS NOTES
HARIS SIR
STATISTICS
Page 1
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
CONCISE NOTES – STATISTICS – CA FOUNDATION
Statistical Description of Data
 Statistic came from latin word Status, Italien word statista, German world
statistic and French world statistique
 Statistic can be defined in singular and plural sense. In plural sense it
means the data collected-qualitative and quantitative. While in singular it
refers to the methods applied on these data.
 There are 2 types of errors in statistics.
 Statistics applies in economics, business management and commerce.
 Variable refers to quantitative characteristic. Discrete if finite and
continuous if it assumes value in a given interval
 Data can be primary or secondary. Primary data if directly collected from
the source, collected at first source. If date has been obtained not from
primary source.
 Collection of primary data. Interview, mailed questionnaire, observation
& filled questionnaire method
 Interview method- Personal Interview, Indirect interview & Telephone
Interview method. Telephone - quick and non-expensive method
 Sources of Secondary data International sources, Government sources,
private and quasi government, unpublished sources.
 The data collected should be 1st scrutinize to check whether the data is
correct and accurate.
 Classification-refers to the process of arranging data. It makes data more
relevant, precise and condensed, make it comparable & serve as base for
analysis.
 Classified data in respect of an attribute are referred as qualitative. Data
can be frequency data and non-frequency data. Time series is an eg of
non-frequency data. Internal consistency of data can be checked with
given related series
 Presentation of data Textual presentation-presenting data in paragraphs, it is simple
however Non-statistical, not preferred.
 Tabular presentation-presentation in tables
 Table has 4 components
STATISTICS
Page 2
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101







 Caption (upper part-showing column and sub column), box head
(entire upper part), stub (left part giving details of rows), and body
(main part that contains numbers). Footnotes are provided in the
bottom part.
 In general the number of types of tabulation is 2.
 Diagrammatical presentation
 Line diagram or historigram e.g. graph, shows relation-ship
between two variables, ogives.
 Bar diagram (horizontal for qualitative data & vertical for
quantitative data)
 Divided bar charts or percentage bar diagrams for compaing and
relating different components of a variable
 Pie chart. Circular diagrams are two dimensional
Frequency distribution is tabular representation of statistical data, when
made in respect of discrete series it is Discrete distribution and when relates
to continuous data it is called Grouped Frequency Distribution
Mutually exclusive series are for continuous series. Mutually inclusive series
are for discrete series.
Relative frequency lies between 0 & 1. Frequency density corresponding to a
class Interval is a ratio of class frequency/class length.
Class Limit means upper and lower limit of class interval
Class Boundary refers to actual class limit of an interval. They are included
in class interval
Graphical Representation of frequency distribution Histogram- helps in comparison of frequencies, calculation of mode.
It is also an area diagram, classes are overlapping. The width of all
classes is equal.
 Frequency Polygon- meant for single frequency distribution, all its
classes have equal width
 Ogives or cumulative frequency graph-for cumulative distributionused for quartiles, median etc.
Frequency curve is a smooth curve for which area is 1. It is a limiting factor
of frequency polygon & Histogram.
1. Belt Shaped-most commonly used, eg where bell shape curve is usedProfits of a company,
2. U shaped
3. J shaped
4. Mixed curve
STATISTICS
Page 3
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
Measure of Central Tendency and Measure of Dispersion
 It shows clusterness (collection) of the data, it represents a series.
 Arithmetic Mean
 Sum of all observation divided by no. of observation
∑
Individual Series
Frequency distribution
Grouped frequency distribution
∑
A+A+∑
*c
∑
∑
A= Assumed Mean
Di=
C= Class interval
Xi= mid-point of class interval
 If all the observation of the series is k, then mean is also k
 Sum of deviation from mean is always 0. (xi-X)=0 or fi(xi - X)=0
 It is affected by change of origin (addition) and change of scale
(multiplication) if y=a+bx, then Y=a+bx (Y represents mean of y
and X represents mean of X).
 Combined mean=
 Affected by extremes, best averages, best on all observations, cannot
be calculated in open end frequency, affected by sampling
fluctuations.
 It represents a series. It is rigidly defined. Most commonly used
 Median
 Positional value, divide a series into two parts, not affected by
extremes.
 Median=
For others =
in case of even number, median is simple average
of two middle values.
 For grouped frequency = L +
L= lower limit
N=Total frequency
Cf= Less than cumulative frequency w
F-Frequency in median class
STATISTICS
Page 4
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
 For moderately skewed distribution, Y=a+bx, then Y=a+bx then
Yme=a+bxme
 Deviation from median is minimum ( xi - A) where A is median
 Not based on all observation, best for calculation in open end
frequency, not affected by sampling fluctuation or present of
extremes. Cumulative frequency are to be calculated for its
calculation
 It is calculated by use of ogives.
 Quartiles, Deciles, Percentiles
 It is also a positional average, Quartiles, deciles and percentiles
divides equation into 410 and 100 parts respectively. There are
3,9,99 quartiles, deciles and percentiles. Q2=D5=P50=Median
 Can be calculated through ogives also.
 For quartiles = L +
For deciles = L +
For percentiles = L +
 Mode
 Highest value in the series. A series can be multimodal also. It is
not uniquely defined; it does not exist if all the observations are
equal. It represents the number which has been repeatedly most of
the times.
 Mode==L+
i
 Where it is difficult to compute mode with the above formula,
mode can be calculated using the equation
Mean-Mode – 3(Mean -Median) or Mode= 3 Median-2 Mean
 For moderately skewed distribution Y=a+bx, then Ymo=o+bxmo
 Graphically it can be calculated by histogram
 Geometric Mean
 Nth root of the product of the observation
 G= (x1*x2*……..* xn)/1/n
 For grouped frequency, G= (x1f1*x2f2*…….*xnfn)1/n where 1/n=
1/f
 Logarithm of G of a set of observation is the AM of the logarithm
of the observation i.e. log G=1/r logx
STATISTICS
Page 5
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
 If all the observation of a series are k, then the GM is also k
GM of product of two variables is the product of their GM’s i.e
z=x*, GM of 2= (GM of x)*(GM of Y)
 The ratio of two variables is the ratio of their GM
 It can be calculated roll the observation have some sign and none of
them is 0
 Use- for calculation of growth rate of population
 Harmonic Mean
 It is reciprocal of the AM of reciprocal of the observations
 H=
for grouped frequency H=
∑
∑
 If all the observations are k , then HM is also K. combined HM =
 Used for calculation of averages of prices used for average of speed
 Weighted average are as follows
 Weighted AM =
∑
∑
 Weighted GM= Antelog
 Weighted HM=
∑
∑
∑
∑
 It is generally used when all the observations are not of equal values.
A.M≥G.M≥ H.M., FOR 2 numbers AM *HM= GM2. For finding rates
generally GM & HM ore used.
 For symmetrical distribution means median= mode.
Measure of Dispersion
 It represents scatterness of the series. It can be a) Absolute and b) Relative.
 Absolute (or 1st measure of dispersion) -a) Range, b) Mean Deviation, c)
Standard Deviation and d) Quartile
 Deviation
 Relative (or 2nd degree dispersion)-a) Coefficient of Range, b) Coef of
MD, c) Coef of SD and d) Coef of QD.
 Absolute v/s Relative absolute unit based, whereas relative is unit free.
Absolute measure are not used for comparisons, relative are hard to
compute,
 Range
STATISTICS
Page 6
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
 Range= L-S, Coef. Of Range=
x 100, it is affected by presence
of extremes, just based on two observations. Affected by change of
origin nor of scale. It is based on two variables.
 Y=a+bX, then Ry=(b)* Rx
 Mean Deviation
 means the arithmetic mean of the deviation from means or median,
 For ungrouped frequency, MD =
 For grouped frequency, MD =
 Coef of MD = (MD by A)x
∑|
|
∑ |
|
where A is assumed mean or median
 MD takes its value minimum when calculated from median, it is not
changed by change of origin but affected by change of scale, for
y=a+bx, MDy= (b) * MDX
 Standard Deviation
 Best method to calculate dispersion. Rigidly defined.
 For any frequency distribution, SD =√
 For any frequency distribution, SD =√
(
)
(
)
 Variance= SD2, Coeff. Of variation represents the variation in a
series =SD/AM *100, more is CV means more dispersion
 If all the variables of the series are k, then SD is 0.
 Remain unchanged by change of origin but change by change of
scale y=a+bx, SDy=(b)*SDX
 Combined standard deviation = √
Where d1=
Where d2=
1-
2-
Where =
 It is used for finding dispersion of a group of series.
 For any two numbers SD is square of range.
 Quartile Deviation
 Quartile deviation or semi quartile deviation – Qd=
coef. Of QD=
*100
STATISTICS
Page 7
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
 It is best measure for open and distribution, it is not affected by change of
origin but affected by change of scale.
 Just based on 50% of the observation
Correlation and Regression
 Correlation shows association or relation between two variables,
whereas regression shows the value of variable based on other
 Bivariate data Can be marginal and conditional distribution
 Collected for 2 variables at the same time
 For distribution p+q, number of cells are pq,
 Some cells can be 0 or negative
 For p x q, marginal distribution is 2, and conditional are p+q
 Correlation (r)
 Between-1 to 1.-1 means perfect negative, 0 to-1 means negative, 0
means no correlation, while 0 to 1 means positive and +1 means
perfect positive correlation,
 Scatter diagram- used for nonlinear distribution calculate
correlation graphically.
 Karl pearson product moment correlation
o Used only for linear variables
o r=rxy=
o Cov(x,y)=
(
∑(
)
̅)(
)
SDx= √
SDy= √
 for bivariate distribution
 Cov(x,y)=
∑
(
)(
)
 The coefficient of correlation is unit free
 It remains unchanged by change of origin or scale, but it
changes its sign with the change of sign of variables. If sign
in both variables is same, remains some. While if sign
differs sign also changes.
 See eg 12.80 page 12.18 or module
STATISTICS
Page 8
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
 Spearman rank correlation
rR=1-6
∑
(
)
D= xi-yi, xi are rankings given to x and y series, Ranks are given in
descending order, 1 is given to highest and so on
In case of repetitive ranks∑
∑
(
)
rR= 1-6x
(
(
)
)
it represents number of repletion, (n3-n) will come the number of
times numbers are repeated.
 Coefficient of Concurrent Deviation
Rc= √
,
M represents n1
The deviation in the value of X and Y is known to be concurrent if
both the deviation have the same sign. C denotes number of
concurrent deviation
 Regression (b)
 Y=a+bx, a and b are regression parameters, regression equation Y on
X ,byx, methods based on least square Regression coefficient also
represents slope of regression equation
(
(a) byx=
)
=
a= y-b, (where y= mean of y, x =mean of x)
For solving a and b
yi=na+bxi
xiyi=axi+bxi2
(b) direct method
byx =
∑
∑
∑
(∑
∑
)
For equation on X on Y, replace y with x.
 changes with change in scale, but not with change origin u=
and y =
Others
byx= > * Bvu
 the regression equation intersect on their means.
R= √
STATISTICS
Page 9
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
 Coefficient of determination r2 = Explained variance/Total
variance
Coefficient of non-determination (1-r2)
 Two regression lines coincides when r=-1 or 1, and
perpendicular if r=0.
 Spurious correlation relation between two variables who are
not casually related.
 Coefficient of correlation is unit free.
 The errors in case of regression can be positive, negative or
zero.
 Product of regression coefficient must be numerically less
than 1.
Probability and Expected value by mathematical
expectation
 Probability means likelihood. Two types of probability are subjective
(based on judgement) and objective.
 Experiment refers to something which produces results. Random
experiments are those where results are based on chances. Events are the
results. Composite events are those which can be further break into simple
events.
 Mutually exclusive events are those in which occurrence of I don't impact
occurrence of other. Exhaustive event are those which constitute universe.
Equally likely event are those whose occurrence are equal.
 P (A) = Chances of happening/Total event. (given by bernouli and
laplace), always between 0 to 1.P(A')=1-P(A).
 Odds in favour-chances of happening/chances of not happening.
 Odds in against chances of non-happening/chances of happening
 Formulae's
• P(AUB)=P(A) P(B)- P(AnB)
• P(AUBUC)=P(A) + P(B) + P(C)-P(AnB)-P(BnC)-P(CnA)+P(AnBnC)
• P(A-B)=P(A)nP(AnB’)
• P(AUBUC)- 1 for mutually exhaustive events,
STATISTICS
Page 10
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
•
•
•
•
•
•
P(AnB)= 0 for mutually exclusive events.
P(AnB)= P(A)*P(B), for independent events
P(B/A)=P(AnB/P(A), conditional probability
µ=E(x)= pixi, expected value
variance = (xi-µ)2= E(x2)- µ2
if y=a+bx, then expected value µ y=a+Bµx,and standard deviation
SDy=| | x SDx
• Expectation of constant k is k
• Expectation of sum of two random variables is the sum of two
variables i.e. E(x+y)=E(x)+E(y),
• Expectation of product of two random variables is the product of
two variables i.e. E(x*y)= E(x)* E(y).
Theoretical Distribution
• A probability distribution where total probability is distributed to
different mass points in case of discrete variable and to different
class interval in case of continuous distribution is known as
theoretical distribution. It exists only in theory. it is used for short
term projects and statistical analysis can be possible on theoretical
distribution
• Probability distribution can be - discrete ( Binomial or Poisson
Distribution) or continuous distribution (normal, t , f. chi-square
distribution)
• Binomial Distribution
 Discrete probability distribution, invented by Bernoulli.
Features- two outcome (mutually exclusive and exhaustive),
independent, trials are finite in number.
 It is based on methods of moments.
 Binomial function nCxpxqn-x, where n is number of trial, p is
probability of happening, q is probability of non-happening, x
is the number of happenings.
 It is bi-parametric Len and p, mean= np. Variance= npq,
variance is always less than mean, variance is maximum when
p= q =0.5 and maximum variance is n/4 i.e.
 It is symmetric when p=0.5. When p is small, binomial
distribution is skewed towards right
STATISTICS
Page 11
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
 It can be bi- modal or uni-modal, if (n+1)p is a integer then
mode= (n+1)p otherwise (n+1)p-1 if it is non- integer
 Additive property if X and Y are two independent variables
such that X-6(n1, p) and Y-6(n2,p) then (X+Y)= 6(n1+n2, p)
 Eg cricket matches between two nations
 Poisson Distribution
 Discrete variable invented by Sir Denis Poisson. It is on
expanded version of binomial distribution where n is very
large and p is very small. B (n1,p)- ~P(m)
)
 Poisson Distribution = x~P= (
where m=np.
 Sum of total distribution 1 i.e xf(x)=1 uniparametric i.e µ=
mean= np. Variance = µ
 It can be un-modal on bi-mode is m-1 if m is non-integer and
m if m is integer.
 Additive property= if X and Y are two independent variables
such that X~P(m1) and Y~P(m2) then (X+Y) ~P(m1+m2)
 Uses like no of printing mistakes per page on large book, no
of road accident on a busy road per minute, no of radio-active
elements per minute in a fusion process, no of demand per
minute for a health care.
 It is symmetrical when mean value is high
 Normal or Gaussion Distribution
 Continuous probability distribution. Most important and
universally accepted. It is based on two parameters µ and σ 2,
denoted by X-~N(u, σ 2), whereas probability distribution (or
probability density Function) is




√
e
(
)
Where µ and σ are constants
Under normal distribution, Mean= Median= Mode. At mean
the probability is highest.
It is bell shaped symmetric on both side of the axis. The area
on both side is 0.5. It is bi parametric with µ and σ2. The
mean of the normal distribution isµ the standard deviation of
the normal distribution is σ.
Mean deviation= 0.8 σ
Quartile deviation 0.675 with Q1= µ- 0.675 σ, Q3 =µ0.675 σ
Point of inflexion on the curve = µ+ σ and µ- σ
Distribution of area under normal curve
STATISTICS
Page 12
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
 P(µ- σ<µ+ σ)= 0.6828 or P(-1 <x< 1)=0.6828
 P(µ-2 σ < X<µ+2 σ)=0.9546 or P(-2<x< 2)=0.9546
P (µ-3 σ < x<µ+3 σ) - 0.9973 or P (-1<x< 1)-0.9973
 Additive property= if X and Y are two independent variables
such that X~N(µ1, σ 1) and y ~N(µ2, σ 2) then (X+Y)
~N(µ1+ µ2,( σ12+ σ22)1/2
 Use wages of workers,
 Others
 Probability function is also known as frequency distribution
 Probability density functions is associated with continuous
distribution
Index number
 It means ratio or average of ratios expressed as a percentage, having two
or more time periods, one of the time period is base period.
 Issues in index number-selection of data, base period, use of weights, use
of average- best is GM, choice of variable and formula.
 Price relative =(Pn/P0)x 100
 Methods-Simple- (Aggregative and Relative) & Weighted (Aggregative and Relative)
o Simple Aggregative method=P/P0)
o Simple average of relative = (Pn/P0)
 Weighted Average
o Laspreyes =PnQ0/P0Q0
o Paasche=PnQn/P0QN
o Marshall Edge worth - Pn (Q0+ Qn)/ P0(Q0+ Qn)
o Fischer Ideal (L*P)1/2
o Bowley= (L+P)/2
o Weighted average of price relative- PnQo/P0Q0
 (For calculating quantity index just replace with Q.)
 Chain index=(link relative of current year chain index of previous year) X100
 Link relative of current year =( P1/P0 )x100
 Value Index= PnQn/PoQo
 Deflated value(Real value)= Current value) Price Index for current year
 Shifting of price index= (Original Price Index/Price index for the year on
which base is to be shifted) x100
STATISTICS
Page 13
Downloaded by Rahul B (personalrahul101@gmail.com)
lOMoARcPSD|27878101
 Tests of Adequacy
o Unit test= Index should be free from any unit, weighted average
satisfies it
o Time reversal test=P01*P10=1
o Factor reversal test- P01x Q01 - V01
o Only fischer satisfies all the above test.
o Circular Test= P0I XP12x P21 (simple geometric mean of price
relative and weighted aggregative with fixed weights satisfies it)
 Others
o Index for base year is always 100
o GM makes index number time reversal
o Activate Window
o P01= means 1 on 0
o Cost of living index PnQ0/ P0Q0. It is also a weighted index
o Group index number =(price relative *W)/ w
o Price index using simple GM, Log lon = 2+1/n  log (Pn/P0)
STATISTICS
Page 14
Downloaded by Rahul B (personalrahul101@gmail.com)
Download