BUSINESS STATISTICS OUR OUTSTANDING PUBLICATIONS Fundamentals of Statistics — S.C. Gupta Practical Statistics — S.C. Gupta & Indra Gupta Business Statistics (UPTU) — S.C. Gupta & Indra Gupta O;olkf;d lkaf[;dh — S.C. Gupta & Arvind Kumar Singh Consumer Behaviour in Indian Perspective — Nair, Suja Consumer Behaviour and Marketing Research — Nair, S.R . Consumer Behaviour—Text and Cases — Nair, Suja Communication — Rayudu, C.S. Investment Management — Avadhani, V.A. Management of Indian Financial Institutions — Srivastava & Nigam Investment Management — Singh, Preeti Personnel Management — Mamoria & Rao Dynamics of Industrial Relations in India — Mamoria, Mamoria & Gankar A Textbook of Human Resource Management — Mamoria & Gankar International Trade and Export Management — Cherunilam, Francis International Business (Text and Cases) — Subba Rao, P. Production and Operations Management — Aswathappa & Sridhara Bhatt Total Quality Management (Text and Cases) — Bhatt, S.K. Quantitative Techniques for Decision Making — Sharma Anand Operations Research — Sharma Anand Advanced Accountancy — Arulanandam & Raman Cost and Management Accounting — Arora, M.N. Indian Economy — Misra & Puri Advanced Accounting — Gowda, J.M. Management Accounting — Gowda, J.M. Accounting for Management — Jawaharlal Accounting Theory — Jawaharlal Managerial Accounting — Jawaharlal Production & Operations Management — Aswathappa & Sridhara Bhatt Business Environment ( Text and Cases) — Cherunilam, Francis Business Laws — Maheshwari & Maheshwari Business Communication — Rai & Rai Business Law for Management — Bulchandani, K.R. Organisational Behaviour — Aswathappa, K. (Hindi Version of Business Statistics) BUSINESS STATISTICS For B.Com. (Pass and Honours) ; B.A. (Economics Honours) ; M.B.A./M.M.S. of Indian Universities S.C. GUPTA M.A. (Statistics) ; M.A. (Mathematics) ; M.S. (U.S.A.) Associate Professor in Statistics (Retired) Hindu College, University of Delhi Delhi-110007 Mrs. INDRA GUPTA (iv) © Author No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording and/or otherwise without the prior written permission of the publisher. First Edition : June 1988 Second Edition : 2013 Published by : Branch Offices : Mrs. Meena Pandey for Himalaya Publishing House Pvt. Ltd., “Ramdoot”, Dr. Bhalerao Marg, Girgaon, Mumbai - 400 004. Phone: 022-23860170/23863863, Fax: 022-23877178 E-mail: himpub@vsnl.com; Website: www.himpub.com New Delhi : “Pooja Apartments”, 4-B, Murari Lal Street, Ansari Road, Darya Ganj, New Delhi - 110 002. Phone: 011-23270392/23278631; Fax: 011-23256286 Nagpur : Kundanlal Chandak Industrial Estate, Ghat Road, Nagpur - 440 018. Phone: 0712-2738731/3296733; Telefax: 0712-2721216 Bengaluru : Plot No. 91-33, 2nd Main Road, Seshadripuram, Behind Nataraja Theatre, Bengaluru - 560020. Phone: 08041138821; Mobile: 9379847017, 9379847005 Hyderabad : No. 3-4-184, Lingampally, Besides Raghavendra Swamy Matham, Kachiguda, Hyderabad - 500 027. Phone: 040-27560041/27550139 Chennai : New-20, Old-59, Thirumalai Pillai Road, T. Nagar, Chennai - 600 017. Mobile: 9380460419 Pune : First Floor, “Laksha” Apartment, No. 527, Mehunpura, Shaniwarpeth (Near Prabhat Theatre), Pune - 411 030. Phone: 020-24496323/24496333; Mobile: 09370579333 Lucknow : House No. 731, Shekhupura Colony, Near B.D. Convent School, Aliganj, Lucknow - 226 022. Phone: 0522-4012353; Mobile: 09307501549 Ahmedabad : 114, “SHAIL”, 1st Floor, Opp. Madhu Sudan House, C.G. Road, Navrang Pura, Ahmedabad - 380 009. Phone: 079-26560126; Mobile: 09377088847 Ernakulam : 39/176 (New No.: 60/251), 1st Floor, Karikkamuri Road, Ernakulam, Kochi - 682011. Phone: 0484-2378012, 2378016’ Mobile: 09387122121 Bhubaneswar : 5 Station Square, Bhubaneswar - 751 001 (Odisha). Phone: 0674-2532129; Mobile: 09338746007 Kolkata 108/4, Beliaghata Main Road, Near ID Hospital, Opp. SBI Bank, Kolkata - 700 010, Phone: 033-32449649; Mobile: 7439040301 : DTP by : Times Printographic Printed at : (v) Dedicated to Our Parents (vi) (vii) Preface TO THE SIXTH EDITION The book originally written over 20 years ago has been revised and reprinted several times during the intervening period. It is very heartening to note that there has been an increasing response for the book from the students of B.A. (Economics Honours), B.Com. (Pass and Honours); M.B.A./M.M.S. and other management courses, in spite of the fact that the book has not been revised for quite a long time. I take great pleasure in presenting to the readers, the sixth thoroughly revised and enlarged edition of the book. The book has been revised in the light of the valuable criticism, suggestions and the feedback received from the teachers, students and other readers of the book from all over the country. Some salient features of the new edition are : • The theoretical discussion throughout has been refined, restructured, rewritten and updated. During the course of rewriting, a sincere attempt has been made to retain the basic features of the earlier editions viz., the simplicity of presentation, lucidity of style and the analytical approach, which have been appreciated by the teachers and the students all over India. • Several new topics have been added at appropriate places to make the treatment of the subject matter more exhaustive and up-to-date. Some of the additions are given below : Remark Remark 4 Remark 6 • , , , page 5·57 page 6·3 page 6·10 : Effect of Change of Scale on Harmonic Mean. : Effect of Change of Origin and Scale on Range. : Effect of Change of Origin and Scale on Mean Deviation about Mean. : Xmax and X min in terms of Mean and Range. : Some Results on Covariance. : Lag and Lead Correlation. : Necessary and Sufficient Condition for Minima of E. Remark 6 , page 7·3 Remark 2 , page 8·12 § 8·10 , page 8·45 Remark 1 ⎤ , page 9·4 ⎥ and ⎥ Theorem ⎥⎦ Remark , page 9·24 : Limits for r. § 11·9 , page 11·54 : Time Series Analysis in Forecasting. § 13·9 , page 13·9 : Covariance In Terms of Expectation. § 13·10 , page 13·14 : Var (ax + by) = a2 Var (x) + b2 Var (y) + 2ab Cov (x, y) and Remark Equations (14·29e) — and (14·29f) , page 14·31 : Distribution of the Mean (X ) of n i.i.d. N (μ, σ2) variates. § 15·11·2 , page 15·18 : Sampling Distribution of Mean. – 15·19 A number of solved examples, selected from the latest examination papers of various universities and professional institutes, have been added. These are bound to assist understanding and provide greater variety. (viii) • Exercise sets containing questions and unsolved problems at the end of each Chapter have been substantially reorganised and rewritten by deleting old problems and adding new problems, selected from the latest examination papers of various universities, C.A., I.C.W.A., and other management courses. All the problems have been very carefully graded and answers to the problems are given at the end of each problem. • An attempt has been made to rectify the errors in the last edition. It is hoped that all these changes, additions and improvements will enhance the value of the book. We are confident, that the book in its present form, will prove to be of much greater utility to the students as well as teachers of the subject. We express our deep sense of thanks and gratitude to our publishers M/s Himalaya Publishing House and Type-setters, M/s Times Printographics, Darya Ganj, New Delhi, for their untiring efforts, unfailing courtesy, and co-operation in bringing out the book in such an elegant form. We strongly believe that the road to improvement is never-ending. Suggestions and criticism for further improvement of the book will be very much appreciated and most gratefully acknowledged. January, 2013 S.C. GUPTA INDRA GUPTA (ix) Preface TO THE FIRST EDITION In the ancient times Statistics was regarded only as the science of statecraft and was used to collect information relating to crimes, military strength, population, wealth, etc., for devising military and fiscal policies. But today, Statistics is not merely a by-product of the administrative set-up of the State but it embraces all sciences-social, physical and natural, and is finding numerous applications in various diversified fields such as agriculture, industry, sociology, biometry, planning, economics, business, management, insurance, accountancy and auditing, and so on. Statistics (theory and methods) is used extensively by the government or business or management organisations in planning future programmes and formulating policy decisions. It is rather impossible to think of any sphere of human activity where Statistics does not creep in. The subject of Statistics has acquired tremendous progress in the recent past so much so that an elementary knowledge of statistical methods has become a part of the general education in the curricula of many academic and professional courses. This book is a modest though determined bid to serve as a text-book, for B.Com. (Pass and Hons.); B.A. Economics (Hons.) courses of Indian Universities. The main aim in writing this book is to present a clear, simple, systematic and comprehensive exposition of the principles, methods and techniques of Statistics in various disciplines with special reference to Economics and Business. The stress is on the applications of techniques and methods most commonly used by statisticians. The lucidity of style and simplicity of expression have been our twin objectives in preparing this text. Mathematical complexity has been avoided as far as possible. Wherever desirable, the notations and terminology have been clearly explained and then all the mathematical steps have been explained in detail. An attempt has been made to start with the explanation of the elementaries of a topic and then the complexities and the intricacies of the advanced problems have been explained and solved in a lucid manner. A number of typical problems mostly from various university examination papers have been solved as illustrations so as to expose the students to different techniques of tackling the problems and enable them to have a better and throughful understanding of the basic concepts of the theory and its various applications. At many places explanatory remarks have been given to widen readers’ horizon. Moreover, in order to enable the readers to have a proper appreciation of the subject-matter and to fortify their confidence in the understanding and application of methods, a large number of carefully graded problems, mostly drawn from various university examination papers, have been given as exercise sets in each chapter. Answers to all the problems in the exercise sets are given at the end of each problem. The book contains 16 Chapters. We will not enumerate the topics discussed in the text since an idea of these can be obtained from a cursory glance at the table of contents. Chapters 1 to 11 are devoted to ‘Descriptive Statistics’ which consists in describing some characteristics like averages, dispersion, skewness, kurtosis, correlation, etc., of the numerical data. In spite of many latest developments in statistical techniques, the old topics like ‘Classification and Tabulation’ (Chapter 3) and ‘Diagrammatic and Graphic Representation’ (Chapter 4) have been discussed in details, since they still constitute the bulk of statistical work in government and business organisations. The use of statistical methods as scientific tools in the analysis of economic and business data has been explained in Chapter 10 (Index Numbers) and Chapter 11 (Times Series Analysis). Chapters 12 to 14 relate to advanced topics like Probability, Random (x) Variable, Mathematical Expectation and Theoretical Distributions. An attempt has been made to give a detailed discussion of these topics on modern lines through the concepts of ‘Sample Space’ and ‘Axiomatic Approach’ in a very simple and lucid manner. Chapter 15 (Sampling and Design of Sample Surveys), explains the various techniques of planning and executing statistical enquiries so as to arrive at valid conclusions about the population. Chapter 16 (Interpolation and Extrapolation) deals with the techniques of estimating the value of a function y = f (x) for any given intermediate value of the variable x. We must unreservedly acknowledge our deep debt of gratitude we owe to the numerous authors whose great and masterly works we have consulted during the preparation of the manuscript. We take this opportunity to express our sincere gratitude to Prof. Kanwar Sen, Shri V.K. Kapoor and a number of students for their valuable help and suggestions in the preparation of this book. Last but not least, we express our deep sense of gratitude to our Publishers M/s Himalaya Publishing House for their untiring efforts and unfailing courtesy and co-operation in bringing out the book in time in such an elegant form. Every effort has been made to avoid printing errors though some might have crept in inadvertently. We shall be obliged if any such errors are brought to our notice. Valuable suggestions and criticism for the improvement of the book from our colleagues (who are teaching this course) and students will be highly appreciated and duly incorporated in subsequent editions. June, 1988 S.C. GUPTA Mrs. INDRA GUPTA (xi) Contents 1. INTRODUCTION — MEANING AND SCOPE 1·1 – 1·16 1·1. 1·2. 1·3. 1·4. 1·5. ORIGIN AND DEVELOPMENT OF STATISTICS DEFINITION OF STATISTICS IMPORTANCE AND SCOPE OF STATISTICS LIMITATIONS OF STATISTICS DISTRUST OF STATISTICS EXERCISE 1.1 . 1·1 1·2 1·5 1·11 1·12 1·13 2. COLLECTION OF DATA 2·1. 2·1·1. 2·1·2. 2·1·3. 2·1·4. 2·1·5. 2·1·6. 2·2. 2·2·1. 2·3. 2·3·1. 2·3·2. 2·3·3. 2·3·4. 2·3·5. 2·4. 2·5. 2·5·1. 2·5·2. 2·6. INTRODUCTION Objectives and Scope of the Enquiry. Statistical Units to be Used. Sources of Information (Data). Methods of Data Collection. Degree of Accuracy Aimed at in the Final Results. Type of Enquiry. PRIMARY AND SECONDARY DATA Choice Between Primary and Secondary Data. METHODS OF COLLECTING PRIMARY DATA Direct Personal Investigation. Indirect Oral Investigation. Information Received Through Local Agencies. Mailed Questionnaire Method. Schedules Sent Through Enumerators. DRAFTING OR FRAMING THE QUESTIONNAIRE SOURCES OF SECONDARY DATA Published Sources. Unpublished Sources. PRECAUTIONS IN THE USE OF SECONDARY DATA EXERCISE 2·1. 3. CLASSIFICATION AND TABULATION 3·1. 3·2. 3·2·1. 3·2·2. INTRODUCTION – ORGANISATION OF DATA CLASSIFICATION Functions of Classification. Rules for Classification. 2·1 – 2·21 2·1 2·1 2·2 2·4 2·4 2·4 2·5 2·6 2·7 2·7 2·8 2·8 2·9 2·10 2·11 2·12 2·16 2·16 2·18 2·18 2·20 3·1 – 3·40 3·1 3·1 3·2 3·2 (xii) 3·2·3. 3·3. 3·3·1. 3·3·2. 3·3·3. 3·3·4. 3·4. 3·4·1. 3·4·2. 3·4·3. 3·4·4. 3·5. 3·5·1. 3·5·2. 3·6. 3·7. 3·7·1. 3·7·2. 3·7·3. Bases of Classification. FREQUENCY DISTRIBUTION Array Discrete or Ungrouped Frequency Distribution. Grouped Frequency Distribution. Continuous Frequency Distribution. BASIC PRINCIPLES FOR FORMING A GROUPED FREQUENCY DISTRIBUTION Types of Classes. Number of Classes. Size of Class Intervals. Types of Class Intervals. CUMULATIVE FREQUENCY DISTRIBUTION Less Than Cumulative Frequency. More Than Cumulative Frequency. BIVARIATE FREQUENCY DISTRIBUTION EXERCISE 3·1 . TABULATION – MEANING AND IMPORTANCE Parts of a Table. Requisites of a Good Table. Types of Tabulation. EXERCISE 3·2 . 4. DIAGRAMMATIC AND GRAPHIC REPRESENTATION 4·1. 4·2. 4·3. 4·3·1. 4·3·2. 4·3·3. 4·3·4. 4·3·5. 4·3·6. 4·3·7. 4·3·8. INTRODUCTION DIFFERENCE BETWEEN DIAGRAMS AND GRAPHS DIAGRAMMATIC PRESENTATION General Rules for Constructing Diagrams. Types of Diagrams. One-dimensional Diagrams. Two-dimensional Diagrams. Three-Dimensional Diagrams. Pictograms Cartograms Choice of a Diagram. EXERCISE 4·1 4·4. 4·4·1. 4·4·2. 4·4·3. 4·4·4. 4·4·5. 4·5. GRAPHIC REPRESENTATION OF DATA Technique of Construction of Graphs. General Rules for Graphing. Graphs of Frequency Distributions. Graphs of Time Series or Historigrams. Semi-Logarithmic Line Graphs or Ratio Charts. LIMITATIONS OF DIAGRAMS AND GRAPHS EXERCISE 4·2 3·3 3·6 3·6 3·6 3·7 3·8 3·8 3·8 3·8 3·10 3·11 3·17 3·18 3·18 3·20 3·22 3·27 3·27 3·29 3·30 3·38 4·1 – 4·57 4·1 4·1 4·2 4·2 4·3 4·3 4·12 4·20 4·22 4·24 4·24 4·24 4·27 4·27 4·28 4·29 4·40 4·47 4·53 4·53 (xiii) 5. AVERAGES OR MEASURES OF CENTRAL TENDENCY 5·1 – 5·68 5·1. 5·2. 5·3. 5·4. 5·4·1. 5·4·2. 5·4·3. 5·5. INTRODUCTION 5·1 REQUISITES OF A GOOD AVERAGE OR MEASURE OF CENTRAL TENDENCY 5·2 VARIOUS MEASURES OF CENTRAL TENDENCY 5·2 ARITHMETIC MEAN 5·2 Step Deviation Method for Computing Arithmetic Mean. 5·3 Mathematical Properties of Arithmetic Mean. 5·5 Merits and Demerits of Arithmetic Mean. 5·8 WEIGHTED ARITHMETIC MEAN 5·14 EXERCISE 5·1 5·17 5·6. MEDIAN 5·22 5·6·1. Calculation of Median. 5·22 5·6·2. Merits and Demerits of Median. 5·24 5·6·3. Partition Values. 5·26 5·6·4. Graphic Method of Locating Partition Values. 5·28 EXERCISE 5·2 5·31 5·7. MODE 5·35 5·7·1. Computation of Mode. 5·36 5·7·2. Merits and Demerits of Mode. 5·37 5·7·3. Graphic Location of Mode. 5·38 5·8. EMPIRICAL RELATION BETWEEN MEAN (M), MEDIAN (Md) AND MODE (Mo) 5·38 EXERCISE 5·3 5·45 5·9. GEOMETRIC MEAN 5·49 5·9·1. Merits and Demerits of Geometric Mean. 5·50 5·9·2. Compound Interest Formula. 5·51 5·9·3. Average Rate of a Variable Which Increases by Different Rates at Different Periods. 5·51 5·9·4. Wrong Observations and Geometric Mean. 5·52 5·9·5. Weighted Geometric Mean. 5·56 5·10. HARMONIC MEAN 5·57 5·10·1. Merits and Demerits of Harmonic Mean. 5·57 5·10·2. Weighted Harmonic Mean. 5·61 5·11. RELATION BETWEEN ARITHMETIC MEAN, GEOMETRIC MEAN AND HARMONIC MEAN 5·61 5·12. SELECTION OF AN AVERAGE 5·62 5·13. LIMITATIONS OF AVERAGES 5·63 EXERCISE 5·4 5·63 6. 6·1. 6·1·1. 6·2. 6·3. 6·4. 6·5. MEASURES OF DISPERSION INTRODUCTION AND MEANING Objectives or Significance of the Measures of Dispersion. CHARACTERISTICS FOR AN IDEAL MEASURE OF DISPERSION ABSOLUTE AND RELATIVE MEASURES OF DISPERSION MEASURES OF DISPERSION RANGE 6·1 – 6·53 6·1 6·2 6·2 6·2 6·2 6·3 (xiv) 6·5·1. 6·5·2. 6·6. 6·6·1. 6·7. Merits and Demerits of Range. Uses. QUARTILE DEVIATION OR SEMI INTER-QUARTILE RANGE Merits and Demerits of Quartile Deviation. PERCENTILE RANGE EXERCISE 6·1 6·8. 6·8·1. 6·8·2. 6·8·3. 6·8·4. 6·8·5. MEAN DEVIATION OR AVERAGE DEVIATION Computation of Mean Deviation. Short-cut Method of Computing Mean Deviation. Merits and Demerits of Mean Deviation. Uses. Relative Measure of Mean Deviation. EXERCISE 6·2 6·9. 6·9·1. 6·9·2. 6·9·3. 6·9·4. STANDARD DEVIATION Mathematical Properties of Standard Deviation. Merits and Demerits of Standard Deviation Variance and Mean Square Deviation. Different Formulae for Calculating Variance. EXERCISE 6·3 6·10. 6·11. 6·12. STANDARD DEVIATION OF THE COMBINED SERIES COEFFICIENT OF VARIATION RELATIONS BETWEEN VARIOUS MEASURES OF DISPERSION EXERCISE 6·4 6·13. LORENZ CURVE EXERCISE 6·5 EXERCISE 6·6 7. SKEWNESS, MOMENTS AND KURTOSIS 7·1. 7·2. 7·2·1. 7·2·2. INTRODUCTION SKEWNESS Measures of Skewness. Karl Pearson’s Coefficient of Skewness. EXERCISE 7·1 7·2·3. 7·2·4. 7·2·5. Bowley’s Coefficient of Skewness. Kelly’s Measure of Skewness. Coefficient of Skewness based on Moments. EXERCISE 7·2 7·3. 7·3·1. 7·3·2. 7·3·3. 7·3·4. 7·3·5. MOMENTS Moments about Mean. Moments about Arbitrary Point A. Relation between Moments about Mean and Moments about Arbitrary Point ‘A’. Effect of Change of Origin and Scale on Moments about Mean. Sheppard’s Correction for Moments. 6·4 6·4 6·5 6·5 6·6 6·7 6·9 6·9 6·10 6·11 6·11 6·11 6·15 6·16 6·18 6·18 6·19 6·19 6·29 6·33 6·36 6·41 6·42 6·47 6·49 6·50 7·1 – 7·34 7·1 7·1 7·2 7·2 7·9 7·12 7·13 7·13 7·16 7·18 7·19 7·19 7·19 7·21 7·21 (xv) 7·3·6. 7·4. 7·5. 7·6. Charlier Checks. KARL PEARSON’S BETA (β) AND GAMMA (γ) COEFFICIENTS BASED ON MOMENTS COEFFICIENT OF SKEWNESS BASED ON MOMENTS KURTOSIS EXERCISE 7·3. 8. CORRELATION ANALYSIS 8·1. 8·1·1. 8·1·2. 8·2. 8·3. INTRODUCTION Types of Correlation Correlation and Causation. METHODS OF STUDYING CORRELATION SCATTER DIAGRAM METHOD 7·21 7·22 7·22 7·23 7·30 8·1 – 8·46 8.10. 8·1 8·1 8·2 8·3 8·3 EXERCISE 8·1 8·5 KARL PEARSON’S COEFFICIENT OF CORRELATION (COVARIANCE METHOD) 8·7 Properties of Correlation Coefficient 8·11 Assumptions Underlying Karl Pearson’s Correlation Coefficient. 8·17 Interpretation of r. 8·18 PROBABLE ERROR 8·18 EXERCISE 8·2 8·20 CORRELATION IN BIVARIATE FREQUENCY TABLE 8·25 EXERCISE 8·3 8·30 RANK CORRELATION METHOD 8·31 Limits for ρ. 8·31 Computation of Rank Correlation Coefficient (ρ). 8·32 Remarks on Spearman’s Rank Correlation Coefficient 8·38 EXERCISE 8·4 8·39 METHOD OF CONCURRENT DEVIATIONS 8·41 EXERCISE 8·5 8·43 COEFFICIENT OF DETERMINATION 8·43 EXERCISE 8·6 8·44 LAG AND LEAD CORRELATION 8.45 9. LINEAR REGRESSION ANALYSIS 9·1. 9·2. 9·3. 9·3·1. 9·3·2. 9·3·3. 9·4. 9·4·1. INTRODUCTION LINEAR AND NON-LINEAR REGRESSION LINES OF REGRESSION Derivation of Line of Regression of y on x. Line of Regression of x on y. Angle Between the Regression Lines. COEFFICIENTS OF REGRESSION Theorems on Regression Coefficients 8·4. 8·4·1. 8·4·2. 8·4·3. 8·5. 8·6. 8·7. 8·7·1. 8·7·2. 8·7·3. 8·8. 8·9. EXERCISE 9·1 9·1 – 9·33 9·1 9·2 9·2 9·2 9·4 9·5 9·6 9·7 9·15 (xvi) 9·5. 9·6. 9·7. 9·8. 9·9. — — TO FIND THE MEAN VALUES (X , Y ) FROM THE TWO LINES OF REGRESSION TO FIND THE REGRESSION COEFFICIENTS AND THE CORRELATION COEFFICIENT FROM THE TWO LINES OF REGRESSION STANDARD ERROR OF AN ESTIMATE REGRESSION EQUATIONS FOR A BIVARIATE FREQUENCY TABLE CORRELATION ANALYSIS vs. REGRESSION ANALYSIS EXERCISE 9·2 EXERCISE 9·3 10. INDEX NUMBERS 10·1. 10·2. 10·3. 10·4. 10·5. 10·5·1. 10·5·2. 10·5·3. 10·5·4. INTRODUCTION USES OF INDEX NUMBERS TYPES OF INDEX NUMBERS PROBLEMS IN THE CONSTRUCTION OF INDEX NUMBERS METHODS OF CONSTRUCTING INDEX NUMBERS Simple (Unweighted) Aggregate Method. Weighted Aggregate Method. Simple Average of Price Relatives. Weighted Average of Price Relatives EXERCISE 10·1 10·6. 10·6·1. 10·6·2. 10·6·3. 10·6·4. TESTS OF CONSISTENCY OF INDEX NUMBER FORMULAE Unit Test. Time Reversal Test. Factor Reversal Test. Circular Test. EXERCISE 10·2 10·7. 10·7·1. 10·7·2. CHAIN INDICES OR CHAIN BASE INDEX NUMBERS Uses of Chain Base Index Numbers. Limitations of Chain Base Index Numbers. EXERCISE 10·3 10·8. 10·8·1. 10·8·2. 10·8·3. BASE SHIFTING, SPLICING AND DEFLATING OF INDEX NUMBERS Base Shifting. Splicing. Deflating of Index Numbers. EXERCISE 10·4 10·9. 10·9·1. 10·9·2. 10·9·3. 10·10. COST OF LIVING INDEX NUMBER Main Steps in the Construction of Cost of Living Index Numbers Construction of Cost of Living Index Numbers. Uses of Cost of Living Index Numbers. LIMITATIONS OF INDEX NUMBERS EXERCISE 10·5 9·20 9·20 9·23 9·26 9·28 9·29 9·32 10·1 – 10·65 10·1 10·1 10·3 10·3 10·7 10·7 10·8 10·15 10.17 10·20 10·26 10·26 10·26 10·27 10·28 10·33 10·35 10·36 10·36 10·38 10·40 10·40 10·41 10·45 10·48 10·51 10·52 10·53 10·53 10·59 10·60 (xvii) 11. TIME SERIES ANALYSIS 11·1. 11·2. 11·2·1. 11·2·2. 11·2·3. 11·3. 11·4. 11·5. 11·5·1. 11·5·2. 11·5·3. 11·5·4. 11·5·5. INTRODUCTION COMPONENTS OF A TIME SERIES Secular Trend. Short-Term Variations. Random or Irregular Variations. ANALYSIS OF TIME SERIES MATHEMATICAL MODELS FOR TIME SERIES MEASUREMENT OF TREND Graphic or Free Hand Curve Fitting Method. Method of Semi-Averages. Method of Curve Fitting by the Principle of Least Squares. Conversion of Trend Equation. Selection of the Type of Trend. EXERCISE 11·1 11·5·6. Method of Moving Averages. EXERCISE 11·2 11·6. 11·6·1. 11·6·2. 11·6·3. 11·6·4. 11·6·5. 11·7. 11·8. 11.9. MEASUREMENT OF SEASONAL VARIATIONS Method of Simple Averages. Ratio to Trend Method. ‘Ratio to Moving Average’ Method. Method of Link Relatives. Deseasonalisation of Data. MEASUREMENT OF CYCLICAL VARIATIONS MEASUREMENT OF IRREGULAR VARIATIONS TIME SERIES ANALYSIS IN FORECASTING EXERCISE 11·3 EXERCISE 11·4 12. THEORY OF PROBABILITY 12·1. 12·2. 12·3. 12·4. 12·4·1. 12·4·2. 12·5. 12·6. INTRODUCTION SHORT HISTORY TERMINOLOGY MATHEMATICAL PRELIMINARIES Set Theory. Permutation and Combination. MATHEMATICAL OR CLASSICAL OR ‘A PRIORI’ PROBABILITY STATISTICAL OR EMPIRICAL PROBABILITY EXERCISE 12·1 12·7. 12·8. 12·8·1. 12·8·2. AXIOMATIC PROBABILITY ADDITION THEOREM OF PROBABILITY Addition Theorem of Probability for Mutually Exclusive Events. Generalisation of Addition Theorem of Probability. 11·1 – 11·60 11·1 11·1 11·2 11·3 11·4 11·5 11·5 11·6 11·6 11·7 11·10 11·22 11·25 11·25 11·30 11·38 11·39 11·40 11·42 11·44 11·47 11·49 11·52 11·53 11.54 11·54 11·59 12·1 – 12·52 12·1 12·1 12·2 12·4 12·4 12·6 12·8 12·9 12·14 12·17 12·19 12·20 12·20 (xviii) 12·9. 12·9·1. 12·9·2. THEOREM OF COMPOUND PROBABILITY OR MULTIPLICATION THEOREM OF PROBABILITY Generalisation of Multiplication Theorem of Probability. Independent Events. Multiplication Theorem for Independent Events. 12·21 12·22 12·22 12·22 12·35 12·41 12·43 12·43 12·49 EXERCISE 12·2 OBJECTIVE TYPE QUESTIONS 12·10. INVERSE PROBABILITY Bayes’s Theorem (Rule for the Inverse Probability) EXERCISE 12·3 13. 13·1. 13·2. 13·3. 13·3·1. 13·4. 13·5. RANDOM VARIABLE, PROBABILITY DISTRIBUTIONS AND MATHEMATICAL EXPECTATION RANDOM VARIABLE PROBABILITY DISTRIBUTION OF A DISCRETE RANDOM VARIABLE PROBABILITY DISTRIBUTION OF A CONTINUOUS RANDOM VARIABLE Probability Density Function (p.d.f.) of Continuous random Variable DISTRIBUTION FUNCTION OR CUMULATIVE PROBABILITY FUNCTION MOMENTS EXERCISE 13·1 13·6. 13·7. 13·8. 13.9. 13·10. 13·11. MATHEMATICAL EXPECTATION Physical Interpretation of E(X). THEOREMS ON EXPECTATION VARIANCE OF X IN TERMS OF EXPECTATION COVARIANCE IN TERMS OF EXPECTATION VARIANCE OF LINEAR COMBINATION JOINT AND MARGINAL PROBABILITY DISTRIBUTIONS EXERCISE 13·2 14. THEORETICAL DISTRIBUTIONS 14·1. 14·2. 14·2·1. 14·2·2. 14·2·3. 14·2·4. INTRODUCTION BINOMIAL DISTRIBUTION Probability Function of Binomial Distribution. Constants of Bionomial Distribution Mode of Binomial Distribution. Fitting of Binomial Distribution. EXERCISE 14·1 14·3. 14·3·1. 14·3·2. 14·3·3. 14·3·4. 13·1 – 13·19 POISSON DISTRIBUTION (AS A LIMITING CASE OF BINOMIAL DISTRIBUTION) Utility or Importance of Poisson Distribution. Constants of Poisson Distribution Mode of Poisson Distribution Fitting of Poisson Distribution. EXERCISE 14·2 13·1 13·2 13·2 13·2 13·3 13·4 13·6 13·7 13·7 13·8 13·9 13·9 13·14 13·14 13·16 14·1 – 14·52 14·1 14·1 14·2 14·3 14·5 14·10 14·11 14·16 14·18 14·18 14·19 14·23 14·25 (xix) 14·4. 14·4·1. 14·4·2. 14·4·3. 14·4·4. 14·4·5. 14·4·6. 14·4·7. NORMAL DISTRIBUTION Equation of Normal Probability Curve. Standard Normal Distribution. Relation between Binomial and Normal Distributions. Relation between Poisson and Normal Distributions. Properties of Normal Distribution. Areas Under Standard Normal Probability Curve Importance of Normal Distribution. EXERCISE 14·3 15. SAMPLING THEORY AND DESIGN OF SAMPLE SURVEYS 15·1. 15·2. 15·3. 15·4. 15·4·1. 15·4·2. 15·5. 15·5·1. 15·5·2. 15·5·3. 15·5·4. 15·5·5. 15·6. 15·7. 15·8. 15·9. 15·9·1. 15·9·2. 15·9·3. 15·10. 15·10·1. 15·10·2. 15·10·3. 15·11. 15·11·1. 15·11·2. 15·11·3. 15·12. 15·12·1. 15·12·2. 15·13. 15·13·1. INTRODUCTION UNIVERSE OR POPULATION SAMPLING PARAMETER AND STATISTIC Sampling Distribution. Standard Error. PRINCIPLES OF SAMPLING Law of Statistical Regularity. Principle of Inertia of Large Numbers. Principle of Persistence of Small Numbers. Principle of Validity. Principle of Optimisation. CENSUS VERSUS SAMPLE ENUMERATION LIMITATIONS OF SAMPLING PRINCIPAL STEPS IN A SAMPLE SURVEY ERRORS IN STATISTICS Sampling and Non-Sampling Errors. Biased and Unbiased Errors. Measures of Statistical Errors (Absolute and Relative Errors). TYPES OF SAMPLING Purposive or Subjective or Judgment Sampling. Probability Sampling. Mixed Sampling. SIMPLE RANDOM SAMPLING Selection of a Simple Random Sample. Sampling Distribution of Mean Merits and Limitations of Simple Random Sampling STRATIFIED RANDOM SAMPLING Allocation of Sample Size in Stratified Sampling. Merits and Demerits of Stratified Random Sampling. SYSTEMATIC SAMPLING Merits and Demerits 14·28 14·28 14·29 14·29 14·30 14·30 14·33 14·36 14·45 15·1 – 15·26 15·1 15·1 15·2 15·2 15·3 15·3 15·4 15·4 15·5 15·5 15·5 15·5 15·5 15·7 15·8 15·10 15·10 15·13 15·14 15·14 15·14 15·15 15·15 15·15 15·16 15·18 15·19 15·19 15·20 15·21 15·22 15·23 (xx) 15·14. 15·15. 15·16. CLUSTER SAMPLING MULTISTAGE SAMPLING QUOTA SAMPLING 15·23 15·23 15·24 15·24 EXERCISE 15·1. 16. INTERPOLATION AND EXTRAPOLATION 16·1. 16·1·1. 16·1·2. 16·2. 16·3. 16·4. 16·5. 16·6. 16·7. 16·8. INTRODUCTION Assumptions. Uses of Interpolation. METHODS OF INTERPOLATION GRAPHIC METHOD ALGEBRAIC METHOD METHOD OF PARABOLIC CURVE FITTING METHOD OF FINITE DIFFERENCES NEWTON’S FORWARD DIFFERENCE FORMULA NEWTON’S BACKWARD DIFFERENCE FORMULA 16·1 – 16·28 EXERCISE 16·1 16·9. BINOMIAL EXPANSION METHOD FOR INTERPOLATING MISSING VALUES EXERCISE 16·2 16·10. 16·11. 16·11·1. 16·12. 16·13. INTERPOLATION WITH ARGUMENTS AT UNEQUAL INTERVALS DIVIDED DIFFERENCES Newton’s Divided Difference Formula. LAGRANGE’S FORMULA INVERSE INTERPOLATION EXERCISE 16·3 17. INTERPRETATION OF DATA AND STATISTICAL FALLACIES 17·1. 17·2. INTRODUCTION INTERPRETATION OF DATA AND STATISTICAL FALLACIES – MEANING AND NEED FACTORS LEADING TO MIS-INTERPRETATION OF DATA OR STATISTICAL FALLACIES Bias. Inconsistencies in Definitions. Faulty Generalisations. Inappropriate Comparisons. Wrong Interpretation of Statistical Measures. (a) Wrong Interpretation of Index Numbers. (b) Wrong Interpretation of Components of Time Series – (Trend, Seasonal and Cyclical Variations). Technical Errors EFFECT OF WRONG INTERPRETATION OF DATA – DISTRUST OF STATISTICS 17.3. 17·3·1. 17·3·2. 17·3·3. 17·3·4. 17·3·5. 17·3·6. 17·3·6. 17·3·7. 17·4. EXERCISE 17·1 16·1 16·1 16·2 16·2 16·2 16·3 16·3 16·5 16·7 16·11 16·12 16·15 16·19 16·20 16·21 16·22 16·24 16·26 16·27 17·1 – 17·14 17·1 17·1 17·2 17·2 17·2 17·3 17·3 17·4 17·10 17·10 17·11 17·11 17·11 (xxi) 18. STATISTICAL DECISION THEORY 18·1. 18·2. 18·2·1. 18·2·2. 18·2·3. 18·2·4. 18·2·5. 18·2·6. 18·2·7. 18·3. 18·3·1. 18·3·2. 18·3·3. 18·3·4. 18·3·5. 18·3·6. 18·3·7. 18·3·8. 18·4. 18·4·1. INTRODUCTION INGREDIENTS OF DECISION PROBLEM Acts. States of Nature or Events. Payoff Table. Opportunity Loss (O.L.). Decision Making Environment Decision Making Under Certainty. Decision Making Under Uncertainty. OPTIMAL DECISION Maximax Criterion. Maximin Criterion. Minimax Criterion. Laplace Criterion of Equal Likelihoods. Hurwicz Criterion of Realism. Expected Monetary Value (EMV). Expected Opportunity Loss (EOL) Criterion. Expected Value of Perfect Information (EVPI). DECISION TREE Roll Back Technique of Analysing a Decision Tree. EXERCISE 18·1. 19. THEORY OF ATTRIBUTES 19·1. 19·2. 19·3. 19·3·1. 19·3·2. 19·3·3. INTRODUCTION NOTATIONS CLASSES AND CLASS FREQUENCIES Order of Classes and Class Frequencies. Ultimate Class Frequency. Relation Between Class Frequencies. EXERCISE 19·1 19·4. 19·4·1. 19·4·2. INCONSISTENCY OF DATA Conditions for Consistency of Data. Incomplete Data. EXERCISE 19·2 19·5. 19·5·1. 19·6. 19·6·1. 19·6·2. INDEPENDENCE OF ATTRIBUTES Criteria of Independence of Two Attributes. ASSOCIATION OF ATTRIBUTES (Criterion 1). Proportion Method. (Criterion 2). Comparison of Observed and Expected Frequencies. 18·1 – 18·35 18·1 18·2 18·2 18·2 18·2 18·3 18·3 18·4 18·4 18·5 18·5 18·6 18·6 18·6 18·7 18·10 18·11 18·12 18·23 18·24 18·29 19·1 – 19·27 19·1 19·1 19·1 19·2 19·2 19·3 19·8 19·10 19·10 19·11 19·13 19·15 19·15 19·18 19·18 19·18 (xxii) 19·6·3. 19·6·4. (Criterion 3) Yule’s Coefficient of Association. (Criterion 4). Coefficient of Colligation. EXERCISE 19·3 Appendix I : NUMERICAL TABLES Appendix II : BIBLIOGRAPHY INDEX 19·18 19·19 19·24 T·1 – T· 9 B·1 I·1 – I·6 1 Introduction — Meaning & Scope 1·1. ORIGIN AND DEVELOPMENT OF STATISTICS The subject of Statistics, as it seems, is not a new discipline but it is as old as the human society itself. It has been used right from the existence of life on this earth, although the sphere of its utility was very much restricted. In the old days, Statistics was regarded as the ‘Science of Statecraft’ and was the byproduct of the administrative activity of the State. The word Statistics seems to have been derived from the Latin word ‘status’ or the Italian word ‘statista’ or the German word ‘statistik’ or the French word ‘statistique’, each of which means a political state. In the ancient times the scope of Statistics was primarily limited to the collection of the following data by the governments for framing military and fiscal policies : (i) Age and sex-wise population of the country ; (ii) Property and wealth of the country ; the former enabling the government to have an idea of the manpower of the country (in order to safeguard itself against any outside aggression) and the latter providing it with information for the introduction of new taxes and levies. Perhaps one of the earliest censuses of population and wealth was conducted by the Pharaohs (Emperors) of Egypt in connection with the construction of famous ‘Pyramids’. Such censuses were later held in England, Germany and other western countries in the middle ages. In India, an efficient system of collecting official and administrative statistics existed even 2000 years ago - in particular during the reign of Chandragupta Maurya (324 – 300 B.C.). Historical evidences about the prevalence of a very good system of collecting vital statistics and registration of births and deaths even before 300 B.C. are available in Kautilya’s ‘Arthashastra’. The records of land, agriculture and wealth statistics were maintained by Todermal, the land and revenue minister in the reign of Akbar (1556 – 1605 A.D.). A detailed account of the administrative and statistical surveys conducted during Akbar’s reign is available in the book ‘Ain-eAkbari’ written by Abul Fazl (in 1596 – 97), one of the nine gems of Akbar. In Germany, the systematic collection of official statistics originated towards the end of the 18th century when, in order to have an idea of the relative strength of different German States, information regarding population and output—industrial and agricultural—was collected. In England, statistics were the outcome of Napoleonic wars. The wars necessitated the systematic collection of numerical data to enable the government to assess the revenues and expenditure with greater precision and then to levy new taxes in order to meet the cost of war. Sixteenth century saw the applications of Statistics for the collection of the data relating to the movements of heavenly bodies – stars and planets – to know about their position and for the prediction of eclipses. J. Kepler made a detailed study of the information collected by Tycko Brave (1554 – 1601) regarding the movements of the planets and formulated his famous three laws relating to the movements of heavenly bodies. These laws paved the way for the discovery of Newton’s law of gravitation. Seventeenth century witnessed the origin of Vital Statistics. Captain John Graunt of London (1620 – 1674), known as the Father of Vital Statistics, was the first man to make a systematic study of the birth and death statistics. Important contributions in this field were also made by prominent persons like Casper Newman (in 1691), Sir William Petty (1623 – 1687), James Dodson, Thomas Simpson and Dr. Price. The computation of mortality tables and the calculation of expectation of life at different ages by 1·2 BUSINESS STATISTICS these persons led to the idea of ‘Life Insurance’ and Life Insurance Institution was founded in London in 1698. William Petty wrote the book ‘Essay on Political Arithmetic’. In those days Statistics was regarded as Political Arithmetic. This concept of Statistics as Political Arithmetic continued even in early 18th century when J.P. Sussmilch (1707 – 1767), a Prussian Clergyman, formulated his doctrine that the ratio of births and deaths more or less remains constant and gave statistical explanation to the theory of ‘Natural Order of Physiocratic School’. The backbone of the so-called modern theory of Statistics is the ‘Theory of Probability’ or the ‘Theory of Games and Chance’ which was developed in the mid-seventeenth century. Theory of probability is the outcome of the prevalence of gambling among the nobles of England and France while estimating the chances of winning or losing in the gamble, the chief contributors being the mathematicians and gamblers of France, Germany and England. Two French mathematicians Pascal (1623 – 1662) and P. Fermat (1601 – 1665), after a lengthy correspondence between themselves ultimately succeeded in solving the famous ‘Problem of Points’ posed by the French gambler Chevalier de-Mere and this correspondence laid the foundation stone of the science of probability. Next stalwart in this field was, J. Bernoulli (1654 – 1705) whose great treatise on probability ‘Ars Conjectandi’ was published posthumously in 1713, eight years after his death by his nephew Daniel Bernoulli (1700 – 1782). This contained the famous ‘Law of Large Numbers’ which was later discussed by Poisson, Khinchine and Kolmogorov. De-Moivre (1667 – 1754) also contributed a lot in this field and published his famous ‘Doctrine of Chance’ in 1718 and also discovered the Normal probability curve which is one of the most important contributions in Statistics. Other important contributors in this field are Pierra Simon de Laplace (1749 – 1827) who published his monumental work ‘Theoric Analytique de’s of Probabilities’, on probability in 1782; Gauss (1777 – 1855) who gave the principle of Least Squares and established the ‘Normal Law of Errors’ independently of DeMoivre; L.A.J. Quetlet (1798 – 1874) discovered the principle of ‘Constancy of Great Numbers’ which forms the basis of sampling; Euler, Lagrange, Bayes, etc. Russian mathematicians also have made very outstanding contributions to the modern theory of probability, the main contributors to mention only a few of them are : Chebychev (1821 – 1894), who founded the Russian School of Statisticians ; A. Markov (Markov Chains) ; Liapounoff (Central Limit Theorem); A. Khinchine (Law of Large Numbers) ; A Kolmogorov (who axiomised the calculus of probability) ; Smirnov, Gnedenko and so on. Modern stalwarts in the development of the subject of Statistics are Englishmen who did pioneering work in the application of Statistics to different disciplines. Francis Galton (1822 – 1921) pioneered the study of ‘Regression Analysis’ in Biometry; Karl Pearson (1857 – 1936) who founded the greatest statistical laboratory in England pioneered the study of ‘Correlation Analysis’. His Chi-Square test (χ2-test) of Goodness of Fit is the first and most important of the tests of significance in Statistics ; W.S. Gosset with his t-test ushered in an era of exact (small) sample tests. Perhaps most of the work in the statistical theory during the past few decades can be attributed to a single person Sir Ronald A. Fisher (1890 – 1962) who applied Statistics to a variety of diversified fields such as genetics, biometry, psychology and education, agriculture, etc., and who is rightly termed as the Father of Statistics. In addition to enhancing the existing statistical theory he is the pioneer in Estimation Theory (Point Estimation and Fiducial Inference); Exact (small) Sampling Distributions ; Analysis of Variance and Design of Experiments. His contributions to the subject of Statistics are described by one writer in the following words : “R.A. Fisher is the real giant in the development of the theory of Statistics.” It is only the varied and outstanding contributions of R.A. Fisher that put the subject of Statistics on a very firm footing and earned for it the status of a full-fledged science. Indian statisticians also did not lag behind in making significant contributions to the development of Statistics in various diversified fields. The valuable contributions of C.R. Rao (Statistical Inference); Parthasarathy (Theory of Probability); P.C. Mahalanobis and P.V. Sukhatme (Sample Surveys) ; S.N. Roy (Multivariate Analysis) ; R.C. Bose, K.R. Nair, J.N. Srivastava (Design of Experiments), to mention only a few, have placed India’s name in the world map of Statistics. 1·2. DEFINITION OF STATISTICS Statistics has been defined differently by different writers from time to time so much so that scholarly articles have collected together hundreds of definitions, emphasizing precisely the meaning, scope and limitations of the subject. The reasons for such a variety of definitions may be broadly classified as follows : INTRODUCTION — MEANING & SCOPE 1·3 (i) The field of utility of Statistics has been increasing steadily and thus different people defined it differently according to the developments of the subject. In old days, Statistics was regarded as the ‘science of statecraft’ but today it embraces almost every sphere of natural and human activity. Accordingly, the old definitions which were confined to a very limited and narrow field of enquiry were replaced by the new definitions which are more exhaustive and elaborate in approach. (ii) The word Statistics has been used to convey different meanings in singular and plural sense. When used as plural, statistics means numerical set of data and when used in singular sense it means the science of statistical methods embodying the theory and techniques used for collecting, analysing and drawing inferences from the numerical data. It is practically impossible to enumerate all the definitions given to Statistics both as ‘Numerical Data’ and ‘Statistical Methods’ due to limitations of space. However, we give below some selected definitions. 1. 2. 3. 4. WHAT THEY SAY ABOUT STATISTICS — SOME DEFINITIONS ”STATISTICS AS NUMERICAL DATA” “Statistics are the classified facts representing the conditions of the people in a State…specially those facts which can be stated in number or in tables of numbers or in any tabular or classified arrangement.”—Webster. “Statistics are numerical statements of facts in any department of enquiry placed in relation to each other.”— Bowley. “By statistics we mean quantitative data affected to a marked extent by multiplicity of causes”.—Yule and Kendall. “Statistics may be defined as the aggregate of facts affected to a marked extent by multiplicity of causes, numerically expressed, enumerated or estimated according to a reasonable standard of accuracy, collected in a systematic manner, for a predetermined purpose and placed in relation to each other.”—Prof. Horace Secrist. Remarks and Comments. 1. According to Webster’s definition only numerical facts can be termed Statistics. Moreover, it restricts the domain of Statistics to the affairs of a State i.e., to social sciences. This is a very old and narrow definition and is inadequate for modern times since today, Statistics embraces all sciences – social, physical and natural. 2. Bowley’s definition is more general than Webster’s since it is related to numerical data in any department of enquiry. Moreover it also provides for comparative study of the figures as against mere classification and tabulation of Webster’s definition. 3. Yule and Kendall’s definition refers to numerical data affected by a multiplicity of causes. This is usually the case in social, economic and business phenomenon. For example, the prices of a particular commodity are affected by a number of factors viz., supply, demand, imports, exports, money in circulation, competitive products in the market and so on. Similarly, the yield of a particular crop depends upon multiplicity of factors like quality of seed, fertility of soil, method of cultivation, irrigation facilities, weather conditions, fertilizer used and so on. 4. Secrist’s definition seems to be the most exhaustive of all the four. Let us try to examine it in details. (i) Aggregate of Facts. Simple or isolated items cannot be termed as Statistics unless they are a part of aggregate of facts relating to any particular field of enquiry. For instance, the height of an individual or the price of a particular commodity do not form Statistics as such figures are unrelated and uncomparable. However, aggregate of the figures of births, deaths, sales, purchase, production, profits, etc., over different times, places, etc., will constitute Statistics. (ii) Affected by Multiplicity of Causes. Numerical figures should be affected by multiplicity of factors. This point has already been elaborated in remark 3 above. In physical sciences, it is possible to isolate the effect of various factors on a single item but it is very difficult to do so in social sciences, particularly when the effect of some of the factors cannot be measured quantitatively. However, statistical techniques have been devised to study the joint effect of a number of a factors on a single item (Multiple Correlation) or the isolated effect of a single factor on the given item (Partial Correlation) provided the effect of each of the factors can be measured quantitatively. 1·4 BUSINESS STATISTICS (iii) Numerically Expressed. Only numerical data constitute Statistics. Thus the statements like ‘the standard of living of the people in Delhi has improved’ or ‘the production of a particular commodity is increasing’ do not constitute Statistics. In particular, the qualitative characteristics which cannot be measured quantitatively such as intelligence, beauty, honesty, etc., cannot be termed as Statistics unless they are numerically expressed by assigning particular scores as quantitative standards. For example, intelligence is not Statistics but the intelligence quotients which may be interpreted as the quantitative measure of the intelligence of individuals could be regarded as Statistics. (iv) Enumerated or Estimated According to Reasonable Standard of Accuracy. The numerical data pertaining to any field of enquiry can be obtained by completely enumerating the underlying population. In such a case data will be exact and accurate (but for the errors of measurement, personal bias, etc.). However, if complete enumeration of the underlying population is not possible (e.g., if population is infinite, or if testing is destructive i.e., if the item is destroyed in the course of inspection just like in testing explosives, light bulbs, etc.), and even if possible it may not be practicable due to certain reasons (such as population being very large, high cost of enumeration per unit and our resources being limited in terms of time and money, etc.), then the data are estimated by using the powerful techniques of Sampling and Estimation theory. However, the estimated values will not be as precise and accurate as the actual values. The degree of accuracy of the estimated values largely depends on the nature and purpose of the enquiry. For example, while measuring the heights of individuals accuracy will be aimed in terms of fractions of an inch whereas while measuring distance between two places it may be in terms of metres and if the places are very distant, e.g., say Delhi and London, the difference of few kilometres may be ignored. However, certain standards of accuracy must be maintained for drawing meaningful conclusions. (v) Collected in a Systematic Manner. The data must be collected in a very systematic manner. Thus, for any socio-economic survey, a proper schedule depending on the object of enquiry should be prepared and trained personnel (investigators) should be used to collect the data by interviewing the persons. An attempt should be made to reduce the personal bias to the minimum. Obviously, the data collected in a haphazard way will not conform to the reasonable standards of accuracy and the conclusions based on them might lead to wrong or misleading decisions. (vi) Collected for a Pre-determined Purpose. It is of utmost importance to define in clear and concrete terms the objectives or the purpose of the enquiry and the data should be collected keeping in view these objectives. An attempt should not be made to collect too many data some of which are never examined or analysed i.e., we should not waste time in collecting the information which is irrelevant for our enquiry. Also it should be ensured that no essential data are omitted. For example, if the purpose of enquiry is to measure the cost of living index for low income group people, we should select only those commodities or items which are consumed or utilised by persons belonging to this group. Thus for such an index, the collection of the data on the commodities like scooters, cars, refrigerators, television sets, high quality cosmetics, etc., will be absolutely useless. (vii) Comparable. From practical point of view, for statistical analysis the data should be comparable. They may be compared with respect to some unit, generally time (period) or place. For example, the data relating to the population of a country for different years or the population of different countries in some fixed year constitute Statistics, since they are comparable. However, the data relating to the size of the shoe of an individual and his intelligence quotient (I.Q.) do not constitute Statistics as they are not comparable. In order to make valid comparisons the data should be homogeneous i.e., they should relate to the same phenomenon or subject. 5. From the definition of Horace Secrist and its discussion in remark 4 above, we may conclude that : “All Statistics are numerical statements of facts but all numerical statements of facts are not Statistics”. 6. We give below the definitions of Statistics used in singular sense i.e., Statistics as Statistical Methods. WHAT THEY SAY ABOUT STATISTICS — SOME DEFINITIONS “STATISTICS AS STATISTICAL METHODS” 1. Statistics may be called the science of counting. —Bowley A.L . 2. Statistics may rightly be called the science of averages. —Bowley A.L . INTRODUCTION — MEANING & SCOPE 1·5 3. Statistics is the science of the measurement of social organism, regarded as a whole in all its manifestations. —Bowley A.L . 4. “Statistics is the science of estimates and probabilities.” — Boddington 5. “The science of Statistics is the method of judging collective, natural or social phenomenon from the results obtained from the analysis or enumeration or collection of estimates.” —King 6. Statistics is the science which deals with classification and tabulation of numerical facts as the basis for explanation, description and comparison of phenomenon.”—Lovin 7. “Statistics is the science which deals with the methods of collecting, classifying, presenting, comparing and interpreting numerical data collected to throw some light on any sphere of enquiry.”—Selligman 8. “Statistics may be defined as the science of collection, presentation, analysis and interpretation of numerical data.” —Croxton and Cowden 9. “Statistics may be regarded as a body of methods for making wise decisions in the face of uncertainty.”—Wallis and Roberts 10. “Statistics is a method of decision making in the face of uncertainty on the basis of numerical data and calculated risks.”—Prof. Ya-Lun-Chou 11. “The science and art of handling aggregate of facts—observing, enumeration, recording, classifying and otherwise systematically treating them.”—Harlow Some Comments and Remarks. 1. The first three definitions due to Bowley are inadequate. 2. Boddington’s definition also fails to describe the meaning and functions of Statistics since it is confined to only probabilities and estimates which form only a part of the modern statistical tools and do not describe the science of Statistics in all its manifestations. 3. King’s definition is also inadequate since it confines Statistics only to social sciences. Lovitt’s definition is fairly satisfactory, though incomplete. Selligman’s definition, though very short and simple is quite comprehensive. However, the best of all the above definitions seems to be one given by Croxton and Cowden. 4. Wallis and Roberts’ definition is quite modern since statistical methods enable us to arrive at valid decisions. Prof. Chou’s definition in number 10 is a modified form of this definition. 5. Harlow’s definition describes Statistics both as a science and an art—science, since it provides tools and laws for the analysis of the numerical information collected from the source of enquiry and art, since it undeniably has its basis upon numerical data collected with a view to maintain a particular balance and consistency leading to perfect or nearly perfect conclusions. A statistician like an artist will fail in his job if he does not possess the requisite skill, experience and patience while using statistical tools for any problem. 1·3. IMPORTANCE AND SCOPE OF STATISTICS In the ancient times Statistics was regarded only as the science of Statecraft and was used to collect information relating to crimes, military strength, population, wealth, etc., for devising military and fiscal policies. But with the concept of Welfare State taking roots almost all over the world, the scope of Statistics has widened to social and economic phenomenon. Moreover, with the developments in the statistical techniques during the last few decades, today, Statistics is viewed not only as a mere device for collecting numerical data but as a means of sound techniques for their handling and analysis and drawing valid inferences from them. Accordingly, it is not merely a by-product of the administrative set up of the State but it embraces all sciences—social, physical, and natural, and is finding numerous applications in various diversified fields such as agriculture, industry, sociology, biometry, planning, economics, business, management, psychometry, insurance, accountancy and auditing, and so on. It is rather impossible to think of any sphere of human activity where Statistics does not creep in. It will not be exaggeration to say that Statistics has assumed unprecedented dimensions these days and statistical thinking is becoming more and more indispensable every day for an able citizenship. In fact to a very striking degree, the modern culture has become a statistical culture and the subject of Statistics has acquired tremendous progress in the recent 1·6 BUSINESS STATISTICS past so much so that an elementary knowledge of statistical methods has become a part of the general education in the curricula of many universities all over the world. The importance of Statistics is amply explained in the following words of Carrol D. Wright (1887), United States Commissioner of the Bureau of Labour : “To a very striking degree our culture has become a Statistical culture. Even a person who may never have heard of an index number is affected…by … of those index numbers which describe the cost of living. It is impossible to understand Psychology, Sociology, Economics, Finance or a Physical Science without some general idea of the meaning of an average, of variation, of concomitance, of sampling, of how to interpret charts and tables.” There is no ground for misgivings regarding the practical realisation of the dream of H.G. Wells viz., “Statistical thinking will one day be as necessary for effective citizenship as the ability to read and write.” Statistics has become so much indispensable in all phases of human endeavour that it is often remarked, “Statistics is what statisticians do” and it appears that Bowley was right when he said, “A knowledge of Statistics is like a knowledge of foreign language or of algebra; it may prove of use at any time under any circumstances.” Let us now discuss briefly the importance of Statistics in some different disciplines. Statistics in Planning. Statistics is imdispensable in planning – may it be in business, economics or government level. The modern age is termed as the ‘age of planning’ and almost all organisations in the government or business or management are resorting to planning for efficient working and for formulating policy decisions. To achieve this end, the statistical data relating to production, consumption, prices, investment, income, expenditure and so on and the advanced statistical techniques such as index numbers, time series analysis, demand analysis and forecasting techniques for handling such data are of paramount importance. Today efficient planning is a must for almost all countries, particularly the developing economies for their economic development and in order that planning is successful, it must be based on a correct and sound analysis of complex statistical data. For instance, in formulating a five-year plan, the government must have an idea of the age and sex-wise break up of the population projections of the country for the next five years in order to develop its various sectors like agriculture, industry, textiles, education and so on. This is achieved through the powerful statistical tool of forecasting by making use of the population data for the previous years. Even for making decisions concerning the day to day policy of the country, an accurate statistical knowledge of the age and sex-wise composition of the population is imperative for the government. In India, the use of Statistics in planning was well visualised long back and the National Sample Survey (N.S.S.) was primarily set up in 1950 for the collection of statistical data for planning in India. Statistics in State. As has already been pointed out, in the old days Statistics was the science of Statecraft and its objective was to collect data relating to manpower, crimes, income and wealth, etc., for formulating suitable military and fiscal policies. With the inception of the idea of Welfare State and its taking deep roots in almost all the countries, today statistical data relating to prices, production, consumption, income and expenditure, investments and profits, etc., and statistical tools of index numbers, time series analysis, demand analysis, forecasting, etc., are extensively used by the governments in formulating economic policies. (For details see Statistics in Economics). Moreover as pointed out earlier (Statistics in planning), statistical data and techniques are indispensable to the government for planning future economic programmes. The study of population movement i.e., population estimates, population projections and other allied studies together with birth and death statistics according to age and sex distribution provide any administration with fundamental tools which are indispensable for overall planning and evaluation of economic and social development programmes. The facts and figures relating to births, deaths and marriages are of extreme importance to various official agencies for a variety of administrative purposes. Mortality (death) statistics serve as a guide to the health authorities for sanitary improvements, improved medical facilities and public cleanliness. The data on the incidence of diseases together with the number of deaths by age and nature of diseases are of paramount importance to health authorities in taking appropriate remedial action to prevent or control the spread of the disease. The use of statistical data and statistical techniques is so wide in government functioning that today, almost all ministries and the departments in the government have a separate statistical unit. In fact, today, in most countries the State (government) is the single unit which is the biggest collector and user of statistical data. In addition to the INTRODUCTION — MEANING & SCOPE 1·7 various statistical bureaux in all the ministries and the government departments in the Centre and the States, the main Statistical Agencies in India are Central Statistical Organisation (C.S.O.) ; National Sample Survey (N.S.S.), now called National Sample Survey Organisation (N.S.S.O.) and the Registrar General of India (R.G.I.). Statistics in Economics. In old days, Economic Theories were based on deductive logic only. Moreover, the statistical techniques were not that much advanced for applications in other disciplines. It gradually dawned upon economists of the Deductive School to use Statistics effectively by making empirical studies. In 1871, W.S. Jevons, wrote that : “The deductive science of economy must be verified and rendered useful from the purely inductive science of Statistics. Theory must be invested with the reality of life and fact.” These views were supported by Roscher, Kines and Hildebrand of the Historical School (1843 – 1883), Alfred Marshall, Pareto, Lord Keynes. The following quotation due to Prof. Alfred Marshall in 1890 amply illustrates the role of Statistics in Economics : “Statistics are the straws out of which I, like every other economist, have to make bricks.” Statistics plays a very vital role in Economics so much so that in 1926, Prof. R.A. Fisher complained of “the painful misapprehension that Statistics is a branch of Economics.” Statistical data and advanced techniques of statistical analysis have proved immensely useful in the solution of a variety of economic problems such as production, consumption, distribution of income and wealth, wages, prices, profits, savings, expenditure, investment, unemployment, poverty, etc. For example, the studies of consumption statistics reveal the pattern of the consumption of the various commodities by different sections of the society and also enable us to have some idea about their purchasing capacity and their standard of living. The studies of production statistics enable us to strike a balance between supply and demand which is provided by the laws of supply and demand. The income and wealth statistics are mainly helpful in reducing the disparities of income. The statistics of prices are needed to study the price theories and the general problem of inflation through the construction of the cost of living and wholesale price index numbers. The statistics of market prices, costs and profits of different individual concerns are needed for the studies of competition and monopoly. Statistics pertaining to some macro-variables like production, income, expenditure, savings, investments, etc., are used for the compilation of National Income Accounts which are indispensable for economic planning of a country. Exchange statistics reflect upon the commercial development of a nation and tell us about the money in circulation and the volume of business done in the country. Statistical techniques have also been used in determining the measures of Gross National Product and Input-Output Analysis. The advanced and sound statistical techniques have been used successfully in the analysis of cost functions, production functions and consumption functions. Use of Statistics in Economics has led to the formulation of many economic laws some of which are mentioned below for illustration : A detailed and systematic study of the family budget data which gives a detailed account of the family budgets showing expenditure on the main items of family consumption together with family structure and composition, family income and various other social, economic and demographic characteristics led to the famous Engel’s Law of Consumption in 1895. Vilfredo Pareto in 19th-20th century propounded his famous Law of Distribution of Income by making an empirical study of the income data of various countries of the world at different times. The study of the data pertaining to the actual observation of the behaviour of buyers in the market resulted in the Revealed Preference Analysis of Prof. Samuelson. Time Series Analysis, Index Numbers, Forecasting Techniques and Demand Analysis are some of the very powerful statistical tools which are used immensely in the analysis of economic data and also for economic planning. For instance, time series analysis is extremely used in Business and Economic Statistics for the study of the series relating to prices, production and consumption of commodities, money in circulation, bank deposits and bank clearings, sales in a departmental store, etc., (i) to identify the forces or components at work, the net effect of whose interaction is exhibited by the movement of the time series; (ii) to isolate, study, analyse and measure them independently. 1·8 BUSINESS STATISTICS The index numbers which are also termed as ‘economic barometers’ are the numbers which reflect the changes over specified period of time in (i) prices of different commodities, (ii) industrial/agricultural production, (iii) sales, (iv) imports and exports, (v) cost of living, etc., and are extremely useful in economic planning. For instance, the cost of living index numbers are used for (i) the calculation of real wages and for determining the purchasing power of the money; (ii) the deflation of income and value series in national accounts; (iii) grant of dearness allowance (D.A.) or bonus to the workers in order to enable them to meet the increased cost of living and so on. The demand analysis consists in making an economic study of the market data to determine the relation between : (i) the prices of a given commodity and its absorption capacity for the market i.e., demand; and (ii) the price of a commodity and its output i.e., supply. Forecasting techniques based on the method of curve fitting by the principle of least squares and exponential smoothing are indispensable tools for economic planning. The increasing interaction of mathematics and statistics with economics led to the development of a new discipline called Econometrics—and the first Econometric Society was founded in U.S.A. in 1930 for “the advancement of economic theory in its relation to mathematics and statistics…” Econometrics aimed at making Economics a more realistic, precise, logical and practical science. Econometric models based on sound statistical analysis are used for maximum exploitation of the available resources. In other words, an attempt is made to obtain optimum results subject to a number of constraints on the resources at our disposal, say, of production capacity, capital, technology, precision, etc., which are determined statistically. Statistics in Business and Management. Prior to the Industrial Revolution, when the production was at the handicraft stage, the business activities were very much limited and were confined only to small units operating in their own areas. The owner of the concern personally looked after all the departments of business activity like sales, purchase, production, marketing, finance and so on. But after the Industrial Revolution, the developments in business activities have taken such unprecedented dimensions both in the size and the competition in the market that the activities of most of the business enterprises and firms are confined not only to one particular locality, town or place but to larger areas. Some of the leading houses have the network of their business activities in almost all the leading towns and cities of the country and even abroad. Accordingly it is impossible for a single person (the owner of the concern) to look after its activities and management has become a specialised job. The manager and a team of management executives is imperative for the efficient handling of the various operations like sales, purchase, production, marketing, control, finance, etc., of the business house. It is here that statistical data and the powerful statistical tools of probability, expectation, sampling techniques, tests of significance, estimation theory, forecasting techniques and so on play an indispensable role. According to Wallis and Roberts : “Statistics may be regarded as a body of methods for making wise decisions in the face of uncertainty.” A refinement over this definition is provided by Prof. Ya-Lun-Chou as follows : “Statistics is a method of decision making in the face of uncertainty on the basis of numerical data and calculated risks.” These definitions reflect the applications of Statistics in business since modern business has its roots in the accuracy and precision of the estimates and statistical forecasting regarding the future demand for the product, market trends and so on. Business forecasting techniques which are based on the compilation of useful statistical information on lead and lag indicators are very useful for obtaining estimates which serve as a guide to future economic events. Wrong expectations which might be the result of faulty and inaccurate analysis of various factors affecting a particular phenomenon might lead to his disaster. The time series analysis is a very important statistical tool which is used in business for the study of : (i) Trend (by method of curve fitting by the principle of least squares) in order to obtain the estimates of the probable demand of the goods; and (ii) Seasonal and Cyclical movements in the phenomenon, for determining the ‘Business Cycle’ which may also be termed as the four-phase cycle composed of prosperity (period of boom), recession, depression and recovery. The upswings and downswings in business depend on the cumulative nature of the economic forces (affecting the equilibrium of supply and demand) and the interaction between them. Most of the business and commercial series e.g., series relating to prices, production, consumption, profits, investments, wages, etc., are affected to a great extent by business cycles. Thus the study of business cycles is of INTRODUCTION — MEANING & SCOPE 1·9 paramount importance in business and a businessman who ignores the effects of booms and depression is bound to fail since his estimates and forecasts will definitely be faulty. The studies of Economic Barometers (Index Numbers of Prices) enable the businessman to have an idea about the purchasing power of money. The statistical tools of demand analysis enable the businessman to strike a balance between supply and demand. [For details, see Statistics in Economics]. The technique of Statistical Quality Control, through the powerful tools of ‘Control Charts’ and ‘Inspection Plans’ is indispensable to any business organisation for ensuring that the quality of the manufactured product is in conformity with the consumer’s specifications. (For details see Statistics in Industry). Statistical tools are used widely by business enterprises for the promotion of new business. Before embarking upon any production process, the business house must have an idea about the quantum of the product to be manufactured, the amount of the raw material and labour needed for it, the quality of the finished product, marketing avenues for the product, the competitive products in the market and so on. Thus the formulation of a production plan is a must and this cannot be achieved without collecting the statistical information on the above items without resorting to the powerful technique of ‘Sample Surveys’. As such, most of the leading business and industrial concerns have full-fledged statistical units with trained and efficient statisticians for formulating such plans and arriving at valid decisions in the face of uncertainty with calculated risks. These units also carry on research and development programmes for the improvement of the quality of the existing products (in the light of the competitive products in the market), introduction of new products and optimisation of the profits with existing resources at their disposal. Statistical tools of probability and expectation are extremely useful in Life Insurance which is one of the pioneer branches of Business and Commerce to use Statistics since the end of the seventeenth century. Statistical techniques have also been used very widely by business organisations in : (i) Marketing Decisions (based on the statistical analysis of consumer preference studies – demand analysis). (ii) Investment (based on sound study of individual shares and debentures). (iii) Personnel Administration (for the study of statistical data relating to wages, cost of living, incentive plans, effect of labour dispute/unrest on the production, performance standards, etc.). (iv) Credit policy. (v) Inventory Control (for co-ordination between production and sales). (vi) Accounting (for evaluation of the assets of the business concerns). (For details see Statistics in Accountancy and Auditing). (vii) Sales Control (through the statistical data pertaining to market studies, consumer preference studies, trade channel studies and readership surveys, etc.), and so on. From the above discussion it is obvious that the use of statistical data and techniques is indispensable in almost all the branches of business activity. Statistics in Accountancy and Auditing. Today, the science of Statistics has assumed such unprecedented dimensions that even the subjects like Accountancy and Auditing have not escaped its domain. The ever-increasing applications of the statistical data and the advanced statistical techniques in Accountancy and Auditing are well supported by the inclusion of a compulsory paper on Statistics both in the Chartered Accountants (Foundation) and Cost and Works Accountants (Intermediate) examinations curriculum. Statistics has innumerable applications in accountancy and auditing. For example the statistical data on some macro-variables like income, expenditure, investment, profits, production, savings, etc., are used for the compilation of National Income Accounts which provide information on the value added by different sectors of economy and are very helpful in formulating economic policies. The statistical study (Correlation Analysis) of profit and dividend statistics enables one to predict the probable dividends for the future years. Further, in Accountancy, the statistics of assets and liabilities, and income and expenditure are helpful to ascertain the financial results of various operations. A very important application of Statistics in accountancy is in the ‘Method of Inflation Accounting’ which consists in revaluating the accounting records based on historical costs of assets after adjusting for the changes in the purchasing power of money. This is achieved through the powerful statistical tools of Price Index Numbers and Price Deflators. 1·10 BUSINESS STATISTICS The Regression Analysis theory is of immense help in Cost Accounting in forecasting cost or price for any given value of the dependent variable. Suppose there exists a functional relation between cost of production (c) and the price of the product (p), of the form : c = f (p) Then with the statistical tools of regression analysis we can predict the effect of changes in future prices on the cost of production. Statistical techniques are also greatly used in forecasting profits, determination of trend, computation of financial and other ratios, cost-volume-profit analysis and so on. The efficacy of the implementation of a new investment plan can be tested by using the statistical tests of significance. In Auditing, sampling techniques are used widely for test checking. The business transactions and the volumes of the various items comprising balances in various accounts are so heavy (voluminous) that it is practically impossible to resort to 100% examination and analysis of the records because of limitations of time, money and staff at our disposal. Accordingly, sampling techniques based on sound statistical and scientific reasoning are used effectively to examine thoroughly only a sample (fraction – 2% or 5%) of the transactions or the items comprising a balance and drawing inferences about the whole lot (data) by using statistical techniques of Estimation and Inference. Statistics in Industry. In industry, Statistics is extensively used in ‘Quality Control’. The main objective in any production process it to control the quality of the manufactured product so that it conforms to specifications. This is called process control and is achieved through the powerful technique of control charts and inspection plans. The discovery of the control charts was made by a young physicist Dr. W.A. Shewhart of the Bell Telephone Laboratories (U.S.A.) in 1924 and the following years and is based on setting the 3σ (3 – sigma) control limits which has its basis on the theory of probability and normal distribution. Inspection plans are based on special kind of sampling techniques which are a very important aspect of statistical theory. Statistics in Physical Sciences. The applications of Statistics in Astronomy, which is a physical science, have already been discussed above. In physical sciences, a large number of measurements are taken on the same item. There is bound to be variation in these measurements. In order to have an idea about the degree of accuracy achieved, the statistical techniques (Interval Estimation – confidence intervals and confidence limits) are used to assign certain limits within which the true value of the phenomenon may be expected to lie. The desire for precision was first felt in physical sciences and this led the science to express the facts under study in quantitative form. The statistical theory with the powerful tools of sampling, estimation (point and interval), design of experiments, etc., is most effective for the analysis of the quantitative expression of all fields of study. Today, there is an increasing use of Statistics in most of the physical sciences such as astronomy, geology, engineering, physics and meteorology. Statistics in Social Sciences. According to Bowley, “Statistics is the science of the measurement of social organism, regarded as a whole in all its manifestations.” In the words of W.I. King, “The science of Statistics is the method of judging collective, natural or social phenomenon from the results obtained from the analysis or enumeration or collection of estimates.” These words of Bowley and King amply reflect upon the importance of Statistics in social sciences. Every social phenomenon is affected to a marked extent by a multiplicity of factors which bring out the variation in observations from time to time, place to place and object to object. Statistical tools of Regression and Correlation Analysis can be used to study and isolate the effect of each of these factors on the given observation. Sampling Techniques and Estimation Theory are very powerful and indispensable tools for conducting any social survey, pertaining to any strata of society and then analysing the results and drawing valid inferences. The most important application of statistics in sociology is in the field of Demography for studying mortality (death rates), fertility (birth rates), marriages, population growth and so on. In fact statistical data and statistical techniques have been used so frequently and in so many problems in social sciences that Croxton and Cowden have remarked : “Without an adequate understanding of the statistical methods, the investigator in the social sciences may be like the blind man groping in a dark room for a black cat that is not there. The methods of Statistics are useful in an over-widening range of human activities in any field of thought in which numerical data may be had.” INTRODUCTION — MEANING & SCOPE 1·11 Statistics in Biology and Medical Sciences. Sir Francis Galton (1822 – 1911), a British Biometrician pioneered the use of statistical methods with his work on ‘Regression’ in connection with the inheritance of stature. According to Prof. Karl Pearson (1857 – 1936) who pioneered the study of ‘Correlation Analysis’, the whole theory of heredity rests on statistical basis. In his Grammar of Sciences he says, “The whole problem of evolution is a problem of vital statistics, a problem of longevity, of fertility, of health, of disease and it is impossible for the evolutionist to proceed without statistics as it would be for the Registrar General to discuss the rational mortality without an enumeration of the population, a classification of deaths and a knowledge of statistical theory.” In medical sciences also, the statistical tools for the collection, presentation and analysis of observed factual data relating to the causes and incidence of diseases are of paramount importance. For example, the factual data relating to pulse rate, body temperature, blood pressure, heart beats, weight, etc., of the patient greatly help the doctor for the proper diagnosis of the disease; statistical papers are used to study heart beats through electro-cardiogram (E.C.G.). Perhaps the most important application of Statistics in medical sciences lies in using the tests of significance (more precisely Student’s t-test) for testing the efficacy of a manufacturing drug, injection or medicine for controlling/curing specific ailments. The testing of the effectiveness of a medicine by the manufacturing concern is a must, since only after the effectiveness of the medicine is established by the sound statistical techniques that it will venture to manufacture it on a large scale and bring it out in the market. Comparative studies for the effectiveness of different medicines by different concerns can also be made by statistical techniques. Statistics in Psychology and Education. Statistics has been used very widely in education and psychology too e.g., in the scaling of mental tests and other psychological data; for measuring the reliability and validity of test scrores ; for determing the Intelligence Quotient (I.Q.) ; in Item Analysis and Factor Analysis. The vast applications of statistical data and statistical theories have given rise to a new discipline called ‘Psychometry’. 1·4. LIMITATIONS OF STATISTICS Although Statistics is indispensable to almost all sciences—social, physical and natural, and is very widely used in almost all spheres of human activity, it is not without limitations which restrict its scope and utility. 1. Statistics does not study qualitative phenomenon. ‘Statistics are numerical statements in any department of enquiry placed in relation to each other’. Since Statistics is a science dealing with a set of numerical data, it can be applied to the study of only those phenomena which can be measured quantitatively. Thus the statements like ‘population of India has increased considerably during the last few years’ or ‘the standard of living of the people in Delhi has gone up as compared with last year’, do not constitute Statistics. As such Statistics cannot be used directly for the study of quality characteristics like health, beauty, honesty, welfare, poverty, etc., which cannot be measured quantitatively. However, the techniques of statistical analysis can be applied to qualitative phenomena indirectly by expressing them numerically after assigning particular scores or quantitative standards. For instance, attribute of intelligence in a group of individuals can be studied on the basis of their intelligence quotients (I.Q.’s) which may be regarded as the quantitative measure of the individuals’ intelligence. 2. Statistics does not study individuals. According to Prof. Horace Secrist, “By Statistics we mean aggregate of facts affected to a marked extent by multiplicity of factors…and placed in relation to each other.” Thus a single or isolated figure cannot be regarded as Statistics unless it is a part of the aggregate of facts relating to any particular field of enquiry. Thus statistical methods do not give any recognition to an object or a person or an event in isolation. This is a serious limitation of Statistics. For instance, the price of a single commodity, the profit of a particular concern or the production of a particular business house do not constitute statistics since these figures are unrelated and uncomparable. However, the aggregate of figures relating to prices and consumption of various commodities, the sales and profits of a business house, the income, expenditure, production, etc., over different periods of time, places, etc., will be Statistics. Thus from statistical point of view the figure of the population of a particular country in some given year is useless unless we are also given the figures of the population of the country for different years or of different countries for the same year for comparative studies. Hence Statistics is confined only to those problems where group characteristics are to be studied. 1·12 BUSINESS STATISTICS 3. Statistical laws are not exact. Since the statistical laws are probabilistic in nature, inferences based on them are only approximate and not exact like the inferences based on mathematical or scientific (physical and natural sciences) laws. Statistical laws are true only on the average. If the probability of getting a head in a single throw of a coin is 21 , it does not imply that if we toss a coin 10 times, we shall get five heads and five tails. In 10 throws of a coin we may get 8 heads, 9 heads or all the 10 heads, or we may not get even a single head. By this we mean that if the experiment of throwing the coin is carried on indefinitely (very large number of times), then we should expect on the average 50% heads and 50% tails. 4. Statistics is liable to be misused. Perhaps the most significant limitation of Statistics is that it must be used by experts. According to Bowley, “Statistics only furnishes a tool though imperfect which is dangerous in the hands of those who do not know its use and deficiencies.” Statistical methods are the most dangerous tools in the hands of the inexperts. Statistics is one of those sciences whose adepts must exercise the self-restraint of an artist. The greatest limitation of Statistics is that it deals with figures which are innocent in themselves and do not bear on their face the label of their quality and can be easily distorted, manipulated or moulded by politicians, dishonest or unskilled workers, unscrupulous people for personal selfish motives. Statistics neither proves nor disproves anything. It is merely a tool which, if rightly used may prove extremely useful but if misused by inexperienced, unskilled and dishonest statisticians might lead to very fallacious conclusions and even prove to be disastrous. In the words of W.I. King, “Statistics are like clay of which you can make a God or a Devil as you please.” At another place he remarks, “Science of Statistics is the useful servant but only of great value to those who understand its proper use.” Thus the use of Statistics by the experts who are well experienced and skilled in the analysis and interpretation of statistical data for drawing correct and valid inferences very much reduces the chances of mass popularity of this important science. 1·5. DISTRUST OF STATISTICS The improper use of statistical tools by unscrupulous people with an improper statistical bend of mind has led to the public distrust in Statistics. By this we mean that public loses its belief, faith and confidence in the science of Statistics and starts condemning it. Such irresponsible, inexperienced and dishonest persons who use statistical data and statistical techniques to fulfill their selfish motives have discredited the science of Statistics with some very interesting comments, some of which are stated below : (i) An ounce of truth will produce tons of Statistics. (ii) Statistics can prove anything. (iii) Figures do not lie. Liars figure. (iv) Statistics is an unreliable science. (v) There are three types of lies – lies, damned lies and Statistics, wicked in the order of their naming ; and so on. Some of the reasons for the above remarks may be enumerated as follows : (a) Figures are innocent and believable, and the facts based on them are psychologically more convincing. But it is a pity that figures do not have the label of quality on their face. (b) Arguments are put forward to establish certain results which are not true by making use of inaccurate figures or by using incomplete data, thus distorting the truth. (c) Though accurate, the figures might be moulded and manipulated by dishonest and unscrupulous persons to conceal the truth and present a working and distorted picture of the facts to the public for personal and selfish motives. Hence, if Statistics and its tools are misused, the fault does not lie with the science of Statistics. Rather, it is the people who misuse it, are to be blamed. Utmost care and precautions should be taken for the interpretation of statistical data in all its manifestations. “Statistics should not be used as a blind man uses a lamp-post for support instead of illumination.” However, there are misapprehensions about the argument that Statistics can be used effectively by expert statisticians, as is given in the following remark due to Wallis and Roberts: “He who accepts statistics indiscriminately will often be duped unnecessarily. But he who distrusts statistics indiscriminately will often be ignorant unnecessarily. There is an accessible alternative between INTRODUCTION — MEANING & SCOPE 1·13 blind gullibility and blind distrust. It is possible to interpret statistics skillfully. The art of interpretation need not be monopolized by statisticians, though, of course, technical statistical knowledge helps. Many important ideas of technical statistics can be conveyed to the non-statistician without distortion or dilution. Statistical interpretation depends not only on statistical ideas but also on ordinary clear thinking. Clear thinking is not only indispensable in interpreting statistics but is often sufficient even in the absence of specific statistical knowledge. For the statistician not only death and taxes but also statistical fallacies are unavoidable. With skill, common sense, patience and above all objectivity, their frequency can be reduced and their effects minimised. But eternal vigilance is the price of freedom from serious statistical blunders.” We give below some illustrations regarding the mis-interpretation of statistical data. 1. “The number of car accidents committed in a city in a particular year by women drivers is 10 while those committed by men drivers is 40. Hence women are safe drivers”. The statement is obviously wrong since nothing is said about the total number of men and women drivers in the city in the given year. Some valid conclusions can be drawn if we are given the proportion of the accidents committed by male and female drivers. 2. “It has been found that the 25% of the surgical operations by a particular surgeon are successful. If he is to operate on four persons on any day and three of the operations have proved unsuccessful, the fourth must be a success.” The given conclusion is not true since statistical laws are probabilistic in nature and not exact. The conclusion that if three operations on a particular day are unsuccessful, the fourth must be a success, is not true. It may happen that the fourth operation is also unsuccessful. It may also happen that on any day two or three or even all the four operations may be successful. The statement means that as the number of operations becomes larger and larger, we should expect, on the average, 25% of the operations to be successful. 3. A report : “The number of traffic accidents is lower in foggy weather than on clear weather days. Hence it is safer to drive in fog.” The statement again is obviously wrong. To arrive at any valid conclusions we must take into account the difference between the rush of traffic under the two weather conditions and also the extra cautiousness observed when driving in bad weather. 4. “80% of the people who drink alcohol die before attaining the age of 70 years. Hence drinking is harmful for longevity of life.” This statement is also fallacious since no information is given about the number of persons who do not drink alcohol and die before attaining the age of 70 years. In the absence of the information about the proportion of such persons we cannot draw any valid conclusions. 5. Incomplete data usually leads us to fallacious conclusions. Let us consider the scores of two students Ram and Shyam in three tests during a year. 1st test 2nd test 3rd test Average Score Ram’s Score 50% 60% 70% 60% Shyam’s Score 70% 60% 50% 60% If we are given the average score which is 60% in each case, we will conclude that the level of intelligence of the two students at the end of the year is same. But this conclusion is false and misleading since a careful study of the detailed marks over the three tests reveals that Ram has improved consistently while Shyam has deteriorated consistently. Remark. Numerous such examples can be constructed to illustrate the misuse of statistical methods and this is all due to their unjudicious applications and interpretations for which the science of Statistics cannot be blamed. EXERCISE 1·1. 1. (a) Write a short essay on the origin and development of the science of Statistics. (b) Give the names of some of the veterans in the development of Statistics, along with their contributions. 2. (a) Discuss the utility of Statistics to the state, the economist, the industrialist and the social worker. (b) Define “Statistics” and discuss the importance of Statistics in a planned economy. 3. (a) Define the term “Statistics” and discuss its use in business and trade. Also point out its limitations. [Punjab Univ. B.Com., April 1999] 1·14 (b) Define the term “Statistics” and discuss its functions and limitations. (c) Explain the importance of Statistics with respect to business and industry. BUSINESS STATISTICS [C.S. (Foundation), June 2001] [Delhi Univ. B.Com. (Pass), 2000] 4. Explain critically a few of the definitions of Statistics and state the one which you think to be the best. 5. (a) “Statistics is a method of decision-making in the face of uncertainty on the basis of numerical data and calculated risks.” Explain with suitable illustrations. (b) Discuss the scope of Statistics. [Punjab Univ. B.Com., Oct. 1998] 6. Discuss briefly the importance of Statistics in the following disciplines : (i) Economics (i) Business and Management (iii) Planning (iv) Accountancy and Auditing (v) Physical Sciences (vi) Industry (vii) Biology and Medical Sciences (viii) Social Sciences 7. Comment briefly on the following statements : (a) “Statistics is the science of human welfare.” (b) “To a very striking degree our culture has become a statistical culture.” (c) “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” 8. (a) Comment briefly on the following statements : (i) “Statistics can prove anything”. (ii) “Statistics affects everybody and touches life at many points. It is both a science and an art”. (b) “He who accepts statistics indiscriminately will often be duped unnecessarily, but he who distrusts statistics indiscriminately will often be ignorant indiscriminately.” Comment on the above statement. (c) “Sciences without statistics bear no fruit, statistics without sciences have no root.” Explain the above statement with necessary comments. 9. Comment on the following statements illustrating your view point with suitable examples : (a) “Knowledge of Statistics is like a knowledge of foreign language or of algebra. It may prove of use at any time under any circumstances.” (Bowley) (b) “Statistics is what statisticians do.” (c) “There are lies, damned lies and Statistics – wicked in the order of their naming.” (d) “By Statistics we mean aggregate of facts affected to a marked extent by multiplicity of causes, numerically expressed, enumerated or estimated according to reasonable standards of accuracy, collected in a systematic manner for a pre-determined purpose and placed in relation to each other’. (Horace Secrist) (e) Statistics are the straws out of which I, like every other economist, have to make the bricks.” (Marshall) (f) “Statistics conceals more than it reveals.” [C.S. (Foundation), June 2002] (g) “Statistics are like bikinis : they reveal what is interesting and conceal what is vital.” 10. (a) What do you understand by distrust of Statistics ? Is the science of Statistics to be blamed for it ? (b) Write a critical note on the limitations and distrust of Statistics. Discuss the important causes of distrust and show how Statistics could be made reliable. (c) Define ‘Statistics’ and discuss its uses and limitations. [Punjab Univ. B.Com., 1998] (d) Discuss the use of Statistics in the fields of economics, trade and commerce. What are the limitations of Statistics. (e) Explain : “Distrust and misuse of Statistics.” [C.S. (Foundation), June 2001] (f) “Statistics widens the field of knowledge.” Elucidate the statement. [C.S. (Foundation), June 2000] 11. (a) “Statistical methods are most dangerous tools in the hands of the inexperts.” Discuss and explain the limitation of Statistics. (b) “The science of Statistics, then, is a most useful servant but only of great value to those who understand its proper use”–King. Comment on the above statement and discuss the limitations of Statistics. (c) “Statistics are like clay of which you can make a God or Devil, as you please.” In the light of this statement, discuss the uses and limitations of Statistics. (d) “All statistics are numerical statements but all numerical statements are not statistics.” Examine. [C.S. (Foundation), Dec. 2000] INTRODUCTION — MEANING & SCOPE 1·15 12. Comment on the following statements : (a) “Statistics are like clay of which you make a God or Devil, as you please.” (b) “Statistics is the science of estimates and probabilities.” (c) “Statistics is the science of counting.” (d) “Statistics should not be used as a blind man uses a lamp post for support, instead of for illumination.” [C.S. (Foundation), Dec. 2001] 13. Comment on the following statistical statements, bringing out in details the fallacies, if any : (i) “A survey revealed that the children of engineers, doctors and lawyers have high intelligence quotients (I.Q.). It further revealed that the grandfathers of these children were also highly intelligent. Hence the inference is that intelligence is hereditary.” (ii) “The number of deaths in military in a recent war in a country was 15 out of 1,000 while the number of deaths in the capital of the country during the same period was 22 per thousand. Hence it is safe to join military service than to live in the capital city of the country.” (iii) “The number of accidents taking place in the middle of the road is much less than the number of accidents taking place on its sides. Hence it is safer to walk in the middle of the road.” (iv) “The frequency of divorce for couples with the children is only about 12 of that for childless couples; therefore producing children is an effective check on divorce.” (v) “The increase in the price of a commodity was 20%. Then the price decreased by 15% and again increased by 10%. So the resultant increase in the price was 20 – 15 + 10 = 15%.” (vi) “Nutritious Bread Company, a private manufacturing concern, charges a lower rate per loaf than that charged by a Government of India Undertaking ‘Modern Bread.’ Thus private ownership is more efficient than public ownership.” (vii) According to the estimate of an economist, the per capita national income of India for 1931-32 was Rs. 65. The National Income Committee estimated the corresponding figure for 1948-49 as Rs. 225. Hence in 1948-49, Indians were nearly four times as prosperous as in 1931-32 ? 14. Point out the ambiguity or mistakes found in the following statements which are made on the basis of the facts given : (a) 80% of the people who die of cancer are found to be smokers and so it may be concluded that smoking causes cancer. (b) The gross profit to sales ratio of a company was 15% in the year 1994 and was 10% in 1995. Hence the stock must have been undervalued. (c) The average output in a factory was 2,500 units in January 1991 and 2,400 units in February 1991. So workers were more efficient in January 1991. (d) The rate of increase in the number of buffalloes in India is greater than that of the population. Hence the people of India are now getting more milk per head. 15. Comment on the following : (a) 50 boys and 50 girls took an examination. 30 boys and 40 girls got through the examination. Hence girls are more intelligent than boys. (b) The average monthly incomes in two cities of Hyderabad and Chennai were found to be Rs. 330. Hence, the people of both the cities have the same standard of living. (c) A tutorial college advertised that there was 100 per cent success of the candidates who took the coaching in their institute. Hence the college has got good faculty. 16. Fill in the blanks : (i) The word Statistics has been derived from the Latin word …… or the German word …… . (ii) The word Statistics is used to convey different meanings in …… and …… sense. (iii) Statistics is an …… and also a …… . (iv) …… defined statistics as ‘numerical statement of facts’. (v) In singular sense, Statistics means …… . (vi) In plural sense, Statistics means …… . (vii) Prof. Ya-Lun-Chou defined Statistics as ‘a method of ……’. (viii) Bowley A.L. defined Statistics as …… counting. 1·16 BUSINESS STATISTICS (ix) Statistics are of …… help in human welfare. (x) Prof. …… is the real giant in the development of the theory of Statistics. (xi) “Statistics are the …… out of which I, like every other economist, have to make……” Alfred Marshall. (xii) Two Indian statisticians who have made significant contribution in the development of statistics are …… and …… . Ans. (i) status, statistik; (ii) singular, plural; (iii) art, science; (iv) A.L. Bowley; (v) statistical methods used for collecting analysing and drawing inferences from the numerical data; (vi) numerical set of data; (vii) decision making in the face of uncertainty on the basis of numerical data and calculated risks. (viii) science of (ix) great (immense); (x) R.A. Fisher; (xi) straws, bricks; (xii) P.C. Mahalanobis, C.R. Rao. 17. Indicate if the following statements are true (T) or false (F). (i) The subject of statistics is a century old. (ii) The word statistics seems to have been derived from Latin word status. (iii) Statistics is of no use to humanity. (iv) ‘To a very striking degree, our culture has become a statistical culture’. (v) Statistics can prove anything. Ans. (i) F; (ii) T; (iii) F; (iv) T; (v) F. 2 Collection of Data 2·1. INTRODUCTION As pointed out in Chapter 1, Statistics are a set of numerical data. (See definitions of Secrist, Croxton and Cowden, etc.). In fact only numerical data constitute Statistics. This means that the phenomenon under study must be capable of quantitative measurement. Thus the raw material of Statistics always originates from the operation of counting (enumeration) or measurement. For any statistical enquiry, whether it is in business, economics or social sciences, the basic problem is to collect facts and figures relating to particular phenomenon under study. The person who conducts the statistical enquiry i.e., counts or measures the characteristics under study for further statistical analysis is known as investigator. Ideally, (though a costly presumption), the investigator should be trained and efficient statistician. But in practice, this is not always or even usually so. The persons from whom the information is collected are known as respondents and the items on which the measurements are taken are called the statistical units. [For details see § 2·1·2]. The process of counting or enumeration or measurement together with the systematic recording of results is called the collection of statistical data. The entire structure of the statistical analysis for any enquiry is based upon systematic collection of data. On the face of it, it might appear that the collection of data is the first step for any statistical investigation. But in a scientifically prepared (efficient and well-planned) statistical enquiry, the collection of data is by no means the first step. Before we embark upon the collection of data for a given statistical enquiry, it is imperative to examine carefully the following points which may be termed as preliminaries to data collection : (i) Objectives and scope of the enquiry. (ii) Statistical units to be used. (iii) Sources of information (data). (iv) Method of data collection. (v) Degree of accuracy aimed at in the final results. (vi) Type of enquiry. We shall discuss these points briefly in the following sections. 2·1·1. Objectives and Scope of the Enquiry. The first and foremost step in organising any statistical enquiry is to define in clear and concrete terms the objectives of the enquiry. This is very essential for determining the nature of the statistics (data) to be collected and also the statistical techniques to be employed for the analysis of the data. The objectives of the enquiry would help in eliminating the collection of irrelevant information which is never used subsequently and also reflect upon the uses to which such information can be put. In the absence of the purpose of the enquiry being explicitly specified, we are bound to collect irrelevant information and also omit some important information which will ultimately lead to fallacious conclusions and wastage of resources. Further, the scope of the enquiry will also have a great bearing upon the data to be collected and also the techniques to be used for its collection and analysis. Scope of the enquiry relates to the coverage with respect to the type of information, subject matter and geographical area. For instance, if we want to study the cost of living index numbers, it must be specified if they relate to a particular city or state or whole of 2·2 BUSINESS STATISTICS India. Further, the class of people (such as a low-income group, middle-income group, labour class, etc.), for which they are intended should also be specified clearly. Thus, if the investigation is to be on a very large scale, the sample method of enumeration and collection will have to be used. However, if the enquiry is confined only to a small group, we may undertake 100% enumeration (census method). Thus if the scope of the enquiry is very wide, it has to be of one nature and if the scope of enquiry is narrow, it has to be of a totally different nature. Thus the decision about the type of enquiry to be conducted depends largely on the objectives and scope of the enquiry. However, the organisers of the enquiry should take care that these objectives and scope are commensurate with the available resources in terms of money, manpower and time limit required for the availability of the results of the enquiry. 2·1·2. Statistical Units to be Used. A well-defined and identifiable object or a group of objects with which the measurements or counts in any statistical investigation are associated is called a statistical unit. For example, in a socio-economic survey the unit may be an individual person, a family, a household or a block of locality. A very important step before the collection of data begins is to define clearly the statistical units on which the data are to be collected. In a number of situations the units are conventionally fixed like the physical units of measurement such as metres, kilometres, kilograms, quintals, hours, days, weeks, etc., which are well defined and do not need any elaboration or explanation. However in many statistical investigations, particularly relating to socio-economic studies, arbitrary units are used which must be clearly defined. This is imperative since in the absence of a clear-cut and precise definition of the statistical units, serious errors in the data collection may be committed in the sense that we may collect irrelevant data on the items, which should have, in fact, been excluded and omit data on certain items which should have been included. This will ultimately lead to fallacious conclusions. REQUISITES OF A STATISTICAL UNIT The following points might serve as guidelines for deciding about the unit in any statistical enquiry. 1. It should be unambiguous. A statistical unit should be rigidly defined so that it does not lead to any ambiguity in its interpretation. The units must cover the entire population and they should be distinct and non-overlapping in the sense that every element of the population belongs to one and only one statistical unit. 2. It should be specific. The statistical unit must be precise and specific leaving no chance to the investigators. Quite often, in most of the socio-economic surveys the various concepts/characteristics can be interpreted in different variant forms and accordingly the variable used to measure it may be defined in several different ways. For example, in an enquiry relating to the wage level of workers in an industrial concern the wages might by weekly wages, monthly wages or might refer to those of skilled labour only or of day workers only or might include bonus payments also. Similarly prices in an enquiry might refer to cost prices, selling prices, retail prices, wholesale prices or contract prices. Thus in a statistical enquiry it is important to distinguish between the conventional and the arbitrary definitions of the characteristics/variables, the former being the one prevalent in common use and shall always remain same (fixed) for every enquiry while the latter is the one which is used in a specific sense and refers to the working or operational definition which will keep on changing from one enquiry to another enquiry. 3. It should be stable. The unit selected should be stable over a long period of time and also w.r.t. places i.e., there should not be significant fluctuations in the value of a unit at different intervals of time or at different places because in the contrary case, the data collected at different times or places will not be comparable and this would mar their utility to a great extent. The fluctuations in the value of money at different times (due to inflation) or in the measurement of weights at different places (due to height above sea level) might render the comparisons useless. Thus, the unit selected should imply, as far as possible, the same characteristics at different times or at different places. 4. It should be appropriate to the enquiry. As already pointed out, the concept and definition of arbitrary statistical units keep on changing from enquiry to enquiry. The unit selected must be relevant to the given enquiry. Thus, for studying the changes in the general price level, the appropriate unit is the wholesale prices while for constructing the cost of living indices (or consumer price indices) the appropriate unit is the retail prices. 2·3 COLLECTION OF DATA 5. It should be uniform. It is essential that the unit adopted should be homogeneous (uniform) throughout the investigation so that the measurements obtained are comparable. For example, in measuring length if we use a yard on some occasions and metre on other occasions in an investigation, the observations obtained would be confusing and misleading. TYPES OF STATISTICAL UNITS The statistical units may be broadly classified as follows : (i) Units of collection. (ii) Units of analysis and interpretation. (i) Units of Collection. The units of collection may further be sub-divided into the following two classes : (a) Units of Enumeration. In any statistical enquiry, whether it is conducted by ‘sample’ method or ‘census’ method, unit of enumeration is the basic unit on which the observations are to be made and this unit is to be decided in advance before conducting the enquiry keeping in view the objectives of the enquiry. The unit of enumeration may be a person, a household, a family, a farm (in land experiments), a shop, a livestock, a firm, etc. As has been pointed out earlier, this unit should be very clearly defined in terms of shape, size, etc. For instance, for the construction of cost of living index number, the proper unit of enumeration is household. It should be explained in clear terms whether a household consists of a family comprising blood relations only or people taking food in a common kitchen or all the persons living in the house or the persons enlisted in the ration card only. The concept of the household (to be used in the enquiry) is to be decided in advance and explained clearly to the enumerators so that there are no essential omissions or irrelevant inclusions. (b) Units of Recording. The units of recording are the units in terms of which the data are recorded or in other words they are the units of quantification. For instance, in the construction of cost of living index number (consumer price index) the data to be collected from each household, among other things, include the retail prices of various commodities together with the quantities consumed by the class of people for whom the index is meant. The units of recording for quantity may be weight (in case of foodgrains), say, in kilograms, quintals, tons, etc., in case of clothing the unit of recording may be metres; the prices may be recorded in terms of rupees and so on. Units of measurement (recording) may be simple or composite. The units which represent only one condition without any qualification (adjective) are called simple units such as metre, rupee, ton, kilogram, pound, bale of cloth, hour, week, year, etc. Such units are generally conventional and not at all difficult to define. However, sometimes care has to be taken in their actual usage. For example, the bale of cloth must be defined in terms of length, say, 20 metres, 50 metres or 100 metres. Similarly, in case of weight it should be clearly specified whether it is net weight or gross weight. A simple unit with some qualifying words is called a composite unit. A simple unit with only one qualifying word is called a compound unit. Examples of such units are skilled worker, employed person, ton–kilometre, kilowatt hour, man hours, retail prices, monthly wages, passenger kilometres. For instance ton–kilometre means the number of tons multiplied by the number of kilometres carried; man hours implies the total number of workers multiplied by the number of hours that each worker has put in and so on. If two or more qualifying words are added to a simple unit, it is called a complex unit such as production per machine hour, output per man hour and so on. Thus as compared to simple units, composite (compound and complex) units are much more restrictive in scope and difficult to define. Such units should be defined properly and clearly as they need explanation about the unit used and also about the qualifying words. (ii) Units of Analysis and Interpretation. As the name implies, the units of analysis and interpretation are those units in the form of which the statistical data are ultimately analysed and interpreted. It should be decided whether the results would be expressed in absolute figures or relative figures. The units of analysis and interpretation facilitate comparisons between different sets of data with respect to time, place or environment (conditions). Generally, the units of analysis are rates, ratios and percentages, and coefficients. Rates involve the comparison between two heterogeneous quantities i.e., when the numerator and denominator are not of the same kind e.g., the mortality (death) rates, the fertility (birth) rates and so on. Rates are usually expressed per thousand. For instance, the Crude Birth Rate (C.B.R.) is the ratio of total 2·4 BUSINESS STATISTICS number of live births in the given region or locality during a given period to the total population of that region or locality during the same period, multiplied by 1,000. Rate per unit is called coefficient. However, ratios and percentages are used for comparing homogeneous quantities e.g., when the numerator and denominator are of the same kind. For example, “the ratio of smokers to non-smokers in a particular locality is 1 : 3” implies that 25% of the population are smokers. From practical point of view for comparing data relating to different series, usually the unit of analysis is one which gives relative figures which are pure numbers independent of units of measurement. For instance, if we want to compare two series for variability (disperson) the appropriate unit of analysis is Coefficient of Variation [See Chapter 6] and for comparing symmetry of two distributions, we study the Coefficient of Skewness [See Chapter 7]. 2·1·3. Sources of Information (Data). Having decided about the objectives and scope of the enquiry and the statistical units to be used, the next problem is to decide about the sources from which the information (data) can be obtained or collected. For any statistical enquiry, the investigator may collect the data first hand or he may use the data from other published sources such as the publications of the government/semi-government organisations, periodicals, magazines, newspapers, research journals, etc. If the data are collected originally by the investigator for the given enquiry it is termed as primary data and if he makes use of the data which had been earlier collected by some one else, it is termed as secondary data. For example, the vital rates i.e., the rates of fertility and mortality in India prepared by the office of Registrar-General of India, New Delhi, are primary data but if the same data are reproduced in the U.N. Statistical Abstract (a publication of the United Nations Organisation), it becomes a secondary data. Obviously the type of enquiry needed for primary data is bound to be of a totally different nature than the type of enquiry needed for the use of secondary data. In case of primary data, the type of enquiry requires laying down the definitions of the various terms and the statistical units used in the enquiry, keeping in mind the objectives and scope of the enquiry. However, in the use of secondary data there are no such problems since the data have already been collected under a given set of definitions of various terms and units used. However, before using secondary data for statistical investigation under study, it must be subjected to careful editing and scrutiny with respect to their reliability, suitability and adequacy. [For details see § 2·6 in this Chapter]. For a given enquiry, the use of either primary or secondary data or both may be made, depending upon the purpose and scope of the enquiry. 2·1·4. Method of Data Collection. The problem does not arise if secondary data are to be used. However, if primary data are to be collected a decision has to be taken whether (i) census method or (ii) sample technique, is to be used for data collection. In the census method, we resort to 100% inspection of the population and enumerate each and every unit of the population. In the sample technique we inspect or study only a selected representative and adequate fraction (finite subset) of the population and after analysing the results of the sample data we draw conclusions about the characteristics of the population. In some situations such as population being infinite or very large, census method fails. Moreover, it is not practicable if the enumeration or testing of the units (objects) is destructive e.g., for testing the breaking strength of chalk, testing the life of electric bulbs or tubes, testing of crackers and explosives, etc. Even, if practicable, it may not be feasible from considerations of time and money. Thus a choice between the sample method and census method is to be made depending upon the objectives and scope of the survey, the limitations of resources in terms of time, money, manpower, etc., and the degree of accuracy desired. In case of sample method the size of the sample and the technique of sampling like simple random sampling, stratified random sampling, systematic sampling, etc., are to be decided. [For detailed discussion, see Chapter 15 – Sampling Theory and Design of Sample Surveys]. 2·1·5. Degree of Accuracy Aimed at in the Final Results. A decision regarding the degree of accuracy or precision desired by the investigator in his estimates or results is essential before starting any statistical enquiry. An idea about the precision aimed at is extremely helpful in deciding about the method of data collection and the size of the sample (if the enquiry is to be on the basis of a sample study). The information gained from any previous completed sample study on the subject in the form of precision achieved for a given sample size may serve as a useful guide in this matter provided there is no fundamental reason to change this empirical basis. In any statistical enquiry perfect accuracy in final results is practically impossible to achieve because of the errors in measurement, collection of data, its analysis and interpretation of the results. However, even if it were attainable, it is not generally desirable in terms of COLLECTION OF DATA 2·5 time and money likely to be spent in attaining it and a reasonable degree of precision is enough to draw valid inferences. A decision regarding the precision of the results very much depends upon the objectives and scope of the enquiry. For example, if we are measuring the length of cloth for shirting or pant, a difference of centimetres is going to make substantial impact. But if we are measuring the distances between two places, say, Delhi and Mumbai, a difference of few metres may be immaterial and while measuring the distance between two distant places, say Delhi and New York (U.S.A.) a difference of few kilometres may be immaterial. Likewise in measuring cereals (rice, wheat etc.) a difference of few grams may not matter at all whereas in measuring gold even 1/15th or 1/20th part of a gram is going to make lot of difference. In the words of Riggleman and Frisbee, “the necessary degree of accuracy in counting or measuring depends upon the practical value of accuracy in relation to its cost.” However it should not be misunderstood to imply that one should sacrifice accuracy to conduct the enquiry at low costs. 2·1·6. Types of Enquiry. Another important point one has to bear in mind before embarking upon the process of collection of data is to decide about the type of enquiry. The statistical enquiries may be of different types as outlined below : (i) Official, Semi-official or Un-official. (ii) Initial or Repetitive. (iii) Confidential or Non-confidential. (iv) Direct or Indirect. (v) Regular or Ad-hoc. (vi) Census or Sample. (vii) Primary or Secondary. (i) Official, Semi-official or Un-official Enquiry. A very important factor in the collection of data is ‘the sponsoring agency of the survey or enquiry’. If an enquiry is conducted by or on behalf of the central, state or local governments it is termed as official enquiry. A semi-official enquiry is one that is conducted by organisations enjoying government patronage like the Indian Council of Agricultural Research (I.C.A.R.), New Delhi; Indian Agricultural Statistics Research Institute (I.A.S.R.I.), New Delhi; Indian Statistical Institute (I.S.I.), Calcutta and New Delhi; and so on. An un-official enquiry is one which is sponsored by private institutions like the F.I.C.C.I., trade unions, universities or the individuals. Obviously the facilities available for each type of the above enquiries differ considerably. In an official enquiry, legal or statutory compulsions can be exercised asking the public or respondents to furnish the requisite information in time and that too at their own cost. In semi-official type enquiries also, the necessary information may be obtained without much difficulty. However, in un-official enquiries the investigator is faced with serious problems in getting information from the respondents. He can only pursuade and request them for information. In such enquiries there is only moral obligation and no legal compulsion on the respondents. Things are still worse if the enquiry is conducted by an individual who, at stages, has even to beg for information. Moreover, there are lot of differences in the financial positions of these three sponsoring agencies. Obviously, the state or central governments can afford to spend much more on an enquiry as compared with private institutions, which in turn, can generally spend more than an individual. Consequently, there is bound to be difference in the types of enquiries depending upon the sponsoring agency and also its financial implications. (ii) Initial or Repetitive Enquiry. As the name suggests, an initial or original enquiry is one which is conducted for the first time while a repetitive enquiry is one which is carried on in continuation or repetition of some previously conducted enquiry (enquiries). In conducting an original enquiry the entire scheme of the plan starting with definitions of various terms, the units, the method of collection, etc., has to be formulated afresh whereas in repetitive enquiry there is no such problem as such a plan already exists and only the original enquiry is to be modified to suit the current situation and on the basis of the experience gained in the past enquiry. However, for making valid conclusions in a repetitive enquiry, it should be ascertained that there is not any material change in the definitions of various terms used in the original enquiry. (iii) Confidential or Non-confidential Enquiry. In a confidential enquiry, the information collected and the results obtained are kept confidential and they are not made known to the public. The findings of 2·6 BUSINESS STATISTICS such enquiries are meant only for the personal records of the sponsoring agency. The enquiries conducted by private organisations like trade unions, manufacturers’ associations, private business concerns, are usually of confidential nature. On the other hand, the types of enquiries whose results are published and made known to the general public are termed as non-confidential enquiries. Most of the enquiries conducted by the state, private bodies or even individuals are of this type. (iv) Direct or Indirect Enquiry. An enquiry is termed as direct if the phenomenon under study is capable of quantitative measurement such as age, weight, income, prices, quantities consumed and so on. However, if the phenomenon under study is of a qualitative nature which is not capable of quantitative measurement like honesty, beauty, intelligence, etc., the corresponding enquiry is termed as indirect one. In such an enquiry, the qualitative characteristic is converted into quantitative phenomenon by assigning appropriate standard which may represent the given attribute (qualitative phenomenon) indirectly. For example, the study of the attribute of intelligence may be made through the Intelligence Quotient (I.Q.) score of a group of individuals in a given test. (v) Regular or Ad-hoc Enquiry. If the enquiry is conducted periodically at equal intervals of time (monthly, quarterly, yearly, etc.), it is said to be regular enquiry. For example, the census is conducted in India periodically every 10 years. Similarly a number of enquiries are conducted by the Central Statistical Organisation (C.S.O.) and their results are published periodically such as Monthly Abstract of Statistics, Monthly Statistics of Production of Selected Industries of India, Statistical Abstract, India (Annually) ; Statistical Pocket Book, India (Annually). On the other hand, if an enquiry is conducted as and when necessary without any regularity or periodicity, it is termed as ad-hoc. For instance C.S.O. and N.S.S.O. (National Sample Survey Organisation) conduct a number of ad-hoc enquiries. (vi) and (vii). Enquiries of type (vi) viz., Census or Sample* and (vii) Primary or Secondary have been discussed later in this chapter. In any statistical enquiry, after deciding about the factors or problems enumerated above from § 2·1·1. § 2·1·6, we are now all set for the process of actual collection of data relating to the given enquiry. In the following sections, we will discuss the methods of data collection. 2·2. PRIMARY AND SECONDARY DATA After going through the preliminaries discussed in the above section, we come to the problem of data collection. The most important factor in any statistical enquiry is that the original collection of data is correct and proper. If there are inadequacies, shortcomings or pitfalls at the very source of the data, no useful and valid conclusions can be drawn even after applying the best and sophisticated techniques of data analysis and presentation of the results. In this context, it may be interesting to quote the remarks made by a judge on Indian Statistics. “Cox, when you are bit older you will not quote Indian statistics with that assurance. The governments are very keen on amassing statistics—they collect them, add them, raise them to the nth power, take the cube root and prepare wonderful diagrams. But what you must never forget is that every one of those figures comes in the first instance from the ‘Chowkidar’ (i.e., the village watchman) who just puts down what he damn pleases.”** It may be remarked that this quotation applies to India of very old days when no definite statistical set up existed in India. Today in India, we have a fairly sound and systematic method of data collection on almost all problems relating to various diversified fields such as economics, business, industry, demography, social, physical and natural sciences. As already pointed out the data may be obtained from the following two sources : (i) The investigator or the organising agency may conduct the enquiry originally or (ii) He may obtain the necessary data for his enquiry from some other sources (or agencies) who had already collected the data on that subject. The data which are originally collected by an investigator or agency for the first time for any statistical investigation and used by them in the statistical analysis are termed as primary data. On the other hand, the * For details see § 2·1·4 and Chapter 15. ** The earlierst use of this story seems to have been made in Sir Josiah Stamp, ‘Some Economic Factors in Modern Life’, P.S. King and Son, London, 1929, p. 258-259. COLLECTION OF DATA 2·7 data (published or unpublished) which have already been collected and processed by some agency or person and taken over from there and used by any other agency for their statistical work are termed as secondary data as far as second agency is concerned. The second agency if and when it publishes and files such data becomes the secondary source to any one who later uses these data. In other words secondary source is the agency who publishes or releases for use by others the data which was not originally collected and processed by it. It may be observed that the distinction between primary and secondary data is a matter of degree or relativity only. The same set of data may be secondary in the hands of one and primary in the hands of others. In general, the data are primary to the source who collects and processes them for the first time and are secondary for all sources who later use such data. For instance, the data relating to mortality (death rates) and fertility (birth rates) in India published by the Office of Registrar General of India, New Delhi are primary whereas the same reproduced by the United Nations Organisation (U.N.O.) in its U.N. Statistical Abstract become secondary in a far as later agency (U.N.O.) is concerned. For this data, the office of Registrar General of India, is the primary source while U.N.O. is the secondary source. Likewise, the data collected by C.S.O. and N.S.S.O. for various surveys are primary as far as these departments are concerned but they become secondary if such data are used by other departments or organisations. 2·2·1. Choice Between Primary and Secondary Data. Obviously, there is lot of difference in the method of collection of primary and secondary data. In the case of primary data which is to be collected originally, the entire scheme of the plan starting with the definitions of various terms used, units to be employed, type of enquiry to be conducted, extent of accuracy aimed at, etc., is to be formulated whereas the collection of secondary data is in the form of mere compilation of the existing data. A proper choice between the type of data (primary or secondary) needed for any particular statistical investigation is to be made after taking into consideration the nature, objective and scope of the enquiry; the time and finances (money) at the disposal of the agency; the degree of precision aimed at and the status of the agency (whether government – state or central – or private institution or an individual). Remarks 1. In using the secondary data it is best to obtain the data from the primary source as far as possible. By doing so, we would at least save ourselves from the errors of transcription (if any) which might have inadvertently crept in the secondary source. Moreover, the primary source will also provide us with detailed discussion about the terminology used, statistical units employed, size of the sample and the technique of sampling (if the sample method was used), methods of data collection and analysis of results and we can ascertain ourselves if these suit our purpose. 2. It may be pointed out that today, in a large number of statistical enquiries secondary data are generally used because fairly reliable published data on a large number of diverse fields are now available in publications of the governments (state or centre), private organisations and research institutions, international agencies, periodicals and magazines, etc. In fact primary data are collected only if there do not exist any secondary data suited to the investigation under study. In some of the investigations both primary as well as secondary data may be used. 3. Internal and External Data. Some statisticians differentiate between primary and secondary data in the form of internal or external data. Internal data of an organisation (of business or economic concern or firm) are those which are collected by the organisation from its own internal operations like production, sales, profits, loans, imports and exports, capital employed, etc., and used by it for its own purposes. On the other hand, external data are those which are obtained from the publications of some other agencies like governments (central or state), international bodies, private research institutions, etc., for use by the given organisation. 2·3. METHODS OF COLLECTING PRIMARY DATA The methods commonly used for the collection of primary data are enumerated below : (i) Direct personal investigation. (ii) Indirect oral interviews. (iii) Information received through local agencies. (iv) Mailed questionnaire method. (v) Schedules sent through enumerators. 2·8 BUSINESS STATISTICS 2·3·1. Direct Personal Investigation. This method consists in the collection of data personally by the investigator (organising agency) from the sources concerned. In other words, the investigator has to go to the field personally for making enquiries and soliciting information from the informants or respondents. This nature of investigation very much restricts the scope of the enquiry. Obviously this technique is suited only if the enquiry is intensive rather than extensive. In other words, this method should be used only if the investigation is generally local – confined to a single locality, region or area. Since such investigations require the personal attention of the investigator, they are not suitable for extensive studies where the scope of investigation is very wide. Obviously, the information gathered from such investigation is original in nature. Merits. (i) The first hand information obtained by the investigator himself is bound to be more reliable and accurate since the investigator can extract the correct information by removing the doubts, if any, in the minds of the respondents regarding certain questions. In case, the investigator suspects foul play on the part of respondent(s) in supplying wrong information on certain items he can check it by some intelligent crossquestioning. (ii) The data obtained from such investigation is generally reliable if the type of enquiry is intensive in nature and if time and money do not pose any problems for the investigator. (iii) When the audience is approached personally by the investigator, the response is likely to be more encouraging. (iv) Different persons have their own ideas, likes and dislikes and their opinions on some of the questions may be coloured by their own prejudices and vision and as such some of them might react very sharply to certain sensitive questions posed to them. The investigator, being on the spot, can handle such a delicate situation creditably and effectively by his skill, intelligence and insight either by changing the topic or if need be, by explaining to the respondent in polite words the objectives of the survey in detail. (v) The investigator can extract proper information from the respondents by talking to them at their educational level and if need be ask them questions in their language of communication and using local connotations, if any, for the words used. Demerits. (i) As already pointed out this type of investigation is restrictive in nature and is suited only for intensive studies and not for extensive enquiries. This method is thus not suitable if the field of investigation is too wide in terms of the number of persons to be interviewed or the area to be covered. (ii) This type of investigation is handicapped due to lack of time, money and manpower (labour). It is particularly time consuming since the informants can be approached only at their convenience and in case of working class, this restricts the contact of the investigator with the informants only in the evenings or at the week ends and consequently the investigation is to be spanned over a long period. (iii) The greatest drawback of this enquiry is that it is absolutely subjective in nature. The success of the investigation largely depends upon the intelligence, skill, tact, insight, diplomacy and courage of the investigator. If the investigator lacks these qualities and is not properly trained, the results of the enquiry cannot be taken as satisfactory or reliable. Moreover the personal biases, prejudices and whims of the investigator may, in certain cases, adversely affect the findings of the enquiry. (iv) Further, if the investigator is not intelligent, tactful or skillful enough to understand the psychologies and customs of the interviewing audience, the results obtained from such an investigation will not be reliable. 2·3·2. Indirect Oral Investigation. When the ‘direct personal investigation’ is not practicable either because of the unwillingness or reluctance of the persons to furnish the requisite information or due to the extensive nature of the enquiry or due to the fact that direct sources of information do not exist or are unreliable, an indirect oral investigation is carried out. For example, if we want to solicit information on certain social evils like if a person is addicted to drinking, gambling or smoking, etc., the person will be reluctant to furnish correct information or he may give wrong information. The information on the gambling, drinking or smoking habits of an individual can best be obtained by interviewing his personal friends, relatives or neighbours who know him thoroughly well. In these types of enquiries factual data on different problems are collected by interviewing persons who are directly or indirectly concerned with the subject matter of the enquiry and who are in possession of the requisite information. The method consists in collection of the data through enumerators appointed for this purpose. A small list of questions pertaining to COLLECTION OF DATA 2·9 the subject matter of the enquiry is prepared. These questions are then put to the persons, known as witnesses or informants, who are in possession of such information and their replies are recorded. Such a procedure for the collection of factual data on different problems is usually adopted by the Enquiry Committees or Commissions appointed by the government—State or Central. Merits. (i) Since the enumerators contact the informants personally, as discussed in the first method, they can exercise their intelligence, skill, tact, etc., to extract correct and relevant information by cross examination of the informants, if necessary. (ii) As compared with the method of “direct personal investigation”, this method is less expensive and requires less time for conducting the enquiry. (iii) If necessary, the expert views and suggestions of the specialists on the given problem can be obtained in order to formulate and conduct the enquiry more effectively and efficiently. Demerits. (i) Due to lack of direct supervision and personal touch the investigator (sponsoring agency) has to rely entirely on the information supplied by the enumerators. The success of the method lies in the intelligence, skill, insight and efficiency of the enumerators and also on the fact that they are honest persons with high integrity and without any selfish motives. It should be ascertained that the enumerators are properly trained and tactful enough to elicit proper and correct response from the informants. Moreover, it should be seen that the personal biases due to the prejudices, and whims of the enumerators do not enter or at least they are minimised. (ii) The accuracy of the data collected and the inferences drawn, depend to a large extent on the nature and quality of the witnesses from whom the information is obtained. A wrong and improper choice of the witnesses will give biased results which may adversely affect the findings of the enquiry. It is, therefore, imperative : (a) To ascertain the reliability and integrity of the persons (witnesses) selected for interrogation. In other words, it should be ascertained that the witnesses are unbiased persons without any selfish motives and that they are not prejudiced in favour of or against a particular view point. (b) That the findings of the enquiry are not based on the information supplied by a single person alone. Rather, a sufficient number of persons should be interviewed to find out the real position. (c) That the witnesses really possess the knowledge about the problem under study i.e., they are aware of the full factors of the problem under investigation and are in a position to give a clear, detailed and correct account of the problem. (d) That a proper allowance about the pessimism or optimism of the witnesses depending upon their inherent psychology should be made. 2·3·3. Information Received Through Local Agencies. In this method the information is not collected formally by the investigator or the enumerators. This method consists in the appointment of local agents (commonly called correspondents) by the investigator in different parts of field of enquiry. These correspondents or agencies in different regions collect the information according to their own ways, fashions, likings and decisions and then submit their reports periodically to the central or head office where the data are processed for final analysis. This technique of data collection is usually employed by newspaper or periodical agencies who require information in different fields like sports, riots, strikes, accidents, economic trend, business stock and share market, policies and so on. This method is also used by the various departments of the government (state or central) where the information is desired periodically (at regular intervals of time) from a wide area. This method is particularly useful in obtaining the estimates of agricultural crops which may be submitted to the government by the village school teachers. A more refined and sophisticated way of the use of this technique is the registration method in which any event, say, birth, death, incidence of disease, etc., is to be reported to the appropriate authority appointed by the government like Sarpanch or Patwari in the village; or Block Development Officers (B.D.O.’s), civil hospitals or the health departments in the district headquarters, etc., as and when or immediately after it occurs. Vital statistics i.e., the data relating to mortality (deaths) and fertility (births) are usually collected in India through the registration technique. Merits. This method works out to be very cheap and economical for extensive investigations particularly if the data are obtained through part-time correspondents or agents. Moreover, the required information can be obtained expeditiously since only rough estimates are required. 2·10 BUSINESS STATISTICS Demerits. Since the different correspondents collect the information in their own fashion and style, the results are bound to be biased due to the personal prejudices and whims of the correspondents in different fields of the enquiry and consequently the data so obtained will not be very reliable. Hence, this technique of data collection is suited if the purpose of investigation is to obtain rough and approximate estimates only and where a high degree of accuracy is not desired. In particular, the registration method suffers from the drawback that many persons do not report and thus neglect to register. This usually results in under-estimation. For an effective and efficient system of registration there should be legal compulsions for registration of events and also there should be sanctions for the enforcement of the obligation. 2·3·4. Mailed Questionnaire Method. This method consists in preparing a questionnaire (a list of questions relating to the field of enquiry and providing space for the answers to be filled by the respondents) which is mailed to the respondents with a request for quick response within the specified time. A very polite covering note, explaining in detail the aims and objectives of collecting the information and also the operational definitions of various terms and concepts used in the questionnaire is attached. Respondents are also requested to extend their full co-operation by furnishing the correct replies and returning the questionnaire duly filled in time. Respondents are also taken into confidence by ensuring them that the information supplied by them in the questionnaire will be kept strictly confidential and secret. In order to ensure quick and better response the return postage expenses are usually borne by the investigator by sending a self-addressed stamped envelope. This method is usually used by the research workers, private individuals, non-official agencies and sometimes even by government (central or state). In this method, the questionnaire is the only media of communication between the investigator and the respondents. Consequently, the most important factor for the success of the ‘mailed questionnaire method’, is the skill, efficiency, care and the wisdom with which the questionnaire is framed. The questions asked should be clear, brief, corroborative, non-offending, courteous in tone, unambiguous and to the point so that not much scope of guessing is left on the part of the respondent. Moreover, while framing the questions the knowledge, understanding and the general educational level of the respondents should be taken into consideration. Remark. “Drafting or framing the questionnaire”, is of paramount importance and is discussed in detail in § 2·4 after ‘Schedules sent through enumerators’. Merits. (i) Of all the methods of collecting information, the ‘mailed questionnaire method’ is by far the most economical method in terms of time, money and manpower (labour) provided the respondents supply the information in time. (ii) This method is used for extensive enquiries covering a very wide area. (iii) Errors due to the personal biases of the investigators or enumerators are completely eliminated as the information is supplied directly by the person concerned in his own handwriting. The information so obtained is original and much more authentic. Demerits. (i) The most serious drawback of this method is that it can be used effectively with advantage only if the audience (people) is (are) educated (literate) and can understand the questions well and reply them in their own handwriting. Obviously, this method is not practicable if the people are illiterate. Even if they are educated, there may be a number of persons who are not interested in the particular enquiry being conducted and as such they adopt an attitude of indifference towards the enquiry which results in their questionnaires finding place in the waste paper baskets. In the case of those who return the questionnaires after filling, a number of them supply haphazard, vague, incomplete and unintelligible information which does not serve much purpose. Thus, this method generally suffers from the high degree (i.e., very large proportion) of non-response and consequently the results based on the information supplied by a very small proportion of the selected individuals cannot be regarded as reliable. (ii) Quite often people might suppress correct information and furnish wrong replies. We cannot verify the accuracy and reliability of the information received. In general, this method also suffers from the low degree of reliability of the information supplied by the respondents. (iii) Another limitation of this method is that at times, informants are not willing to give written information in their own handwriting on certain personal questions like income, property, personal habits and so on. COLLECTION OF DATA 2·11 (iv) Since the questionnaires are filled by the respondents personally, there is no scope for asking supplementary questions for cross checking of the information supplied by them. Moreover, the doubts in the minds of the informants, if any, on certain questions cannot be dispelled. 2·3·5. Schedules Sent Through Enumerators. Before discussing this method it is desirable to make a distinction between a questionnaire and a schedule. As already explained, questionnaire in a list of questions which are answered by the respondent himself in this own handwriting while schedule is the device of obtaining answers to the questions in a form which is filled by the interviewers or enumerators (the field agents who put these questions) in a face to face situation with the respondents. The most widely used method of collection of primary data is the ‘schedules sent through the enumerators’. This is so because this method is free from certain shortcomings inherent in the earlier methods discussed so far. In this method the enumerators go to the respondents personally with the schedule (list of questions), ask them the questions there in and record their replies. This method is generally used by big business houses, large public enterprises and research institutions like National Council of Applied Economic Research (NCAER), Federation of Indian Chambers of Commerce and Industries (FICCI) and so on and even by the governments – state or central – for certain projects and investigations where high degree of response is desired. Population census, all over the world is conducted by this technique. Merits. (i) The enumerators can explain in detail the objectives and aims of the enquiry to the informants and impress upon them the need and utility of furnishing the correct information. Being on the spot, the enumerators can dispel the doubts, if any, of certain people to certain questions by explaining to them the implications of certain definitions and concepts used in the questionnaire. (ii) This technique is very useful in extensive enquiries and generally yields fairly dependable and reliable results due to the fact that the information is recorded by highly trained and educated enumerators. Moreover, since the enumerators personally call on the respondents to obtain information there is very little non-response which occurs if it is not possible to contact the respondents even after repeated calls or if the respondent is unwilling to furnish the requisite information. Thus, this method removes both the drawbacks of the ‘mailed questionnaire method’, viz., very large proportion of non-response and fairly low degree of reliability of the information. (iii) Unlike the ‘mailed questionnaire method’, this technique can be used with advantage even if the respondents are illiterate. (iv) As already pointed out in the ‘direct personal investigation’, due to personal likes and dislikes, different people react differently to different questions and as such some people might react very sharply to certain sensitive and personal questions. In that case the enumerators, by their tact, skill, wisdom and calibre can handle the situation very effectively by changing the topic of discussion, if need be. Moreover, the enumerators can effectively check the accuracy of the information supplied by some intelligent crossquestioning by asking some supplementary questions. Demerits. (i) It is fairly expensive method since the team of enumerators is to be paid for their services and as such can be used by only those bodies or institutions which are financially sound. (ii) It is also more time consuming as compared with the ‘mailed questionnaire method’. (iii) The success of the method largely depends upon the efficiency and skill of the enumerators who collect the information. Thus the choice of enumerators is of paramount importance. The enumerators have to be trained properly in the art of collecting correct information by their intelligence, insight, patience and perseverence, diplomacy and courage. They should clearly understand the aims and objectives of the enquiry and also the implications of the various terms, definitions and concepts used in the questionnaire. They should be provided with adequate guidelines so that their personal biases do not enter the final results of the enquiry. They should be honest persons with high integrity and should not have any personal axe to grind. They should be well versed in the local language, customs and traditions. If the enumerators are biased they may suppress or even twist the information supplied by the respondents. Inefficiency on the part of the enumerators coupled with personal biases due to their prejudices and whims will lead to false conclusions and may even adversely affect the results of the enquiry. (iv) Due to inherent variation in the individual personalities of the enumerators there is bound to be variation, though not so obvious, in the information recorded by different enumerators. An attempt should be made to minimise this variation. 2·12 BUSINESS STATISTICS (v) The success of this method also lies to a great extent on the efficiency and wisdom with which the schedule is prepared or drafted. If the schedule is framed haphazardly and incompetently, the enumerators will find it very difficult to get the complete and correct desired information from the respondents. Remarks. 1. In the last two methods viz., ‘mailed questionnaire method’ and the ‘schedules sent through enumerators’, it is desirable to scrutinise the questionnaires or schedules duly filled in for detecting any apparent inconsistency in the information supplied by the respondents or recorded by the enumerators. 2. If resources (time, money and manpower) permit, two sets of enumerators may be used for recording information for the enquiry under investigation and their findings may be compared. This will, incidentally, provide a check on the honesty and integrity of the enumerators and will also reflect upon personal bias due to the prejudices and whims of the individual personalities (of the enumerators). However, this technique is not practicable in the case of interviewing individuals, who might get irritated, annoyed or confused when approached for the second time. 2·4. DRAFTING OR FRAMING THE QUESTIONNAIRE As has been pointed out earlier, the questionnaire is the only media of communication between the investigator and the respondents and as such the questionnaire should be designed or drafted with utmost care and caution so that all the relevant and essential information for the enquiry may be collected without any difficulty, ambiguity and vagueness. Drafting of a good questionnaire is a highly specialised job and requires great care, skill, wisdom, efficiency and experience. No hard and fast rules can be laid down for designing or framing a questionnaire. However, in this connection, the following general points may be borne in mind : 1. The size of the questionnaire should be as small as possible. The number of questions should be restricted to the minimum, keeping in view the nature, objectives and scope of the enquiry. In other words, the questionnaire should be concise and should contain only those questions which would furnish all the necessary information relevant for the purpose. Respondents’ time should not be wasted by asking irrelevant and unimportant questions. A large number of questions would involve more work for the investigator and thus result in delay on his part in collecting and submitting the information. These may, in addition, also unnecessarily annoy or tire the respondents. A reasonable questionnaire should contain from 15 to 20-25 questions. If a still larger number of questions is a must in any enquiry, then the questionnaire should be divided into various sections or parts. 2. The questions should be clear, brief, unambiguous, non-offending, courteous in tone, corroborative in nature and to the point so that not much scope of guessing is left on the part of the respondents. 3. The questions should be arranged in a natural logical sequence. For example, to find if a person owns a refrigerator the logical order of questions would be : “Do you own a refrigerator”? When did you buy it ? What is its make ? How much did it cost you ? Is its performance satisfactory ? Have you ever got it serviced ? The logical arrangement of questions in addition to facilitating tabulation work, would leave no chance for omissions or duplication. 4. The usage of vague and ‘multiple meaning’ words should be avoided. The vague works like good, bad, efficient, sufficient, prosperity, rarely, frequently, reasonable, poor, rich, etc., should not be used since these may be interpreted differently by different persons and as such might give unreliable and misleading information. Similarly the use of words with multiple meanings like price, assets, capital, income, household, democracy, socialism, etc., should not be used unless a clarification to these terms is given in the questionnaire. 5. Questions should be so designed that they are readily comprehensible and easy to answer for the respondents. They should not be tedious nor should they tax the respondents’ memory. Further, questions involving mathematical calculations like percentages, ratios, etc., should not be asked. 6. Questions of a sensitive and personal nature should be avoided. Questions like ‘How much money you owe to private parties ?’ or ‘Do you clean your utensils yourself ?’ which might hurt the sentiments, pride or prestige of an individual should not be asked, as far as possible. It is also advisable to avoid questions on which the respondent may be reluctant or unwilling to furnish information. For example, the questions pertaining to income, savings, habits, addiction to social evils, age (particularly, in case of ladies), etc., should be asked very tactfully. 2·13 COLLECTION OF DATA 7. Typed of Questions. Under this head, the questions in the questionnaire may be broadly classified as follows : (a ) Shut Questions. In such questions possible answers are suggested by the framers of the questionnaire and the respondent is required to tick one of them. Shut questions can further be sub-divided into the following forms : (i) Simple Alternate Questions. In such questions, the respondent has to choose between two clear cut alternatives like ‘Yes or No’ ; ‘Right or Wrong’ ; ‘Either, Or’ and so on. For instance, Do you own a refrigerator ?—Yes or No. Such questions are also called dichotomous questions. This technique can be applied with elegance to situations where two clear cut alternatives exist. (ii) Multiple Choice Questions. Quite often, it is not possible to define a clear cut alternative and accordingly in such a situation either the first method (Alternate Questions) is not used or additional answers between Yes and No like Do not know, No opinion, Occasionally, Casually, Seldom, etc., are added. For instance to find if a person smokes or drinks, the following multiple choice answers may be used : Do you smoke ? Yes (Regularly) ■ No (Never) ■ Occasionally ■ Seldom ■ Similarly, to get information regarding the mode of cooking in a household, the following multiple choice answers may be suggested. Which of the following modes of cooking you use ? Gas ■ Coal (Coke) ■ Power (Electricity) ■ Wood ■ Stove (Kerosene) ■ As another illustration, to find what conveyance an individual uses to go from his house to the place of his duty, the following question with multiple answers may be framed : How do you go to your place of duty ? By bus ■ By three wheeler scooter ■ By your own cycle ■ By taxi ■ By your own scooter/Motor cycle ■ On foot ■ By your own car ■ Any other ■ Multiple choice questions are very easy and convenient for the respondents to answer. Such questions save time and also facilitate tabulation. This method should be used if only a selected few alternative answers exist to a particular question. Sometimes, a last alternative under the category ‘Others’ or ‘Any other’ may be added. However, multiple answer questions cannot be used with advantage if it is possible to construct a fairly large number of alternative answers of relatively equal importance to a given question. (b) Open Questions. Open questions are those in which no alternative answers are suggested and the respondents are at liberty to express their frank and independent opinions on the problem in their own words. For instance, ‘What are the drawbacks in our examination system’ ? ; ‘What solution do you suggest to the housing problem in Delhi’ ? ; ‘Which programme in the Delhi TV do you like best’ ? ; are some of the open questions. Since the views of the respondents in the open questions might differ widely, it is very difficult to tabulate the diverse opinions and responses. Remark. Sometimes a combination of both shut questions and open questions might be used. For instance an open question : ‘When did you buy the car’ ?, may be followed by a multiple choice question as to whether its performance is (i) extremely good, (ii) satisfactory, (iii) poor, (iv) needs improvement. 8. Leading questions should be avoided. For example, the question ‘Why do you use a particular brand of blades, say, Erasmic blades’ should preferably be framed into two questions. (i) Which blade do you use ? 2·14 (ii) Why do you prefer it ? Gives a smooth shave Gives more shaves Price is less (Cheaper) BUSINESS STATISTICS ■ ■ ■ Readily available in the market Any other ■ ■ 9. Cross Checks. The questionnaire should be so designed as to provide internal checks on the accuracy of the information supplied by the respondents by including some connected questions at least with respect to matters which are fundamental to the enquiry. For example in a social survey for finding the age of the mother the question ‘What is your age’ ?, can be supplemented by additional questions ‘What is your date of birth’ or ‘What is the age of your eldest child ?’ Similarly, the question, ‘Age at marriage’ can be supplemented by the question ‘The age of the first child’. 10. Pre-testing the Questionnaire. From practical point of view it is desirable to try out the questionnaire on a small scale (i.e., on a small cross-section of the population for which the enquiry is intended) before using it for the given enquiry on a large scale. This testing on a small scale (called pretest) has been found to be extremely useful in practice. The given questionnaire can be improved or modified in the light of the drawbacks, shortcomings and problems faced by the investigator in the pre-test. Pre-testing also helps to decide upon the effective methods of asking questions for soliciting the requisite information. 11. A Covering Letter. A covering letter from the organisers of the enquiry should be enclosed along with the questionnaire for the following purposes : (i) It should clearly explain in brief the objectives and scope of the survey to evoke the interest of the respondents and impress upon them to render their full co-operation by returning the schedule/questionnaire duly filled in within the specified period. (ii) It should contain a note regarding the operational definitions to the various terms and the concepts used in the questionnaire; units of measurements to be used and the degree of accuracy aimed at. (iii) It should take the respondents in confidence and ensure them that the information furnished by them will be kept completely secret and they will not be harassed in any way later. (iv) In the case of mailed questionnaire method a self-addressed stamped envelope should be enclosed for enabling the respondents to return the questionnaire after completing it. (v) To ensure quick and better response the respondents may be offered awards/incentives in the form of free gifts, coupons, etc. (vi) A copy of the survey report may be promised to the interested respondents. 12. Mode of tabulation and analysis viz., hand operated, machine tabulation or computerisation should also be kept in mind while designing the questionnaire. 13. Lastly, the questionnaire should be made attractive by proper layout and appealing get up. We give below two specimen questionnaires for illustration. MODEL I Questionnaire for Collecting Information (Covering Production, Employment etc.) Relating to an Industrial concern 1. Name of the concern . . . . . . 2. (a) Name of the Proprietor/Managing Director . . . . . . (b) Qualifications : (i) Academic . . . . . . (ii) Technical/Professional . . . . . . 3. (a) Location (b) Telephone Number (c) e-mail Address (i) Factory . . . . . . (i) Factory . . . . . . (i) Factory . . . . . . (ii) Office . . . . . . (ii) Office . . . . . . (ii) Office . . . . . . 4. Factory Registration Number (with date) . . . . . . 5. Total Capital employed/Assets (approximately) . . . . . . 6. Number of Shifts . . . . . . 2·15 COLLECTION OF DATA 7. Whether the machinery used is indigenous ? Yes ■ No ■ Other ■ 8. What is the approximate value of the imported machinery ? . . . . . . 9. Whether the raw material used is available in domestic market ? Yes ■ No ■ 10. From which country is the raw material imported ? . . . . . . 11. What is the approximate annual consumption of the raw material ? . . . . . . 12. Expenditure in foreign currency : (i) Foreign Travel . . . . . . ; (ii) Technical know-how . . . . . . ; (iii) Material and goods . . . . . 13. Employment (Pay-roll) : S. No. Categories Working No. employed Salaries paid in hours per ’000 Rs. week 2001 2002 2001 2002 (i) Management (ii) Supervisory/Technical Personnel (iii) Skilled workers (iv) Unskilled workers (v) Non-technical office staff 14. Production : S. No. Items (Production) Installed Capacity Actual Production 2001 2002 15. Market 2001 2002 (a) Gross value of sales (’000 Rs.) ...... ...... (b) Who are : (i) Immediate purchasers . . . . . . (ii) End users . . . . . . (c) The extent of market is Local ■ National ■ International ■ (d) If the extent of the market is international (i) What are the approximate foreign exchange earnings (annually) ? (ii) Which countries are the chief importers of the product ? (e) Total Sales (’000 Rs.) : National market : . . . . . . ; International market : . . . . . . (f) Are the present conditions of the market satisfactory/not satisfactory/poor ? . . . . . . 16. Financial Highlights (Rupees ’000s) Items 2001 2002 Sales Profits Dividends Capital expenditure Fixed assets Shareholders’ funds 2·16 BUSINESS STATISTICS MODEL II We collect : (i) (ii) (iii) 1. 2. 3. 6. 7. 8. 9. 11. 13. 15. 16. give below the 1971 Census – Individual Slip which was used for a general purpose survey to Social and cultural data like nationality, religion, literacy, mother tongue, etc.; Exhaustive economic data like occupation, industry, class of worker and activity, if not working ; Demographic data like relation to the head of the house, sex, age, marital status, birth place, births and deaths and the fertility of women to assess in particular the performance of the family planning programme. 1971 CENSUS – INDIVIDUAL SLIP Name........................................................... Relationship to the head of the family..................................................... Sex............................... ; 4. Age............................ ; 5. Marital status................................ For currently married women only : (a) Age at marriage........... (b) Any child born in the last one year.............. Birth place : (i) Place of birth............... (ii) Rural or urban.................. (iii) District........................ (iv) State/country.................... Last Residence : (i) Place of last residence.............. (ii) Rural/urban................. (iii) District................... (iv) State/country............... Duration of present residence............. 10. Religion........................ Scheduled Caste or Tribe.............. 12. Literacy......................... Educational level.............. 14. Mother tongue................. Other languages, if any................ Main activity : (a) Broad category : (i) Worker (C, AL, HHI, OW)* (ii) Non-worker (H, ST, R, D.B.I.O.)** (b) Place of work (Name of Village/Town) . . . . . . (c) Name of establishment… (d) Name of Industry, Trade, Profession or Service . . . . . . (e) Description of work . . . . . . (f) Class of worker . . . . . . 17. Secondary work : (a) Broad category (C, AL, HHI, OW) (b) Place of work . . . . . . (c) Name of establishment . . . . . . (d) Nature of Industry, Trade, Profession or Service . . . (e) Description of work . . . . . . (f) Class of worker . . . . . . 2·5. SOURCES OF SECONDARY DATA The chief sources of secondary data may be broadly classified into the following two groups : (i) Published sources. (ii) Unpublished sources. 2·5·1. Published Sources. There are a number of national (government, semi-government and private) organisations and also international agencies which collect statistical data relating to business, trade, labour, prices, consumption, production, industries, agriculture, income, currency and exchange, health, population *C AL HHI OW : Cultivator : Agriculture Labour : House Hold Industries : Other Works **H ST R DBIO : Household Duties : Student : Retired person or Renteer : Dependent, Beggar, Institutions, Others COLLECTION OF DATA 2·17 and a number of socio-economic phenomena and publish their findings in statistical reports on a regular basis (monthly, quarterly, annually, ad-hoc). These publications of the various organisations serve as a very powerful source of secondary data. We give below a brief summary of these sources. 1. Official Publications of Central Government. The following are various government organisations along with the year of their establishment [given in bracket ( )] which collect, compile and publish statistical data on a number of topics of current interest – prices, wages, population, production and consumption, labour, trade, army, etc. (1) Office of the Registrar General and Census Commissioner of India, New Delhi (1949).*** (2) Directorate-General of Commercial Intelligence and Statistics – Ministry of Commerce (1895). (3) Labour Bureau – Ministry of Labour (1946). (4) Directorate of Economics and Statistics – Ministry of Agriculture and Irrigation (1948). (5) The Indian Army Statistical Organisation (I.A.S.O.) – Ministry of Defence (1947). (6) National Sample Survey Organisation (N.S.S.O.), Department of Statistics, Ministry of Planning (1950).**** (7) Central Statistical Organisation (C.S.O.) – Department of Statistics, Ministry of Planning (1951). Some of the main publications of the above government agencies are : (a) Monthly Abstract of Statistics ; Monthly Statistics of Production of Selected Industries in India ; Statistical Pocket Book, India ; Annual Survey of Industries – General Review ; Sample Surveys of Current Interest in India (all published annually) ; Statistical Systems of India ; National Income Statistics – Estimates of Savings in India (1960-61 to 1965-66) ; National Income Statistics – Estimates of Capital Formation in India (1960-61 to 1965-66) (Ad-hoc publications) ; all published by the Central Statistical Organisation (C.S.O.), New Delhi. (b) Census data in various census reports ; Vital Statistics of India (Annual), Indian Population Bulletin (Biennial) – all published by Registrar-General of India (R.G.I.). (c) Various statistical reports on phenomenon relating to socio-economic and demographic conditions, prices, area and yield of different crops, as a result of the various surveys conducted in different rounds by National Sample Survey Organisation (N.S.S.O.). In addition to the above organisations a number of departments in the State and Central Governments like Income Tax Department, Directorate General of Supplies and Disposals, Railways, Post and Telegraphs, Central Board of Revenues, Textile Commissioner’s Office, Central Excise Commissioner’s Office, Iron and Steel Controller’s Office and so on, publish statistical reports on current problems and the information supplied by them is, in general, more authentic and reliable than that obtained from other sources on the same subject. 2. Publications of Semi-Government Statistical Organisations. Very useful information is provided by the publications of the semi-government statistical organisations some of which are enumerated below : (i) Statistics department of the Reserve Bank of India (Mumbai), which brings out an Annual Report of the Bank, Currency and Finance ; Reserve Bank of India Bulletin (monthly) and various monthly and quarterly reports. (ii) Economic department of Reserve Bank of India ; (iii) The Institute of Economic Growth, Delhi. (iv) Gokhale Institute of Politics and Economics, Poona ; (v) The Institute of Foreign Trade, New Delhi. Moreover, the statistical material published by the institutions like Municipal and District Boards, Corporations, Block and Panchayat Samitis on Vital Statistics (births and deaths), health, sanitation and other related subjects provides a fairly reliable and useful information. 3. Publications of Research Institutions. Individual research scholars, the different departments in the various universities of India and various research organisations and institutes like Indian Statistical Institute *** In India census has been carried out every ten years since 1881. Prior to the establishment of this organisation, the census was conducted by a temporary cell in the Ministry of Home Affairs. **** National Sample Survey (N.S.S.) was set up in 1950 in Ministry of Finance. In 1957, N.S.S. was transferred to the Cabinet Secretariate and named National Sample Survey Organisation (N.S.S.O.). 2·18 BUSINESS STATISTICS (I.S.I.), Kolkata and Delhi ; Indian Council of Agricultural Research (I.C.A.R.), New Delhi ; Indian Agricultural Statistics Research Institute (I.A.S.R.I.), New Delhi ; National Council of Educational Research and Training (N.C.E.R.T.), New Delhi ; National Council of Applied Economic Research, New Delhi ; The Institute of Applied Man Power Research, New Delhi ; The Institute of Labour Research, Mumbai ; Indian Standards Institute, New Delhi ; and so on publish the findings of their research programmes in the form of research papers, or monographs or journals which are a constant source of secondary data on the subjects concerned. 4. Publications of Commercial and Financial Institutions. A number of private commercial and trade institutions like Federation of Indian Chamber of Commerce and Industries (FICCI), Institute of Chartered Accountants of India, Trade Unions, Stock Exchanges, Bank Bodies, Co-operative Societies, etc., publish reports and statistical material on current economic, business and other phenomena. 5. Reports of Various Committees and Commissions appointed by the Government. The report of the survey and enquiry commissions and committees of the Central and State Governments to find their expert views on some important matters relating to economic and social phenomena like wages, dearness allowance, prices, national income, taxation, land, education, etc., are invaluable source of secondary information. For instance Simon-Kuznet Committee report on National Income in India, Wanchoo Commission report on Taxation, Kothari Commission report on Educational Reforms, Pay Commissions Reports, Land Reforms Committee report, Gupta Commission report on Maruti Affairs, etc., are invaluable sources of secondary data. 6. Newspapers and Periodicals. Statistical material on a number of important current socio-economic problems can be obtained from the numerical data collected and published by some reputed magazines, periodicals and newspapers like Eastern Economist, Economic Times, The Financial Express, Indian Journal of Economics, Commerce, Capital, Transport, Statesman’s Year Book and The Times of India Year Book, etc. 7. International Publications. The publications of a number of foreign governments or international agencies provide invaluable statistical information on a variety of important economic and current topics. The publications of the United Nations Organisation (U.N.O.) like U.N.O. Statistical Year Book, U.N. Statistical Abstract, Demographic Year Book, etc., and its subsidiaries like World Health Organisation (W.H.O.) on contagious diseases ; annual reports of International Labour Organisation (I.L.O.) ; International Monetary Fund (I.M.F.) ; World Bank ; Economic and Social Commission for Asia and Pacific (ESCAP) ; International Finance Corporation (I.F.C.) ; International Statistical Education Institute and so on are very valued publications of secondary data. Remark. It may be pointed out that the various publications enumerated above vary as regards the periodicity of their publications. Some are published periodically at regular intervals of time (such as weekly, monthly, quarterly or annually) whereas others are ad-hoc publications which do not have any specific periodicity of publications. 2·5·2. Unpublished Sources. The statistical data need not always be published. There are various sources of unpublished statistical material such as the records maintained by private firms or business enterprises who may not like to release their data to any outside agency ; the various departments and offices of the Central and State Governments ; the researches carried out by the individual research scholars in the universities or research institutes. Remark. In some of the socio-economic surveys the information is gathered from the respondents with the promise that it is exclusively meant for research programmes and will be kept strictly confidential. Such data are not published. In case it is published, it is done with a brief note namely, “Source : Confidential.” 2·6. PRECAUTIONS IN THE USE OF SECONDARY DATA Secondary data should be used with extra caution. Before using such data, the investigator must be satisfied regarding the reliability, accuracy, adequacy and suitability of the data to the given problem under investigation. Proper care should be taken to edit it so that it is free from inconsistencies, errors and omissions. In the words of L.R. Connor “Statistics, especially other peoples’ statistics, are full of pitfalls for the user” and therefore, secondary data should not be used before subjecting it to a thorough and careful scrutiny. COLLECTION OF DATA 2·19 Prof. A.L. Bowley also remarks, “It is never safe to take the published statistics at their face value without knowing their meaning and limitations and it is always necessary to criticise the arguments that can be based upon them.” In using secondary data we should take a special note of the following factors. 1. The Reliability of Data. In order to know about the reliability of the data, we should satisfy ourselves about : (i) the reliability, integrity and experience of the collecting organistion. (ii) the reliability of source of information and (iii) the methods used for the collection and analysis of the data. It should be ascertained that the collecting agency was unbiased in the sense that it had no personal motives and right from the collection and compilation of the data to the presentation of results in the final form in the selected source, the data was thoroughly scrutinised and edited so as to make it free from errors as far as possible. Moreover, it should also be verified that the data relates to normal times free from periods of economic boom or depression or natural calamities like famines, floods, earthquakes, wars, etc., and is still relevant for the purpose in hand. If the data were collected on the basis of a sample we should satisfy ourselves that : (i) The sample was adequate (not too small). (ii) It was representative of the characteristics of the population, i.e., it was selected by proper sampling technique. (iii) The data were collected by trained, experienced and unbiased investigators under the proper supervisory checks on the field work so that sampling errors were minimised. (iv) Proper estimation techniques were used for estimating the parameters of the population. (v) The desired degree of accuracy was achieved by the compiler. Remark. A source note, giving in details the sources from which data were obtained is imperative for the validity of the secondary data since to the learned users of statistics the reputations of the sources may vary greatly from one agency to another. 2. The Suitability of Data. Even if the data are reliable in the sense as discussed above it should not be used without confirming that it is suitable for the purpose of enquiry under investigation. For this, it is important : (i) To observe and compare the objectives, nature and scope of the given enquiry with the original investigation. (ii) To confirm that the various terms and units were clearly defined and uniform throughout the earlier investigation and these definitions are suitable for the present enquiry also. For instance, a unit like household, wages, prices, farm, etc., may be defined in many different ways. If the units are defined differently in the original investigation than what we want, the secondary data will be termed as unsuitable for the present enquiry. For example, if we want to construct the cost of living indices, it must be ensured that the original data relating to prices was obtained from retail shops, co-operative stores or super bazars and not from the wholesale market. (iii) To take into account the difference in the timings of collection and homogeneity of conditions for the original enquiry and the investigation in hand. 3. Adequacy of Data. Even if the secondary data are reliable and suitable in terms of the discussion above, it may not be adequate enough for the purpose of the given enquiry. This happens when the coverage given in the original enquiry was too narrow or too wide than what is desired in the current enquiry or in other words when the original data refers to an area or a period which is much larger or smaller than the required one. For instance, if the original data relate to the consumption pattern of the various commodities by the people of a particular State, say, Maharashtra then it will be inadequate if we want to study the consumption pattern of the people for the whole country. Similarly if the original data relate to yearly figures of a particular phenomenon, it will be inadequate if we are interested in the monthly study. This is so because of the fluctuations in the phenomenon in different regions or periods. Another important factor to decide about the adequacy of the available data for the given investigation is the time period for which the data are available. For example, if we are given the values of a particular 2·20 BUSINESS STATISTICS phenomenon (say, profits of a business concern or production of a particular commodity) for the last 3-4 years, it will be inadequate for studying the trend pattern for which the values for the last 8-10 years will be required. Hence, in order to arrive at conclusions free from limitations and inaccuracies, the published data i.e., the secondary data must be subjected to thorough scrutiny and editing before it is accepted for use. EXERCISE 2·1. 1. (a) What do you mean by a statistical enquiry ? Describe the main stages in a statistical enquiry. (b) Describe the process of planning a statistical enquiry, with special reference to its scope and purpose, choice between sample and census approaches, accuracy and analysis of data. 2. If you are appointed to conduct a statistical enquiry, describe in general, what steps will you be taking from the stage of appointment till the presentation of your report. 3. (a) Distinguish between (i) primary and secondary data (ii) sampling and census method. (b) Distinguish between primary data and secondary data and discuss the various methods of collecting primary data. [C.S. (Foundation), Dec. 2000] (c) Explain the different methods of collecting primary data. (Punjab Univ. B.Com., 1996) 4. (a) What are the various methods of collecting statistical data ? Which of these is most reliable and why ? (b) Describe the methods generally employed in the collection of statistical data, stating briefly their merits and demerits. 5. (a) Distinguish between Primary and Secondary data. Give a brief account of the chief methods of collecting Primary Data and bring out their merits and defects. (b) Discuss the various methods of collecting ‘primary data’. State the methods you would employ to collect information about utilisation of plant capacity in small-scale sector in the Union Territory of Delhi. 6. Distinguish between ‘Primary Data’ and ‘Secondary Data’. State the chief sources of Secondary Data. What precautions are to be observed when such data are to be used for any investigation ? 7. A firm’s own records are internal data. What is meant by external data, a primary data, a primary source and secondary source ? Which is preferred, primary sources or secondary sources, and why ? Why do you suppose secondary sources are so often used ? 8. (a) What are consumer primary and secondary data ? State those factors which should be kept in mind while using secondary data for the investigation. (b) Distinguish between primary and secondary data. What precautions should be taken in the use of secondary data ? (c) “Statistics, especially other people’s statistics are full of pitfalls for the user unless used with caution.” Elucidate the statement and mention what are the sources of the secondary data. [Delhi Univ. B.Com. (Pass), 2001] 9. (a) “In collection of statistical data common sense is the chief requisite and experience the chief teacher”. Discuss the above statement with comments. (b) “It is never safe to take published statistics at their face value without knowing their meanings and limitations and it is always necessary to criticise the arguments that can be based on them.” (Bowley). Elucidate. 10. (a) Define a statistical unit and explain what should be the essential requirements of a good statistical unit. (b) What are the essential points to be remembered in the choice of statistical units ? (c) What is a statistical unit ? What do you mean by units of collection and units of analysis ? Discuss their relative uses. (d) Giving appropriate reasons, state what units can be used for the following : (i) Production of cotton in textile industry. (ii) Labour employed in industry. (iii) Consumption of electricity. 11. What are the essentials of a questionnaire ? Draft a questionnaire not exceeding ten questions to study the views on educational programmes of television and indicate an outline of the design of the survey. 12. What do you mean by a questionnaire ? What is the difference between a questionnaire and a schedule ? State the essential points to be remembered in drafting a questionnaire. 13. Discuss the essentials of a good questionnaire. “It is proposed to conduct a sample survey to obtain information on the study habits of University students in Chandigarh and the facilities available to them”. Explain how you will plan the survey. Draft a suitable questionnaire for this purpose. COLLECTION OF DATA 2·21 14. What are the different methods of collection of data ? Why are personal interviews usually preferred to questionnaire ? Under what conditions may a questionnaire prove as satisfactory as a personal interview ? 15. You are the Sales Promotion Officer of Delta Cosmetics Co. Ltd. Your company is about to market a new product. Design a suitable questionnaire to conduct a consumer survey before the product is launched. State the various types of persons that may be approached for replying to the questionnaire. 16. It is required to collect information on the economic conditions of textile mill workers in Mumbai. Suggest a suitable method for collection of primary data. Draft a suitable questionnaire of about ten questions for collecting this information. Also suggest how you will proceed to carry out statistical analysis of the information collected. 17. What are the essentials of a good questionnaire ? Draft a suitable questionnaire to enable you to study the effects of super markets on prices of essential consumer goods. 18. What are the chief features of a good questionnaire ? What precautions do you take while drafting a questionnaire ? 19. How would you organise an enquiry into the cost of living of the student community in a city ? Draw up a blank form to obtain the required information. 20. Fill in the blanks : (i) There are . . . . . . methods of collecting primary data. (ii) Data are classified into . . . . . . and . . . . . . (iii) . . . . . . is a suitable method of collecting data in cases where the informants are literate and spread over a vast area. (iv) Data originally collected for any investigation is called . . . . . . (v) The . . . . . . data should be used after careful scrutiny. Ans. (i) Four ; (ii) Primary and Secondary ; (iii) Mailed questionnaire method; (iv) Primary data; (v) Secondary. 21. What methods would you employ in collection of data considering accuracy, time and cost involved when the field of enquiry is : (i) small; (ii) fairly large; (iii) very large. Ans. (i) Direct personal interview ; (ii) and (iii) Mailed questionnaire method. 22. Assume that you employ the following data while conducting a statistical investigation : (i) Estimate of personal income taken from R.B.I. bulletin. (ii) Financial data of Indian companies taken from the annual reports of the Ministry of Law and Company Affairs. (iii) Tabulation from schedules used in interviews that you yourself conducted. (iv) Data collected by the National Sample Survey. Which of the above is Primary Data ? Ans. Only (iii). 23. Explain the necessity of editing primary and secondary data and briefly discuss points to be considered while editing such data. 3 Classification and Tabulation 3·1. INTRODUCTION — ORGANISATION OF DATA In the last chapter we described the various methods of collecting data for any enquiry. Unfortunately the data collected in any statistical investigation, known as raw data, are so voluminous and huge that they are unwieldy and uncomprehensible. So, having collected and edited the data, the next important step is to organise it i.e., to present it in a readily comprehensible condensed form which will highlight the important characteristics of the data, facilitate comparisons and render it suitable for further processing (statistical analysis) and interpretations. The presentation of the data is broadly classified into the following two categories: (i) Tabular Presentation. (ii) Diagrammatic or Graphic Presentation. A statistical table is an orderly and logical arrangement of data into rows and columns and it attempts to present the voluminous and heterogeneous data in a condensed and homogeneous form. But before tabulating the data, generally, systematic arrangement of the raw data into different homogeneous classes is necessary to sort out the relevant and significant features (details) from the irrelevant and insignificant ones. This process of arranging the data into groups or classes according to resemblances and similarities is technically called classification. Thus, classification of the data is preliminary to its tabulation. It is thus the first step in tabulation because the items with similarities must be brought together before the data are presented in the form of a ‘table’. On the other hand, the diagrams and graphs are pictorial devices for presenting the statistical data. However, in this chapter, we shall discuss only ‘Classification and Tabulation’ of the data while ‘Diagrammatic and Graphic Presentation’ of the data will be discussed in next chapter (Chapter 4). 3·2. CLASSIFICATION It is of interest to give below the following definitions of Classification : “Classification is the process of arranging data into sequences and groups according to their common characteristics, or separating them into different but related parts.”—Secrist. “A classification is a scheme for breaking a category into a set of parts, called classes, according to some precisely defined differing characteristics possessed by all the elements of the category.” —Tuttle A.M. Thus classification impresses upon the ‘arrangement of the data into different classes, which are to be determined depending upon the nature, objectives and scope of the enquiry. For instances the number of students registered in Delhi University during the academic year 2002-03 may be classified on the basis of any of the following criterion : (i) Sex (ii) Age (iii) The state to which they belong 3·2 BUSINESS STATISTICS (iv) Religion (v) Different faculties, like Arts, Science, Humanities, Law, Commerce, etc. (vi) Heights or weights (vii) Institutions (Colleges) and so on. Thus the same set of data can be classified into different groups or classes in a number of ways based on any recognisable physical, social or mental characteristic which exhibits variation among the different elements of the given data. The facts in one class will differ from those of another class w.r.t. some characteristic called the basis or criterion of classification. As an illustration, the data relating to socio-economic enquiry e.g., the family budget data relating to nature, quality and quantity of the commodities consumed by the group of people together with expenditure on different items of consumption may be classified under the following major heads : (i) Food (ii) Clothing (iii) Fuel and Lighting (iv) House rent (v) Miscellaneous (including items like education, recreation, medical expenses, gifts, newspaper, washerman, etc.). Each of the above groups or classes may further be divided into sub-groups or sub-classes. For example, ‘Food’ may be sub-divided into Cereals (rice, wheat, maize, pulses, etc.); Vegetables; Milk and milk products ; Oil and ghee ; Fruits and Miscellaneous. Thus it may be understood that to analyse any statistical data, classification may not be limited to one criterion or basis only. We might classify the given data w.r.t. two or more criteria or bases simultaneously. This technique of dividing the given data into different classes w.r.t. more than one basis simultaneously is called cross-classification and this process of further classification may be carried on as long as there are possible bases for classification. For instance, the students in the university may be simultaneously classified w.r.t. sex and faculty or w.r.t. age, sex and religion (three criteria) simultaneously and so on. 3·2·1. Functions of Classification. The functions of classification may be briefly summarised as follows : (i) It condenses the data. Classification presents the huge unwieldy raw data in a condensed form which is readily comprehensible to the mind and attempts to highlight the significant features contained in the data. (ii) It facilitates comparisons. Classification enables us to make meaningful comparisons depending on the basis or criterion of classification. For instance, the classification of the students in the university according to sex enables us to make a comparative study of the prevalence of university education among males and females. (iii) It helps to study the relationships. The classification of the given data w.r.t. two or more criteria, say, the sex of the students and the faculty they join in the university will enable us to study the relationship between these two criteria. (iv) It facilitates the statistical treatment of the data. The arrangement of the voluminous heterogeneous data into relatively homogeneous groups or classes according to their points of similarities introduces homogeneity or uniformity amidst diversity and makes it more intelligible, useful and readily amenable for further processing like tabulation, analysis and interpretation of the data. 3·2·2. Rules for Classification. Although classification is one of the most important techniques for the statistical treatment and analysis of numerical data, no hard and fast rules can be laid down for it. Obviously, a technically sound classification of the data in any statistical investigation will primarily depend on the nature of the data and the objectives of the enquiry. However, consistent with the nature and objectives of the enquiry, the following general guiding principles may be observed for good classification : (i) It should be unambiguous. The classes should be rigidly defined so that they should not lead to any ambiguity. In other words, there should not be any room for doubt or confusion regarding the placement of the observations in the given classes. For example, if we have to classify a group of individuals as CLASSIFICATION AND TABULATION 3·3 ‘employed’ and ‘un-employed’ : or ‘literate’ and ‘illiterate’ it is imperative to define in clear cut terms as to what we mean by an employed person and unemployed person ; by a literate person and illiterate person. (ii) It should be exhaustive and mutually exclusive. The classification must be exhaustive in the sense that each and every item in the data must belong to one of the classes. A good classification should be free from the residual class like ‘others’ or ‘miscellaneous’ because such classes do not reveal the characteristics of the data completely. However, if the classes are very large in number as is the case in classifying various commodities consumed by people in a certain locality, it becomes necessary to introduce this ‘residual class’ otherwise the purpose of classification viz., condensation of the data will be defeated. Further, the various classes should be mutually disjoint or non-overlapping so that an observed value belongs to one and only one of the classes. For instance, if we classify the students in a college by sex i.e., as males and females, the two classes are mutually exclusive. But if the same group is classified as males, females and addicts to a particular drug then the classification is faulty because the group “addicts to a particular drug” includes both males and females. However, in such a case, a proper classification will be w.r.t. two criteria viz., w.r.t. sex (males and females) and further dividing the students in each of these two classes into ‘addicts’ and ‘non-addicts’ to the given drug. (iii) It should be stable. In order to have meaningful comparisons of the results, an ideal classification must be stable i.e., the same pattern of classification should be adopted throughout the analysis and also for further enquiries on the same subject. For instance, in the 1961 census, the population was classified w.r.t. profession in the four classes viz. (i) working as cultivator, (ii) working as agricultural labourer, (iii) working as household industry, and (iv) others. However, in 1971 census, the classification w.r.t. profession was as under : (a) Main Activity : (i) Worker [Cultivator (C), Agricultural labourer (AL), Household industries (HHI), Other works (OW)]. (b) Broad Category. Non-worker [Household duties (H) ; Student (ST) ; Renteer or Retired person (R); Dependent, Beggers, Institutions and Others (DBIO)]. Consequently the results obtained in the two censuses cannot be compared meaningfully. Hence, having decided about the basis of classification in an enquiry, we should stick to it for other related matters in order to have meaningful comparisons. (iv) It should be suitable for the purpose. The classification must be in keeping with the objectives of the enquiry. For instance, if we want to study the relationship between the university education and sex, it will be futile to classify the students w.r.t. to age and religion. (v) It should be flexible. A good classification should be flexible in that it should be adjustable to the new and changed situations and conditions. No classification is good enough to be used for ever ; changes here and there become necessary with the changes in time and changed circumstances. However, flexibility should not be interpreted as instability of classification. The classification can be kept flexible by classifying the given population into some major groups which more or less remain stable and allowing for adjustment due to changed circumstances or conditions by sub-dividing these major groups into sub-groups or sub-classes which can be made flexible. Hence, the classification can maintain the character of flexibility along with stability. Also see § 3·4·2 (Number of Classes) and § 3·4·3 (Size of Class Intervals). 3·2·3. Bases of Classification. The bases or the criteria w.r.t. which the data are classified primarily depend on the objectives and the purpose of the enquiry. Generally, the data can be classified on the following four bases : (i) Geographical i.e., Area-wise or Regional. (ii) Chronological i.e., w.r.t. occurrence of time. (iii) Qualitative i.e., w.r.t. some character or attribute. (iv) Quantitative i.e., w.r.t. numerical values or magnitudes. In the following section we shall briefly discuss them one by one. (i) Geographical Classification. As the name suggests, in this classification the basis of classification is the geographical or locational differences between the various items in the data like States, Cities, 3·4 BUSINESS STATISTICS Regions, Zones, Areas, etc. For example, the yield of agricultural output per hectare for different countries in some given period or the density of the population (per square km) in different cities of India, is given in the following tables. TABLE 3·1 AGRICULTURAL OUTPUT OF DIFFERENT COUNTRIES (In Kg Per Hectare) Country TABLE 3·2 DENSITY OF POPULATION IN DIFFERENT CITIES OF INDIA (Per Square Kilometre) Average Output India USA Pakistan USSR China Syria Sudan UAR City 123·7 581·0 257·5 733·5 269·2 615·2 339·8 747·5 Density of Population Kolkata 685 Mumbai 654 Delhi 423 Chennai 205 Chandigarh 48 Source : Yojana; Vol. XV, No. 18. 19th Sept. 1971, page 22. Geographical classifications are usually presented either in an alphabetical order (which is generally the case in reference tables*) or according to size or values to lay more emphasis on the important area or region as is done in Table 3·2 (and this is generally the case in summary table*) TABLE 3·3 (ii) Chronological Classification. Chronological classification is one in which the data are classified on the POPULATION OF INDIA (In Crores) basis of differences in time, e.g., the production of an Year Population industrial concern for different periods ; the profits of a big 1901 23·8 business house over different years ; the population of any 1911 25·0 country for different years. We give in Table 3·3, the 1921 25·2 population of India for different decades. 1931 27·9 1941 31·9 The time series data, which are quite frequent in 1951 36·1 Economic and Business Statistics are generally classified 1961 43·9 chronologically, usually starting with the first period of 1971 54·8 occurrence. 1981 68·3 (iii) Qualitative Classification. When the data are 1991 84·4 classified according to some qualitative phenomena which are not capable of quantitative measurement like honesty, beauty, employment, intelligence, occupation, sex, literacy, etc., the classification is termed as qualitative or descriptive or w.r.t. attributes. In qualitative classification the data are classified according to the presence or absence of the attributes in the given units. If the data are classified into only two classes w.r.t. an attribute like its presence or absence among the various units, the classification is termed as simple or dichotomous. Examples of such classification are classifying a given population of individuals as honest or dishonest ; male or female ; employed or unemployed ; beautiful or not beautiful and so on. However, if the given population is classified into more than two classes w.r.t. a given attribute, it is said to be manifold classification. For example, for the attribute intelligence the various classes may be, say, genius, very intelligent, average intelligent, below average and dull as given below : Population ↓ ↓ ↓ Highly Intelligent Average Intelligent ——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— ↓ Genius ↓ Below Average ↓ Dull ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— * For detailed discussion on reference tables and summary tables, see § 3.7.3, on Types of Tabulation in this chapter. 3·5 CLASSIFICATION AND TABULATION Moreover, if the given population is divided into classes on the basis of simultaneous study of more than one attribute at a time, the classification is again termed as manifold classification. As an illustration, suppose we classify the population by sex into two classes, males and females and each of these two classes is further divided into two classes w.r.t. another attribute, say, smoking i.e., smokers and non-smokers, thus giving us four classes in all. Each of these four classes may further be divided w.r.t. a third attribute, say, religion into two classes Hindu, Non-Hindu and so on. The scheme is explained below. Population ↓ ————————————————————————————————————————————————————————————————————————— ↓ ↓ Male Female ——————————————————————————————————————— ———————————————————————————————————————— ↓ ↓ ↓ ↓ Smoker Non-Smoker Smoker Non-Smoker ↓ —————————————————— ↓ Hindu ↓ —————————————————— ↓ ↓ Non-Hindu Hindu ↓ Non-Hindu ↓ ————————————————————— ↓ Hindu ↓ ————————————————————————————— ↓ ↓ Non-Hindu Hindu ↓ Non-Hindu (iv) Quantitative Classification. If the data are TABLE 3·4 classified on the basis of phenomenon which is capable of DAILY EARNINGS (IN ’00 RUPEES) quantitative measurement like age, height, weight, prices, OF 60 DEPARTMENTAL STORES production, income, expenditure, sales, profits, etc., it is Number of stores Daily earnings termed as quantitative classification. The quantitative Up to 100 6 phenomenon under study is known as variable and hence this 101—200 14 classification is also sometimes called classification by 201—300 8 variables. For example, the earnings of different stores may be 301—400 10 classified as given in Table 3·4. 401—500 8 In the above classification, the daily earnings of the stores 501—600 6 601—700 4 are termed as variable and the number of stores in each class 701—800 4 as the frequency. The above classification is termed as grouped frequency distribution. Variable. As already pointed out, the quantitative phenomenon under study, like marks in a test, heights or weights of the students in a class, wages of workers in a factory, sales in a departmental store, etc., is termed a variable or a variate. It may be noted that different variables are measured in different units e.g., age is measured in years, height in inches or cms ; weight in lbs or kgs, income in rupees and so on. Variables are of two kinds : (i) Continuous variable. (ii) Discrete variable (Discontinuous variable). Those variables which can take all the possible values (integral as well as fractional) in a given specified range are termed as continuous variables. For example, the age of students in a school (Nursery to Higher Secondary) is a continuous variable because age can take all possible values (as it can be measured to the nearest fraction of time : years, months, days, minutes, seconds, etc.), in a certain range, say, from 3 years to 20 years. Some other examples of continuous variable are height (in cms), weight (in lbs), distance (in kms). More precisely a variable is said to be continuous if it is capable of passing from any given value to the next value by infinitely small gradations. On the other hand those variables which cannot take all the possible values within a given specified range are termed as discrete (discontinuous) variables. For example, the marks in a test (out of 100) of a group of students is a discrete variable since in this case marks can take only integral values from 0 to 100 (or it may take halves or quarters also if such fractional marks are given. Usually, fractional marks, if any, are rounded to the nearest integer). It cannot take all the values (integral as well as fractional) from 0 to 100. Some other examples of discrete variable are family size (members in a family), the population of a city, the number of accidents on the road, the number of typing mistakes per page and so on. A discrete variable is, thus, characterised by jumps and gaps between the one value and the next. Usually, it takes 3·6 BUSINESS STATISTICS integral values in a given range which depends on the variable under study. A detailed discussion of the quantitative classification of a series or set of observations is given in § 3·3, “Frequency Distributions”. Remark. Values of the variable in a ‘given specified range’ are determined by the nature of the phenomenon under study. In case of heights of students in a college the range may be 4′ 6′′ (i.e., 137 cms) to 6′ 3′′ (i.e., 190 cms) ; in case of weights, it may be from 100 lbs to 200 lbs, say ; in case of marks in a test out of 25, it will be from 0 to 25 and so on. 3·3. FREQUENCY DISTRIBUTION The organisation of the data pertaining to a quantitative phenomenon involves the following four stages : (i) The set or series of individual observations - unorganised (raw) or organised (arrayed) data. (ii) Discrete or ungrouped frequency distribution. (iii) Grouped frequency distribution. (iv) Continuous frequency distribution. We shall explain the various stages by means of a numerical illustration. Let us consider the following distribution of marks of 200 students in an examination, arranged serially in order of their roll numbers. 70 55 51 42 57 45 60 47 63 53 33 65 39 78 55 64 58 61 65 42 TABLE 3·5. MARKS OF 50 25 65 75 52 36 45 42 53 59 49 41 45 63 54 52 45 39 64 35 200 STUDENTS 30 20 41 35 40 30 15 53 43 48 46 46 26 18 41 49 37 22 42 48 42 38 41 25 45 36 40 50 52 30 41 32 17 38 28 29 49 46 46 31 46 48 32 40 17 40 50 31 50 42 32 43 42 27 57 34 55 34 47 35 44 43 34 34 38 42 61 43 33 21 33 42 34 51 17 16 42 37 71 37 22 42 39 67 31 37 44 62 38 42 39 42 39 51 49 16 53 52 17 41 53 33 32 50 38 48 37 24 26 40 21 35 38 15 37 28 29 38 23 40 54 39 32 44 17 35 41 33 34 33 39 48 24 33 46 31 53 43 47 36 48 34 39 42 23 38 27 53 53 19 59 57 54 37 47 51 27 59 39 29 The data in the above form is called the raw or disorganised data. In the raw form the data are so unwieldy and scattered that even after a very careful perusal, the various details contained in them remain unfollowed and uncomprehensive. The above presentation of the data in its raw form does not give us any useful information and is rather confusing to the mind. Our objective will be to express the huge mass of data in a suitable condensed form which will highlight the significant facts and comparisons and furnish more useful information without sacrificing any information of interest about the important characteristics of the distribution. 3·3·1. Array. A better presentation of the above raw data would be to arrange them in an ascending or descending order of magnitude which is called the ‘arraying’ of the data. However, this presentation (arraying), though better than the raw data does not reduce the volume of the data. 3·3·2. Discrete or Ungrouped Frequency Distribution. A much better way of the representation of the data is to express it in the form of a discrete or ungrouped frequency distribution where we count the number of times each value of the variable (marks in the above illustration) occurs in the above data. This is facilitated through the technique of Tally Marks or Tally Bars as explained below : In the first column we place all the possible values of the variable (marks in the above case). In the second column a vertical bar (|) called the Tally Mark is put against the number (value of the variable) whenever it occurs. After a particular value has occurred four times, for the fifth occurrence we put a cross 3·7 CLASSIFICATION AND TABULATION tally mark ( ) on the first four tally marks like |||| to give us a block of 5. When it occurs for the 6th time we put another tally mark against it (after leaving some space from the first block of 5) and for the 10th occurrence we again put a cross tally mark ( ) on the 6th to 9th tally marks to get another block of 5 and so on. This technique of putting cross tally marks at every 5th repetition (giving groups of 5 each) facilitates the counting of the number of occurrences of the value at the end. In the absence of such cross tally marks we shall get continuous tally bars like ||||||||…and there may be confusion in counting and we are liable to commit mistakes also. Thus, the 2nd column consists of tally marks or tally bars. After putting tally marks for all the values in the data, we count the number of times each value is repeated and write it against the corresponding number (value of the variable) in the third column, entitled frequency. This type of representation of the data is called discrete or ungrouped frequency distribution. The marks (which vary from student to student) are called the variable under study and the number of students against the corresponding marks (which tell us how frequently the marks occur) is called the frequency (f) of the variable. The Table 3·6 below gives the ungrouped frequency distribution of the data in Table 3·5, along with the tally marks. TABLE 3·6. FREQUENCY DISTRIBUTION OF MARKS OF 200 STUDENTS Marks 15 Frequency 2 Marks 33 |||| || Frequency 7 Marks 51 || 2 34 35 18 19 |||| | | 5 |||| || 7 52 |||| 4 53 3 7 54 55 |||| ||| ||| ||| 8 36 37 |||| ||| |||| || 5 1 1 3 3 20 | 1 38 |||| ||| 8 57 ||| 3 21 22 || 2 39 |||| |||| 9 58 | 1 || 2 40 |||| | 6 59 ||| 3 23 || 2 41 |||| || 24 || 2 42 |||| |||| |||| 25 || 2 43 26 27 || ||| 2 3 44 45 |||| ||| |||| 28 || 2 46 29 30 ||| ||| 3 3 47 48 31 32 |||| |||| 4 5 49 50 16 17 Tally Bars || Tally Bars |||| | |||| |||| | |||| |||| Tally Bars |||| Frequency 4 7 60 | 1 14 61 || 2 5 62 | 1 3 5 63 64 || || 2 2 6 65 ||| 3 4 6 67 70 | || 1 2 4 5 75 78 | | 1 1 From the frequency table given above, we observe that there are 8 students getting 38 marks, 14 students getting 42 marks, only 1 student getting 75 marks and so on. The presentation of the data in the form of an ungrouped frequency distribution as given above is better way than ‘arraying’ but still it does not condense the data much and is quite cumbersome to grasp and comprehend. The ungrouped frequency distribution is quite handy (i) if the values of the variable are largely repeated otherwise there will be hardly any condensation or (ii) if the variable (X) under consideration takes only a few values, say, if X were the marks out of 10 in a test given to 200 students, then X is a variable taking the values in the range 0 to 10 and can be conveniently represented by an ungrouped frequency distribution. However, if the variable takes the values in a wide (large) range as in the above illustration in Table 3·6, the data still remain unwieldy and need further processing for statistical analysis. 3·3·3. Grouped Frequency Distribution. If the identity of the units (students in our example) about whom a particular information is collected (marks in the above illustration) is not relevant, nor is the order in which the observations occur, then the first real step of condensation consists in classifying the data into different classes (or class intervals) by dividing the entire range of the values of the variable into a suitable 3·8 BUSINESS STATISTICS number of groups called classes and then recording the number of observations in each group (or class). Thus, in the above data of Table 3·6, if we divide the total range of the values of the variable viz., 78 – 15 = 63 into groups of size 5 each, then we shall get (63/5) = 13 groups and the distribution of marks is then given by the following grouped frequency distribution. TABLE 3·7. DISTRIBUTION OF MARKS OF 200 STUDENTS Marks (X) 15—19 20—24 25—29 30—34 35—39 40—44 45—49 No. of Students (f) 11 9 12 26 32 35 25 Marks (X) 50—54 55—59 60—64 65—69 70—74 75—79 No. of Students (f) 24 10 8 4 2 2 The various groups into which the values of the variable are classified are known as classes or class intervals ; the length of the class interval (which is 5 in the above case) is called the width or magnitude of the classes. The two values specifying the class are called the class limits ; the larger value is called the upper class limit and the smaller value is called the lower class limit. 3·3·4. Continuous Frequency Distribution. While dealing with a continuous variable it is not desirable to present the data into a grouped frequency distribution of the type given in Table 3·7. For example, if we consider the ages of a group of students in a school, then the grouped frequency distribution into the classes 4—6, 7—9, 10—12, 13—15, etc., will not be correct, because this classification does not take into consideration the students with ages between 6 and 7 years i.e., 6 < X < 7 ; between 9 and 10 years i.e., 9 < X < 10 and so on. In such situations we form continuous class intervals, (without any gaps), of the following type : Age in years : Below 6 6 or more but less than 9 9 or more but less than 12 12 or more but less than 15 and so on, which takes care of all the students with any fractions of age. The presentation of the data into continuous classes of the above type along with the corresponding frequencies is known as continuous frequency distribution. [For further detailed discussion, see Types of Classes—Inclusive and Exclusive—§ 3·4·4, page 3.11]. 3·4. BASIC PRINCIPLES FOR FORMING A GROUPED FREQUENCY DISTRIBUTION In spite of the great importance of classification in statistical analysis, no hard and fast rules can be laid down for it. A statistician uses his discretion for classifying a frequency distribution, and sound experience, wisdom, skill and aptness are required for an appropriate classification of the data. However, the following general guidelines may be borne in mind for a good classification of the frequency data. 3·4·1. Types of Classes. The classes should be clearly defined and should not lead to any ambiguity. Further, they should be exhaustive and mutually exclusive (i.e., non-overlapping) so that any value of the variable corresponds to one and only one of the classes. In other words, there is one to one correspondence between the value of the variable and the class. 3·4·2. Number of Classes. Although no hard and fast rule exists, a choice about the number of classes (class intervals) into which a given frequency distribution can be divided primarily depends upon : (i) The total frequency (i.e., total number of observations in the distribution), (ii) The nature of the data i.e., the size or magnitude of the values of the variable, (iii) The accuracy aimed at, and 3·9 CLASSIFICATION AND TABULATION (iv) The ease of computation of the various descriptive measures of the frequency distribution such as mean, variance, etc., for further processing of the data. However, from practical point of view the number of classes should neither be too small nor too large. If too few classes are used, the classification becomes very broad and rough in the sense that too many frequencies will be concentrated or crowded in a single class. This might obscure some important features and characteristics of the data, thereby resulting in loss of information. Moreover, with too few classes the basic assumption that class marks (i.e., mid-values of the classes) are representative of the class for computation of further descriptive measures of distribution like mean, variance, etc., will not be valid, and the so-called grouping error will be larger in such cases. Consequently, in general, the accuracy of the results decreases as the number of classes becomes smaller and smaller. On the other hand, too many classes i.e., large number of classes will result in too few frequencies in each class. This might give irregular pattern of frequencies in different classes thus making the frequency distribution (frequency polygon) irregular. Moreover a large number of classes will render the distribution too unwieldy to handle, thus defeating the very purpose (or aim viz., summarisation of the data) of classification. Further the computational work for further processing of the data will unnecessarily become quite tedious and time consuming without any proportionate gain in the accuracy of the results. However, a balance should be struck between these two factors viz., the loss of information in the first case (i.e., too few classes) and irregularity of frequency distribution in the second case (i.e., too many classes) to arrive at a pleasing compromise, giving the optimum number of classes in the view of the statistician. Ordinarily, the number of classes should not be greater than 20 and should not be less than 5, of course keeping in view the points (i) to (iv) given above together with the magnitude of class interval, since the number of classes is inversely proportional to the magnitude of the class interval. A number of rules of the thumb have been proposed for calculating the proper number of classes. However, an elegant, though approximate formula seems to be one given by Prof. Sturges known as Sturges’ rule, according to which k = 1 + 3·322 log10 N …(3·1) where k is the number of class intervals (classes) and N is the total frequency i.e., total number of observations in the data. The value obtained in (3·1) is rounded to the next higher integer. Since log of one digited number is 0· (…) ; log of two digited number is 1· (…) ; log of three digited number is 2· (…) and log of four digited number is 3· (…), the use of formula (3·1) restricts the value of k, the number of classes, to be fairly reasonable. For example : If N If N If N If N If N = 10, = 100, = 500, = 1000, = 10000, _4 k = 1 + 3·322 log10 10 = 4·322 ~ [·.· loga a = 1] k = 1 + 3·322 log10 100 = 1 + 3·322 × 2 log10 10 = 1 + 6·644 = 7·644 _~ 8 _ 10 k = 1 + 3·322 log10 500 = 1 + 3·322 × 2·6990 = 1 + 8·966 = 9·966 ~ k = 1 + 3·322 log10 1000 = 1 + 3·322 × 3 log10 10 = 1 + 9·966 = 10·966 _~ 11 k = 1 + 3·322 × 4 = 1 + 13·288 = 14·288 _~ 14 Accordingly, the Sturges’ formula (3·1) very ingeniously restricts the number of classes between 4 and 20, which is a fairly reasonable number from practical point of view. The rule, however, fails if the number of observations is very large or very small. Remarks. 1. The number of class intervals should be such that they usually give uniform and unimodal distribution in the sense that the frequencies in the given classes first increase steadily, reach a maximum and then decrease steadily. There should not be any sudden jumps or falls which result in the so-called irregular distribution. The maximum frequency should not occur in the very beginning or at the end of the distribution nor should it (maximum frequency) be repeated in which cases we shall get an irregular distribution. 2. The number of classes should be a whole number (integer) preferably 5 or some multiple of 5 viz., 10, 15, 20, 25, etc., which are readily perceptible to the mind and are quite convenient for numerical computations in the further processing (statistical analysis) of the data. Uncommon figures like 3, 7, 11, etc., should be avoided as far as possible. 3·10 BUSINESS STATISTICS 3·4·3. Size of Class Intervals. Since the size of the class interval is inversely proportional to the number of classes (class intervals) in a given distribution, from the above discussion it is obvious that a choice about the size of the class interval will also largely depend on the sound subjective judgement of the statistician keeping in mind other considerations like N (total frequency), nature of the data, accuracy of the results and computational ease for further processing of the data. Here an approximate value of the magnitude (or width) of the class interval, say, ‘i’ can be obtained by using Sturges’ rule (3.1) which gives : i= Range Range = Number of Classes 1 + 3·322 log N 10 [Using (3·1)] where Range of the distribution is given by the difference between the largest (L) and the smallest (S) value in the distribution. i.e., Range = Xmax – Xmin = L – S …(3·2) Range L–S ∴ i = = …(3·3) N k 1 + 3·322 log 10 Another ‘rule of the thumb’ for determining the size of the class interval is that : “The length of the class interval should not be greater than 14 th of the estimated population standard deviation.”* Thus, if ^σ is the estimate of the population standard deviation then the length of class interval is given by ^ /4 = A, (say). i ≤ σ Remarks 1. From (3·3), we get : k = Range Range ≥ i A …(3·4) [ ·.· i ≤ A ⇒ 1 1 ≥ i A ] … (3·5) Thus, (3·4) also enables us to have an idea about the minimum number of classes (k) which will be given by : Range k = …(3·5) A ^ /4. where range is defined in (3·2) and A = σ If we consider a hypothetical frequency distribution of the life time of 400 radio bulbs tested at a certain company with the result that minimum life time is 340 hours and maximum life time is 1300 hours such that, in usual notations : N = 400, L = 1300 hrs, S = 340 hrs, then using (3·3), we get 1300 – 340 i = 1 + 3·322 log 400 10 960 960 960 = 1 + 3·322 × 2·6021 = 1 + 8·644 = 9·644 = 99·54 –~ 100 …(*) If the magnitude of the class interval is taken as 100, then the number of classes will be 10 [which is nothing but the value 9·644 –~ 10 in the denominator of (*)]. 2. Like the number of classes, as far as possible, the size of class intervals should also be taken as 5 or some multiple of 5 viz., 10, 15, 20, etc., for facilitating computations of the various descriptive measures of the frequency distribution like mean (x–), standard deviation (σ), moments, etc. 3. Class intervals should be so fixed that each class has a convenient mid-point about which all the observations in the class cluster or concentrate. In other words, this amounts to saying that the entire frequency of the class is concentrated at the mid-value of the class. This assumption will be true only if the frequencies of the different classes are uniformly distributed in the respective class intervals. This is a very fundamental assumption in the statistical theory for the computation of various statistical measures, like mean, standard deviation, etc. 4. From the point of view of practical convenience, as far as possible, it is desirable to take the class intervals of equal or uniform magnitude throughout the frequency distribution. This will facilitate the ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— * For detailed discussion on “Standard Deviation” see Chapter 6. 3·11 CLASSIFICATION AND TABULATION computations of various statistical measures and also result in meaningful comparisons between different classes and different frequency distributions. Further, frequency distributions with equal classes can be represented diagrammatically with greater ease and utility whereas in the case of classes with unequal widths the diagrammatic representation might give a distorted picture and thus lead to fallacious interpretations. However, it may not be practicable nor desirable to keep the magnitudes of the class intervals equal if there are very wide gaps in the observed data e.g., in the frequency distribution of incomes, wages, profits, savings, etc., For example, in the frequency distribution of income, larger class intervals would (obscure) sacrifice all the details about the smaller incomes and smaller classes would give quite an unwieldy frequency distribution. Such distributions are quite common in many economic and medical data, where we have to be content with classes of unequal width. 3·4·4. Types of Class Intervals. As already stated, each class is specified by two extreme values called the class limits, the smaller one being termed as the lower limit and the larger one the upper limit of the class. The classification of a frequency distribution into various classes is of the following types : (a) Inclusive Type Classes. The classes of the type 30—39, 40—49, 50—59, 60—69, etc., in which both the upper and lower limits are included in the class are called “inclusive classes”. For instance, the class interval 40—49 includes all the values from 40 to 49, both inclusive. The next value viz., 50 is included in the next class 50—59 and so on. However, the fractional values between 49 and 50 cannot be accounted for in such a classification. Hence, ‘Inclusive Type’ of classification may be used for a grouped frequency distribution for discrete variables like marks in a test, number of accidents on the road, etc., where the variable takes only integral values. It cannot be used with advantage for the frequency distribution of continuous variables like age, height, weight, etc., where all values (integral as well as fractional) are permissible. (b) Exclusive Type Classes. Let us consider the distribution of ages of a group of persons into classes 15—19, 20—24, 25—29, etc., each of magnitude 5. This classification of ‘inclusive type’ for ages is defective in the sense that it does not account for the individuals with ages more than 19 years but less than 20 years. In such a situation (where the variable is continuous), the classes have to be made without any gaps as given below : 15 years and over but under 20 …(*) 20 years and over but under 25 25 years and over but under 30 and so on ; each class in this case also being of magnitude 5. More precisely the above classes can be written as : 15—20 i.e., 15 ≤ X < 20 20—25 i.e., 20 ≤ X < 25 …(**) 25 ≤ X < 30 25—30 i.e., } } and so on, where it should be clearly understood that in the above classes, the upper limits of each class are excluded from the respective classes. Such classes in which upper limits are excluded from the respective classes and are included in the immediate next class are termed as ‘exclusive classes’. Remarks 1. For ‘exclusive classes’ the presentation given in (*) is preferred since it does not lead to any confusion. However, if presentation (**) is used, there is slight confusion about the overlapping values viz., 20, 25, 30, etc., but whenever such presentation is used (which is extensively done in practice) it should be clearly understood that the upper limit of the class TABLE 3·8 is to be excluded from that class. From the above discussion Age (on last birthday) No. of persons (f) it is also clear that a choice between the ‘inclusive method’ 20—24 6 or ‘exclusive method’ of classification will depend on the 25—29 10 nature of the variable under study. For a discrete variable, 30—34 14 the ‘inclusive classes’ may be used while for continuous 35—39 9 variable the ‘exclusive classes’ are to be used. 40—44 6 2. However, sometimes, even for a continuous random 45—49 5 variable the classification may be given to be of ‘inclusive type’. As an illustration, let us consider the frequency distribution of age of a group of 50 individuals given in Table 3·8. 3·12 BUSINESS STATISTICS Although the variable (age) X is a continuous variable, here inclusive type of classes are used since we are recording the age as on late birthday and consequently it becomes a discrete variable taking only integral values. Since age is a continuous variable, we might like to convert this ‘inclusive type’ classification into ‘exclusive type’ classification. Since the ages are recorded as on last birthday, they are recorded almost one year younger (prior). For example, in the TABLE 3·9 age group 20—24, there may be a person (or many persons) with Age (on last birthday) No. of persons ages 24·1, 24·2,...........up to 24·99. Thus all these persons who (X) (f) have not yet completed 25 years will be taken in the age group 20—25 6 20—24. Hence for obtaining ‘exclusive classes’ we can make 25—30 10 a correction in the above distribution by converting 24 to 25. 30—35 14 Accordingly for continuous representation of data (exclusive 35—40 9 type), all the upper class limits in Table 3·8 will have to be 40—45 6 increased by 1, thereby giving the (exclusive type) distribution, 45—50 5 as given in Table 3·9. TABLE 3·10 However, if the variable X is taken to denote the ‘age on Age (on next birthday) No. of persons next birthday’, then it would imply that the ages are recorded (X) (f) one year advance (i.e., one year older than existing one). This will mean that the class 20—25 may include person(s) with 19—24 6 24—29 10 ages just higher than 19 also. As such for continuity of the data 29—34 14 the lower limit will have to be reduced by 1. Hence to obtain 34—39 9 the ‘exclusive type’ classification for this case (X -age on next 39—44 6 birthday i.e., coming birthday), we shall have to subtract 1 from 44—49 5 the lower limit of each class in Table 3·8 to get the distribution as given in Table 3·10. 3. As far as possible the class limits should start with zero or some convenient multiple of 5. As an illustration if we want to form a frequency distribution of wages in a factory with class interval of 10 and the lowest value of wages (per week) is given to be Rs. 43, then instead of having classes 43—53, 53—63,…etc., a proper classification should be 40—50, 50—60, etc. 4. Class Boundaries. If in a grouped frequency distribution there are gaps between the upper limit of any class and lower limit of the succeeding class (as in the case of inclusive type of classification), there is need to convert the data into a continuous distribution by applying a correction for continuity for determining new classes of exclusive type. The upper and lower class limits of the new ‘exclusive type’ classes as called class boundaries. If d is the gap between the upper limit of any class and lower limit of the succeeding class, the class boundaries for any class are then given by : Upper class boundary = Upper class limit + 12 d …(3·6) 1 Lower class boundary = Lower class limit – 2 d } d/2 is called the correction factor. As an illustration, consider the following distribution of marks : TABLE 3·11 Marks 20—24 25—29 30—34 35—39 40—44 20 25 30 35 40 – – – – – 0·5, 0·5, 0·5, 0·5, 0·5, Class Boundary 24 + 0·5 i.e., 29 + 0·5 i.e., 34 + 0·5 i.e., 39 + 0·5 i.e., 44 + 0·5 i.e., 19·5, 24·5, 29·5, 34·5, 39·5, 24·5 29·5 34·5 39·5 44·5 d = 0·5 2 This technique enables us to convert a grouped frequency distribution (inclusive type) into continuous frequency distribution and is extensively helpful in computing certain statistical measures like mode, median, etc., [See Chapter 5] which require the distribution to be continuous. Here, d = 25 – 24 = 30 – 29 = 35 – 34 = 1 ⇒ 3·13 CLASSIFICATION AND TABULATION Thus, in Table 3·11, the lower class limits are 20, 25, 30, …, 40 and the upper class limits are 24, 29,…, 44, while the lower class boundaries are 19·5, 24·5,…, 39·5 and the upper class boundaries are 24·5, 29·5,…, 44·5. 5. Mid-value or Class Mark. As the name suggests, the mid-value or the class-mark is the value of the variable which is exactly at the middle of the class. The mid-value of any class is obtained on dividing the sum of the upper and lower class limits (or class boundaries) by 2. In other words : Mid-value of a class = 12 [Lower class limit + Upper class limit] …(3·7) = 21 [Lower class boundary + Upper class boundary] } In the Table 3.11 of remark 4, it may be seen that the mid-values of various classes are : 22, 27, 32, 37, 42 respectively as given below : 1 2 1 2 1 2 1 2 1 2 } 1 } (20 + 24) = 22 2 (19·5 + 24·5) = 22 1 (25 + 29) = 27 2 (24·5 + 29·5) = 27 1 (30 + 34) = 32 or 2 (29·5 + 34·5) = 32 1 (35 + 39) = 37 2 (34·5 + 39·5) = 37 1 (40 + 44) = 42 2 (39·5 + 44·5) = 42 It may be noted that whether we use class limits or class boundaries, the mid-values remain same. Important Note. For fixing the class limits the most important factor to be kept in mind is as given below: “The class limits should be chosen in such a manner that the observations in any class are evenly distributed throughout the class interval so that the actual average of the observations in any class is very close to the mid-value of the class. In other words, this amounts to saying that the observations are concentrated at the mid points of the classes.” This is a very fundamental assumption in preparing a grouped or continuous frequency distribution for computation of various statistical measures like mean, variance, moments, etc. [See Chapters 5, 6, 7] for further analysis of the data. If this assumption is not true then the classification will not reveal the main characteristics and thus give a distorted picture of the distribution. The deviation from this assumption introduces the so-called ‘grouping error’. (c) Open End Classes. The classification is termed as ‘open end classification’ if the lower limit of the first class or the upper limit of the last class are not specified and such classes in which one of the limits is missing are called ‘open end classes’. For example, the classes like the marks less than 20; age above 60 years, salary not exceeding Rupees 100 or salaries over Rupees 200, etc., are ‘open end classes’ since one of the class limits (lower or upper) is not specified in them. As far as possible, open end classes should be avoided since in such classes the mid-value or class-mark cannot be accurately obtained and this poses problems in the computation of various statistical measures for further processing of the data. Moreover, open end classes present problems in graphic presentation of the data also. However, the use of open end classes is inevitable or unavoidable in a number of practical situations, particularly relating to economic and medical data where there are a few observations with extremely small or large values while most of the other observations are more or less concentrated in a narrower range. Thus, we have to resort to open end classes for the frequency distribution of income, wages, profits, payment of income-tax, savings, etc. Remark. In case of open end classes, it is customary to estimate the class-mark or mid-value for the first class with reference to the succeeding class (i.e., 2nd class). In other words, we assume that the magnitude of the first class is same as that of second class. Similarly, the mid-value of the last class is determined with reference to the preceding class i.e., last but one class. This assumption will, of course, introduce some error in the calculation of further statistical measures (averages, dispersion, etc.—See Chapters 5, 6). However, if only a few items fall in the open end classes then : (i) there won’t be much loss in information in further processing of data as a consequence of open end classes, and (ii) the open end classes will not seriously reduce the utility of graphic presentation of the data. 3·14 BUSINESS STATISTICS Example 3·1. Form a frequency distribution from the following data by Inclusive Method, taking 4 as the magnitude of class-intervals : 17, 10, 15, 22, 11, 16, 19, 24, 29, 18, 25, 26, 32, 14, 20, 17, 23, 27, 30, 12, 15, 18, 24, 36, 18, 15, 21, 28, 38, 33, 34, 13, 10, 16, 20, 22, 29, 19, 23, 31. TABLE 3·12 Solution. Since the minimum value of the variable is 10 FREQUENCY DISTRIBUTION which is a very convenient figure for taking the lower limit of the first class and the magnitude of the class intervals is given Class Interval Tally Marks Frequency (f) to be 4, the classes for preparing frequency distribution by 10—13 |||| 5 the ‘Inclusive Method’ will be 10—13, 14—17, 18—21, 14—17 8 |||| ||| 22—25,…, 34—37, 38—41, the last class being 38—41, 18—21 8 |||| ||| because the maximum value in the distribution is 38. 22—25 7 |||| || To prepare the frequency distribution, since the first 26—29 5 |||| value 10 occurs in class 10—13 we put a tally mark against it, 30—33 4 |||| for the value 17 we put a tally mark against the class 14—17 ; 34—37 2 || for the value 15 we put a tally mark against the class 14—17 38—41 1 | and so on. The final frequency distribution along with the tally marks is given in Table 3·12. Total 40 Example 3·2. Following figures relate to the weekly wages of workers in a factory. 100 75 79 80 110 93 109 84 95 77 100 89 84 81 106 96 94 83 95 78 101 99 83 89 102 97 93 82 97 80 102 96 87 99 107 99 97 80 98 93 Wages (in ’00 Rs.) 106 86 94 93 88 89 104 100 103 101 100 102 98 99 84 86 100 105 96 97 82 92 75 103 101 103 100 88 106 98 87 90 76 104 101 107 97 91 103 98 109 86 76 107 86 107 88 93 85 98 104 78 79 110 94 108 86 95 84 87 Prepare a frequency table by taking a class interval of 5. Solution. In the above distribution, the minimum value of the variable X (wages in ’00 Rupees) is 75 and the maximum value is 110. Moreover, the magnitude of the class intervals is given to be 5. Since ‘wages’ is a continuous variable, the frequency distribution with ‘Exclusive Method’ would be appropriate. Since the minimum value 75 is a convenient figure to be taken as the lower limit of the first class, the class intervals may be taken as 75—80, 80—85, 85—90,…, 110—115, the upper limit of each class being included in the next class. The frequency distribution is given in Table 3·13. TABLE 3·13 FREQUENCY DISTRIBUTION OF WAGES OF WORKERS IN A FACTORY Weekly Wages (in ’00 Rs.) (X) 75—80 80—85 85—90 90—95 95—100 100—105 105—110 110—115 Tally Marks |||| |||| |||| |||| |||| |||| |||| || |||| |||| |||| |||| |||| |||| |||| || |||| | |||| |||| | No. of Workers (f) |||| |||| 9 12 15 11 20 20 11 2 Total = 100 3·15 CLASSIFICATION AND TABULATION Example 3·3. Prepare a frequency distribution of the number of letters in a word from the following excerpt (ignore punctuation marks). “In the beginning”, said a Persian Poet, “Allah took a rose, a lily, a dove, a serpent, a little honey, a Dead Sea Apple and a handful of clay. When he looked at the amalgam – it was a woman.” Also obtain (i) the number of words with 6 letters or more, (ii) the proportion of words with 5 letters or less, and (iii) the percentage of words with number of letters between 2 and 8 (i.e., more than 2 but less than 8). Solution. Let X denote the number of letters in each word in the excerpt given above. We note that in the above excerpt there are words with number of letters ranging from 1 to 9. Hence X takes the values from 1 to 9. For example, in the first word ‘In’ there are 2 TABLE 3·14 letters ; in the second world ‘the’ there are 3 letters ; in the FREQUENCY DISTRIBUTION third word ‘beginning’ there are 9 letters and so on. Thus, the corresponding values of the variable X in the above OF NUMBER OF LETTERS IN A WORD excerpt are as given below : Number of letters Tally Marks Frequency (f) in a word (X) 2, 3, 9, 4, 1, 7, 4, 5, 4, 1, 4, 1, 4 1, 4, 1, 7, 1, 6, 5, 1, 4, 3, 5, 3, 1 7, 2, 4, 4, 2, 6, 2, 3, 7, 2, 3, 1, 5 The frequency distribution along with the tally marks is given in the Table 3·14. (i) The number of words with 6 letters or more =2+4+1=7 (ii) The proportion of words with 5 letters or less is given by : 39 – 7 32 39 = 39 = ( 5 + 9 + 4 32 0·82 or 9 + 5 +39 = 39 = 0·82 1 2 3 4 5 6 7 8 9 9 5 5 9 4 2 4 0 1 |||| |||| |||| |||| |||| |||| |||| || |||| — | Total 39 ) (iii) The percentage of words with the number of letters between 2 and 8 is : 5+9+4+2+4 × 39 24 100 39 × 100 = 61·45 Example 3·4. In a survey, it was found that 64 families bought milk in the following quantities (litres) in a particular week. 19 16 22 9 22 12 39 19 14 23 6 24 16 18 7 17 20 25 28 18 10 24 20 21 10 7 18 28 24 20 14 23 25 34 22 5 33 23 26 29 13 36 11 26 11 37 30 13 8 15 22 21 32 21 31 17 16 23 12 9 15 27 17 21 Using Sturges’ rule, convert the above data into a frequency distribution by ‘Inclusive Method’. Solution. Here the total frequency is N = 64. By Sturges’ rule, the number of classes (k) is given by : k = 1 + 3·322 log 10 64 = 1 + 3·322 × 1·8062 = 1 + 6·0002 = 7 Range = Maximum value – Minimum value = 39 – 5 = 34 3·16 BUSINESS STATISTICS TABLE 3·15 Hence, the magnitude (i) of the class is given by FREQUENCY DISTRIBUTION Range 34 i = Number of classes (k) = 7 = 4·857 –~ 5. OF THE MILK PER WEEK AMONG 64 FAMILIES Hence taking the magnitude of each class interval as 5, we shall get 7 classes. Since the lowest value is 5, which is quite a convenient figure for being taken as the lower limit of the first class, the various classes by the inclusive method would be 5—9, 10—14, 15—19, 20—24, 25—29, 30—34, 35—39 Using tally marks, the required frequency distribution is obtained and is given in the Table 3·15. Milk quantity (litres) (C.I.) Tally Marks 5—9 10—14 15—19 20—24 25—29 30—34 35—39 |||| |||| |||| |||| |||| |||| ||| || |||| |||| ||| |||| |||| ||| ||| Number of families (f) 7 10 13 18 8 5 3 Total 64 Example 3·5. A college management wanted to give scholarships to B. Com. students securing 60 per cent and above marks in the following manner : The marks of 25 students who were eligible for scholarship are given below : 74, 62, 84, 72, 61, 83, 72, 81, 64, 71, 63, 61, 60, 67, 74, 66, 64, 79, 73, 75, 76, 69, 68, 78 and 67. Calculate the monthly scholarship paid to the students. Percentage of Marks 60—65 65—70 70—75 75—80 80—85 Monthly Scholarship In Rs. 250 300 350 400 450 Solution. As we are given the amount of scholarships according to the percentage of marks of the students within classes 60—65, 65—70,…, 80—85, we shall convert the given distribution of marks into frequency distribution with these classes as obtained in Table 3·16. TABLE 3·16 FREQUENCY DISTRIBUTION OF MARKS OF 25 STUDENTS Percentage of marks Tally Marks No. of Students (f) Scholarship (in Rs.) (X) Total Amount (f X) 60—65 |||| || 7 250 1750 65—70 |||| 5 300 1500 70—75 |||| | 6 350 2100 75—80 |||| 4 400 1600 80—85 ||| 3 450 1350 Total ∑f = 25 ∑fX = 8,300 Total monthly scholarship paid to the students is : ∑ f X = Rs. 8,300. Example 3·6. If the class mid-points in a frequency distribution of age of a group of persons are 25, 32, 39, 46, 53 and 60, find : (a) the size of the class interval, (b) the class boundaries, and (c) the class limits, assuming that the age quoted is the age completed last birthday. Solution. (a) The size (i) of the class interval is given by : i = Difference between the mid-values of any two consecutive classes =7 [Since 32 – 25 = 39 – 32 = … = 60 – 53 = 7] 3·17 CLASSIFICATION AND TABULATION (b) Since the magnitude of the class is 7 and the mid-values of the classes are 25, 32,…, 60, the corresponding class boundaries for different classes are obtained on adding (for upper class boundaries) and subtracting (for lower class boundaries) half the magnitude of the class interval viz., (7/2) = 3·5, from the mid-value respectively. For example the class boundaries for the first class will be (25 – 3·5, 25 + 3·5) i.e., (21·5, 28·5) ; for the second class will be (32 – 3·5, 32 + 3·5) i.e., (28·5, 35·5) and so on. Thus the various classes (Eclusive Type) with class boundaries are as given in Table 3·17. TABLE 3·17 Class 21·5—28·5 28·5—35·5 35·5—42·5 42·5—49·5 49·5—56·5 56·5—63·5 Mid-value 25 32 39 46 53 60 TABLE 3·18 (c) Assuming the age quoted (X) is the age completed on last birthday then X will be a discrete variable which can take only integral values. Hence the given distribution can be expressed in an ‘inclusive type’ of classes with class interval of magnitude 7, as given in Table 3·18. (For details see Remark 2 § 3·4·4). Age (on last birthday) 22—28 29—35 36—42 43—49 50—56 57—63 Mid-po int 25 32 39 46 53 60 Example 3·7. The following table shows the distribution of the life time of 350 radio tubes. Lifetime (in hours) : Number of tubes : 300—400 6 400—500 18 500—600 73 600—700 165 700—800 62 800—900 22 900—1000 4 Stating clearly the assumptions involved, obtain the percentage of tubes that have life time: (a) Greater than 760 hours ; (b) Between 650 and 850 hours; and (c) Less than 530 hours. Solution. Under the assumption that the class frequencies are uniformly distributed within the corresponding classes, we obtain by simple interpolation technique : (a) Number of tubes with the lifetime over 760 hours =4+2 + ( 62 100 × ) 40 = 26 + 24·8 = 50·8 –~ 51, since number of tubes cannot be fractional. 51 × 100 = 14·57. 350 (b) Number of tubes with lifetime over 650 hours Hence, required percentage of tubes = = 4 + 22 + 62 + 165 ~ 171 100 × 50 = 88 + 82·5 = 170·5 – Number of tubes with lifetime over 850 hours = 4 + ( 22 100 × ) 50 = 15 Hence the number of tubes with lifetime between 650 hours and 850 hours is 171 – 15 = 156. The required percentage of tubes = 156 350 × 100 = 44·57 73 (c) Number of tubes with life less than 530 hours = 6 + 18 + 100 × 30 = 6 + 18 + 21·9 = 45·9 –~ 46 46 Hence required percentage of tubes = 350 × 100 = 13·14. 3·5. CUMULATIVE FREQUENCY DISTRIBUTION A frequency distribution simply tells us how frequently a particular value of the variable (class) is occurring. However, if we want to know the total number of observations getting a value ‘less than’ or ‘more than’ a particular value of the variable (class), this frequency table fails to furnish the information as such. This information can be obtained very conveniently from the ‘cumulative frequency distribution’, which is a modification of the given frequency distribution and is obtained on successively adding the frequencies of the values of the variable (or classes) according to a certain law. The frequencies so obtained are called the cumulative frequencies abbreviated as c.f. The laws used are of ‘less than’ and ‘ more than’ type giving rise ‘less than cumulative frequency distribution’ and ‘more than cumulative 3·18 BUSINESS STATISTICS frequency distribution’. We shall explain the construction of such distributions by means of a numerical illustration. 3·5·1. Less Than Cumulative Frequency. Let us consider the distribution of marks of 70 students in a test as given in Table 3·19. TABLE 3·19 Less than cumulative frequency for any value of the variable (or class) is obtained on adding successively the frequencies of all the Marks No. of students previous values (or classes), including the frequency of variable 30—35 5 (class) against which the totals are written, provided the values 35—40 10 (classes) are arranged in ascending order of magnitude. For instance, 40—45 15 in the above illustration, the total number of students with marks less 45—50 30 than, say, 40 is 5 + 10 = 15; ‘less than 50’ is the sum of all the 50—55 5 previous frequencies upto and including the class 45—50 i.e., 55—60 5 5 + 10 + 15 + 30 = 60 and so on. The final distribution is given in Total 70 Table 3·20. TABLE 3·20 ‘LESS THAN’ CUMULATIVE FREQUENCY DISTRIBUTION OF MARKS OF 70 STUDENTS Marks 30—35 35—40 40—45 45—50 50—55 55—60 Frequency (f) 5 10 15 30 5 5 ‘Less than’ c.f. 5 5 + 10 = 15 15 + 15 = 30 30 + 30 = 60 60 + 5 = 65 65 + 5 = 70 TABLE 3·20 (a) LESS THAN c. f. DISTRIBUTION Marks Less than 30 ” ” 35 ” ” 40 ” ” 45 ” ” 50 ” ” 55 ” ” 60 Frequency 0 5 15 30 60 65 70 The ‘less than’ cumulative frequency distribution of Table 3·20 can also be written as given in Table 3·20(a). 3·5·2. More Than Cumulative Frequency. The ‘more than cumulative frequency’ is obtained similarly by finding the cumulative totals of frequencies starting from the highest value of the variable (class) to the lowest value (class). Thus in the above illustration the number of students with marks ‘more than 50’ is 5 + 5 = 10, and ‘more than 40’ is 15 + 30 + 5 + 5 = 55 and so on. The complete ‘more than’ type cumulative frequency distribution for this data is given in Table 3·21. TABLE 3·21 ‘MORE THAN’ CUMULATIVE FREQUENCY DISTRIBUTION OF MARKS OF 70 STUDENTS Marks 30—35 35—40 40—45 45—50 50—55 55—60 Frequency (f) 5 10 15 30 5 5 ‘More than’ cumulative frequency (c.f.) 65 + 5 = 70 55 + 10 = 65 40 + 15 = 55 10 + 30 = 40 5 + 5 = 10 5 TABLE 3·21 (a) MORE THAN FREQUENCY DISTRIBUTION Marks More than ” ” ” ” ” ” ” ” ” ” ” ” No. of students 30 35 40 45 50 55 60 70 65 55 40 10 5 0 The ‘more than’ c.f. distribution of Table 3·21 can also be expressed as given in Table 3·21 (a). Remarks 1. In fact ‘less than’ and ‘more than’ words also include the equality sign i.e., ‘less than a given value’ means ‘less than or equal to that value’ and ‘more than a given value’ means ‘more than or equal to that value’. 3·19 CLASSIFICATION AND TABULATION 2. Cumulative frequency distribution is of particular importance in the computation of median, quartiles and other partition values of a given frequency distribution. [For details See Chapter 5— Averages]. 3. In ‘less than’ cumulative frequency distribution, the c.f. refers to the upper limit of the corresponding class and in ‘more than’ cumulative frequency distribution, the c.f. refers to the lower limit of the corresponding class. Example 3·8. Convert the following distribution into ‘more than’ frequency distribution. Weekly wages (less than ’00 Rs.) Number of workers : : 20 41 40 92 60 156 80 194 100 201 Solution. Here we are given, ‘less than’ cumulative frequency distribution. To obtain the ‘more than’ cumulative frequency distribution, we shall first convert it into continuous frequency distribution as shown in Table 3·22. TABLE 3·22 MORE THAN c.f. DISTRIBUTION Weekly wages (in ’00 Rs.) 0—20 20—40 40—60 60—80 80—100 No. of workers (f) 41 92 – 41 = 51 156 – 92 = 64 194 – 156 = 38 201 – 194 = 7 TABLE 3·22 (a) ‘MORE THAN’ FREQUENCY DISTRIBUTION ‘More than’ c.f. 160 + 41 = 201 109 + 51 = 160 45 + 64 = 109 7 + 38 = 45 7 Weekly wages more than (’00 Rs.) No. of workers 0 20 40 60 80 100 201 160 109 45 7 0 From Table 3·22, we obtain the ‘more than’ frequency distribution as given in Table 3·22 (a). Example 3·9. The credit office of a departmental store gave the following statements for payment due to 40 customers. Construct a frequency table of the balances due taking the class intervals as Rs. 50 and under Rs. 200, Rs. 200 and under Rs. 350, etc. Also find the percentage cumulative frequencies and interpret these values. Balances due in Rs. 337, 570, 99, 759, 487, 352, 115, 60, 521, 95 563, 399, 625, 215, 360, 178, 827, 301, 501, 199 110, 501, 201, 99, 637, 328, 539, 150, 417, 250 451, 595, 422, 344, 186, 681, 397, 790, 272, 514 Solution. Taking the class intervals as 50—200, 200—350, … …, and using tally marks, we obtain the following distribution of the balance due (in Rs.) from 40 customers. TABLE 3·23 FREQUENCY TABLE OF BALANCE DUE (IN RUPEES) TO 40 CUSTOMERS Balance due (in Rs.) Tally Marks No. of customers (f) Less than c.f. 10 Percentage c.f.* 50—200 |||| |||| 10 200—350 |||| ||| 8 10 + 8 = 18 45·0 350—500 |||| ||| 8 18 + 8 = 26 65·0 500—650 |||| |||| 10 26 + 10 = 36 90·0 650—800 ||| 3 36 + 3 = 39 97·5 800—950 | 1 39 + 1 = 40 100·0 Total c.f. * Percentage c.f. = Less than × 100 N N = ∑f = 40 25·0 3·20 BUSINESS STATISTICS The last column of the percentage cumulative frequencies shows that 25% of the customers have to pay less than Rs. 200, 45% of customers have to pay less than Rs. 350 ; 65% of the customers have to pay less than Rs. 500 ; 90% of the customers have to pay less than Rs. 650 ; 97·5% of the customers have to pay less than Rs. 800 and the balance due is less than Rs. 950 from each of the 40 customers i.e., no customer has to pay more than Rs. 950. 3·6. BIVARIATE FREQUENCY DISTRIBUTION So far our study was confined to frequency distribution of a single variable only. Such frequency distributions are also called univariate frequency distributions. Quite often we are interested in simultaneous study of two variables for the same population. This amounts to classifying the given population w.r.t. two bases or criteria simultaneously. For example, we may study the weights and heights of a group of individuals, the marks obtained by a group of individuals on two different tests or subjects, income and expenditure of a group of individuals, ages of husbands and wives for a group of couples, etc. The data so obtained as a result of this cross classification give rise to the so-called bivariate frequency distribution and it can be summarised in the form of two-way table called the bivariate frequency table or commonly called the correlation table. Here also the values of each variable are grouped into various classes (not necessarily the same for each variable) keeping in view the same considerations of classification as for a univariate distribution. If the data corresponding to one variable, say, X is grouped into m classes and the data corresponding to the other variable, say, Y is grouped into n classes then the bivariate table will consist of m × n cells. By going through the different pairs of the values (x, y) of the variables and using tally marks we can find the frequency for each cell and thus obtain the bivariate frequency table. The format of a bivariate frequency table is given in Table 3.24. TABLE 3·24 BIVARIATE FREQUENCY TABLE X Series Y Series Classes → ↓ Total of frequencies of Y Mid Points x1 x2 ... x ... xm Mid Points Classes y1 y2 y f (x, y) fy .. . yn Total of frequencies of X fx Total ∑ fx =∑ fy = N Here f (x, y) is the frequency of the pair (x, y). Remarks 1. The bivariate frequency table gives a general visual picture of the relationship between the two variables under consideration. However, a quantitative measure of the linear relationship between the variables is given by the correlation coefficient (See Chapter 8, Correlation Analysis). 2. Marginal Distributions of X and Y. The frequency distribution of the values of the variable X together with their frequency totals as given by fx in the above table is called the marginal frequency distribution of X. Similarly, the frequency distribution of the values of the variable Y together with the total frequencies fy in the above table, is known as the marginal frequency distribution of Y. 3·21 CLASSIFICATION AND TABULATION 3. Conditional Distributions of X and Y. The conditional frequency distribution of X for a given value of Y is obtained by the values of X together with their frequencies corresponding to the fixed values of Y. Similarly, we may obtain the conditional frequency distribution of Y for given values of X. We shall now explain the technique of constructing bivariate frequency table and obtaining the marginal and conditional distributions of X and Y by means of numerical illustrations. Example 3·10. (a) Prepare a bivariate frequency distribution for the following data for 20 students : Marks in Law 10 11 10 11 11 14 12 12 13 10 Marks in Statistics 29 21 22 21 23 23 22 21 24 23 Marks in Law 13 12 11 12 10 14 14 12 13 10 Marks in Statistics 24 23 22 23 22 22 24 20 24 23 (b) Also obtain the marginal frequency distributions of the marks in Law and marks in Statistics and the conditional frequency distribution of marks in Law when marks in Statistics are 23 and the conditional distribution of marks in Statistics when marks in Law are 12. Solution. (a) Let us denote the marks in Law by the variable X and the marks in Statistics by the variable Y. Then X takes the values from 10 to 14 i.e., 5 values in all, and Y takes the values from 20 to 24 i.e., 5 values in all. Thus the two-way table will consist of 5 × 5 = 25 cells. To prepare the bivariate frequency table, we observe that the first student gets 10 marks in Law and 20 marks in Statistics. Therefore, we put a tally mark in the cell where the column corresponding to X = 10 intersects the row corresponding to Y = 20. Proceeding similarly we put tally marks for each pair of values (x, y) for all the 20 candidates. The total frequency for each cell is given in small brackets ( ), after the tally marks. Now count all the frequencies in each row and write at extreme right column. Similarly count all the frequencies in each column and write at the bottom row. The bivariate frequency distribution so obtained is given in Table 3·25. TABLE 3·25 BIVARIATE FREQUENCY TABLE SHOWING MARKS OF 20 STUDENTS IN LAW AND STATISTICS Marks in Law (X) → Marks in Statistics (Y) ↓ 20 10 11 | (1) 21 12 13 || (2) | (1) || (2) | (1) | (1) 23 | (1) 5 3 || (2) | (1) || (2) 24 5 Total (fy) 2 | (1) 22 Total (fx ) 14 4 5 | (1) 6 ||| (3) | (1) 4 3 3 20 (b) The marginal frequency distributions of X and Y are given in Table 3·25(a). TABLE 3·25 (a) MARGINAL DISTRIBUTIONS Marginal Distribution of X X 10 11 12 13 14 Total f 5 4 5 3 3 20 Marginal Distribution of Y Y 20 21 22 23 24 Total f 2 3 5 6 4 20 TABLE 3·25 (b) CONDITIONAL DISTRIBUTIONS Conditional Distribution of X when Y = 23 X 10 11 12 13 14 Total Frequency 2 1 2 0 1 6 Conditional Distribution of Y when X = 12 Y 20 21 22 23 24 Total Frequency 1 1 1 2 0 5 3·22 BUSINESS STATISTICS The conditional distributions of marks in Law (X) when marks in Statistics i.e., Y = 23 and the conditional distribution of marks in Statistics (Y) when marks in Law, i.e., X = 12, are given in Table. 3·25(b). Example 3·11. Following figures give the ages in years of newly married husbands and wives. Represent the data by a frequency distribution. Age of Husband : 24 26 27 25 28 24 27 28 25 26 Age of Wife : 17 18 19 17 20 18 18 19 18 19 Age of Husband : 25 26 27 25 27 26 25 26 26 26 Age of Wife : 17 18 19 19 20 19 17 20 17 18 [Delhi Univ. B.Com. (Hons.), 1975] Solution. Let us denote the age (in years) of the husbands by the variable X and the age (in years) of wives by the variable Y. Then we observe that the variable X takes the values from 24 to 28 and Y takes the values from 17 to 20. Proceeding exactly as in Example 3·10, we obtain the bivariate frequency distribution given in Table 3·26. TABLE 3·26 FREQUENCY DISTRIBUTION OF THE AGES (IN YEARS) OF NEWLY MARRIED HUSBANDS AND WIVES Age of Husband Age of (X) → Wife (Y) ↓ 24 25 26 27 17 18 19 20 | (1) | (1) ||| (3) | (1) | (1) | (1) ||| (3) || (2) | (1) | (1) || (2) | (1) | (1) | (1) 5 6 6 3 Total (fx ) 2 4 2 20 5 7 28 Total (fy) EXERCISE 3·1. 1. (a) What do you mean by classification of data ? Discuss in brief the modes of classification. [Delhi Univ. B.Com. (Pass), 1996] (b) Briefly explain the principles of Classification. [Delhi Univ. B.Com. (Pass), 2000] (c) What do you understand by classification of data ? What are its objectives ? (d) What are different types of classification ? Illustrate by suitable examples. 2. “Classification is the process of arranging things (either actually or notionally) in groups or classes according to their resemblances and affinities giving expression to the unity of attributes that may subsist amongst a diversity of individuals”. Elucidate the above statement. 3. (a) What is meant by ‘classification’? State its important objectives. Briefly explain the different methods of classifying statistical data. [C.A. (Foundation), June 1993] (b) What are the purposes of classification of data ? State the primary rules to be observed in classification. [C.A. (Foundation), Nov. 1995] 4. (a) What are the advantages of data classification? What primary rules should ordinarily be followed for classification ? [C.A. (Foundation), Nov. 1997] (b) What are different kinds of classification ? State, how the classification of data is useful. [C.A. (Foundation), Nov. 2000] (c) State the principles underlying classification of data. 5. (a) What are grouped and ungrouped frequency distributions ? What are their uses ? What are the considerations that one has to bear in mind while forming the frequency distribution ? (b) Discuss the problems in the construction of a frequency distribution from raw data, with particular reference to the choice of number of classes and the class limits. CLASSIFICATION AND TABULATION 3·23 6. (a) What are the principles governing the choice of : (i) Number of class intervals. (ii) The length of the class interval. (iii) The mid point of class interval ? (b) What are the general rules of forming a frequency distribution with particular reference to the choice of classinterval and number of classes ? Illustrate with examples. [Calicut Univ., B.Com., 1999] 7. What do you mean by an inclusive series ? How can an inclusive series be converted to an exclusive series ? Illustrate with the help of an example. [Delhi Univ. B.Com. (Pass), 2002] 8. Prepare a frequency distribution from the following figures relating to bonus paid to factory workers : BONUS PAID TO WORKERS (in Rs.) 86 62 58 73 101 90 84 90 76 61 84 63 56 72 102 56 83 92 87 60 83 69 57 71 103 57 87 93 88 59 76 70 54 70 104 58 88 94 89 57 74 86 55 60 105 59 89 84 90 74 67 84 82 70 60 60 90 81 91 76 60 83 78 80 65 61 96 82 93 102 60 91 76 90 70 63 94 83 94 101 70 100 74 101 80 67 92 85 96 92 Take a class-interval of 5. Ans. Frequencies of classes 50—54, 55—59,…, 100—104, 105—109 are respectively 1, 10, 11, 4, 11, 5, 13, 9, 15, 2, 8 and 1. 9. Describe in brief the practical problems of frequency distribution in classification of data on a variable according to class intervals, giving the definition of frequency distribution. [C.A. (Foundation), Nov. 2001] 10. (a) For the following raw data prepare a frequency distribution with the starting class as 5—9 and all classes with the same width 5. Marks in English 12 36 40 16 10 10 19 20 28 30 19 27 15 21 33 45 7 19 20 26 26 37 6 5 20 30 37 17 11 20 Ans. Marks : 5—9 10—14 15—19 20—24 25—29 30—34 35—39 40—44 45—49 Frequency : 3 4 6 5 4 3 3 1 1 (b) Classify the following data by taking class intervals such that their mid-values are 17, 22, 27, 32, and so on. 30 42 30 54 40 48 15 17 51 42 25 41 30 27 42 36 28 26 37 54 44 31 36 40 36 22 30 31 19 48 16 42 32 21 22 46 33 41 21 [Madurai-Kamaraj Univ. B.Com., 1995] Ans. 15—19 20—24 25—29 30—34 35—39 40—44 45—49 50—54 4 4 4 8 4 9 3 3 11. (a) The following are the weights in kilograms of a group of 55 students. 42 74 40 60 82 115 41 61 75 83 63 53 110 76 84 50 67 78 77 63 65 95 68 69 104 80 79 79 54 73 59 81 100 66 49 77 90 84 76 42 64 69 70 80 72 50 79 52 103 96 51 86 78 94 71 Prepare a frequency table taking the magnitude of each class-interval as 10 kg. and the first class-interval as equal to 40 and less than 50. Ans. Frequencies of classes 40—50, 50—60,…,110—120 are 5, 7, 11, 15, 8, 4, 3, 2 respectively. (b) Prepare a statistical table from the following data taking the class width as 7 by inclusive method. 24 26 28 32 37 5 1 7 9 11 15 13 14 18 29 31 32 6 4 2 9 18 27 36 3 9 15 21 27 33 4 8 12 16 20 5 10 3 8 1 6 4 9 2 7 12 18 27 23 21 29 22 15 17 28 Ans. Frequencies of classes 1—7, 8—14, 15—21,…, 36—42 are respectively 15, 12, 11, 9, 6, 2. 3·24 BUSINESS STATISTICS 12. Using Sturges’ Rule k = 1 + 3·322 log N, where k is the number of class intervals, N is the total number of observations, classify in equal intervals, the following data of hours worked by 50 piece rate workers for a period of a month in a certain factory : 110, 175, 161, 157, 155, 108, 164, 128, 114, 178, 165, 133, 195, 151, 71, 94, 97, 42, 30, 62, 138, 156, 167, 124, 164, 146, 116, 149, 104, 141, 103, 150, 162, 149, 79, 113, 69, 121, 93, 143, 140, 144, 187, 184, 197, 87, 40, 122, 203, 148. Ans. Using Sturges’ rule we get k (No. of classes) = 7, Magnitude of class = (Range/k) = (174/7) ~ – 25. Classes are 30—55, 55—80, 80—105,…,180—205. The corresponding frequencies are 3, 4, 6, 9, 12, 11, 5. 13. Two dice are thrown at random. Obtain the frequency distribution of the sum of the numbers which appear on them. Hint and Ans. Total possible pairs of numbers on the two dice are as given below : (1, 1), (1, 2),…, (1, 6) ; (2, 1), (2, 2),…, (2, 6) ; … …, (6, 1), (6, 2_,…, (6, 6) If X denotes the sum of the numbers on the two dice then X is a discrete variable which can take the values 2, 3,…, 12. Sum (X) : 2 3 4 5 6 7 8 9 10 11 12 Frequency : 1 2 3 4 5 6 5 4 3 2 1 14. If the class mid-points in a frequency distribution of a group of persons are : 125, 132, 139, 146, 153, 160, 167, 174, 181 pounds, find (i) size of the class intervals, (ii) the class boundaries, and (iii) the class limits, assuming that the weights are measured to the nearest pound. [Delhi Univ. B.Com. (Hons.), 2007] Ans. (i) 7 (ii) 121·5—128·5, 128·5—135·5,…,177·5—184·5 (iii) 122—128, 129—135,…, 178—184. 15. With the help of suitable examples, distinguish between : (i) Continuous and Discrete variable. (ii) Exclusive and Inclusive class intervals. (iii) ‘More than’ and ‘Less than’ frequency tables. (iv) Simple and Bivariate frequency tables. 16. What do you mean by cumulative frequency (c.f.) distribution ; ‘More than’ and ‘Less than’ type c.f. distribution. Illustrate by an example. 17. The weekly observations on cost of living index in a certain city for the year 2000-01 are given below : Cost of living index : 140—150 150—160 160—170 170—180 180—190 190—200 No. of workers : 5 10 20 9 6 2 Prepare ‘less than’ and ‘more than’ cumulative frequency distributions. 18. (a) Convert the following into an ordinary frequency distribution : 5 students get less than 3 marks ; 12 students get less than 6 marks; 25 students get less than 9 marks ; 33 students get less than 12 marks. [Delhi Univ. B.Com. (Pass), 2001] Ans. 0—3 3—6 6—9 9—12 5 7 13 8 (b) Following is a cumulative frequency table showing the number of packages and the number of times a given number of packages was received by a post office in 60 days : No. of packages below : 10 20 30 40 50 60 No. of times received in 60 days : 17 22 29 37 50 60 Obtain the frequency table from it. Also prepare ‘more than’ cumulative frequency table. 19. (a) What is the difference between continuous and discrete variables ? (b) Are the following variables discrete or continuous ? Give your answer with reason. (i) Age on last birthday. ; (ii) Temperature of the patient. (iii) Length of a room. ; (iv) Number of shareholders in a company. Ans. (ii) and (iii) continuous ; (i) and (iv) discrete. (c) State with reasons which of the following represent discrete data and which represent continuous data : (i) Number of table fans sold each day at a Departmental Store. (ii) Temperature recorded every half an hour of a patient in a hospital. (iii) Life of television tubes produced by Electronics Ltd. (iv) Yearly income of school teachers. (v) Lengths of 1,000 bolts produced in a factory. 3·25 CLASSIFICATION AND TABULATION 20. Complete the table showing the frequencies with which words of different number of letters occur in the extract reproduced below (omitting punctuation marks) treating as the variable the number of letters in each word : “Her eyes were blue ; blue as autumn distance—blue as the blue we see between the retreating mouldings of hills and woody slopes on a sunny September morning : a misty and shady blue, that had no beginning or surface, and was looked into rather than at.” Ans. X: 1 2 3 4 5 6 7 8 9 10 f: 2 8 9 10 5 4 3 1 3 1 21. Prepare a frequency distribution of the words in the following extract according to their length (number of letters) omitting punctuation marks. Also give (i) the number of words with 7 letters or less ; (ii) the proportion of words with 5 letters or more ; (iii) the percentage of words with not less than 4 and not more than 7 letters. “Success in the examination confers no absolute right to appointment, unless Government is satisfied, after such enquiry as may be considered necessary, that the candidate is suitable in all respects for appointment to the public service.” Ans. X: 2 3 4 5 6 7 8 9 10 11 f: 9 6 2 2 2 4 3 3 2 3 (i) 25, (ii) (19/36) = 0·5278, (iii) (10/36) × 100 = 27·28. 22. A company wants to pay daily bonus to its employees. The bonus is to be paid us under : Daily Salary (Rs.) : 100—200 200—300 300—400 400—500 500—600 Daily Bonus (Rs.) : 10 20 30 40 50 Actual daily salaries of the employees, in Rupees, are as under : 175, 225, 375, 478, 525, 650, 570, 451, 382, 280 375, 465, 530, 480, 320, 515, 225, 345, 471, 450 Find out the total daily bonus paid to the employees. 600—700 60 Ans. Total daily bonus paid to the employees = Rs. 720. 23. From the following data construct a bivariate frequency distribution : Age of husbands (in years) (x) 28 26 27 25 28 Ans. x: fx : Age of wives (in years) (y) 22 21 21 20 22 25 26 2 4 Age of husbands (in years) (x) 28 27 27 26 27 27 28 6 3 Age of wives Age of husbands (in years) (y) (in years) (x) 21 27 21 26 20 25 20 26 19 27 y: 19 20 21 fy : 3 4 6 Age of wives (in years) (y) 21 19 19 20 21 22 2 24. The data given below relates to the heights and weights of 20 persons. You are required to form a two-way frequency table with class 62″ to 64″, 64″ to 66″ and so on, and 115 to 125 lbs, 125 to 135 lbs. and so on. S. No. 1. 2. 3. 4. 5. 6. 7. Weight 170 135 136 137 148 124 117 Height 70 65 65 64 69 63 65 S. No. 8. 9. 10. 11. 12. 13. 14. Weight 128 143 129 163 139 122 134 Height 70 71 62 70 67 63 68 S. No. 15. 16. 17. 18. 19. 20. Weight 140 132 120 148 129 152 Height 67 69 66 68 67 67 Ans. [Frequencies: (W) = 4, 5, 6, 3, 1, 1 ; (H) = 3, 4, 5, 4, 4.] 25. The following figures are income (x) and percentage expenditure on food (y) in 25 families. Construct a bivariate frequency table classifying x into intervals 200—300, 300—400, …, and y into 10—15, 15—20, … . Write down the marginal distributions of x and y and the conditional distribution of x when y lies between 15 and 20. 3·26 BUSINESS STATISTICS x y x y x y x y x y 550 623 310 420 600 12 14 18 16 15 225 310 640 512 690 25 26 20 18 12 680 330 425 555 325 13 25 16 15 23 202 255 492 587 643 29 27 18 21 19 689 523 317 384 400 11 12 18 17 19 26. What is meant by a bi-variate series? Taking 15 imaginary figures, construct a series of this type on discrete pattern. [Delhi Univ. B.Com. (Pass), 1998] 27. 30 pairs of values of two variables X and Y are given below. Form a two-way table : X 14 20 33 25 41 18 24 29 38 45 23 32 37 19 28 Y 148 242 296 312 518 196 214 340 492 568 282 400 288 292 431 X 34 38 29 44 40 22 39 43 44 12 27 39 38 17 26 Y 440 500 512 415 514 282 481 516 598 122 200 451 387 245 413 Take class intervals of X as 10 to 20, 20 to 30 etc., and that of Y as 100 to 200, 200 to 300, etc. [Osmania Univ. B.Com., 1996] 28. Following are the marks obtained by 24 students in English (X) and Economics (Y) in a test. (15, 13), (0, 1), (1, 2), (3, 7), (16, 8), (2, 9), (18, 12), (5, 9), (4, 17), (17, 16), (6, 6), (19, 18) (14, 11), (9, 3), (8, 5), (13, 4), (10, 10), (13, 11), (11, 14), (11, 7), (12, 18), (18, 15), (9, 15), (17, 3). Taking class-intervals as 0—4, 5—9, etc., for X and Y both, construct— (i) Bivariate frequency table. (ii) Marginal frequency tables of X and Y. 29. Fill in the blanks : (i) Variables are of two kinds … … and … … (ii) … … is the process of arranging data into groups according to their common characteristics. (iii) In chronological classification, the data are classified on the basis of … … (iv) … … classification means the classification of data according to location. (v) Class-mark (mid-point) is the value lying half-way between … … (vi) According to Sturges’ rule, the number of classes (k) is given by : k = … … (vii) The magnitude of the class (i) is given by : i=…… (viii) … … of data is a function very similar to that of sorting letters in a post office. (ix) Different bases of classification of data are … … (x) The data can be classified into … … and … … type classes. (xi) While forming a grouped frequency distribution, the number of classes should usually be between … … (xii) In exclusive type classes, the upper limit of the class is … … (xiii) In the continuous classes 0—5, 5—10, 10—15, 15—20 and so on, the class 15—20 means that the variable X takes the values … … (xiv) Two examples of discrete variable are … … and … … and continuous variable are … … and … … (xv) The classes in which the lower limit or the upper limit are not specified, are known as … … (xvi) The difference between the upper and the lower limits of a class gives … … of the class. (xvii) The number of observations in a particular class is called the … … of the class. (xviii) If the data values are classified into the classes 0—9, 10—19, 20—29, and so on and the frequency of the class 20—29 is 12, it means that … … . (xix) If the mid-points of the classes are 16, 24, 32, 40, and so on, then the magnitude of the class intervals is … …. (xx) In (xix), the class boundaries are … … . Ans. (i) discrete, continuous, (ii) classification. (iii) time. (iv) geographical. (v) the upper and the lower limits of the class. (vi) k = 1 + 3·322 log10 N ; N is total frequency. (vii) i = (upper limit – lower limit) of the class. (viii) classification. (ix) geographical, chronological, qualitative and quantitative. (x) inclusive, exclusive. (xi) 5 and 15. (xii) is not included in the class. (xiii) 15 and more but less than 20 i.e., 15 ≤ X < 20. (xiv) marks in a test, number of CLASSIFICATION AND TABULATION 3·27 accidents; height in inches, weight in kgs. (xv) open end classes. (xvi) the width or the magnitude. (xvii) frequency. (xviii) there are 12 observations taking values between 20 and 29, both inclusive i.e., 20 ≤ X ≤ 29. (xix) 8. (xx) 12—20, 20—28, 28—36, 36—44 and so on. 3·7. TABULATION – MEANING AND IMPORTANCE By tabulation we mean the systematic presentation of the information contained in the data, in rows and columns in accordance with some salient features or characteristics. Rows are horizontal arrangements and columns are vertical arrangements. In the words of A.M. Tuttle : “A statistical table is the logical listing of related quantitative data in vertical columns and horizontal rows of numbers with sufficient explanatory and qualifying words, phrases and statements in the form of titles, headings and notes to make clear the full meaning of data and their origin.” Professor Bowley in his manual of Statistics refers to tabulation as “the intermediate process between the accumulation of data in whatever form they are obtained, and the final reasoned account of the result shown by the statistics.” Tabulation is one of the most important and ingenious device of presenting the data in a condensed and readily comprehensible form and attempts to furnish the maximum information contained in the data in the minimum possible space, without sacrificing the quality and usefulness of the data. It is an intermediate process between the collection of the data on one hand and statistical analysis on the other hand. In fact, tabulation is the final stage in collection and compilation of the data and forms the gateway for further statistical analysis and interpretations. Tabulation makes the data comprehensible and facilitates comparisons (by classifying data into suitable groups), and the work of further statistical analysis, averaging, correlation, etc. It makes the data suitable for further Diagrammatic and Graphic representation. If the information contained in the data is expressed as a running text using language paragraphs, it is quite time-consuming to comprehend it because in order to understand every minute details of the text, one has to go through all the paragraphs ; which usually contain very large amount of repetitions. Tabulation overcomes the drawback of the repetition of explanatory phrases and headings and presents the data in a neat, readily comprehensible and true perspective, thus highlighting the significant and relevant details and information. Tabulated data have attractive get up and leave a lasting impression on the mind as compared to the data in the textual form. Tabulation also facilitates the detection of the errors and the omissions in the data. Tabulation enables us to draw the attention of the observer to specific items by means of comparisons, emphasis and arrangement of the layout. No hard and fast rules can be laid down for tabulating the statistical data. To prepare a first class table one must have a clear idea about the facts to be presented and stressed, the points on which emphasis is to be laid and familiarity with technique of preparation of the table. The arrangement of data tabulation requires considerable thought to ensure showing the relationship between the data of one or more series, as well as the significance of all the figures given in the classification adopted. The facts, comparisons and contrasts, and emphasis vary from one table to another table. Accordingly a good table (the requirements of which are given below) can only be obtained through the skill, expertise, experience and common sense of the tabulator, keeping in view the nature, scope and objectives of the enquiry. This bears testimony to the following words of A.L. Bowley : “In the tabulation of the data common sense is the chief requisite and experience is the chief teacher.” 3·7·1. Parts of a Table. The various parts of a table vary from problem to problem depending upon the nature of the data and the purpose of the investigation. However, the following are a must in a good statistical table : (i) (ii) (iii) (iv) (v) (vi) (vii) Table number Title Head notes or Prefatory notes Captions and Stubs Body of the table Foot-note Source note 3·28 BUSINESS STATISTICS 1. Table Number. If a book or an article or a report contains more than one table then all the tables should be numbered in a logical sequence for proper identification and easy and ready reference for future. The table number may be placed at the top of the table either in the centre above the title or in the side of the title. 2. Title. Every table must be given a suitable title, which usually appears at the top of the table (below the table number or next to the table number). A title is meant to describe in brief and concise form the contents of the table and should be self-explanatory. It should precisely describe the nature of the data (criteria of classification, if any) ; the place (i.e., the geographical or political region or area to which the data relate) ; the time (i.e., period to which the data relate) and the source of the data. The title should be brief but not an incomplete one and not at the cost of clarity. It should be un-ambiguous and properly worded and punctuated. Sometimes it becomes desirable to use long titles for the sake of clarity. In such a situation a ‘catch title’ may be given above the ‘main title’. Of all the parts of the table, title should be most prominently lettered. 3. Head Notes (or Prefatory Notes). If need be, head note is given just below the title in a prominent type usually centred and enclosed in brackets for further description of the contents of the table. It is a sort of a supplement to the title and provides an explanation concerning the entire table or its major parts-like captions or stubs. For instance, the units of measurements are usually expressed as head such as ‘in hectares’, ‘in millions’, ‘in quintals’, ‘in Rupees’, etc. 4. Captions and Stubs. Captions are the headings or designations for vertical columns and stubs are the headings or designations for the horizontal rows. They should be brief, concise and self-explanatory. Captions are usually written in the middle of the columns in small letters to economise space. If the same unit is used for all the entries in the table then it may be given as a head note along with the title. However, if the items in different columns or rows are measured or expressed in different units, then the corresponding units should also be indicated in the columns or rows. Relative units like ratios, percentages, etc., if any, should also be specified in the respective rows or columns. For instance, the columns may constitute the population (in millions) of different countries and rows may indicate the different periods (years). Quite often two or more columns or rows corresponding to similar classifications (or with same headings) may be grouped together under a common heading to avoid repetitions and may be given what are called sub-captions or sub-stubs. It is also desirable to number each column and row for reference and to facilitate comparisons. 5. Body of the Table. The arrangement of the data according to the descriptions given in the captions (columns) and stubs (rows) forms the body of the table. It contains the numerical information which is to be presented to the readers and forms the most important part of the table. Undesirable and irrelevant (to the enquiry) information should be avoided. To increase the usefulness of the table, totals must be given for each separate class/category immediately below the columns or against the rows. In addition, the grand totals for all the classes for rows/columns should also be given. 6. Foot Note. When some characteristic or feature or item of the table has not been adequately explained and needs further elaboration or when some additional or extra information is required for its complete description, foot-notes are used for this purpose. As the name suggests, footnotes, if any, are placed at the bottom of the table directly below the body of the table. Foot-notes may be attached to the title, captions, stubs or any part of the body of the table. Foot-notes are identified by the symbols *, **, ***, €€†, @, etc. 7. Source Note. If the source of the table is not explicitly contained in the title, it must be given at the bottom of the table, below the footnote, if any. The source note is required if the secondary data are used. If the data are taken from a research journal or periodical, then the source note should give the name of the journal or periodical along with the date of publication, its volume number, table number (if any), page number, etc., so that anybody who uses this data may satisfy himself, (if need be), about the accuracy of the figures given in the table by referring to the original source. Source note will also enable the user to decide about the reliability of the data since to the learned users of Statistics the reputations of the sources may vary greatly from one agency to another. 3·29 CLASSIFICATION AND TABULATION The format of a blank table is given in Table 3·27. TABLE 3·27. FORMAT OF A BLANK TABLE TITLE [Head Note or Prefatory Note (if any)] Caption Stub Heading ↓ Sub-Heads Column Head Column Head Sub-Heads Column Head Column Head Total Column Head Body Total Foot Note : Source Note : Remarks 1. A table should be so designed that it is neither too long and narrow nor too short and broad. It should be of reasonable size adjusted to the space at our disposal and should have an attractive get up. If the data are very large they should not be crowded in a single table which would become unwieldy and difficult to comprehend. In such a situation it is desirable to split the large table into a number of tables of reasonable size and shape. Each table should be complete in itself. 2. If the figures corresponding to certain items in the table are not available due to certain reasons, then the gaps arising therefrom should be filled by writing N.A. which is used as an abbreviation for ‘not available’. 3·7·2. Requisites of a Good Table. As pointed out earlier, no hard and fast rules can be laid down for preparing a statistical table. Preparation of a good statistical table is a specialised job and requires great skill, experience and common sense on the part of the tabulator. However, commensurate with the objectives and scope of the enquiry, the following points may be borne in mind while preparing a good statistical table. (i) The table should be simple and compact so that it is readily comprehensible. It should be free from all sorts of overlappings and ambiguities. (ii) The classification in the table should be so arranged as to focus attention on the main comparisons and exhibit the relationship between various related items and facilitate statistical analysis. It should highlight the relevant and desired information needed for further statistical investigation and emphasise the important points in a compact and concise way. Different modes of lettering (in italics, bold or antique type, capital letters or small letters of the alphabet, etc.), may be used to distinguish points of special emphasis. (iii) A table should be complete and self-explanatory. It should have a suitable title, head note (if necessary), captions and stubs, and footnote (if necessary). If the data are secondary, the source note should also be given. [For details see § 3·7·1]. The use of dash (—) and ditto marks (,,) should be avoided. Only accepted common abbreviations should be used. (iv) A table should have an attractive get up which is appealing to the eye and the mind so that the reader may grasp it without any strain. This necessitates special attention to the size of the table and proper spacings of rows and columns. 3·30 BUSINESS STATISTICS (v) Since a statistical table forms the basis for statistical analysis and computation of various statistical measures like averages, dispersion, skewness, etc., it should be accurate and free from all sorts of errors. This necessitates checking and re-checking of the entries in the table at each stage because even a minor error of tabulation may lead to very fallacious conclusions and misleading interpretations of the results. (vi) The classification of the data in the table should be in alphabetical, geographical or chronological order or in order of magnitude or importance to facilitate comparisons. (vii) A summary table [See § 3·7·3] should have adequate interpretative figures like totals, ratios, percentages, averages, etc. 3·7·3. Types of Tabulation. Statistical tables are constructed in many ways. Their choice basically depends upon : (i) Objectives and scope of the enquiry. (ii) Nature of the enquiry (primary or secondary). (iii) Extent of coverage given in the enquiry. The following diagrammatic scheme elegantly displays the various forms of tables commonly used in practice. Types of Tables On the basis of objectives or purpose General Purpose or Reference Table Special Purpose or Summary Table On the basis of nature of enquiry Original or Primary Table Derived or Derivative Table On the basis of coverage Simple Table Complex Table General Purpose (or Reference) and Special Purpose (or Summary) Tables. General purpose tables, which are also known as reference tables or sometimes informative tables provide a convenient way of compiling and presenting a systematically arranged data, usually in chronological order, in a form which is suitable for ready reference and record without any intentions of comparative studies, relationship or significance of figures. Most of the tables prepared by government agencies e.g., the detailed tables in the census reports, are of this kind. These tables are of repository nature and mainly designed for use by research workers, statisticians and are generally given at the end of the report in the form of an appendix. Examples of such tables are : age and sexwise distribution of the population of a particular region, community or country ; payrolls of a business house ; sales orders for different products manufactured by a concern ; the distribution of students in a university according to age, sex and the faculty they join ; and so on. As distinct from the general purpose or reference tables, the special purpose or summary tables (also sometimes called interpretative tables) are of analytical nature and are prepared with the idea of making comparative studies and studying the relationship and the significance of the figures provided by the data. These are generally constructed to emphasise some facts or relationships pertaining to a particular or specific purpose. In such tables interpretative figures like ratios, percentages, etc., are used in order to facilitate comparisons. Summary tables are sometimes called derived or derivative tables (discussed below) as they are generally derived from the general purpose tables. Original and Derived Tables. On the basis of the nature or originality of the data, the tables may be classified into two classes : (i) Primary tables (ii) Derived or Derivative tables. In a primary table, the statistical facts are expressed in the original form. It, therefore, contains absolute and actual figures and not rounded numbers or percentages. On the other hand, derived or derivative table is one which contains figures and results derived from the original or primary data. It expresses the information in terms of ratios, percentages, aggregates or statistical measures like average, dispersion, 3·31 CLASSIFICATION AND TABULATION skewness, etc. For instance, the time series data is expressed in a primary table but a table expressing the trend values and seasonal and cyclic variations is a derived table. In practice, mixtures of primary and derived tables are generally used, an illustration being given below : TABLE 3·28. LOAD CARRIED BY RAILWAYS AND ROAD TRANSPORT FOR DIFFERENT YEARS (In Billion Tonne Km) Year Railways Road Transport 1960-61 1965-66 1968-69 1973-74 1974-75 1975-76 88 117 125 122 134 148 17 34 40 65 80 81 Percentage Share Railways Road Transport 83·8 77·5 75·8 65·2 62·6 64·6 16·2 22·5 24·2 34·8 37·4 35·4 Simple and Complex Tables. In a simple table the data are classified w.r.t. a single characteristic and accordingly it is also termed as one-way table. On the other hand, if the data are grouped into different TABLE 3·29 classes w.r.t. two or more characteristics or criteria simultaneously, then we get a complex or manifold table. In IMPORTS FROM PRINCIPAL COUNTRIES BY SEA, AIR AND particular, if the data are classified w.r.t. two (three) characLAND FOR 1975-76 teristics simultaneously we get a two-way (three-way) table. (Rupees in lakhs) Simple Table. As already stated, a simple table furnishes information about only one single characteristic of the data. For Imports Country instance, Table 3·1 (on page 3·4) relating to agricultural output 10,167 of different countries (in kg per hectare) ; Table 3·2 (on page Australia Canada 23,201 3·4) giving the density of population (per square kilometre) in France 19,653 different cities of India ; Table 3·3 (on page 3·4) giving the (West) Germany 36,996 population of India (in crores) for different years, are all simple Japan 36,118 tables. As another illustration, the Table 3·29 giving the imports UK 28,400 from principal countries (by sea, air and land) for the year USA 1,28,522 1975-76 is a simple table. USSR 30,978 Two-way Table. However, if the caption or stub is classified into two sub-groups, which means that the data are classified w.r.t. two characteristics, we get a twoway table. Thus a two-way table furnishes information about two inter-related characteristics of a particular phenomenon. For example, the distribution of the number of students in a college w.r.t. age (1st characteristic) and sex (2nd characteristic) gives a two-way table. As another illustration, Table 3·30 which gives the load/distance by Railways and Road Transport for different years, is a two-way table. TABLE 3·30. LOAD CARRIED BY RAILWAYS AND ROAD TRANSPORT FOR DIFFERENT YEARS (In Billion Tonne Km) Years Railways Road Transport 1960-61 1965-66 1968-69 1973-74 1974-75 1975-76 88 117 125 122 134 148 17 34 40 65 80 81 Three-way Tables. If the data are classified simultaneously w.r.t. three characteristics, we get a threeway table. Thus a three-way table gives us information regarding three inter-related characteristics of a particular phenomenon. For example, the classification of a given population w.r.t. age, sex and literacy, or the classification of the students in a university w.r.t. sex, faculty (Arts, Sciences, Commerce) and the class (Ist year, 2nd year, 3rd year of the under-graduate courses) will give rise to three-way tables. The tables 3·32 BUSINESS STATISTICS given in Examples 3·12 to 3·19 are three-way tables. As another illustration, the following Table 3·31 representing the distribution of population of a city according to different age-groups (say, five age groups from 0 to 100 years), sex and literacy is a three-way table. Row-totals Females Males Sub-totals Females Males Sub-totals Males Age Group Females TABLE 3·31. DISTRIBUTION OF POPULATION (IN ’000) OF A CITY w.r.t AGE, SEX AND LITERACY Literates Illiterates Total 0—20 20—40 40—60 60—80 80—100 Column Totals Higher Order or Manifold Tables. These tables give the information on a large number of interrelated problems or characteristics of a given phenomenon. For example, the distribution of students in a college according to faculty, class, sex and year (Example 3·20) or the distribution of employees in a business concern according to sex, age-groups, years and grades of salary (Example 3·21) gives rise to manifold tables. Manifold or higher order tables are commonly used in presenting population census data. Remark. It may be pointed out that as the order of the table goes on increasing, the table becomes more and more difficult to comprehend and might even become confusing. In practice, in a single table only upto three or sometimes four characteristics are represented simultaneously. If the study is confined to more than four characteristics at a time then it is desirable to represent the data in more than one table for depicting the relationship between different characteristics. Example 3·12. Present the following information in a suitable tabular form, supplying the figures not directly given : In 1995 out of total 2000 workers in a factory, 1550 were members of a trade union. The number of women workers employed was 250, out of which 200 did not belong to any trade union. In 2000, the number of union workers was 1725 of which 1600 were men. The number of non-union workers was 380, among which 155 were women. Solution. TABLE 3·32. COMPARATIVE STUDY OF THE MEMBERSHIP OF TRADE UNION IN A FACTORY IN 1995 AND 2000. Year → Trade Union ↓ Members Non-members Total 1995 Males Females 1550 – 50 = 1,500 1,750 – 1500 = 250 2,000 – 250 = 1,750 250 – 200 = 50 2000 Total Males 200 1,550 2000 – 1550 = 450 250 2,000 1,600 380 – 155 = 225 1,600 + 225 = 1,825 Females Total 1,725 – 1600 = 125 1,725 155 125 + 155 = 280 380 1,725 + 380 = 2,105 Note. The bold figures are the given figures. The other values are obtained on appropriate additions or subtractions, since the totals are fixed. Example 3·13. In a sample study about coffee habit in two towns, the following information was received : Town A : Females were 40% ; Total coffee drinkers were 45% and Males non-coffee drinkers were 20%. Town B : Males were 55% ; Males non-coffee drinkers were 30% and Females coffee drinkers were 15%. Present the above data in a tabular form. [C.A. (Foundation), May 1997] 3·33 CLASSIFICATION AND TABULATION TABLE 3·33 Solution. Town A Coffee drinkers Non-coffee drinkers Males 60 – 20 = 40 20 Females 45 – 40 = 5 40 – 5 = 35 Total 45 100 – 45 = 55 Total 100 – 40 = 60 40 100 Note. The figures in bold are the given figures. TABLE 3·33(a) Town B Coffee drinkers Non-coffee drinkers Males Females Total 55 – 30 = 25 30 15 60 – 30 = 30 25 + 15 = 40 100 – 40 = 60 55 100 – 55 = 45 100 Total Note. The figures in the bold are the given figures. The information in Tables 3·33 and 3·33(a) can be expressed in a single table as given in Table 3·33(b). TABLE 3·33(b). SEX-WISE PERCENTAGE OF COFFEE DRINKERS IN TOWNS A AND B Town A Town B Males Females Total Males Females Total Coffee drinkers Non-coffee drinkers 40 20 5 35 45 55 25 30 15 30 40 60 Total 60 40 100 55 45 100 Example 3·14. Tabulate the following : Out of a total number of 10,000 candidates who applied for jobs in a government department, 6,854 were males, 3,146 were graduates and others, non-graduates. The number of candidates with some experience was 2,623 of whom 1,860 were males. The number of male graduates was 2,012. The number of graduates with experience was 1,093 that includes 323 females. Solution. We are given that the total number of : Applicants = 10,000 ; Males = 6,854 ; Graduates = 3,146 ; Experienced = 2,623. Total number of Females = 10,000 – 6,854 = 3,146 Total number of Non-graduates = 10,000 – 3,146 = 6,854 Total number of In-experienced persons = 10,000 - 2,623 = 7,377 The above and the remaining given information can be summarised in the following Table 3·34. TABLE 3·34.DISTRIBUTION OF CANDIDATES FOR GOVERNMENT JOBS SEX-WISE EDUCATION-WISE AND EXPERIENCE-WISE Sex ↓ Male Female Total Graduates Non-graduates Total Experinced In-experienced Total Experienced In-experienced Total Experienced In-experienced Total 770 323 1242 811 2012 1134 1090 440 3752 1572 4842 2012 1860 763 4994 2383 6854 3146 1093 2053 3146 1530 5324 6854 2623 7377 10000 The figures in ‘bold’ are the given figures. The remaining values have been obtained by minor calculations (additions or subtractions), as the totals are fixed. Example 3·15. A survey of 370 students from Commerce Faculty and 130 students from Science Faculty revealed that 180 students were studying for only C.A. Examinations, 140 for only Costing Examinations and 80 for both C.A. and Costing Examinations. 3·34 BUSINESS STATISTICS The rest had offered part-time Management Courses. Of those studying for Costing only, 13 were girls and 90 boys belonged to Commerce Faculty. Out of 80 studying for both C.A. and Costing, 72 were from Commerce Faculty amongst which 70 were boys. Amongst those who offered part-time Management Courses, 50 boys were from Science Faculty and 30 boys and 10 girls from Commerce Faculty. In all there were 110 boys in Science Faculty. Present the above information in a tabular form. Find the number of students from Science Faculty studying for part-time Management Courses. Solution. We are given that : Total number of Commerce students = 370 Total number of Science students = 130 ∴ Total number of all the students = 370 + 130 = 500 We are also given that out of these 500 students, the number of students studying : For C.A. only = 180 ; For Costing only = 140 ; For both Costing and C.A. = 80 ∴ Number of students studying for part-time Management courses = 500 – (180 + 140 + 80) = 100. The above information and the remaining given information is summarised in the Table 3·35. The figures in ‘bold’ are the given figures. TABLE 3·35. FACULTY, SEX AND COURSE-WISE DISTRIBUTION OF STUDENTS Faculty Courses Part-time Management C.A. only Costing only C.A. and Costing Total Commerce Science Boys Girls 30 10 Total 30 + 10 = 40 90 70 72 – 70 = 2 72 Boys Girls 50 10 Total Total 100 – 40 = 60 Boys 30 + 50 = 80 Girls 10 + 10 = 20 80 – 72 = 8 100 180 140 80 130 500 13 370 110 130 – 110 = 20 Total Number of students from Science Faculty studying part-time Management courses is 60. Example 3·16. In 1990, out of a total of 2,000 students in a college 1,400 were for Graduation and the rest for Post-Graduation (P.G.). Out of 1,400 Graduate students 100 were girls. However, in all there were 600 girls in the college. In 1995, number of graduate students increased to 1,700, out of which 250 were girls, but the number of P.G. students fell to 500 of which only 50 were boys. In 2000, out of 800 girls, 650 were for Graduation, whereas the total number of graduates was 2,200. The number of boys and girls in P.G. classes was equal. Represent the above information in tabular form. Also calculate the percentage increase in the number of graduate students in 2000 as compared to 1990. [C.A. (Foundation), Nov. 2001] Solution. The distribution of the number of students with respect to level of education and sex is obtained as follows : Year 1990 Girls Boys Total Graduation 100 1400 – 100 = 1300 1400 Post-Graduation 600 – 100 = 500 600 – 500 = 100 2000 – 1400 = 600 Total 600 2000 – 600 = 1400 2000 Note. The figures in bold are the given figures. Year 1995 Girls Boys Total Graduation 250 1,700 – 250 = 1450 1700 Post-Graduation 500 – 50 = 450 50 500 Total 250 + 450 = 700 1450 + 50 = 1500 1700 + 500 = 2200 3·35 CLASSIFICATION AND TABULATION Year 2000 Graduation 650 2200 – 650 = 1550 2200 Girls Boys Total Post-Graduation 800 – 650 = 150 150 150 + 150 = 300 Total 800 1550 + 150 = 1700 2500 The information in the above three tables can be expressed in single table as given in Table 3·36. TABLE 3·36. DISTRIBUTION OF STUDENTS ACCORDING TO DEGREE AND SEX FOR YEARS 1990 TO 2000. Degrees → Graduation Post-graduation Total (a) + (b) Boys Girls Total (a) Boys Girls Total (b) 1990 1995 2000 1300 1450 1550 100 250 650 1400 1700 2200 100 50 150 500 450 150 600 500 300 2000 2200 2500 Total 4300 1000 5300 300 1100 1400 6700 Year ↓ Percentage increase in the number of graduate students in 2000 as compared to 1990 is : (2200 – 1400) 1400 × 100 = 57·14%. Example 3·17. Out of a total number of 1,807 women who were interviewed for employment in a textile factory of Mumbai; 512 were from textile areas and the rest from the non-textile areas. Amongst the married women who belonged to textile areas, 247 were experienced and 73 inexperienced, while for nontextile areas, the corresponding figures were 49 and 520. The total number of inexperienced women was 1,341 of whom 111 resided in textile areas. Of the total number of women, 918 were unmarried and of these the number of experienced women in the textile and non-textile areas was 154 and 16 respectively. Tabulate. Solution. Total number of women interviewed = 1,807 No. of women from textile areas = 512 ∴ Number of women from non-textile areas = 1,807 – 512 = 1,295 Total number of married women in textile areas = 247 + 73 = 320 Total number of married women in non-textile areas = 49 + 520 = 569 Total number of inexperienced women = 1,341 ∴ Total number of experienced women = 1,807 – 1,341 = 466 Total number of unmarried women = 918 ∴ Total number of married women = 1,807 – 918 = 889 Total number of unmarried experienced women in textile areas = 154 and Total number of unmarried experienced women in non-textile areas = 16 After filling this information in the table, the remaining entries in the table of the experience, marital status and area-wise distribution of the number of women can now be completed by subtraction/addition, wherever necessary and is given in Table 3·37. TABLE 3·37. TABLE SHOWING THE NUMBER OF WOMEN INTERVIEWED FOR EMPLOYMENT IN A TEXTILE FACTORY ACCORDING TO THEIR MARITAL STATUS, EXPERIENCE AND AREA THEY BELONG Textile Areas Married Unmarried Total Experienced 247 154 401 Inexperienced 73 38 111 Non-textile Areas Total 320 192 512 Experienced 49 16 65 Inexperienced 520 710 1,230 Total 569 726 1,295 Total Experienced 296 170 466 Inexperienced 593 748 1,341 Total 889 918 1,807 3·36 BUSINESS STATISTICS Example 3·18. Draw up a blank table to show the number of candidates sex-wise, appearing in the Pre-university, First Year, Second Year and Third Year examinations of a university in the faculties of Art, Science and Commerce in the year 2002. Solution. TABLE 3·38. DISTRIBUTION OF CANDIDATES APPEARING IN THE UNIVERSITY EXAMINATIONS w.r.t. FACULTY, SEX AND EXAMINATION IN 2002 Faculty → Sex → Arts M F Examination ↓ Science SubTotal M F Commerce SubTotal M F Total SubTotal M F Row Totals Pre-university First Year Second Year Third Year Column Totals Note. M indicates Male ; F indicates Female. Example 3·19. Prepare a blank table to show the exports of three companies A, B, C to five countries U.K, U.S.A., U.S.S.R., France and West Germany, in each of the years 1995 to 1999. Solution. TABLE 3·39. EXPORTS OF THE COMPANIES A, B AND C TO FIVE COUNTRIES FROM 1995 TO 1999 (IN MILLION RUPEES) Year → Company → Countries ↓ 1995 A B 1996 C A B 1997 C A B 1998 C A B 1999 C A B Total C A B C UK USA USSR France West Germany Total Example 3·20. Draw a blank table to present the following information regarding the college students according to : (a) Faculty : Social Sciences, Commercial Sciences. (b) Class : Under-graduate and Post-graduate classes. (c) Sex : Male and Female. (d) Years : 1998 to 2002. Solution. Please see Table 3·40, on page 3·37. Example 3·21. Prepare a blank table showing the number of employees in a big business concern according to : (a) Sex : Males and Females. (b) Five age-groups: Below 25 years, 25 to 35 years, 35 to 45 years, 45 to 55 years, 55 years and over. (c) Two years: 2000 and 2001. (d) Three grades of weekly salary: Below Rs. 4000; Rs. 4000 to 7000 ; Rs. 7000 and above. Solution. Please see Table 3·41, on page 3·37. TABLE 3·40. DISTRIBUTION OF COLLEGE STUDENTS ACCORDING TO SEX, CLASS AND FACULTY FOR 1998 TO 2002 Faculty → Class → Sex → Years ↓ Social Sciences Commercial Sciences Under-Graduate Post-Graduate Total Under-Graduate M F S.T. M F S.T. M F S.T. M F S.T. (ii) + (iii) (vii) (viii) (ix ) (i) (ii) (iii) (iv) (v) (vi) (i) = (i) + (ii) = (iv) + (v) + (iv) (v) + (vi) = (vii) + (viii) M (x) Post-Graduate F S.T. (xi) (xii) = (x) + (xi) M (vii) + (x) Total F S.T. (ix) (viii) + (xi) + (xii) 1998 1999 2000 2001 2002 Note : M indicates Males ; F Indicates Females ; S.T. indicates sub-totals. Remark : In the caption, we may have another sub-group named ‘Total’ of Social Sciences and Commercial Sciences as in Example 3·22. TABLE 3·41. DISTRIBUTION OF THE NUMBER OF EMPLOYEES OF A BUSINESS HOUSE ACCORDING TO SEX, AGE AND SALARY FOR 2000 AND 2001. Sex → Likely salary Years (’00 Rs.) → ↓ Age in years ↓ 0—25 25—35 2000 35—45 45—55 55 and over Sub-total 0—25 25—35 2001 35—45 45—55 55 and over Sub-total Male 0—40 40—70 70 and over Sub-total Female 0—40 40—70 70 and over Sub-total Total 0—40 40—70 70 and over Total 3·38 BUSINESS STATISTICS EXERCISE 3·2. 1. (a) Explain the terms ‘classification’ and ‘tabulation’ and point out their importance in a statistical investigation. What precautions would you take in tabulating statistical data ? (b) What are the chief functions of tabulation ? What precautions would you take in tabulating statistical data ? (c) Explain the purpose of classification and tabulation of data. State the rules that serve as a guide in tabulation of data. 2. (a) What do you mean by tabulation of data ? What precautions would you take while tabulating data ? (b) Distinguish between classification and tabulation of statistical data. Mention the requisites of a good statistical table. [Himachal Pradesh Univ. B.Com., 1998] (c) Distinguish between classification and tabulation. What precautions would you take in tabulating data ? 3. (a) What do you understand by tabulation ? State any six points that should be kept in mind while tabulating the data. [CA (Foundation), May 1995] (b) Which important points should be kept in mind while preparing a good statistical table ? [C.A. (Foundation), May 1999] (c) What are the rules to be followed in preparing a statistical table ? 4. (a) Briefly discuss the essential parts of a statistical table. [C.A. (Foundation), May 2001] (b) Draw a specimen table showing the various parts of a table. [Osmania Univ. B.Com., 1997] 5. Comment on the statement : “In collection and tabulation of data common sense is the chief requisite and experience the chief teacher”. 6. “The statistical table is a systematic arrangement of numerical data presented in columns and rows for purposes of comparison.” Explain and discuss the various types of tables used in statistical investigation after the data have been collected. 7. In a trip organised by a college there were 80 persons each of whom paid Rs. 15·50 on an average. There were 60 students each of whom paid Rs. 16. Members of the teaching staff were charged at a higher rate. The number of servants was 6 (all males) and they were not charged anything. The number of ladies was 24% of the total of which one was a lady staff member. Tabulate the above information. 8. Tabulate the following data : A survey was conducted amongst one lakh spectators visiting on a particular day cinema houses showing criminal, social, historical, comic and mythological films. The proportion of male to female spectators under survey was three to two. It indicated that while the respective percentages of spectators seeing criminal, social and historical films was sixteen, twenty-six and eighteen, the actual number of female viewers seeing these types was four thousand six hundred, twelve thousand two hundred, and seven thousand eight hundred respectively. The remaining two types of films, namely, comic and mythological, were seen by forty per cent and one per cent of the male spectators. The number of female spectators seeing mythological films was four thousand four hundred. 9. Present the following information in a suitable tabular form. “In 1990, out of a total of 1750 workers of a factory, 1200 were members of a trade union. The number of women employees was 200 of which 175 did not belong to a trade union. In 1995, the number of union workers increased to 1580 of which 1290 were men. On the other hand, the number of non-union workers fell to 208, of which 180 were men. In 2000, there were 1800 employees who belonged to a trade union and 50 who did not belong to a trade union. Of all the employees in 2000, 300 were women, of whom only 8 did not belong to a trade union”. 10. Present the following information in a tabular form : In 2001, out of a total of 4,000 workers in a factory, 3,300 were members of a trade union. The number of women workers employed was 500 out of which 400 did not belong to the union. In 2000, the number of workers in the union was 3,450 of which 3,200 were men. The number of non-union workers was 760 of which 330 were women. 11. A classification of the population of India by livelihood categories (agricultural and non-agricultural) according to the 1951 census showed that out of total of 356,628 thousand persons, 249,075 thousand persons belonged to agricultural category. In the agricultural category 71,049 thousand persons were self-supporting, 31,069 thousand were earning dependents and the rest were non-earning. The number of non-earning persons and self-supporting persons in the non-agricultural category were 67,335 thousand and 33,350 thousand respectively. The others were earning dependents. CLASSIFICATION AND TABULATION 3·39 Tabulate the above information expressing all figures in millions (1 million = 1,000 thousand). 12. A survey was conducted among 1,00,000 music listeners who were asked to indicate their preference for classical music, light music, folk songs, film songs and pop varieties of music. The male listeners interviewed were as many as female listeners. The survey indicated that while the percentage of listeners who preferred classical music, light music and folk songs were eight, thirteen and four respectively; the actual number of females for each of the first two kinds were six thousand. Of the listeners who liked folk songs, the number of male listeners was same as that of female listeners. While film songs were liked by number one and half times that for all other varieties put together, the number for pop music were only a fourth of the number of film song listeners. Sixty per cent of the listeners of pop music were females. Prepare a table showing the distribution of music listeners according to sex and type of music. 13. What are different parts of a table ? What points should be borne in mind while arranging the items in a table ? An investigation conducted by the education department in a public library revealed the following facts. You are required to tabulate the information as neatly and clearly as you can : “In 1990, the total number of readers was 46,000 and they borrowed some 16,000 volumes. In 2000, the number of books borrowed increased by 4,000 and the borrowers by 5%. The classification was on the basis of three sections : literature, fiction and illustrated news. There were 10,000 and 30,000 readers in the sections literature and fiction respectively in the year 1990. In the same year 2,000 and 10,000 books were lent in the sections illustrated news and fiction respectively. Marked changes were seen in 2000. There were 7,000 and 42,000 readers in the literature and fiction sections respectivley. So also 4,000 and 13,000 books were lent in the sections illustrated news and fiction respectively.” 14. What is tabulation ? What are its uses ? Mention the items that a good statistical table should contain. [C.A. (Foundation), Nov. 1996] 15. Out of total number of 2,807 women, who were interviewed for employment in a textile factory, 912 were from textile areas and the rest from non-textile areas. Amongst the married women, who belonged to textile areas, 347 were having some work experience and 173 did not have work experience, while for non-textile areas the corresponding figures were 199 and 670 respectively. The total number of women having no experience was 1,841 of whom 311 resided in textile areas. Of the total number of women, 1,418 were unmarried and of these the number of women having experience in the textile and non-textile areas was 254 and 166 respectively. Tabulate the above information. [C.A. (Foundation), May 1998] 16. In 1995 out of a total of 4,000 workers in a factory 3,300 were members of a trade union. The number of women workers was 500 out of which 400 did not belong to the union. In 1994, the number of workers in the union was 3,450 of which 3,200 were men. The number of workers not belonging to the union was 760 of which 300 were women. Present data in a suitable tabular form. [C.A. (Foundation), May 2000] 17. What are the considerations to be taken into account in the construction of a table ? Construct a table for showing the profits of a company for a period of 5 years with imaginary figures. [Madras Univ. B.Com., 1998] 18. (a) What are the components of a good table. (b) Construct a blank table in which could be shown, at two different dates and in five industries, the average wages of the four groups, males and females, eighteen years and over, and under eighteen years. Suggest a suitable title. 19. State briefly the requirements of a good statistical table. Prepare a blank table to show the distribution of population of various States and Union Territories of India according to sex and literacy. 20. Draft a blank table to show the distribution of personnel working in an office according to (i) sex, (ii) three grades of monthly salary - below Rs. 10,000 ; Rs. 10,000 to Rs. 20,000 ; above Rs. 20,000, (iii) age groups : below 25 years, 25—40 and 40—60 and (iv) 3 years : 1995-96, 1996-97, 1997-98. 21. Draw up a blank table to show five categories of skilled and unskilled workers i.e., regular, seasonal, casual, clerical and supervisory; further divided into family members and paid workers with monthly/daily rate and piece rate. 22. Draw up in detail, with proper attention to spacing, double lines, etc., and showing all sub-totals, a blank table in which could be entered the numbers occupied in six industries on two dates, distinguishing males from females, and among the latter single, married and widowed. 3·40 BUSINESS STATISTICS 23. Draft a blank table to show the following information for the country A to cover the years 1974, 1989, 1999 and 2002. (a) Population ; (b) Income-tax collected ; (c) Tobacco duties collected (d) Spirits and beer duties collected ; (e) Other taxation. Arrange for suitable columns to show also the “per capita” figures for (b), (c), (d), (e). Suggest a suitable title. 24. Fill in the blanks : (i) … … is the first step in tabulation. (ii) A … … is the systematic arrangement of data in rows and columns. (iii) The numerical information in a statistical table is called the …… of the table. (iv) In a statistical table, …… refer to the row headings and …… refer to column headings. (v) In the collection and tabulation, …… is the chief requisite and …… is the chief teacher. (vi) In a statistical table, the principal basis for the arrangement of captions and stubs in a systematic order are … … , … … , and … … (vii) Classification is the … … step in … … (viii) In a statistical table, captions refer to the … … headings and stubs refer to the … … headings. (ix) In a statistical table, the data are arranged in … … and … … (x) In a statistical table, ……should be avoided, especially in titles and headings. Ans. (i) classification ; (ii) table ; (iii) body ; (iv) stubs, captions ; (v) commonsense, experience. (vi) alphabetical, chronological and geographical; (vii) first, tabulation ; (viii) column, row ; (ix) rows and columns ; (x) abbreviations. 4 Diagrammatic and Graphic Representation 4·1. INTRODUCTION In Chapter 3, we discussed that classification and tabulation are the devices of presenting the statistical data in neat, concise, systematic and readily comprehensible and intelligible form, thus highlighting the salient features. Another important, convincing, appealing and easily understood method of presenting the statistical data is the use of diagrams and graphs. They are nothing but geometrical figures like points, lines, bars, squares, rectangles, circles, cubes, etc., pictures, maps or charts. Diagrammatic and graphic presentation has a number of advantages, some of which are enumerated below : (i) Diagrams and graphs are visual aids which give a bird’s eye view of a given set of numerical data. They present the data in simple, readily comprehensible form. (ii) Diagrams are generally more attractive, fascinating and impressive than the set of numerical data. They are more appealing to the eye and leave a much lasting impression on the mind as compared to the dry and uninteresting statistical figures. Even a layman, who has no statistical background can understand them easily. (iii) They are more catching and as such are extensively used to present statistical figures and facts in most of the exhibitions, trade or industrial fairs, public functions, statistical reports, etc. Human mind has a natural craving and love for beautiful pictures and this psychology of the human mind is extensively exploited by the modern advertising agencies who give their advertisements in the shape of attractive and beautiful pictures. Accordingly diagrams and graphs have universal applicability. (iv) They register a meaningful impression on the mind almost before we think. They also save lot of time as very little effort is required to grasp them and draw meaningful inferences from them. An individual may not like to go through a set of numerical figures but he may pause for a while to have a glance at the diagrams or pictures. It is for this reason that diagrams, graphs and charts find a place almost daily in financial/business columns of the newspapers, economic and business journals, annual reports of the business houses, etc. (v) When properly constructed, diagrams and graphs readily show information that might otherwise be lost amid the details of numerical tabulations. They highlight the salient features of the collected data, facilitate comparisons among two or more sets of data and enable us to study the relationship between them more readily. (vi) Graphs reveal the trends, if any present in the data more vividly than the tabulated numerical figures and also exhibit the way in which the trends change. Although this information is inherent in a table, it may be quite difficult and time-consuming (and sometimes may be impossible) to determine the existence and nature of trends from a tabulation of data. 4·2. DIFFERENCE BETWEEN DIAGRAMS AND GRAPHS No hard and fast rules exist to distinguish between diagrams and graphs but the following points of difference may be observed : (i) In the construction of a graph, generally graph paper is used which helps us to study the mathematical relationship (though not necessarily functional) between the two variables. On the other hand, 4·2 BUSINESS STATISTICS diagrams are generally constructed on a plane paper and are used for comparisons only and not for studying the relationship between the variables. In diagrams data are presented by devices such as bars, rectangles, squares, circles, cubes, etc., while in graphic mode of presentation points or lines of different kinds (dots, dashes, dot-dash, etc.), are used to present the data. (ii) Diagrams furnish only approximate information. They do not add anything to the meaning of the data and, therefore, are not of much use to a statistician or research worker for further mathematical treatment or statistical analysis. On the other hand, graphs are more obvious, precise and accurate than the diagrams and are quite helpful to the statistician for the study of slopes, rates of change and estimation, (interpolation and extrapolation), wherever possible. In fact, today, graphic work is almost a must in any research work pertaining to the analysis of economic, business or social data. (iii) Diagrams are useful in depicting categorical and geographical data but they fail to present data relating to time series and frequency distributions. In fact, graphs are used for the study of time series and frequency distributions. (iv) Construction of graphs is easier as compared to the construction of diagrams. In the following sections we shall first discuss the various types of diagrams and then the different modes of graphic presentation. 4·3. DIAGRAMMATIC PRESENTATION 4·3·1. General Rules for Constructing Diagrams 1. Neatness. As already pointed out, diagrams are visual aids for presentation of statistical data and are more appealing and fascinating to the eye and leave a lasting impression on the mind. It is, therefore, imperative that they are made very neat, clean and attractive by proper size and lettering; and the use of appropriate devices like different colours, different shades (light and dark), dots, dashes, dotted lines, broken lines, dots and dash lines, etc., for filling the in between space of the bars, rectangles, circles, etc., and their components. Some of the commonly used devices are given below : Fig. 4·1. 2. Title and Footnotes. As in the case of a good statistical table, each diagram should be given a suitable title to indicate the subject-matter and the various facts depicted in the diagram. The title should be brief, self explanatory, clear and non-ambiguous. However, brevity should not be attempted at the cost of clarity. The title should be neatly displayed either at the top of the diagram or at its bottom. If necessary the footnotes may be given at the left hand bottom of the diagram to explain certain points or facts, not otherwise covered in the title. 3. Selection of Scale. One of the most important factors in the construction of diagrams is the choice of an appropriate scale. The same set of numerical data if plotted on different scales may give the diagrams differing widely in size and at times might lead to wrong and misleading interpretations. Hence, the scale should be selected with great caution. Unfortunately, no hard and fast rules are laid down for the choice of scale. As a guiding principle the scale should be selected consistent with the size of the paper and the size DIAGRAMMATIC AND GRAPHIC REPRESENTATION 4·3 of the observations to be displayed so that the diagram obtained is neither too small nor too big. The size of the diagram should be reasonable so as to focus attention on the salient features and important characteristics of the data. The scale showing the values should be in even numbers or multiples of 5 or 10. The scale(s) used on both the horizontal and vertical axes should be clearly indicated. For comparative study of two or more diagrams, the same scale should be adopted to draw valid conclusions. 4. Proportion Between Width and Height. A proper proportion between the dimensions (height and width) of the diagram should be maintained, consistent with the space available. Here again no hard and fast rules are laid down. In this regard Lutz in his book Graphic Presentation has suggested a rule called ‘root two’ rule, viz., the ratio 1 to √2 or 1 to 1·414 between the smaller side and the larger side respectively. The diagram should be generally displayed in the middle (centre) of the page. 5. Choice of a Diagram. A large number of diagrams (discussed below) are used to present statistical data. The choice of a particular diagram to present a given set of numerical data is not an easy one. It primarily depends on the nature of the data, magnitude of the observations and the type of people for whom the diagrams are meant and requires great amount of expertise, skill and intelligence. An inappropriate choice of the diagram for the given set of data might give a distorted picture of the phenomenon under study and might lead to wrong and fallacious interpretations and conclusions. Hence, the choice of a diagram to present the given data should be made with utmost caution and care. 6. Source Note and Number. As in the case of tables, source note, wherever possible should be appended at the bottom of the diagram. This is necessary as, to the learned audience of Statistics, the reliability of the information varies from source to source. Each diagram should also be given a number for ready reference and comparative study. 7. Index. A brief index explaining various types of shades, colours, lines and designs used in the construction of the diagram should be given for clear understanding of the diagram. 8. Simplicity. Lastly, diagrams should be as simple as possible so that they are easily understood even by a layman who does not have any mathematical or statistical background. If too much information is presented in a single complex diagram it will be difficult to grasp and might even become confusing to the mind. Hence, it is advisable to draw more simple diagrams than one or two complex diagrams. 4·3·2. Types of Diagrams. A large variety of diagrammatic devices are used in practice to present statistical data. However, we shall discuss here only some of the most commonly used diagrams which may be broadly classified as follows : (1) One-dimensional diagrams viz., line diagrams and bar diagrams. (2) Two-dimensional diagrams such as rectangles, squares, and circles or pie diagrams. (3) Three-dimensional diagrams such as cubes, spheres, prisms, cylinders and blocks. (4) Pictograms. (5) Cartograms. 4·3·3. One-dimensional Diagrams A. LINE DIAGRAM This is the simplest of all the diagrams. It consists in drawing vertical lines, each vertical line being equal to the frequency. The variate (x) values are presented on a suitable scale along the X-axis and the corresponding frequencies are presented on a suitable scale along Y-axis. Line diagrams facilitate comparisons though they are not attractive or appealing to the eye. Remark. Even a time series data may be presented by a line diagram, by taking time factors along X-axis and the variate values along Y-axis. Example. 4·1. The following data shows the number of accidents sustained by 314 drivers of a public utility company over a period of five years. Number of accidents: 0 1 2 3 4 5 6 7 8 9 10 11 Number of drivers : 82 44 68 41 25 20 13 7 5 4 3 2 Represent the data by a line diagram. 4·4 BUSINESS STATISTICS Solution. NUMBER OF DRIVERS 82 80 LINE DIAGRAM 68 70 60 50 44 40 30 41 25 20 20 13 10 0 1 7 5 4 3 2 2 3 4 5 6 7 8 9 10 11 NUMBER OF ACCIDENTS Fig. 4·2. B. BAR DIAGRAM Bar diagrams are one of the easiest and the most commonly used devices of presenting most of the business and economic data. These are especially satisfactory for categorical data or series. They consist of a group of equidistant rectangles, one for each group or category of the data in which the values or the magnitudes are represented by the length or height of the rectangles, the width of the rectangles being arbitrary and immaterial. These diagrams are called one-dimensional because in such diagrams only one dimension viz., height (or length) of the rectangles is taken into account to present the given values. The following points may be borne in mind to draw bar diagrams. (i) All the bars drawn in a single study should be of uniform (though arbitrary) width depending on the number of bars to be drawn and the space available. (ii) Proper but uniform spacing should be given between different bars to make the diagram look more attractive and elegant. (iii) The height (length) of the rectangles or bars are taken proportional to magnitude of the observations, the scale being selected keeping in view the magnitude of the largest observation. (iv) All the bars should be constructed on the same base line. (v) It is desirable to write the figures (magnitudes) represented by the bars at the top of the bars to enable the reader to have a precise idea of the value without looking at the scale. (vi) Bars may be drawn vertically or horizontally. However, in practice, vertical bars are generally used because they give an attractive and appealing get up. (vii) Wherever possible the bars should be arranged from left to right (from top to bottom in case of horizontal bars) in order of magnitude to give a pleasing effect. Types of Bar Diagrams. The following are the various types of bar diagrams in common use : (a) Simple bar diagram. (b) Sub-divided or component bar diagram. (c) Percentage bar diagram. (d) Multiple bar diagram. (e) Deviation or Bilateral bar diagram. (a) SIMPLE BAR DIAGRAM Simple bar diagram is the simplest of the bar diagrams and is used frequently in practice for the comparative study of two or more items or values of a single variable or a single classification or category of data. For example, the data relating to sales, profits, production, population, etc., for different periods 4·5 DIAGRAMMATIC AND GRAPHIC REPRESENTATION may be presented by bar diagrams. As already pointed out the magnitudes of the observations are represented by the heights of the rectangles. Remark. If there are a large number of items or values of the variable under study, then instead of bar diagram, line diagram may be drawn. Example 4·2. The following data relating to the strength of the Indian Merchant Shipping Fleet gives the Gross Registered Tonnage (GRT) as on 31st December, for different years. Year : 1961 1966 1971 1975 1976 GRT in ’000 : 901 1,792 2,500 4,464 5,115 Source : Ministry of Shipping and Transport. Represent the data by suitable bar diagram. Solution. STRENGTH OF INDIAN MERCHANT SHIPPING FLEET GROSS REGISTERED TONNAGE IN ’000 6,000 5,115 4,464 4500 3000 2,500 1,792 1500 0 901 1961 1966 1971 1975 1976 Fig. 4·3. (b) SUB-DIVIDED OR COMPONENT BAR DIAGRAM A very serious limitation of the bar diagram is that it studies only one characteristic or classification at a time. For example, the total number of students in a college for the last 5 years can be conveniently expressed by simple bar diagrams but it cannot be used if we have also to depict the faculty-wise or sexwise distribution of students. In such a situation, sub-divided or component bar diagram is used. Subdivided bar-diagrams are useful not only for presenting several items of a variable or a category graphically but also enable us to make comparative study of different parts or components among themselves and also to study the relationship between each component and the whole. In general sub-divided or component bar diagrams are to be used if the total magnitude of the given variable is to be divided into various parts or sub-classes or components. First of all a bar representing the total is drawn. Then it is divided into various segments, each segment representing a given component of the total. Different shades or colours, crossing or dotting, or designs are used to distinguish the various components and a key or index is given along with the diagram to explain these differences. In addition to the general rules for constructing bar diagrams, the following points may be kept in mind while constructing sub-divided or component bar diagrams : (i) To facilitate comparisons the order of the various components in different bars should be same. It is customary to show the largest component at the base of the bar and the smallest component at the top so that the various components appear in the order of their magnitude. (ii) As already pointed, an index or key showing the various components represented by different shades, dottings, colours, etc., should be given. 4·6 BUSINESS STATISTICS (iii) The use of sub-divided bar diagram is not suggested if the number of components exceeds 10, because in that case the diagram is loaded with too much information and is not easy to understand and interpret. Pie or circle diagram (discussed later) is appropriate in such a situation. The comparison of the various components in different bars is quite tedious as they do not have a common base and requires great skill and expertise. Example 4·3. Represent the following data by a suitable diagram : Items of Expenditure Family A (Income Rs. 500) Food 150 Clothing 125 Education 25 Miscellaneous 190 Saving or Deficit +10 Family B (Income Rs. 300) 150 60 50 70 –30 Solution. The data can be represented by sub-divided bar diagram as shown below : SUB-DIVIDED BAR DIAGRAM SHOWING EXPENDITURE OF TWO FAMILIES 500 FOOD 400 300 CLOTHING CLOTHING FOOD 200 EDUCATION CLOTHING 100 MISC. EDUCATION MISC. 0 –50 DEFICIT SAVING FAMILY A FAMILY B Fig. 4·4. (c) PERCENTAGE BAR DIAGRAM Sub-divided or component bar diagrams presented graphically on percentage basis give percentage bar diagrams. They are specially useful for the diagrammatic portrayal of the relative changes in the data. Percentage bar diagram is used to highlight the relative importance of the various component parts to the whole. The total for each bar is taken as 100 and the value of each component or part is expressed, as percentage of the respective totals. Thus, in a percentage bar diagram, all the bars will be of the same height, viz., 100, while the various segments of the bar representing the different components will vary in height depending on their percentage values to the total. Percentage bars are quite convenient and useful for comparing two or more sets of data. Item Example 4·4. The adjoining table gives the break-up of the expenditure of a family on different items of consumption. Draw percentage bar diagram to represent the data. Food Clothing Rent Fuel and Lighting Education Miscellaneous Expenditure (Rs.) 240 66 125 57 42 190 4·7 DIAGRAMMATIC AND GRAPHIC REPRESENTATION Solution. First of all we convert the given figures into percentages of the total expenditure as detailed below. Rs. Expenditure % Cumulative % 240 240 × 100 =33·33 720 33·33 66 66 × 100 = 9·17 720 42·50 125 125 × 100 =17·36 720 59·86 Fuel and lighting 57 57 × 100 = 7·92 720 67·78 Education 42 42 × 100 = 5·83 720 73·61 Miscellaneous 190 190 × 100 =26·39 720 100·00 Total 720 100 Food Clothing Rent Miscellaneous 26·39% 80 PERCENTAGE EXPENDITURE Item DIAGRAM SHOWING EXPENDITURE OF FAMILY ON DIFFERENT ITEMS OF CONSUMPTION 100 Education 5·83% Fuel and Lighting 7·92% 60 Rent 17·36% 40 Clothing 9·17% 20 Food 33·33% The percentage bar diagram is given in Fig. 4·5. 0 FAMILY Fig. 4·5. Example 4·5. Draw a bar chart for the following data showing the percentage of total population in villages and towns: Percentage of total population in Villages Towns Infants and young children 13·7 12·9 Boys and girls 25·1 23·2 Young men and women 32·3 36·5 Middle-aged men and women 20·4 20·1 8·5 7·3 Elderly persons Solution. CALCULATIONS FOR PERCENTAGE BAR DIAGRAMS Category Villages % Towns Cumulative % % Cumulative % Infants and young children 13·7 13·7 12·9 12·9 Boys and girls 25·1 38·8 23·2 36·1 Young men and women 32·3 71·1 36·5 72·6 Middle aged men and women 20·4 91·5 20·1 92·7 8·5 100·0 7·3 100·0 Elderly persons 100 100 4·8 BUSINESS STATISTICS PERCENTAGE POPULATION PERCENTAGE BAR DIAGRAM SHOWING TOTAL POPULATION IN VILLAGES AND TOWNS BY DIFFERENT CATEGORIES 100 90 80 Infants and Young Children Boys and Girls Young Men and Women Middle Aged Men and Women Elderly Persons 70 60 50 40 30 20 10 VILLAGES TOWNS Fig. 4·6. (d) MULTIPLE BAR DIAGRAM A limitation of the simple bar diagram was that it can be used to portray only a single characteristic or category of the data. If two or more sets of inter-related phenomena or variables are to be presented graphically, multiple bar diagrams are used. The technique of drawing multiple bar diagram is basically same as that of drawing simple bar diagram. In this case, a set of adjacent bars (one for each variable) is drawn. Proper and equal spacing is given between different sets of the bars. To distinguish between the different bars in a set, different colours, shades, dottings or crossings may be used and key or index to this effect may be given. Example 4·6. The data below give the yearly profits (in thousand of rupees) of two companies A and B. Profits in (’000 rupees) Year 1994-95 1995-96 1996-97 1997-98 1998-99 Company A 120 135 140 160 175 Company B 90 95 108 120 130 Represent the data by means of a suitable diagram. Solution. The data can be suitably represented by a multiple bar diagram as shown below. YEARLY PROFITS IN ’000 RUPEES 160 120 108 135 95 120 90 100 140 Company B 150 175 Company A 130 200 50 0 1994-95 1995-96 1996-97 Fig. 4·7. 1997-98 1998-99 4·9 DIAGRAMMATIC AND GRAPHIC REPRESENTATION Remark. A careful examination of the above figures of profits for the two companies A and B reveals that in all the years from 1994-95 to 1998-99, company A shows higher profits than the company B. In such a situation when the values of one concern or unit show an increase over the values of the other concern or unit for all the periods under YEARLY PROFITS IN ’000 RUPEES 200 consideration, the data can be 175 elegantly represented by a 160 special type of sub-divided 150 45 Company B 140 135 bar diagram, in which total 40 120 refers to the values of the 32 40 Excess concern or unit with higher 100 30 values and the lower portion (shaded) of the bar shows the values of other concern. The 50 remaining portion (blank) shows the balance (excess) of 0 the two concerns or units. We 1994-95 1995-96 1996-97 1997-98 1998-99 represent in Fig. 4.8 the above data in this manner. Fig. 4·8. Example 4·7. The following data shows the students in millions on rolls at school/university stage in India according to different class groups and sex for the year 1970-71 as on 31st March. Stage Boys Girls Total 35·74 21·31 57·05 Class VI to VIII 9·43 3·89 13·32 Class IX to XI 4·87 1·71 6·58 University/College 2·17 0·64 2·81 Class I to V Represent the data by (i) Component bar diagram and (ii) Multiple bar diagram. Solution. Fig. 4·9. 35·74 Boys 5 Class I-V Class VI-VIII 0·64 15 10 2·17 Girls 4·37 1·71 45 40 35 30 25 20 9·43 3·83 6·58 Class Class Class University/ I-V VI-VIII IX-XI College 50 21·31 Boys 13·32 20 15 10 5 Girls (ii) MULTIPLE BAR DIAGRAM SHOWING STUDENTS ON ROLL AT SCHOOL/UNIVERSITY STAGE ACCORDING TO SEX IN 1970-71 NUMBER OF STUDENTS IN MILLIONS 55 50 45 40 35 30 25 2·81 60 57·05 NUMBER OF STUDENTS IN MILLIONS (i) COMPONENT BAR DIAGRAM SHOWING STUDENTS ON ROLL AT SCHOOL/UNIVERSITY STAGE ACCORDING TO SEX IN 1970-71 Class University/ IX-XI College Fig. 4·10. 4·10 BUSINESS STATISTICS (e) DEVIATION BARS Deviation bars are specially useful for graphic presentation of net quantities viz., surplus or deficit, e.g., net profit or loss, net of imports and exports which have both positive and negative values. The positive deviations (e.g., profits, surplus) are presented by bars above the base line while negative deviations (loss, deficit) are represented by bars below the base line. The following example will illustrate the points. Remark. Deviation bars are also sometimes known as Bilateral Bar Diagrams and are used to depict plus (surplus) and minus (deficit) directions from the point of reference. Example 4·8. For the following data prepare a suitable diagram showing Balance of Trade : Years 1994 1995 1996 1997 1998 1999 Exports (In Rs. Million) 24 115 84 110 130 162 Inports (In Rs. Million) 9 92 92 120 183 187 [Delhi Univ. B.Com. (Pass), 2001] Solution. Year 1994 1995 1996 1997 1998 1999 Exports (In Rs. Million) 24 115 84 110 130 162 Imports (In Rs. Million) 9 92 92 120 183 187 Balance of Trade (In Million Rs.) 24 – 9 = 15 115 – 92 = 23 84 – 92 = – 8 110 – 120 = –10 130 – 183 = –53 162 – 187 = –25 Deviation bar diagram showing balance of trade is given in Fig. 4·11 : Y DEVIATION BAR DIAGRAM BALANCE OF TRADE (In Million Rs.) 30 – 20 – 23 15 10 – 0– –10 – 1996 1994 1997 1998 1999 1995 –8 –10 –20 – –25 –30 – –40 – –50 – –53 –60 – Fig. 4·11. X 4·11 DIAGRAMMATIC AND GRAPHIC REPRESENTATION For more illustrations, see Examples 4·13 and 4·14 (f) BROKEN BARS Broken bars are used for graphic presentation of the data which contain very wide variations in the values i.e., the data which contain very large observations along with small observations. In this case the squeezing of the vertical scale will not be of much help because it will make the small bars to look too small and clumsy and thus will not reveal the true characteristics of the data. In order to provide adequate and reasonable shape for smaller bars, the larger (or largest) bar(s) may be broken at the top, as illustrated in Examples 4·9 and 4·10. Remark. However, if all the observations are fairly large so that all the bars have a broken vertical axis, then instead they can be drawn with a false base line for the vertical axis. [For false base line, see § 4·4·2—Graphic Presentation.] Example 4·9. The following data relates to the imports of foreign merchandise and exports (including re-exports) of Indian merchandise (in million rupees) for some countries for the year 1975-76. Country Imports Burma Exports Country Imports Exports 53 89 Germany (F.R.) 3,566 1,173 522 343 Iran 4,593 2,708 Canada 2,278 424 United Kingdom 2,683 4,020 Australia 1,015 477 USSR 2,958 4,128 799 785 Japan 1,852 835 USA Czechoslovakia Italy France 3,548 4,263 12,699 5,054 Represent the data by suitable diagram. Solution. The above data is represented by bar-diagrams as shown below. FOREIGN TRADE BY COUNTRIES 1975-76 IMPORTS EXPORTS (Including Re-Exports) Burma Czechoslovakia Canada Australia Italy France Germany (F.R.) Iran United Kingdom U.S.S.R. Japan 1,270 U.S.A. 600 400 200 0 0 200 400 600 RUPEES TEN MILLION RUPEES TEN MILLION Fig. 4·12. 4·12 BUSINESS STATISTICS Example 4·10. Represent the following data relating to the military statistics at the border during the war between the two countries A and B in 1999 by multiple bar diagram. Category Country A Army Divisions Semi-Army Units Fighter Planes Tanks Total Troops Country B 4 50 75 50 100,000 20 — 700 300 170,000 Solution. 100,000 170,000 MILITARY STATISTICS AT THE BORDER Country A Country B 700 300 75 50 50 20 4 Army Semi-Army Fighter Divisions Units Planes Tanks Total Troops Fig. 4·13. 4·3·4. Two-dimensional Diagrams. Line or bar diagrams discussed so far are one-dimensional diagrams since the magnitudes of the observations are represented by only one of the dimensions viz., height (length) of the bars while the width of the bars is arbitrary and uniform. However, in twodimensional diagrams, the magnitudes of the given observations are represented by the area of the diagram. Thus, in the case of two-dimensional bar diagrams, the length as well as width of the bars will have to be considered. Two-dimensional diagrams are also known as area diagrams or surface diagrams. Some of the commonly used two-dimensional diagrams are : (A) (B) (C) (D) Rectangles. Squares. Circles. Angular or pie diagrams. (A) Rectangles. A “rectangle” is a two-dimensional diagram because it is based on the area principle. Since the area of a rectangle is given by the product of its length and breadth, in a rectangle diagram both the dimensions viz., length (height) and width of the bars is taken into consideration. Just like bars, the rectangles are placed side by side, proper and equal spacing being given between different rectangles. In fact, rectangle diagrams are a modified form of bar diagrams and give a more detailed information than bar diagrams. 4·13 DIAGRAMMATIC AND GRAPHIC REPRESENTATION COST AND PROFIT PER UNIT (IN RS.) Like sub-divided bars, we have also sub-divided rectangles for depicting the total and its break-up into various components. Likewise percentage rectangle diagram may be used to portray the relative magnitudes of two or more sets of data and their components making up the total. We give below a few illustrations. DIAGRAM SHOWING COST AND PROFIT Example 4·11. Prepare a rectangular diagram FOR A COMMODITY IN A FACTORY from the following particulars relating to the 9 production of a commodity in a factory. Units produced 1,000 8 Cost of raw materials Rs. 5,000 Raw 7 Material Direct expenses Rs. 2,000 6 Indirect expenses Rs. 1,000 Profit Rs. 1,000 5 Solution. First of all we will find the cost of 4 material, expenses and profits per unit as given Direct below : 3 5000 Cost of raw material per unit = Rs. 1000 = Rs. 5 Rs. 2000 1000 = Rs. 2 1000 Rs. 1000 = Re. 1 Direct expenses per unit = Indirect expenses per unit = 1000 Profit per unit = Rs. 1000 = Re. 1 Expenses 2 Indirect Expenses 1 0 Profit 1000 UNITS PRODUCED Fig. 4·14. Example 4·12. The following data relates to the monthly expenditure (in Rs.) of two families A and B. Expenditure (in Rs.) Item of Expenditure Family A 160 80 60 20 80 400 Food Clothing Rent Light and fuel Miscellaneous Total Family B 120 32 48 16 24 240 Represent it by a suitable percentage diagram. Solution. Since the total expenses of the two families are different, an appropriate percentage diagram for the above data will be rectangular diagram on percentage basis. The percentage bar diagram will not be able to reflect the inherent differences in the total expenditures of the two families. The widths of the rectangles will be taken in the ratio of the total expenses of the two families viz., 400 : 240 i.e., 5 : 3. CALCULATIONS FOR PERCENTAGE RECTANGULAR DIAGRAM Item of Expenditure Food Clothing Rent Light and fuel Miscellaneous Total Rs. 160 80 60 20 80 400 Family A % 40 20 15 5 20 100 Cumulative % 40 60 75 80 100 Rs. 120 32 48 16 24 240 Family B % 50 13·33 20 6·67 10 100 Cumulative % 50 63·33 83·33 90 100 4·14 BUSINESS STATISTICS PERCENTAGE RECTANGLE DIAGRAM SHOWING MONTHLY EXPENDITURE OF TWO FAMILIES A AND B 100 90 80 Food Food 70 60 50 Clothing Clothing 40 30 20 10 Rent Rent Light and Fuel Misc. Light & Fuel Misc. 0 Family A Family B Fig. 4·15. Example 4·13. Represent the following data by a percentage sub-divided bar diagram. Item of Expenditure Family A Income Rs. 500 150 125 25 190 +10 Food Clothes Education Miscellaneous Savings or Deficit Family B Income Rs. 300 150 60 50 70 –30 Solution. Since the total incomes of the two families are different, an appropriate percentage bar diagram for the above data will be rectangular diagram on percentage basis. The percentage bar diagram will not be able to reflect the inherent differences in the total incomes in the two families. The widths of the rectangles will be taken in the ratio of the total incomes of the families viz., 500 : 300 i.e., 5 : 3. CALCULATIONS FOR PERCENTAGE RECTANGULAR DIAGRAM Item of Expenditure Food Clothes Education Miscellaneous Savings or Deficit Total Expenditure (Rs.) 150 125 25 190 + 10 500 Family A % 30 25 5 38 2 Cumulative % 30 55 60 98 10 Expenditure (Rs.) 150 60 50 70 – 30 300 Family B % Cumulative % 50 20 16.7 23.3 – 10 50 70 86.7 110.0 100 4·15 DIAGRAMMATIC AND GRAPHIC REPRESENTATION PERCENTAGE EXPENDITURE PERCENTAGE DIAGRAM SHOWING MONTHLY INCOME AND EXPENDITURE OF TWO FAMILIES A AND B Family B Family A 100 Food 90 Clothes 30 80 Education 50 70 Misc. 60 25 50 5 20 38 16·7 40 30 20 10 23·3 0 2 Saving Rs. 500 Rs. 300 Income ⎯ ⎯→ 10 Deficit Fig. 4·16. Example 4·14. Draw a suitable diagram to represent the following information. Factory X Factory Y Selling price per unit (in Rs.) Quantity sold 400 600 20 30 Total Cost (in Rs.) ————————————————————————————————————————————————————————————————————————————————— Wages Materials Misc. Total 3,200 6,000 2,400 6,000 1,600 9,000 7,200 21,000 Show also the profit or loss as the case may be. Solution. First of all we shall calculate the cost (wages, materials, misc.) and profit per unit as given in the following table. Selling price per unit (in Rs.) Quantity sold Cost per unit (in Rs.) Wages Materials Misc. Total Profit per unit (in Rs.) Factory X 400 20 160 120 80 360 400 – 360 = 40 Factory Y 600 30 200 200 300 700 600 – 700 = 100 Note. Negative profit is regarded as loss. An appropriate diagram for representing this data would be the ‘Rectangles’ whose widths are in the ratio of the quantities sold i.e., 20 : 30 i.e., 2 : 3. Selling prices would be represented by the corresponding heights of the rectangles with various factors of cost (wages, materials, misc.) and profit or loss represented by the various divisions of the rectangles as shown in the following diagram (Fig. 4.17). Remark. In the case of profit i.e., when selling price (S.P.) is greater than cost price (C.P.), the entire rectangle will lie above the X-axis, the segment just above the X-axis showing profit. But in case of loss i.e., when S.P. is less than C.P., we will have the rectangle with a portion lying below the X-axis which will reflect the loss incurred i.e., the cost not recovered through sales. The values of each component are given by the product of the base with the corresponding height of the component (rectangle). For example, for the factory X, the area of the component for wages is 20 × 160 = 3200, which is the given cost. 4·16 BUSINESS STATISTICS SUB-DIVIDED RECTANGLE SHOWING COST, SALES AND PROFIT OR LOSS PER UNIT SALES (IN RUPEES) 800 Factory Y 30 Units 700 600 Factory X 20 Units Wages 300 Wages Materials 200 Materials 100 Misc. Profits 500 400 0 Misc. Loss –100 Fig. 4·17. (B) Square Diagrams. Among the two-dimensional diagrams, squares are specially useful if it is desired to compare graphically the values or quantities which differ widely from one another as e.g., the population of different countries at a given time or of the same country at different times or the imports or exports of different countries. In such a situation, the bar diagrams are not suitable, since they will give very disproportionate bars i.e., the bars corresponding to smaller quantities would be comparatively too small and those corresponding to bigger values would be too big. In particular, if two values are in the proportion of 1 : 25 and if we draw bar diagram, one (bigger) will be 25 times (in height) than that of the other (smaller). In such a situation, square diagrams give a better presentation. Like rectangle diagram, square diagram is a two-dimensional diagram in which the given values are represented by the area of the square. Since the area of the square is given by the square of its side, the side of the square diagram will be in proportion to the square root of the given observations. Thus if the two observations are in the ratio of 1 : 25, the sides of the squares will be in the ratio of their square roots viz., 1 : 5. Construction of the square diagrams is quite simple. First of all we obtain the square roots of the given observations and then squares are drawn with sides proportional to these square roots, on an appropriate scale which must be specified. Remarks 1. The square may be drawn horizontally (on the same base line) or vertically one below the other to facilitate comparisons. However, in practice, the first method viz., horizontal presentation is generally used since it economises space. 2. Although square diagram is a two-dimensional diagram, it is used to depict only a single magnitude or value. Example 4·15. Draw a square diagram to represent the following data. Country A B C Yield in (kg.) per hectare 350 647 1,120 YIELD in kg. (Per hectare) Solution. The square roots of the given yields in (kg) per hectare give the proportion of the sides SCALE : of the corresponding squares. The calculations are 1 Square cm = 350 kg. shown in the following table : Country Yield in (kg.) per hectare Square root Ratio of the sides of the squares A B 1·79 cm C 1·36 cm 350 18·7083 647 25·4362 1 1·36 1,120 33·4664 1·79 The square diagram is given in Fig. 4·18. 1 cm A B Fig. 4·18. C 4·17 DIAGRAMMATIC AND GRAPHIC REPRESENTATION Remark. In the above table, the ratio of the sides of the squares has been obtained on dividing the square roots of the values for B and C by the square root of the value for A. This is easy to do on a calculator but without a calculator it is quite time-consuming. However, the things can be simplified to a great extent by dividing the square roots of the values of A, B and C by a whole number, say, 15 or 18. Division by, say, 15, gives the ratio of the sides of squares for A, B and C as 1·25, 1·70, 2·23 respectively. Squares can now be constructed by taking appropriate scale which must be specified in the diagram. If we construct squares with sides 1·25 cms., 1·70 cms., and 2·23 cms., the scale will be obtained as follows : Area of square for A is (1·25)2 = 1·5625 square cms. This area represents the value 350 kg. Thus, 1·5625 sq. cms. = 350 kg. 350 1 sq. cm. = 1·5625 = 224 kg. ⇒ (C) Circle Diagrams. Circle diagrams are alternative to square diagrams and are used for the same purpose, viz., for diagrammatic presentation of the values differing widely in their magnitude. The area of the circle, which represents the given values is given by πr2 , where π = 22/7 and r is the radius of circle. In other words, the area of the circle is proportional to the square of its radius and consequently, in the construction of the circle diagram the radius of the circle is a value proportional to the square root of the given magnitude. Accordingly, the lengths which were taken as the sides of the square may also be taken as the radii of the circles representing the given magnitudes. Remarks 1. Circle diagrams are more attractive and appealing than square diagrams and since both require more or less the same amount of work, viz., computing the square roots of the given magnitudes (rather circles are easy to draw), circle diagrams are generally preferred to square diagrams. 2. Since square and circle diagrams are to be compared on an area basis, it is difficult to judge the relative magnitudes with precision, particularly by a layman without any mathematical or statistical background. Accordingly, proper care should be taken to interpret them. They are also more difficult to construct than the rectangle diagrams. 3. Scale. The scale to be used for constructing circle diagrams can be calculated as follows : For a given magnitude ‘a’ we have a Area = πr2 square units = a ⇒ 1 square unit = 2 . πr Example 4·16. Represent the data of Example 4·15 by a circular diagram. Solution. The data of Example 4·15 can be represented by a circular diagram on taking the lengths of the sides of the squares which were taken in Example 4·15, as the radii of the corresponding circles. However, in this case, the scale will be modified accordingly. 2450 Scale : 1 sq. cm. = 350 π = 22 = 111·36 kg. YIELD IN KG (PER HECTARE) SCALE : 1 sq. cm. = 111·36 kg. 1 cm A 1·79 cm 1.36 cm C B Fig. 4·19. 4·18 BUSINESS STATISTICS (D) Angular or Pie Diagram. Just as sub-divided and percentage bars or rectangles are used to represent the total magnitude and its various components, the circle (representing the total) may be divided into various sections or segments viz., sectors representing certain proportion or percentage of the various component parts to the total. Such a sub-divided circle diagram is known as an angular or pie diagram, named so because the various segments resemble slices cut from a pie. Steps for Construction of Pie Diagram 1. Express each of the component values as a percentage of the respective total. 2. Since the angle at the centre of the circle is 360°, the total magnitude of the various components is taken to be equal to 360° and each component part is to be expressed proportionately in degrees. Since 1 per cent of the total value is equal to 360/100 = 3·6°, the percentage of the component parts obtained in step 1 can be converted to degrees by multiplying each of them by 3·6. 3. Draw a circle of appropriate radius using an appropriate scale depending on the space available. If only one category or characteristic is to be used, the circle may be drawn of any radius. However, if two or more sets of data are to be presented simultaneously for comparative studies, then the radii of the corresponding circles are to be proportional to the square roots of their total magnitudes. 4. Having drawn the circle, draw any radius (preferably horizontal). Now with this radius as the base line draw an angle at the centre [with the help of protractor (D)] equal to the degree represented by the first component, the new line drawn at the centre to form this angle will touch the circumference. The sector so obtained will represent the proportion of the first component. From this second line as base, now draw another angle at the centre equal to the degree represented by the 2nd component, to give the sector representing the proportion of the second component. Proceeding similarly, all the sectors representing different component parts can be constructed. 5. Different sectors representing various component parts should be distinguished from one another by using different shades, dottings, colours, etc., or giving them explanatory or descriptive labels either inside the sector (if possible) or just outside the circle with proper identification. Remarks 1. The degrees represented by the various component parts of a given magnitude can be obtained directly without computing their percentage to the total value as follows : Degree of any component part = Component value × 360°. Total value 2. Pie diagrams are also called circular diagrams. 3. Since the comparison of the pie diagrams is to be made on the basis of the areas of the circles and of various sectors which are difficult to be ascertained visually with precision, generally sub-divided or percentage bars or rectangles are preferred to pie diagrams for studying the changes in the total and component parts. Moreover, pie diagrams are difficult to construct as compared with bars or rectangle diagrams. Any way, if the number of component parts is more than 10, pie chart is preferred to bar or rectangle diagram which becomes rather confusing in such a situation. We give below some illustrations of pie diagrams. Example 4·17. Draw a pie diagram to represent the following data of proposed expenditure by a State Government for the year 1997-98. Items Proposed Expenditure (in million Rs.) Agriculture & Rural Development Industries & Urban Development Health & Education Miscellaneous 4,200 1,500 1,000 500 [Delhi Univ. B.Com. (Pass), 1997] 4·19 DIAGRAMMATIC AND GRAPHIC REPRESENTATION Solution. CALCULATIONS FOR PIE CHART Items Proposed expenditure (in million Rs.) (1) (2) Angle at the centre (3) = (2) × 360° 7200 Agriculture and Rural Development 4,200 42 72 × 360° = 210° Industries and Urban Development 1,500 15 72 × 360° = 75° Health and Education 1,000 10 72 × 360° = 50° 500 5 72 × 360° = 25° 7,200 360° Miscellaneous Total PIE DIAGRAM REPRESENTING PROPOSED EXPENDITURE BY STATE GOVERNMENT ON DIFFERENT ITEMS FOR 1997-98 Agriculture and Rural Development Miscellaneous Industries and Urban Development Health and Education Fig. 4·20. Example 4·18. The following data shows the expenditure on various heads in the first three five-year plans (in crores of rupees). Expenditure (in crores Rs.) Subject First Plan Second Plan Third Plan Agriculture and C.D. 361 529 1068 Irrigation and Power 561 865 1662 Village and Small Industries 173 176 264 Industry and Minerals 292 900 1520 Transport and Communications 497 1300 1486 Social Services and Miscellaneous 477 830 1500 2361 4600 7500 Total Represent the data by angular (pie) diagrams. 4·20 BUSINESS STATISTICS Solution. CALCULATIONS FOR PIE DIAGRAMS Expenditure (in crores Rupees) Items of Expenditure First Plan Second Plan Rs. Degrees 361 361 2361 × 360° = 55·1° 529 529 4600 × 360° = 41·4° 1,068 7500 × 360° = 51·2° 561 561 2361 × 360° = 85·5° 865 865 4600 × 360° = 67.7° 1,662 7500 × 360° = 79·8° 173 173 2361 × 360° = 26·4° 176 176 4600 × 360° = 13·8° 292 292 2361 × 360° = 44·5° 900 900 4600 × 360° = 70·4° 1,520 7500 × 360° = 73·0° 497 497 2361 × 360° = 75·8° 1300 1300 4600 × 360° = 101·7° 1,486 7500 × 360° = 71·3° 477 477 2361 × 360° = 72·7° 830 830 4600 × 360° = 65·0° 1,500 7500 × 360° = 72·0° Total 2,361 360° 4,600 360° Square Root 48·59 67·82 86·60 1·0 1·4 1·8 Agriculture and C.D. Irrigation and Power Village and Small Industries Industry and Minerals Transport and Communications Social Services and Miscellaneous Radii of circles Rs. Third Plan Degrees Rs. Degrees 1068 1662 264 264 7500 × 360° = 12·7° 1520 1486 1500 7,500 360° THIRD PLAN FIRST PLAN SECOND PLAN EXPENDITURE ON VARIOUS HEADS IN FIRST THREE FIVE-YEAR PLANS Agriculture and C.D. Industry and Minerals Irrigation and Power Transport and Communications Village and Small Industries Social Services and Miscellaneous Fig. 4·21. 4·3·5. Three-Dimensional Diagrams. Three-dimensional diagrams, also termed as volume diagrams are those in which three dimensions, viz., length, breadth and height are taken into account. They are constructed so that the given magnitudes are represented by the volumes of the corresponding diagrams. The common forms of such diagrams are cubes, spheres, cylinders, blocks, etc. These diagrams are specially useful if there are very wide variations between the smallest and the largest magnitudes to be represented. Of the various three-dimensional diagrams, ‘cubes’ are the simplest and most commonly used devices of diagrammatic presentation of the data. 4·21 DIAGRAMMATIC AND GRAPHIC REPRESENTATION Cubes. For instance, if the smallest and the largest magnitudes to be presented are in the ratio of 1 : 1000, the bar diagrams cannot be used because the height of the biggest bar would be 1000 times the height of the smallest bar and thus they would look very disproportionate and clumsy. On the other hand, if square or circle diagrams are used then the sides (radii) of the squares (circles) will be in the ratio of the square roots viz., 1 : √ ⎯⎯⎯ 1000 i.e., 1 : 31·63 i.e., 1 : 32 (approx.), which will again give quite disproportionate diagrams. However, if cubes are used to present this data, then since the volume of cube of side x is x 3 , the 3 sides of the cubes will be in the ratio of their cube roots viz., 1 : √ ⎯⎯⎯⎯ 1000 i.e., 1 : 10, which will give reasonably proportionate diagrams as compared to one-dimensional or two-dimensional diagrams. Construction of a Cube of Side ‘x’. The various steps are outlined below : E 1. Construct a square ABCD of side x. 2. Draw EF as right bisector i.e., perpendicular bisector of AB, [This is done by finding the mid-point of AB and then drawing perpendicular at that point to the line AB] such that EF = AB and half of it is above AB and half of it is below AB. G A B (iii) Join AE, CF and EF. (iv) Through B draw a line BG parallel to AE (i.e., BG || AE) such that BG = AE. (v) Join EG and through G draw a line GH || EF such that GH = EF. H F C D x (vi) Join D and H. Fig. 4·22 (vii) Rub off the lines CF, EF and FH. Now CDHGEAC is the required cube. Remarks 1. As already discussed, three-dimensional diagrams are used with advantage over one or two-dimensional diagrams if the range i.e., the gap between the smallest and the largest observation to be presented, is very large. Moreover, they are more beautiful and appealing to the eye than bars, rectangles, squares or circles. However, since three-dimensional diagrams are quite difficult to construct and comprehend as compared to one-or two-dimensional diagrams, they are not very popular. Further, as the magnitudes are represented by the volumes of the cubes (volumes of the three-dimensional diagrams, in general), it is very difficult to visualise and hence interpret them with precision. 2. Cylinders, spheres and blocks are quite difficult to construct and are, therefore, not discussed here. 3. It is worthwhile pointing out here that now-a-days projection techniques are used to represent even one-dimensional diagrams as three-dimensional diagrams for giving them a beautiful and attractive get up. Example 4·19. The following table gives the population of India on the basis of religion. Religion Number (in lakhs) Hinduism 2031·9 Islam 354·0 Christianity 81·6 Sikhism 62·2 Others 36·3 Represent the data by cubes. Solution. The sides of the cubes will be proportional to the cube roots of the magnitudes they represent. Note. To compute cube root of a, viz., √ ⎯3 a or a1/3, Taking logarithm of both sides : log y = 31 log10 a let ⇒ y =√ ⎯3 a = a1/3 y = Antilog [ 1 3 log10 a ] 4·22 BUSINESS STATISTICS COMPUTATION OF CUBE ROOTS 1 l og10 a 3 log 10 a a Anti-log [13 log10a ] 12·67 Ratio of sides 3·83 –~ 3·8 2031·9 3·3079 1·1026 354·0 2·5441 0·8480 7·047 81·6 1·9117 0·6372 4·337 62·2 1·7938 0·5979 3·962 2·13 ~ – 2·1 1·31 –~ 1·3 1·19 ~ – 1·2 36·3 1·5599 0·5199 3·311 1 Now we can express the data diagrammatically by drawing cubes with sides proportional to the values given in the last column. 1 3 8 cms = Scale : 2.1 Islam 3.8 Hinduism 36·3 lakhs 1·3 Christianity 1·2 Sikhism 1 Others Fig. 4·23. 4·3·6. Pictograms. Pictograms is the technique of presenting statistical data through appropriate pictures and is one of the very popular devices particularly when the statistical facts are to be presented to a layman without any mathematical background. In this, the magnitudes of the particular phenomenon under study are presented through appropriate pictures, the number of pictures drawn or the size of the pictures being proportional to the values of the different magnitudes to be presented. Pictures are more attractive and appealing to the eye and have a lasting impression on the mind. Accordingly they are extensively used by government and private institutions for diagrammatic presentation of the data relating to a variety of social, business or economic phenomena primarily for display to the general public or common masses in fairs and exhibitions. Remark. Pictograms have their limitations also. They are difficult and time-consuming to construct. In pictogram, each pictorial symbol represents a fixed number of units like thousands, millions or crores, etc. For instance, in Example 4·20 which displays the number of vessels in the Indian Merchant Shipping fleet, one ship symbol represents 50 vessels and it is really a problem to represent and read fractions of 50. For example, 174 vessels will be represented in pictograms by 3 ships and about a half more ; 231 vessels will be represented in pictogram by 4 ships and a proportionate fraction of the 5th. This proportionate representation introduces error and is quite difficult to visualise with precision. We give below some illustrations of pictograms. The following table gives the number of students studying in schools/colleges for different years in India. STUDENTS ON ROLL AT THE SCHOOL / UNIVERSITY STAGE (As on 31st March) Stage ↓ Class I to XI University/College 1960-61 44·73 0·73 (In Million) 1965-66 1970-71 1974-75 1975-76 66·29 1·24 76·95 2·81 87·30 2·94 89·46 3·21 Source : Ministry of Education and Social Welfare. 4·23 DIAGRAMMATIC AND GRAPHIC REPRESENTATION The data can be represented in the diagrammatic form by pictogram (pictures) as given below : STUDENTS ON ROLL AT SCHOOL/UNIVERSITY STAGE CLASS I-XI 1960-61 = = 10 Million 1965-66 1970-71 1974-75 1975-76 UNIVERSITY/COLLEGE = 0·5 Million 1960-61 1965-66 1970-71 1974-75 1975-76 Fig. 4·24. Data pertaining to the population of a country at different age groups are usually represented through pictures by means of so-called pyramids. The pyramid in Fig. 4·25 exhibits the population of India at various age-groups according to 1971 census as given in the following table. DISTRIBUTION OF POPULATION BY AGE GROUPS (1971 Census) Age-group Under 1 1— 4 5—14 15—24 25—34 35—44 45—54 55—64 65—74 75 and over Population (in ’000) 16,519 63,040 1,50,776 90,569 77,010 61,186 43,416 27,202 12,880 5,446 Source : Registrar General of India. AGE PYRAMIDS — INDIA CENSUS 1971 75 and Over Population in Millions 65-74 55-64 45-54 35-44 25-34 15-24 5-14 1-4 0-1 7 6 5 4 3 2 1 0 1 2 Fig. 4·25. 3 4 5 6 7 8 4·24 BUSINESS STATISTICS Example 4·20. The following table gives the number of vessels as on 31st December, in Indian Merchant Shipping fleet for different years. Year No. of vessels 1961 174 1966 231 1971 255 1975 330 1976 359 Represent the data by pictogram. Solution. NUMBER OF VESSELS IN INDIAN MERCHANT SHIPPING FLEET = 50 VESSELS 1961 1966 1971 1975 1976 Fig. 4·26. 4·3·7. Cartograms. In cartograms, statistical facts are presented through maps accompanied by various types of diagrammatic representation. They are specially used to depict the quantitative facts on a regional or geographical basis e.g., the population density of different states in a country or different countries in the world, or the distribution of the rainfall in different regions of a country can be shown with the help of maps or cartograms. The different regions or geographical zones are depicted on a map and the quantities or magnitudes in the regions may be shown by dots, different shades or colours etc., or by placing bars or pictograms in each region or by writing the magnitudes to be represented in the respective regions. Cartograms are simple and elementary forms of visual presentation and are easy to understand. They are generally used when the regional or geographic comparisons are to be highlighted. 4·3·8. Choice of a Diagram. In the previous sections we have described types of diagrams which can be used to present the given set of numerical data and also discussed briefly their relative merits and demerits. No single diagram is suited for all practical situations. The choice of a particular diagram for visual presentation of a given set of data is not an easy one and requires great skill, intelligence and expertise. The choice will primarily depend upon the nature of the data and the object of presentation, i.e., the type of the audience to whom the diagrams are to be presented and it should be made with utmost care and caution. A wrong or injudicious selection of the diagram will distort the true characteristics of the phenomenon to be presented and might lead to very wrong and misleading interpretations. Some special types of data, viz., the data relating to frequency curves and time series are best represented by means of graphs which we will discuss in the following sections. EXERCISE 4·1 1. (a) What are the merits and limitations of diagrammatic representation of statistical data ? (b) Describe the advantages of diagrammatic representation of statistical data. Name the different types of diagrams commonly used and mention the situations where the use of each type of diagram would be appropriate. 2. (a) What are the different types of diagrams which are used in statistics to show the salient characteristics of group and series ? Illustrate your answer. (b) Discuss the usefulness of diagrammatic representation of facts. 4·25 DIAGRAMMATIC AND GRAPHIC REPRESENTATION 3. “Diagrams do not add anything to the meaning of Statistics but when drawn and studied intelligently, they bring to view the salient characteristics of the data”. Explain. 4. “Diagrams help us visualize the whole meaning of a numerical complex data at a single glance.” Explain the statement. [Delhi Univ. B.Com. (Pass), 1999] 5. The merits of diagrammatic presentation of data are classified under three heads : attraction, effective impression and comparison. Explain and illustrate these points. 6. (a) State the different methods used for diagrammatic representation of statistical data and indicate briefly the advantages and disadvantages of each one of them. (b) Point out the usefulness of diagrammatic representation of facts and explain the construction of any of the different forms of diagrams you know. (c) A Bar diagram and a Rectangular diagram have the same appearance. Do they belong to the same category of diagrams ? Explain. [Delhi Univ. B.Com. (Pass), 1998] 7. What types of mistakes are commonly committed in the construction of diagrams ? What precautions are necessary in this connection ? 8. (a) Draw a bar chart to represent the following information : Year No. of women M.P.’s 1952 22 1957 27 1962 34 1967 31 1972 22 1977 19 (b) In a recent study on causes of strikes in mills, an experimenter collected the following data. Causes : Occurrences (in percentage) : Economic Personal Political Rivalry Others 58 16 10 6 10 Represent the data by bar chart. 9. Represent the following data by a percentage sub-divided bar-diagram : Item of Expenditure Family A (Income Rs. 500) Family B (Income Rs. 300) Food 150 Clothes 125 60 25 50 Education 150 Miscellaneous 190 70 Saving or Deficit +10 –30 10. (a) Construct a multiple bar graph to represent Imports and Exports of a country for the following years : Year Imports (In billion rupees) Exports (In billion rupees) 1992-93 19 20 1993-94 30 25 1994-95 45 33 1995-96 53 40 1996-97 51 51 (b) Draw a suitable diagram to present the following data : 1998 1999 I Division II Division III Division Failures Total No. of candidates 16 12 40 44 60 72 44 34 160 162 11. Represent the following data by a sub-divided bar diagram : No. of Students College Arts Science Commerce A 1200 800 600 B 750 500 300 Agriculture 400 450 Total 3000 2000 4·26 BUSINESS STATISTICS 12. Draw a rectangular diagram to represent the following information : Factory A Rs. 15·00 1000 Nos. Rs. 5·00 Rs. 4·00 Rs. 6·00 Price per unit Units produced Raw material/unit Other expenses/unit Profit/unit Factory B Rs. 12·00 1200 Nos. Rs. 5·00 Rs. 3·00 Rs. 4·00 13. Represent the following data by a deviation bar diagram : Years Income (in crores of Rs.) Expenditure (in crores of Rs.) 1994 15 18 1995 16 17 1996 17 16 1997 18 20 1998 19 17 1999 20 18 [Delhi Univ. B.Com. (Pass), 2002] 14. (a) What do you mean by two-dimensional diagrams ? Under what situations they are preferred to onedimensional diagrams ? (b) Describe the (i) square and (ii) circle diagrams. Discuss their merits and demerits. (c) What do you understand by Pie diagrams. Discuss the technique of constructing such diagrams. 15. The following table gives the average approximate yield of rice in kg. per acre in three different countries. Draw square diagrams to represent the data : Country Yield in (kg.) per acre A 350 B 647 C 1120 16. Represent the following data on production of Tea, Cocoa and Coffee by means of a pie diagram. Tea 3,260 tons Cocoa 1,850 tons Coffee 900 tons Total 6,010 tons 17. (a) Point out the usefulness of diagrammatic representation of facts and explain the construction of volume and pie diagrams. (c) Draw a pie diagram for the following data of (b) A Rupee spent on ‘Khadi’ is distributed as Sixth Five-Year Plan Public Sector outlays: follows : Paise Agriculture and Rural Development 12·9% Farmer 19 Irrigation, etc. 12·5% Carder and Spinner 35 Energy 27·2% Weaver 28 Industry and Minerals 15·4% Washerman, Dyer and Printer 8 Transport, Communication, etc. 15·9% Administrative Agency 10 Social Services and Others 16·1% Total 100 (Bangalore Univ. B.Com., 1997) Present the data in the form of a pie diagram. ————— ————— 18. Draw a Pie diagram to represent the distribution of a certain blood group ‘O’ among Gypsies, Indians and Hungarians. Frequency ———————————————————————————————————————————————————————————————— Blood group ‘O’ Gypsies 343 Indians 313 19. (a) Represent the following data by means of circular diagrams. No. of Employed Year Men Women 1951 1,80,000 1,10,000 1961 3,50,000 2,10,000 Hungarians 344 Total 1000 Children 70,000 1,60,000 Total 3,60,000 7,20,000 4·27 DIAGRAMMATIC AND GRAPHIC REPRESENTATION (b) Represent the following data by Pie diagram. Expenditure (in Rs.) Items of Expenditure Family A Family B Food 150 120 Clothing 100 80 Rent and Education 120 80 Fuel and Electricity 80 40 Others 90 40 20. The areas of the various continents of the world in millions of square miles are presented below : AREAS OF CONTINENTS OF THE WORLD Continent : Area (Millions of square miles) : Africa Asia 11·7 10·4 Europe North America Oceania South America 1·9 9·4 3·3 6·9 U.S.S.R Total 7·9 51·5 Represent the data by a Pie diagram. 4·4. GRAPHIC REPRESENTATION OF DATA – – – – – – – – – The difference between the diagrams and graphs has been discussed in §4·2. To summarise, diagrams are useful for visual presentation of categorical and geographical data while the data relating to time series and frequency distributions is best represented through graphs. Diagrams are primarily used for comparative studies and can’t be used to study the relationship, (not necessarily functional), between the variables under study. This is done through graphs. Diagrams furnish only approximate information and are not of much utility to a statistician from analysis point of view. On the other hand, graphs are more obvious, precise and accurate than diagrams and can be effectively used for further statistical analysis, viz., to study slopes, rates of change and for forecasting wherever possible. Graphs are drawn on a special type of paper, known as graph paper. The advantages of graphic representation of a set of numerical data have also been discussed in § 4·1. Like diagrams, a large number of graphs are used in practice. But they can be broadly classified under the following two heads : (i) Graphs of Frequency Distributions. (ii) Graphs of Time Series. Before discussing these graphs we shall briefly describe the technique of constructing graphs and the general rules for drawing graphs. Y 4·4·1. Technique of Construction of Graphs. Graphs are drawn on a special type of paper known as graph paper which has a fine net work of horizontal and QUADRANT I QUADRANT II +4 – vertical lines; the thick lines for each division of a +3 – X-Positive X-Negative centimetre or an inch measure and thin lines for small Y-Positive Y-Positive parts of the same. In a graph of any size, two simple +2 – (+x, +y) (–x, +y) lines are drawn at right angle to each other, intersecting +1 – at point ‘O’ which is known as origin or zero of reference. The two lines are known as co-ordinate axes. X ′ – X 1 2 3 4 – 4 –3 –2 –1 0 The horizontal line is called the X-axis and is denoted –1 – by X′OX. The vertical line is called the Y-axis and is QUADRANT III –2 – QUADRANT IV usually denoted by YOY′. Thus, the graph is divided X-Negative X-Positive into four sections, known as the four quadrants, but in –3 – Y-Negative practice only the first quadrant is generally used unless Y-Negative –4 – negative magnitudes are to be displayed. Along the X(–x, –y) (+x, –y) axis, the distances measured towards right of the origin Y ′ i.e., towards right of the line YOY′, are positive and the Fig. 4·27. distances measured towards left of origin i.e., towards left of the line YOY′ are negative, the origin showing 4·28 BUSINESS STATISTICS – – – – – – – – – the value zero. Along the Y-axis, the distances above Y the origin i.e., above the line X′OX are positive and Q (–3, 4) the distances below the origin i.e., below the +4 – line X′OX are negative. Any pair of the values of the variables is represented by a point (x, y), x usually +3 – P (4, 2) represents the value of the independent variable and is +2 – shown along the X-axis and y represents the value of the +1 – dependent variable and is shown along the Y-axis. The four quadrants along the position of x and y values are – X ′ X shown in the Fig. 4·27. – 4 –3 –2 –1 0 1 2 3 4 –1 – In any pair (a, b), first coordinate, viz., ‘a’ always refers to the X-coordinate which is also known as –2 – abscissa and the second coordinate, viz., ‘b’ always R (3, –2) –3 – refers to the Y -coordinate which is also known as S (–2, –3) –4 – ordinate. As an illustration the points P(4, 2), Q(–3, 4), R(3, –2) and S(–2, –3) are displayed in the Fig. 4·28. Y ′ In the graph on a natural or arithmetic scale, the equal magnitudes of the values of the variables are Fig. 4·28. represented by equal distances along both the axes, though the scales along X-axis and Y-axis may be different depending on the nature of the phenomenon under consideration. 4·4·2. General Rules for Graphing. The following guidelines (some of which have already been discussed in § 4·3·1 for diagrammatic representation of data), may be kept in mind for drawing effective and accurate graphs : 1. Neatness. (For details see § 4·3·1 page 4.2). 2. Title and Footnote (For details see § 4·3·1 page 4.2). 3. Structural Framework. The position of the axes should be so chosen that the graph gives an attractive and proportionate get up. It should be kept in mind that for each and every value of the independent variable, there is a corresponding value of the dependent variable. In drawing the graph it is customary to plot the independent variable along the X-axis and the dependent variable along the Y-axis. For instance, if the data pertaining to the prices of the commodity and the quantity demanded or supplied at different prices is to be plotted, then the dependent variable, viz., price (which depends on independent forces of supply and demand) is taken along Y-axis while the independent variable viz., quantity demanded or supplied is taken along X-axis. Similarly, in case of time series data, the time factor is taken along Xaxis and the phenomenon which changes with time e.g., population of a country in different years, production of a particular commodity for different periods, etc., is taken along Y-axis. 4. Scale. This point has also been discussed in § 4·3·1. It may further be added that the scale along both the axes (X-axis and Y-axis) should be so chosen that the entire data can be accommodated in the available space without crowding. In this connection, it is worthwhile to quote the words of A. L. Bowley : “It is difficult to lay down rules for the proper choice of scales by which the figures should be plotted out. It is only the ratio between the horizontal and vertical scales that need to be considered. The figure must be sufficiently small for the whole of it to be visible at once : if the figure is complicated, related to long series of years and varying numbers, minute accuracy must be sacrificed to this consideration. Supposing the horizontal scale is decided, the vertical scale must be chosen so that the part of the line which shows the greatest rate of increase is well inclined to the vertical which can be managed by making the scale sufficiently small; and on the other hand, all important fluctuations must be clearly visible for which the scale may need to be decreased. Any scale which satisfies both these conditions will fulfill its purpose.” 5. False Base Line. The fundamental principle of drawing graph is that the vertical scale must start with zero. If the fluctuations in the values of the dependent variable (to be shown along Y-axis) are very small relative to their magnitudes, and if the minimum of these values is very distant (far greater) from DIAGRAMMATIC AND GRAPHIC REPRESENTATION 4·29 zero, the point of origin, then for an effective portrayal of these fluctuations the vertical scale is stretched by using false base line. In such a situation the vertical scale is broken and the space between the origin ‘O’ and the minimum value (or some convenient value near that) of the dependent variable is omitted by drawing two zig-zag horizontal lines above the base line. The scale along Y-axis is then framed accordingly. False base line technique is quite extensively used for magnifying the minor fluctuations in a time series data. It also economises space because if such data are graphed without using false base line, then the plotted data will lie on the top of the graph. This will give a very clumsy look and also result in wastage of space. However, proper care should be taken to interpret graphs in which false base line is used. As illustrations, see Examples 4·31 and 4·32 in § 4·4·4. 6. Ratio or Logarithmic Scale. In order to display proportional or relative changes in the magnitudes, the ratio or logarithmic scale should be used instead of natural or arithmetic scale which is used to display absolute changes. Ratio scale is discussed in detail in § 4·4·5. 7. Line Designs. If more than one variable is to be depicted on the same graph, the different graphs so obtained should be distinguished from each other by the use of different lines, viz., dotted lines, broken lines, dash-dot lines, thin or thick lines, etc., and an index to identify them should be given. [See Examples 4·31 and 4·32]. 8. Sources Note and Number. For details see § 4·3·1. 9. Index. For details see § 4·3·1. 10. Simplicity. For details, see § 4·3·1. Remark. For detailed discussion on items 1, 2, 4, 8, 9 and 10, see § 4·3·1 replacing the word ‘diagram’ by ‘graphs’. 4·4·3. Graphs of Frequency Distributions. The reasons and the guiding principles for the graphic representation of the frequency distributions are precisely the same as for the diagrammatic and graphic representation of other types of data. The so-called frequency graphs are designed to reveal clearly the characteristic features of a frequency data. Such graphs are more appealing to the eye than the tabulated data and are readily perceptible to the mind. They facilitate comparative study of two or more frequency distributions regarding their shape and pattern. The most commonly used graphs for charting a frequency distribution for the general understanding of the details of the data are : (A) Histogram. (B) Frequency Polygon. (C) Frequency Curve. (D) “Ogive” or Cumulative Frequency Curve. The choice of a particular graph for a given frequency distribution largely depends on the nature of the frequency distribution, viz., discrete or continuous. In the following sections we shall discuss them in details, one by one. A. HISTOGRAM It is one of the most popular and commonly used devices for charting continuous frequency distribution. It consists in erecting a series of adjacent vertical rectangles on the sections of the horizontal axis (X-axis), with bases (sections) equal to the width of the corresponding class intervals and heights are so taken that the areas of the rectangles are equal to the frequencies of the corresponding classes. Construction of Histogram. The variate values are taken along the X-axis and the frequencies along the Y-axis. Case (i) Histogram with equal classes. In the case, if classes are of equal magnitude throughout, each class interval is drawn on the X-axis by a section or base (of the rectangle) which is equal (or proportional) to the magnitude of the class interval. On each class interval (as base) erect a rectangle with the height proportional to the corresponding frequency of the class. The series of adjacent rectangles (one for each class), so formed gives the histogram of the frequency distribution and its area represents the total frequency of the distribution as distributed throughout the different classes. The procedure is explained in Example 4.21. Case (ii) Histogram with unequal classes. If all the classes are not uniform throughout; as in case (i) the different classes are represented on the X-axis by sections or bases which are equal (or proportional) to 4·30 BUSINESS STATISTICS the magnitudes of the corresponding classes and the heights of the corresponding rectangles are to be adjusted so that the area of the rectangle is equal to the frequency of the corresponding class. This adjustment can be done by taking the height of each rectangle proportional (equal) to the corresponding frequency density of each class which is obtained on dividing the frequency of the class by its magnitude, viz., Frequency of the class Frequency Density (of a class) = Magnitude of the class . Instead of finding the frequency density a more convenient way (from the practical point of view) is to make all the class intervals equal and then adjust the corresponding frequency by using the basic assumption that all the frequencies are distributed uniformly throughout the class. This consists in taking the lowest class interval as standard one with unit length on the X-axis. The adjusted frequencies of the different classes are obtained on dividing the frequency of the given class by the corresponding Adjustment Factor (A.F.) which is given by : Magnitude of the class A.F. for any class = Lowest class interval . Thus, if the magnitude of any class interval is twice (three) the lowest class interval, the adjustment factor is 2(3) and the height of the rectangle which is represented by the adjusted frequency will be 12 ( rd ) 1 3 of the corresponding class frequency and so on. This is illustrated in Example 4·22. This adjustment gives the rectangles whose areas are equal to the frequencies of the corresponding classes. Remarks 1. Grouped (Not Continuous) Frequency Distribution. It should be clearly understood that histogram can be drawn only if the frequency distribution is continuous. In case of grouped frequency distribution, if classes are not continuous, they should be made continuous by changing the class limits into class boundaries and then rectangles should be erected on the continuous classes so obtained. As an illustration, see Example 4·24. 2. Mid-points given. Sometimes, only the mid-values of different classes are given. In such a case, the given distribution is converted into continuous frequency distribution with exclusive type classes by ascertaining the upper and lower limits of the various classes under the assumption that the class frequencies are uniformly distributed throughout each class. (See Example 4·25.) 3. Discrete Frequency Distribution. Histograms, may sometimes also be used to represent discrete frequency distribution by regarding the given values of the variable as the mid-points of continuous classes and then proceeding as explained in Remark 2 above. 4. Difference between Histogram and Bar Diagram. (i) A histogram is a two-dimensional (area) diagram where both the width (base) and the length (height of the rectangle) are important whereas bar diagram is one-dimensional diagram in which only length (height of the bar) matters while width is arbitrary. (ii) In a histogram, the bars (rectangles) are adjacent to each other whereas in bar diagram proper spacing is given between different bars. (iii) In a histogram, the class frequencies are represented by the area of the rectangles while in a bar diagram they are represented by the heights of the corresponding bars. 5. Open-end classes. Histograms can’t be constructed for frequency distributions with open end classes unless we assume that the magnitude of the first open class is same as that of the succeeding (second) class and the magnitude of the last open class is same as that of the preceding (i.e., last but one) class. 6. Histogram may be used for the graphic location of the value of Mode (See Chapter 5). Example 4·21. Draw histogram for the following frequency distribution. Variable Frequency : : 10—20 12 20—30 30 30—40 35 40—50 65 50—60 45 60—70 25 70—80 18 4·31 DIAGRAMMATIC AND GRAPHIC REPRESENTATION Solution. HISTOGRAM 70 65 60 FREQUENCY 50 45 40 35 30 30 25 18 20 12 10 0 10 20 30 40 50 60 70 80 90 VARIABLE Fig. 4·29. Example 4·22. Represent the following data by means of a histogram. Weekly Wages (’00 Rs.) : No. of Workers : 10—15 7 15—20 19 20—25 27 25—30 15 30—40 12 40—60 12 60—80 8 No. of Magnitude Height of Solution. Since the class intervals are of Weekly Wages Workers (f) of Class Rectangle (’00 Rs.) unequal magnitude, the corresponding frequencies have to be adjusted to obtain the 10—15 7 5 7 so-called ‘frequency density’ so that the area 15—20 19 5 19 of the rectangle erected on the class interval 20—25 27 5 27 is equal to the class frequency. We observe 25—30 15 5 15 that first four classes are of magnitude 5, the 30—40 12 10 (12/2) = 6 class 30—40 is of magnitude 10 and the last 40—60 12 20 (12/4) = 3 two classes 40—60 and 60—80 are of 60—80 8 20 (8/4) = 2 magnitude 20. Since 5 is the minimum class interval, the frequency of the class 30—40 is divided by 2 and the frequencies of classes 40—60 and 60— 80 are to be divided by 4 as shown in the adjoining table. ——————————————————————————————————————————— ————————————————————— ————————————————————— HISTOGRAM 27 NUMBER OF WORKERS 30 25 19 20 15 15 10 7 5 0 6 3 2 60 10 15 20 25 30 70 40 50 WEEKLY WAGES (IN ’00 Rs.) Fig. 4·30. 80 4·32 BUSINESS STATISTICS B. FREQUENCY POLYGON Frequency polygon is another device of graphic presentation of a frequency distribution (continuous, grouped or discrete). In case of discrete frequency distribution, frequency polygon is obtained on plotting the frequencies on the vertical axis (Y-axis) against the corresponding values of the variable on the horizontal axis (X-axis) and joining the points so obtained by straight lines. [As an illustration, see Example 4·23.] In case of grouped or continuous frequency distribution, frequency polygon may be drawn in two ways. Case (i) From Histogram. First draw the histogram of the given frequency distribution as explained in § 4·4·3 (A). Now join the mid-points of the tops (upper horizontal sides) of the adjacent rectangles of the histogram by straight line graph. The figures so obtained is called a frequency polygon. (Polygon is a figure with more than four sides). It may be noted that when the frequency polygon is constructed as explained above it cuts off a triangular strip (which lies outside the frequency polygon) from each rectangle of the histogram. But, at the same time, another triangular strip of the same area which is outside the histogram is included under the polygon, as shown by shaded area in the diagram of Example 4·28. [This, however, is not true in the case of unequal class intervals]. In order that the area of the frequency polygon is equal to the area of the corresponding histogram of the frequency distribution, it is necessary to close the polygon at both ends by extending them to the base line such that it meets the X-axis at the mid-points of two hypothetical classes, viz., the class before the first class and the class after the last class, at both the ends each with frequency zero [See Examples 4·24 and 4·25]. Case (ii) Without Constructing Histogram. Frequency polygon of a grouped or continuous frequency distribution is a straight line graph which can also be constructed directly without drawing the histogram. This consists in plotting the frequencies of different classes (along Y-axis) against the mid-values of the corresponding classes (along X-axis). The points so obtained are joined by straight lines to obtain the frequency polygon. As in Case (i), the frequency polygon so obtained should be extended to the base at both ends by joining the extreme points (first and last point) to the mid-points of the two hypothetical classes (before the first class and after the last class) assumed to have zero frequencies. The figure of the frequency polygon so obtained would be exactly same as in Case (i) except for the histogram. This point can be elaborated mathematically as follows. Let x1, x2, …, x n be the mid-values of n classes with frequencies f1 , f2, …, fn respectively. We plot the points (x1, f1), (x 2 , f2),…, (xn, fn) on the co-ordinate axes, taking mid-values along X-axis and frequencies along Y-axis and join them by straight lines. The first point (x1 , f1 ) is joined to the point (x0, 0) and the last point (xn, fn ) to the point (x n + 1, 0) by straight lines and the required frequency polygon is obtained. Remarks 1. Frequency polygon can be drawn directly without the histogram (as explained above) if only the mid-points of the classes are given; without forming the continuous frequency distribution which is desirable in the case of histogram. 2. Frequency Polygon Vs. Histogram : (i) Histogram is a two-dimensional figure, viz., a collection of adjacent rectangles whereas frequency polygon is a line graph. (ii) Frequency polygon can be used more effectively for comparative study of two or more frequency distributions because frequency polygons of different distributions can be drawn on the same single graph. This is not possible in the case of histogram where we need separate histograms for each of the frequency distributions. However, for studying the relationship of the individual class frequencies to the total frequency, histogram gives a better picture and is accordingly preferred to the frequency polygon. (iii) In the construction of frequency polygon we come across same difficulties as in the construction of histograms, viz., (a) It cannot be constructed for frequency distributions with open end classes; and (b) Suitable adjustments, as in the case of histogram are required for frequency distributions with unequal classes. 4·33 DIAGRAMMATIC AND GRAPHIC REPRESENTATION (iv) Unlike histogram, frequency polygon is a continuous curve and therefore possesses all the distinct advantages of graphic representation, viz., it may be used to determine the slope, rate of change, estimates (interpolation and extrapolation), etc., wherever admissible. Example 4·23. The following data show the number of accidents sustained by 313 drivers of a public utility company over a period of 5 years. Number of accidents : 0 1 2 3 4 5 6 7 8 9 10 11 Number of drivers 80 44 68 41 25 20 13 7 5 4 3 2 Draw the frequency polygon. Solution. See Fig. 4·31 FREQUENCY POLYGON OF NUMBER OF ACCIDENTS NUMBER OF DRIVERS 80 70 FREQUENCY POLYGON 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 NUMBER OF ACCIDENTS 9 10 11 Fig. 4·31. Example 4·24. The following table gives the frequency distribution of the weekly wages (in ’00 Rs.) of 100 workers in a factory. Weekly Wages (’00 Rs.) Number of Workers 20—24 25—29 30—34 35—39 40—44 45—49 50—54 55—59 60—64 Total 4 5 12 23 31 10 8 5 2 100 Draw the histogram and frequency polygon of the distribution. Solution. Since all the classes are of equal magnitude i.e., 5, for the construction of the histogram, the heights of the rectangles to be erected on the classes will be proportional to their respective frequencies. However, since the classes are not continuous, the given distribution is to be converted into a continuous frequency distribution, with exclusive type classes before erecting the rectangles, as given in the following table. Weekly Wages 19·5—24·5 24·5—29·5 29·5—34·5 34·5—39·5 39·5—44·5 44·5—49·5 49·5—54·5 54·5—59·5 59·5—64·5 (’00 Rs.) Number of 4 5 12 23 31 10 8 5 2 Workers (f) As usual, frequency polygon is obtained from histogram by joining the mid-points of the rectangles by straight lines, and extended both ways to the classes 14·5—19·5 and 64·5—69·5 on the X-axis, as shown in the Fig. 4·32. 4·34 BUSINESS STATISTICS HISTOGRAM AND FREQUENCY POLYGON 31 NUMBER OF WORKERS 35 30 HISTOGRAM 25 23 20 POLYGON 15 12 10 10 8 4 5 5 5 2 0 14·5 19·5 24·5 29·5 34·5 39·5 44·5 49·5 54·5 59·5 64·5 69·5 WEEKLY WAGES Fig. 4·32. Remark. It may be pointed out that frequency polygon can be drawn straight way by plotting the frequencies against the mid-points of the corresponding classes without converting the given distribution into a continuous one and joining these points by straight lines. Example 4·25. Draw the histogram and frequency polygon for the following frequency distribution. : : 2·5 7 7·5 10 Solution. Since we are given the mid-values of the class intervals, for the construction of histogram, the distribution is to be transformed into continuous class intervals each of magnitude 5, (under the assumption that the frequencies are uniformly distributed throughout the class intervals), as given in the following table. Class 0—5 5—10 10—15 15—20 20—25 25—30 30—35 35—40 Frequency 7 10 20 13 17 10 14 9 12·5 20 17·5 13 22·5 17 27·5 10 32·5 14 37·5 9 HISTOGRAM AND FREQUENCY POLYGON 25 20 FREQUENCY Mid-value of class interval Frequency Histogram 15 Frequency Polygon 10 5 0 5 10 15 20 25 30 35 40 45 The histogram and frequency polygon CLASSES are shown in the Fig. 4·33. Fig. 4·33. Remark. It may be noted that frequency polygon can be drawn even without converting the given distribution into classes. The frequencies are plotted against the corresponding mid-points (given) and joined by straight lines. C. FREQUENCY CURVE A frequency curve is a smooth free hand curve drawn through the vertices of a frequency polygon. The object of smoothing of the frequency polygon is to eliminate, as far as possible, the random or erratic fluctuations that might be present in the data. The area enclosed by the frequency curve is same as that of 4·35 DIAGRAMMATIC AND GRAPHIC REPRESENTATION the histogram or frequency polygon but its shape is smooth one and not with sharp edges. Frequency curve may be regarded as a limiting form of the frequency polygon as the number of observations (total frequency) becomes very large and the class intervals are made smaller and smaller. Remarks 1. Smoothing should be done very carefully so that the curve looks as regular as possible and sudden and sharp turns should be avoided. In case of the data pertaining to natural phenomenon like tossing of a coin or throwing of a dice the smoothing can be conveniently done because such data generally give rise to symmetrical curves. However, for the data relating to social, economic or business phenomenon, smoothing cannot be done effectively as such data usually give rise to skewed (asymmetrical) curves. [For details see § 4·4·3 C (b) page 4·36]. In fact, it is desirable to attempt a frequency curve if we have sufficient reasons to believe that the frequency distribution under study is fairly regular. It is futile to attempt a frequency curve for an irregular distribution. In general, frequency curves should be attempted (i) for frequency distribution based on the samples, and (ii) when the distribution is continuous. 2. We have already seen that a frequency polygon can be drawn with or without a histogram. However, to obtain an ideal frequency curve for a given frequency distribution, it is desirable to proceed in a logical sequence, viz., first draw a histogram, then a frequency polygon and finally a frequency curve, because in the absence of a histogram the smoothing of the frequency polygon cannot be done properly. As discussed in frequency polygon, the frequency curve should also be extended to the base on both sides of the histogram so that the area under the frequency curve represents the total frequency of the distribution. 3. A frequency curve can be used with advantage for interpolation [i.e., estimating the frequencies for given value of the variable or in a given interval (within the given range of the variable)], provided it rises gradually to the highest point and then falls more or less in the same manner. It can also be used to determine the rates of increase or decrease in the frequencies. It also enables us to have an idea about the Skewness and Kurtosis of the distribution [See Chapter 7]. Example 4·26. Draw a frequency curve for the following distribution : Age (Yrs.) No. of Students : : 17—19 7 19—21 13 21—23 24 23—25 30 25—27 22 27—29 15 29—31 6 Solution. See Fig. 4·34. NUMBER OF STUDENTS FREQUENCY CURVE 30 Frequency Curve 25 Histogram 20 15 10 5 15 17 19 23 25 27 29 AGE (YEARS) Fig. 4·34. 21 31 33 Types of Frequency Curves. Though different types of data may give rise to a variety of frequency curves, we shall discuss below only some of the important curves which, in general, describe most of the data observed in practice, viz., the data relating to natural, social, economic and business phenomena. (a) Curves of Symmetrical Distributions. In a symmetrical distribution, the class frequencies first rise steadily, reach a maximum and then diminish in the same identical manner. 4·36 BUSINESS STATISTICS If a curve is folded symmetrically about a vertical line (corresponding to the maximum frequency), so that the two halves of the figures coincide, it is called a symmetrical curve. It has a single smooth hump in the middle and tapers off gradually at either end and is bell-shaped. The following hypothetical distribution of marks in a test will give a symmetrical frequency distribution. Marks Frequency : : 0—10 40 10—20 70 20—30 120 30—40 160 40—50 180 50—60 160 60—70 120 70—80 70 80—90 40 If the data are presented graphically, we shall obtain a frequency curve which is symmetrical. The most commonly and widely used symmetrical curve in Statistics is the Normal frequency curve which is given in Fig. 4·35. (For details, see Chapter 14 on Theoretical Probability Distributions). NORMAL PROBABILITY CURVE Normal curve, generally describes the data relating to natural phenomenon like tossing of a coin, throwing of a dice, etc. Most of the data relating to psychological and educational statistics also give rise to normal curve. However, the data relating to social, business and economic phenomena do not conform to normal curve. They always X = Mean give moderately asymmetrical (slightly skewed) curves Fig. 4·35. discussed below. (b) Moderately Asymmetrical (Skewed) Frequency Curves. A frequency curve is said to be skewed (asymmetrical) if it is not symmetrical. Moderately asymmetrical curves are commonly observed in social, economic and business phenomena. Such curves are stretched more to one side than to the other. If the curve is stretched more to the right (i.e., it has a longer tail towards the right), it is said to be positively skewed and if it is stretched more to the left (i.e., has a longer tail towards the left), it is said to be negatively skewed. Thus, in a positively skewed distribution, most of the frequencies are associated with smaller values of the variable and in a negatively skewed distribution most of the frequencies are associated with larger values of the variable. The following figures show positively skewed and negatively skewed distributions. POSITIVELY SKEWED DISTRIBUTION NEGATIVELY SKEWED DISTRIBUTION Mode Mode Fig. 4·36. (c) Extremely Asymmetrical or J-Shaped Curves. The distributions in which the value of the variable corresponding to the maximum frequency is at one of the ranges (and not in the middle as in the case of symmetrical distributions), give rise to highly skewed curves. When plotted, they give a J-shaped or inverted J-shaped curve and accordingly such curves are also called J-shaped curves. In a J-shaped curve, the distribution starts with low frequencies in the lower classes and then frequencies increase steadily as the variable value increases and finally the maximum frequency is attained in the last class thus exhibiting a peak at the extreme right end of the distribution. Such curves are not regular curves but become unavoidable in certain situations. For example, the distribution of mortality (death) rates (along Y-axis) w.r.t. age (along X-axis) after ignoring the accidental deaths; or the distribution of persons travelling in local state buses, (e.g., DTC in Delhi or BEST in Mumbai) w.r.t. time from morning hours, say, 7 A.M. to peak traffic hours, say, 10 A.M. will give rise to a J-shaped distribution [Fig. 4·37(a)]. Similarly in an inverted J-shaped curve the frequency decreases continuously with the increase in the variate values, the maximum frequency being attained in the beginning of the distribution. For example, the distribution of the quantity demanded w.r.t. the price; or the number of depositors w.r.t. their saving in a bank, or the number of persons w.r.t. their wages or incomes in a city, will give a reverse J-shaped curve [Fig. 4·37(b)]. 4·37 DIAGRAMMATIC AND GRAPHIC REPRESENTATION INVERTED J-SHAPED CURVE Mortality Rate Quantity Demanded J-SHAPED CURVE Price Age Fig. 4·37(a). Fig. 4·37(b). No. of persons (d) U-Curve. The frequency distributions in which U-SHAPED CURVE the maximum frequency occurs at the extremes (i.e., both f ends) of the range and the frequency keeps on falling symmetrically (about the middle), the minimum frequency being attained at the centre give rise to a Ushaped curve. In this type of distribution, most of values are associated with the values of the variable at the extremes i.e., with smaller and larger values whereas smaller frequencies are associated with the intermediate values, the central value having the minimum frequency. Such distributions are generally observed in the behaviour of total costs where the curve initially falls O Morning Evening Time steadily and after attaining the optimum level (in the peak hours peak hours middle), it starts rising steadily again. As another Fig. 4·38. illustration, the distribution of persons travelling in local state buses between morning and evening peak hours will give, more or less, a U-shaped curve shown in Fig. 4·38. (e) Mixed Curves. In the curves discussed so far, we have seen that the highest concentration of the values lies at the centre (symmetrical curve), or near around the centre (moderately asymmetrical curve), or at the extremes (J-shaped and U-shaped curves). But sometimes, though very rarely, we come across certain distributions in which maximum frequency is attained at two or more points in an irregular manner as shown in Fig. 4·39. Y Bi-Modal Curve O Variable Frequency Frequency Y X O Tri-Modal Curve Variable X Fig. 4·39. Such curves are obtained in a distribution where as the value of the variable increases, the frequencies increase and decrease, then again increase and decrease in an irregular manner; the phenomenon may be repeated twice or thrice as shown in the above diagrams or even more than that. The distributions with two humps are called bi-modal distributions and those with three humps are called tri-modal distributions while those with more than three humps are termed as multi-modal distributions. Such distributions are rarely observed in practice and should be avoided as far as possible because they cannot be usefully employed for the computation of various statistical measures and for statistical analysis. 4·38 BUSINESS STATISTICS D. OGIVE OR CUMULATIVE FREQUENCY CURVE Ogive, pronounced as ojive, is a graphic presentation of the cumulative frequency (c.f.) distribution [See Chapter 3] of continuous variable. It consists in plotting the c.f. (along the Y-axis) against the class boundaries (along X-axis). Since there are two types of cumulative frequency distributions viz., ‘less than’ c.f. and ‘more than’ c.f. we have accordingly two types of ogives, viz., (i) ‘Less than’ ogive. (ii) ‘More than’ ogive. ‘Less Than’ Ogive. This consists in plotting the ‘less than’ cumulative frequencies against the upper class boundaries of the respective classes. The points so obtained are joined by a smooth freehand curve to give ‘less than’ ogive. Obviously, ‘less than’ ogive is an increasing curve, sloping upwards from left to right and has the shape of an elongated S. Remark. Since the frequency below the lower limit of the first class (i.e., upper limit of the class preceding the first class) is zero, the ogive curve should start on the left with a cumulative frequency zero at the lower boundary of the first class. ‘More Than’ Ogive. Similarly, in ‘more than’ ogive, the ‘more than’ cumulative frequencies are plotted against the lower class boundaries of the respective classes. The points so obtained are joined by a smooth freehand curve to give ‘more than’ ogive. ‘More than’ ogive is a decreasing curve and slopes downwards from left to right and has the shape of an elongated S, upside down. Remarks 1. We may draw both the ‘less than’ ogive and ‘more than’ ogive on the same graph. If done so, they intersect at a point. The foot of the perpendicular from their point of intersection on the X-axis gives the value of median. [See Example 4·28]. 2. Ogives are particularly useful for graphic computation of partition values, viz., Median, Quartiles, Deciles, Percentiles, etc. [For details, see Chapter 5]. They can also be used to determine graphically the number or proportion of observations below or above a given value of the variable or lying between certain interval of the values of the variable. 3. Ogives can be used with advantage over frequency curves for comparative study of two or more distributions because like frequency curves, for each of the distributions different ogives can be constructed on the same graph and they are generally less overlapping than the corresponding frequency curves. 4. If the class frequencies are large, they can be expressed as percentages of the total frequency. The graph of the cumulative percentage frequency is called ‘percentile curve’. Example 4·27. Draw a less than cumulative frequency curve for the following data and find from the graph the value of seventh decile. Monthly income No. of workers 0—100 100—200 200—300 300—400 400—500 12 28 35 65 30 Monthly income 500—600 600—700 700—800 800—900 900—1000 No. of workers 20 20 17 13 10 Solution. Less than cumulative frequency curve is obtained on plotting the ‘less than’ c.f. against the upper limit of the corresponding class and joining the points so obtained by a smooth free hand curve as shown in Fig. 4·40. 7N 7 To obtain the value of seventh decile from the graph, at frequency = × 250 = 175, draw a line 10 10 parallel to the X-axis meeting the ‘less than’ c.f. curve at point P. From P draw PM perpendicular to X-axis meeting it at M. Then the value of seventh decile is (Fig. 4.40) : D7 = OM = Rs. 545 [From the graph] 4·39 DIAGRAMMATIC AND GRAPHIC REPRESENTATION —————————————————————— —————————————————————— ———————————————————— 250 NO. OF WORKERS ‘LESS THAN’ CUMULATIVE FREQUENCY TABLE Monthly No. of Less than Income workers (f) c.f. 0—100 12 12 100—200 28 40 200—300 35 75 300—400 65 140 400—500 30 170 500—600 20 190 600—700 20 210 700—800 17 227 800—900 13 240 900—1000 10 250 200 P 150 7N 10 100 50 O M 545 0 ——————————————————————— ————————————————————————————————————— Total ∑ f = 250 100 200 300 400 500 600 700 800 900 1000 MONTHLY INCOME Fig. 4·40. Example 4·28. The following table gives the distribution of monthly income of 600 families in a certain city. Monthly Income (’00 Rs.) No. of Families Below 75 60 75—150 170 150—225 225—300 200 60 300—375 375—450 50 40 450 and over 20 Draw a ‘less than’ and a ‘more than’ ogive curve for the above data on the same graph and from these read the median income. Solution. For drawing the ‘less than’ and ‘more than’ ogive we convert the given distribution into ‘less than’ and ‘more than’ cumulative frequencies (c.f.) as given in the following table. No. of Families (f) 60 170 200 60 50 40 20 As already explained, for drawing ‘less than’ ogive, we plot ‘less than’ c.f. against the upper limit of the corresponding class intervals and join the points so obtained by smooth freehand curve. Similarly, ‘more than’ ogive is obtained on joining the points obtained on plotting the ‘more than’ c.f. against the lower limit of the corresponding class by smooth freehand curve. From the point of intersection of these two ogives, draw a line perpendicular to the X-axis (monthly incomes). The abscissa (x-coordinate) of the point where this perpendicular meets the X-axis gives the value of median. The ‘more than’ and ‘less than’ ogives and the value of median are shown in the Fig. 4·41. From Fig. 4·41, Median = OM = 176 (approximately) Hence, the median monthly income is Rs. 17,600. Less than c.f. 60 230 430 490 540 580 600 More than c.f. 600 540 370 170 110 60 20 ‘LESS THAN’ AND ‘MORE THAN’ OGIVE NUMBER OF FAMILIES Monthly Income (’00Rs.) Below 75 75—150 150—225 225—300 300—375 375—450 450 and over 600 Less than Ogive 500 400 300 200 100 M O More than Ogive 0 50 100 150 200 250 300 400 MEDIAN = 176 (APPROX.) MONTHLY INCOME (RS.) Fig. 4·41. 500 4·40 BUSINESS STATISTICS Example 4·29. Draw a percentile curve for the following distribution of marks obtained by 700 students at an examination. Marks No. of Students Marks No. of Students 0—10 9 50—59 102 10—19 42 60—69 71 20—29 61 70—79 23 30—39 140 80—89 2 40—49 250 Find from the graph (i) the marks at the 20th percentile, and (ii) the percentile equivalent to a mark of 65. Solution. A percentile curve is obtained on expressing the ‘less than’ cumulative frequencies as percentage of the total frequency and then plotting these cumulative percentage frequencies (P) against the upper limit of the corresponding class boundaries (x). These points are then joined by a smooth freehand curve. COMPUTATION OF CUMULATIVE PERCENTAGE FREQUENCY DISTRIBUTION ‘Less than’ c.f. 9 42 61 140 250 102 71 23 2 1·3 7·3 16·0 36·0 71·7 86·3 96·4 99·7 100·0 PERCENTILE CURVE Y 100 B 90 80 70 60 50 99·5 89·5 M 79·5 A 10 O0 59·5 20 69·5 40 30 39·5 49·5 Percentage ‘less than’ c.f. ‘Less than’ c.f. = × 100 Total frequency c.f. c.f. = × 100 = 700 7 (i) To find marks (x) corresponding to the 20th percentile, at P = 20, draw a line parallel to X-axis, meeting the percentile curve at A. Draw AM perpendicular to Xaxis, meeting X-axis at M. Then OM = 31·5, gives the marks at the 20th percentile. (ii) To find percentile equivalent to mark x = 65, at x = 65, draw perpendicular to Xaxis meeting the percentile curve at B. From B draw a line parallel to X-axis meeting the Y-axis at N. Then ON = 92, is the percentile equivalent to score of 65. Percentage ‘less than’ c.f. (P) 9 51 112 252 502 604 675 698 700 29·5 Frequency (f) 9·5 19·5 — 9·5 9·5—19·5 19·5—29·5 29·5—39·5 39·5—49·5 49·5—59·5 59·5—69·5 69·5—79·5 79·5—89·5 PERCENTAGE LESS THAN c.f. → P Marks X MARKS Fig. 4·42. 4·4·4. Graphs of Time Series or Historigrams. A time series is an arrangement of statistical data in a chronological order i.e., with respect to occurrence of time. The time period may be a year, quarter, month, week, days, hours and so on. Most of the series relating to economic and business data are time series such as population of a country, money in circulation, bank deposits and clearings, production and price of commodities, sales and profits of a departmental store, imports and exports of a country, etc. Thus in a time series data there are two variables; one of them, the independent variable being time and the other (dependent) variable being the phenomenon under study. 4·41 DIAGRAMMATIC AND GRAPHIC REPRESENTATION The time series data are represented geometrically by means of Time Series Graph which is also known as Historigram. The independent variable viz., time is taken along the X-axis and the dependent variable is taken along the Y-axis. The various points so obtained are joined by straight lines to get the time series graph. If the actual time series data are graphed, the historigram is called Absolute Historigram. However, the graph obtained on plotting the index number of the given values is called Index Historigram and it depicts the percentage changes in the values of the phenomenon as compared to some fixed base period. Historigrams are extensively used in practice. They are easy to draw and understand and do not require much skill and expertise to construct and interpret them. Remark. Time series graphs can be drawn on a natural (arithmetic scale) or on a ratio (semilogarithmic or logarithmic) scale, the former reflecting the absolute changes from one period to another and the latter depicting the relative changes or rates of change. In the following sections we shall study the time series graphs on a natural scale. Ratio scale graphs are discussed in § 4·4·5. The various types of time series graphs are : (i) Horizontal Line Graphs or Historigrams (ii) Silhouette or Net Balance Graphs (iii) Range or Variation Graphs (iv) Components or Band Graphs Now we shall discuss them briefly, one by one. A. HORIZONTAL LINE GRAPHS OR HISTORIGRAMS In such a graph only one variable is to be represented graphically. As already explained, the desired graph (historigram) is obtained on plotting the time variable along the X-axis and the other variable viz., the magnitudes of the phenomenon under consideration along the Y-axis on a suitable scale and joining the points so obtained by straight lines. An illustration is given in Example 4·37. Example 4·30. Draw the graph of the following : 1990 12·8 1991 13·9 1992 12·8 Solution. Taking the scale along X-axis as 1 cm = 1 year and along Y-axis as 1 cm = 2 million tons, the required graph is as shown in Fig. 4·43. FALSE BASE LINE. As already explained [§ 4·4·2 (Item 5)], if the fluctuations in the values of the variable (to be shown along the Yaxis) are small as compared to their magnitudes and if the minimum value of the variable is very distant from origin i.e., zero, then the technique of false base line is used to highlight these fluctuations. Illustrations are given in Examples 4·31 and 4·32. 1993 13·9 1994 13·4 1995 6·5 1996 2·9 1997 14·8 YIELD (IN MILLION TONS) FOR DIFFERENT YEARS 16 YIELD (IN MILLION TONS) Year Yield (in million tons) 14 12 10 8 6 4 2 1990 1991 1992 1993 1994 1995 1996 1997 YEARS Fig. 4·43. HISTORIGRAM — TWO OR MORE VARIABLES The time series data relating to two or more related variables i.e., phenomena measured in the same unit and belonging to the same time period can be displayed together in the same graph using the same scales for all the variables along the vertical axis and the same scale for time along X-axis for each variable. The method for drawing such graphs is same as that of historigram for one variable. Thus we shall get a 4·42 BUSINESS STATISTICS number of curves, one for each variable. They should be distinguished from each other by the use of different types of lines viz., thin and thick lines, dotted lines, dash lines, dash-dot lines, etc., and an index to this effect should be given for proper identification of the curves. The following illustration will clarify the point. Example 4·31. The following table gives the index numbers of industrial production for India. INDEX NUMBER OF INDUSTRIAL PRODUCTION Item Cement Iron and steel General Index 1971 107·0 100·6 104·2 1972 113·1 112·0 110·2 1973 107·6 96·1 112·0 1974 102·6 100·2 114·3 1975 116·7 121·3 119·3 Base : 1970 = 100 1976 133·9 145·0 131·2 Represent them on the same graph paper. Ans. As usual, we take time (years) along X-axis and the index numbers along Y-axis. Using false base line (for vertical axis) at 95, the graph is shown in Fig. 4·44. 150 INDEX NUMBERS 140 INDEX NUMBER OF INDUSTRIAL PRODUCTION Base : 1970 = 100 Cement Iron & Steel General Index 130 120 110 100 95 0 1971 1972 1973 1974 YEARS 1975 1976 Fig. 4·44. Remarks 1. The technique of drawing two or more historigrams on the same graph facilitates comparisons between the related phenomena. However, its use should not be recommended if the number of variables is large, say, more than 4. In such a case the different line graphs which may intersect each other become quite confusing and it becomes quite difficult to understand and interpret them. 2. The graph obtained on plotting the index number is known as index historigram and it represents the relative changes in the values of the variables under consideration. Two or more variable index historigrams on the same graph obviously facilitate comparisons. However, in order to arrive at any valid conclusion, the index numbers for all the variables should be computed with respect to the same base period. 3. Graphs of two variables measured in different units. The time series data relating to two related phenomena which are measured in different units e.g., imports (quantity in million tons) and imports (values in crores Rupees) but pertaining to the same time period can also be displayed on the same graph. This is done by using two different vertical scales (one for each variable), one on the left and the other on 4·43 DIAGRAMMATIC AND GRAPHIC REPRESENTATION the right; the scales for each variable being so selected that the two historigrams so obtained are close to each other. This objective can be achieved by taking the scales for each variable proportional to its average value i.e., the average value of each variable is kept in or near about the middle of the vertical scale in the graph and the scale for each is selected accordingly. We explain this point by the following illustration. Example 4·32. Plot a graph to represent the following data in a suitable manner. 1990 400 220 1991 450 235 1992 560 385 1993 1994 1995 1996 1997 620 580 460 500 540 420 420 380 360 400 VOLUME AND VALUE OF IMPORTS (1990—1997) Volume Solution. The time variable (Year) is recorded along the X-axis with scale 1 cm = 1 year and the variate values, imports (volume in tons) and imports (value in Rs.) are recorded along the Y-axis with scales : 1 cm = 20,000 tons (for imports-quantity) 1 cm = 20,000 Rs. (for imports-value) and false base line is selected at 400,000 tons (for quantity) and Rs. 220,000 (for value). The graph is shown in Fig. 4·45. IMPORTS (MILLION TONS) 620 600 Value 420 580 400 560 380 540 360 520 500 340 Volume 300 480 460 320 Value IMPORTS (MILLION RS.) Year Imports (million tons) Imports (million Rs.) 280 440 260 420 240 400 220 0 1990 1991 1992 1993 1994 1995 1996 1997 YEARS Fig. 4·45. B. SILHOUETTE OR NET BALANCE GRAPH This graph is specially used to highlight the difference or the net balance between the values of two variables along the vertical axis e.g., the difference between imports and exports of a country in different years, sales and purchase of a business concern for different periods, the income and expenditure of a family in different months and so on. This can be done in any one of the following two ways : Method 1. Obtain the net balance viz., the difference between two sets of the values of the phenomena (variables) for different periods. Some of these differences may be negative also. Now, in addition to the two historigrams, one for each variable, draw a third historigram for the net balance on the same graph. A portion of this graph, (corresponding to the negative values of the net balance) will be below the X-axis. Method 2. Draw two historigrams, one for each phenomenon (variable), on the same graph. The net balance between the variables is depicted by proper filling or shading of the space between the two historigrams, depicting clearly the positive and negative balance. Both these methods are explained in the following illustration. 4·44 BUSINESS STATISTICS Example 4·33. India’s overall balance of payment situation (Billions of Rupees) is given below : Years : Credits Debits Balance (Credit – Debit) 1970—71 18·9 22·2 –3·3 1971—72 20·9 24·9 – 4·0 1972—73 24·2 26·7 – 2·5 1973—74 46·1 33·0 13·1 1974—75 40·7 47·2 –6·5 Represent the above data on the same graph. Solution. The above data can best be represented by the Silhouette or Net Balance Graph. The graphs obtained on using both the methods are given a below. Favourable Unfavourable 0 20 0 1974-75 10 1973-74 10 Debits 30 1972-73 20 Credits 40 1971-72 30 50 1970-71 40 1974-75 1973-74 1972-73 1971-72 YEARS 1970-71 RUPEES (IN BILLIONS) 50 –10 Method 2 CREDITS AND DEBITS DURING 1970—1975 (ALONG WITH BALANCE OF PAYMENTS) RUPEES (IN BILLIONS) Method 1 INDIA’S OVERALL BALANCE OF PAYMENTS Credits Debits Balance Fig. 4·47. YEARS Fig. 4·46. C. RANGE GRAPH OR ZONE GRAPH By range we mean the deviation, i.e., difference between the two extreme values viz., the maximum and minimum values of the variable under consideration. Range graph, also sometimes known as zone graph, is used to depict and emphasize the range of variation of a phenomenon for each period. For instance, for highlighting the range of variation of : (i) the temperature on different days, (ii) the blood pressure readings of an individual on different days, (iii) price of a commodity on different periods of time, etc., the range or zone graph is the most appropriate and helps us to have an idea of the likely fluctuations in the magnitudes of the phenomenon under study. The range chart can be drawn in any of the following ways. Method 1. For each time period, plot the maximum and minimum values of the variable and join them by straight lines to get the range lines. Plot the mid-point (average value of the variable) for each period. Join these points by straight lines to get the range graph. The range graph thus depicts the maximum, the minimum and the average value of the phenomenon for each period. Method 2. This method consists in plotting two historigrams, one corresponding to the maximum values of the phenomenon for different periods and the other corresponding to the minimum values. The 4·45 DIAGRAMMATIC AND GRAPHIC REPRESENTATION space between the two historigrams depicts the range of the variation and is prominently displayed by proper filling or shading it. Both these methods are explained in the following illustration. Example 4·34. The following are the share price quotations of a firm for five consecutive weeks. Present the data by an appropriate diagram. Week High Low 1 102 100 2 103 101 3 107 103 4 106 105 5 105 104 Solution. Since the maximum and minimum price quotations of a firm for 5 consecutive weeks are given, the most appropriate graph for it is the zone or the range graph. Taking weeks along X-axis and share price quotations along Y-axis and using the false base line at 100 for the vertical scale, the range chart as obtained by Method 1 is drawn in Fig. 4·48. RANGE CHART FOR SHARE PRICE QUOTATIONS OF A FIRM 107 Highest SHARE-PRICE QUOTATIONS SHARE-PRICE QUOTATIONS Average 104 Lowest 103 Highest Price Lowest Price 107 106 105 RANGE CHART FOR SHARE PRICE QUOTATIONS OF A FIRM 106 105 104 103 102 101 100 102 0 101 1 2 3 4 WEEKS 5 Fig. 4·49. 100 0 1 2 3 WEEKS 4 5 Fig. 4·48. The range chart may also be drawn as in Figure 4·49 : (c.f., Method 2). D. BAND GRAPH Like sub-divided bar diagram or pie diagram, the band graph, also known as component part line chart, is a line graph used to display the total value or the magnitude of a variable and its break up into different components for each period. The construction of such a chart which is used only for time series is quite simple and involves the following steps : (i) For each period, arrange the break up of the value of the variable into various components in the same order. (ii) Draw historigram for the first component. 4·46 BUSINESS STATISTICS (iii) Over this historigram draw another historigram for the 2nd component. This is done by drawing the 2nd historigram for the cumulative totals of the first two components. (iv) Over the 2nd historigram draw another historigram for the third component. This is done by drawing the historigram for the cumulative totals of the first three components. This technique of drawing historigrams, one over the other is continued till all the components are exhausted. The last historigram, thus corresponds to the total value of the variable. The space between different historigrams in the form of different bands or belts, one for each component, is prominently displayed by different types of lines viz., dash lines, dot lines, dash-dot lines, etc. This chart is specially useful to display the division of the total costs, total sales, total production, etc., into various component parts for different periods. Remark. Just like percentage bar diagram or percentage rectangular diagram, band chart can also be used for time series where data are expressed in percentage form. In such a situation, the total value of the variable for each period is taken as 100 and bands will depict the percentage that different components bear to the total. 1990—91 1991—92 1992—93 1993—94 1994—95 1995—96 1996—97 1997—98 Material Labour Overhead Total 1989—90 Items 1988—89 Example 4·35. The following table gives the cost of production (in arbitrary units) of a factory in biennial averages : 37 10 13 60 25 8 10 43 35 11 15 61 36 11 16 63 35 11 17 63 38 12 20 70 22 7 12 41 17 5 9 31 26 8 12 46 20 9 15 44 Represent the above data by a band graph. Solution. COST OF PRODUCTION OF A FACTORY (1988—1998) 70 60 Labour 40 30 20 Materials YEARS Fig. 4·50 1997-98 1996-97 1995-96 1994-95 1993-94 1992-93 1991-92 1990-91 0 1989-90 10 1988-89 COST OF PRODUCTION Overhead 50 4·47 DIAGRAMMATIC AND GRAPHIC REPRESENTATION 4·4·5. Semi-Logarithmic Line Graphs or Ratio Charts. In the graphs discussed so far, we have used arithmetic i.e., natural scale in which equal distances represent equal absolute magnitudes on both the axes. Such graphs can be used with advantage if we are interested in displaying the absolute changes in the value of a phenomenon and the variations in the magnitudes are such that they can be plotted in the available space in the graph paper. But quite often, particularly in the case of phenomena pertaining to growth like population, production, sales, profits, etc., the increase or decrease in the value of the variable is very rapid. In such a situation we are primarily interested to study the relative changes rather than the absolute changes in the value of the phenomenon and the arithmetic scale is not of much use. In such cases we use semilogarithmic or logarithmic or ratio scale which is basically used to highlight or emphasize relative or proportionate or percentage changes in the values of a phenomenon over different periods of time. Since log () a = log a – log b, b on a logarithmic scale, equal distances will represent equal proportionate changes. There are two ways of using logarithmic scale : Semi-logarithmic Line Graph. In such a graph, the time variable along X-axis is expressed on a natural scale and the logarithms of the values of the phenomenon under study for different periods of time are plotted on the vertical axis on a natural scale. The points so obtained are joined by straight lines to give the desired curve. Since, in this type of curve the logarithms are taken along only one axis, it is known as semi-logarithmic graph and it is specially useful for studying the rates of change in the dependent variable (phenomenon under study) for different periods of time in a time series. Logarithmic Line Graph. In this graph, both the variables along horizontal and vertical axis are plotted on a logarithmic scale. For instance, for a time series data, the logarithms of the time values are plotted along horizontal axis and the logarithms of the values of the variable are plotted along the vertical axis, each on a natural scale. The required graph is obtained on joining the points so obtained by straight lines. However, it is very difficult to interpret such a graph and in practice, mostly semi-logarithmic graph is used. Remarks 1. In a semi-logarithmic graph, almost always, the vertical scale or Y-scale is a logarithmic scale. Since a semi-logarithmic graph is useful for studying the relative changes or rates and ratios of increase or decrease over different periods of time, it is also called a Ratio Graph or Ratio Chart and the logarithmic scale is also called ratio scale. 2. For practical purposes, semi-logarithmic graph papers (in which vertical scale is logarithmic scale i.e., log Y is marked along Y-axis and horizontal scale is natural i.e., the values of X are marked in arithmetical scale), analogous to ordinary graph paper are available. The use of such a semi-logarithmic graph paper relieves us of the problem of looking up and plotting the logarithms of the values of a variable on a natural scale. The specimen of a semi-logarithmic graph paper is given below. SEMI-LOGARITHMIC GRAPH PAPER 100 80 60 40 100 80 60 40 20 20 10 8 6 4 10 8 6 4 2 2 1 1 Fig. 4·51. 4·48 BUSINESS STATISTICS 3. The following diagram (Fig. 4.52) displays the arithmetic and logarithmic scales. 7— 1·00000 ·95424 — ·90309 — ·84510 — ·77815 — ·69897 — 10 — 9— 8— 7— 6— 5— 6— ·60206 — 4— 5— ·47712 — 3— ·30103 — 2— 10 — 9— 8— 4— 3— 2— 1— ·00000 Logarithms on Arithmetic scale 0 Arithmetic scale 1— Logarithmic scale Fig. 4·52. The reason why the logarithmic scale shows lower and lower distances as we move towards higher and higher magnitudes from 1 to 10 is that the series which is increasing by equal absolute amounts (on an arithmetic scale) is increasing at a diminishing rate. Arithmetic Scale Graphs Vs. Ratio Scale Graphs 1. A line graph on an arithmetic scale depicts the absolute changes from one period to another whereas on a ratio scale it reflects the rate of change between any two points of time. Thus the graph drawn on natural scale will not be able to reflect the relative or percentage changes or the rate of change of the phenomenon for any two points of time. In most of the problems of growth, e.g., data relating to population, production, sales or profits of a business concern, national income, etc., absolute changes if shown on the graph on a natural scale, are often misleading. As an illustration, let us consider the following hypothetical figures relating to the profits of a business concern. Year 1990 1991 1992 1993 1994 1995 Profits (Rs. in lacs) 15 30 50 75 105 140 Increase over profits of preceding year Absolute (Rs.. in lacs) Percentage — — 15 100·0 20 66·7 25 50·0 30 40·0 35 33·3 Thus in the above table, although the absolute increase shows a steady increase in the profits, the percentage or relative increase registers a steady decline. It is surprising to note that the smallest percentage increase (for the year 1995) corresponds to the greatest absolute increase, a fact which is prominently displayed on a semi-logarithmic graph by the flattening of the slope of the curve. Hence, if the primary objective is to study the rate of change in the magnitudes of a phenomenon, the data plotted on a natural scale will give quite wrong and misleading conclusions. In such a case the ratio or semi-logarithmic scale is the appropriate one. 2. On an arithmetic or natural scale equal absolute amounts (along vertical axis) are represented by equal distances whereas in a ratio scale equal distances represent equal proportionate movements or equal 4·49 DIAGRAMMATIC AND GRAPHIC REPRESENTATION relative rates of change or equal percentage changes. Thus in a natural scale, the readings are in arithmetical progression while in a ratio scale they are in geometric progression as exhibited in the Fig. 4·53. 5— 32 — 320 — 3200 — 4— 16 — 160 — 1600 — 3— 8— 80 — 800 — 2— 4— 40 — 400 — 1— 2— 20 — 200 — 0— Natural scale 1— 10 — Ratio scale 100 — Fig. 4·53. Thus on a logarithmic scale, the distances between the points on the vertical scale represent the distances of the logarithms of the numbers and not the distances of the numbers themselves. 3. On a natural or arithmetic scale, the vertical scale must start with zero. Since the logarithm of zero is minus infinity, i.e., since log (0) = – ∞, in a ratio graph there is no zero base line. Thus, in a ratio graph, the vertical scale starts with a positive number. Further, since log (1) = 0, the value 1 is placed at a zero distance from the origin i.e., at the origin itself. Hence, in a ratio chart, the origin along the vertical scale is at 1. 4. In case the magnitudes of the phenomenon under consideration have a very wide range, i.e., the values differ widely in magnitudes, then ratio graph is more appropriate than the graph on an arithmetic scale. 5. In interpreting the graphs drawn on a natural scale, the relative position of the curve on the graph is very significant. But, in interpreting a ratio graph it is the shape, direction and degree of steepness of the graph (i.e., straight line or a curve sloping upwards or downwards) that matters and not its position. Accordingly, on a semi-logarithmic scale, the different graphs can be moved up and down without changing their meaning (interpretation). Hence ratio graph can be effectively used to graph, for purposes of comparisons, two or more phenomena (variables) which differ widely in their magnitudes or which are measured even in different units. For instance, for charting the data relating to the population growth, agricultural or industrial output, prices, profits, sales, etc., the ratio graph or semi-logarithmic graph is more appropriate. Such comparison, however, might be misleading on a natural scale. Uses of Semi-Logarithmic Scale or Ratio Scale. From the above discussion, the uses of ratio or semilogarithmic scale may be summarised as follows : 1. For studying the rates of change (increase or decrease) or the relative or percentage changes in the values of a phenomenon like population, production, sales, profit, income, etc. 2. For charting two or more phenomena differing very widely in their magnitudes. 3. For charting and comparative study of two or more phenomena measured in different units. 4. When we are interested in proportionate or percentage changes rather than absolute changes. Limitations of Semi-Logarithmic Scale or Ratio Scale 1. Since log (0) = – ∞ and the logarithm of a negative quantity is not defined, the ratio scale cannot be used to plot zero or negative values. Accordingly, it cannot be used to represent the ‘Net Balance’ or ‘Balance of Trade’ on the graph. 4·50 BUSINESS STATISTICS 2. Another limitation of the ratio scale is that it cannot be used to study the total magnitude and its break up into component parts of any given phenomenon. 3. It cannot be used to study absolute variations. 4. Lastly, it is quite difficult for a layman to draw and interpret ratio charts. The interpretation of a semi-logarithmic graph requires great skill and expertise. This is a great handicap in the mass popularity of ratio or semi-logarithmic graphs. Shape of the Curve on Semi-Logarithmic Scale and Natural Scale 1. The values of phenomenon increasing by a constant amount will give a straight line rising upward on an arithmetic scale while on a semi-logarithmic scale it will give an upward rising curve with its slope steadily declining (which implies a steady decreasing rate). In other words, it will be a curve concave to the base. This is so because the values increasing by a constant absolute amount increase at a declining rate. This is shown in the following diagram. SERIES INCREASING BY A CONSTANT ABSOLUTE AMOUNT ARITHMETIC SCALE LOGARITHMIC SCALE 10 9 8 7 6 5 4 3 2 1 0 1990 1992 1994 1996 1998 1999 9 8 7 6 5 4 3 2 1 1990 1992 1994 1996 1998 1999 Fig. 4·54. 2. A time series increasing at a constant rate will give a curve convex to the base (i.e., a curve rising upwards towards the right with its slope gradually increasing), on a natural scale. However, on a ratio scale, it will give an upward rising straight line graph as shown in the following diagram. SERIES INCREASING AT A CONSTANT RATE ARITHMETIC SCALE LOGARITHMIC SCALE 9 8 7 6 5 4 3 2 1 0 1990 1992 1994 1996 1998 1999 10 9 8 7 6 5 4 3 2 1 1990 1992 1994 1996 1998 1999 Fig. 4·55. 3. If the time series values decrease by a constant absolute amount the graphs on the two scales will be like the mirror images of the graphs in case 1, in the reverse order (as shown in the Fig. 4·64) i.e., on a natural scale it will give a straight line moving downwards (rapidly declining) and on a ratio scale it will give a curve falling to the right with its slope increasing. 4·51 DIAGRAMMATIC AND GRAPHIC REPRESENTATION SERIES DECREASING BY CONSTANT ABSOLUTE AMOUNT ARITHMETIC SCALE LOGARITHMIC SCALE 10 9 8 7 6 5 4 3 2 1 0 1990 1992 1994 1996 1998 1999 9 8 7 6 5 4 3 2 1 1990 1992 1994 1996 1998 1999 Fig. 4·56 4. Similarly, for a time series decreasing at a constant rate, the graphs on the two scales will be the mirror images of the graphs in case 2, in the reverse order (as shown in the following diagram) i.e., on an arithmetic scale, we shall get a curve moving downwards with a declining slope and on a semi-logarithmic scale we shall get a straight line moving downwards. SERIES DECREASING BY CONSTANT RATE ARITHMETIC SCALE LOGARITHMIC SCALE 10 9 9 8 8 7 6 7 5 6 4 5 3 4 3 2 2 1 1 0 1990 1992 1994 1996 1998 1999 1990 1992 1994 1996 1998 1999 Fig. 4·57. Interpretation of Semi-Logarithmic or Ratio Curves 1. If the curve is rising upwards, the rate of growth or increase is positive and a curve falling downwards indicates a decreasing rate. 2. If the curve is nearly a straight line which is ascending, it represents the series increasing, more or less, at a constant rate. Similarly, a nearly straight line curve which is descending i.e., moving downwards, represents a series which is decreasing, more or less, at a uniform rate. 3. If the curve rises (falls) steeply at one point of time than at another, it depicts rapid rate of increase (decrease) at that point than at the other point. 4. If two curves on the same semi-logarithmic graph are parallel to each other, they represent equal percentage rates of change for each phenomenon. 5. If one curve is steeper than the other on the same ratio-chart, it implies that the first is changing at a faster rate than the second. We now give below, some illustrations of the use of ratio or semi-logarithmic graphs. Example 4·36. A firm reported that its net worth (in lacs Rs.) in the years 1990-91 to 1994-95 was as follows : Year 1990-91 1991-92 1992-93 1993-94 1994-95 Net worth 100 112 120 133 147 4·52 BUSINESS STATISTICS Plot the above data in the form of a semi-logarithmic graph. Can you say anything about the approximate rate of growth of its net worth ? Solution. To plot the above data on a semi-logarithmic scale we plot the logarithms of the dependent variable, (Net Worth), along the vertical SEMI-LOG GRAPH axis on a natural scale. The horizontal axis, (NET WORTH OF A FIRM) as usual, will represent time variable on a 2·18 natural scale. —————————————————————— —————————————————————— —————————————————————— 2·00 2·05 2·08 2·12 2·17 2·12 2·09 2·06 94-95 2·00 0 The graph is shown in Fig. 4·66. 93-94 2·03 92-93 100 112 120 133 147 91-92 1990-91 1991-92 1992-93 1993-94 1994-95 2·15 log (Y) 90-91 Net Worth (Y) (in lacs Rs.) LOGARITHMS Year Comments. Since the graph is ascending throughout, it reflects the YEARS increasing rate of growth of the net worth Fig. 4·58. for the entire period. However, since the graph is steepest for the period 1990-91 to 1991-92, it represents the highest rate of positive growth (increase) during this period. Then however, there is slight decline in the rate of increase for the period 1991-92 to 1992-93. There is again increase in the rate of growth for the period 1992-93 to 1994-95 over the period 1991-92 to 1992-93. Further, since the graph for period 1992-93 to 1994-95 is almost a straight line, it represents a constant rate of increase during this period. Example 4·37. The following table gives the population of India at intervals of 10 years : Year Population 1931 27,88,67,430 1941 31,85,39,060 1951 36,09,50,365 1961 43,90,72,582 1971 54,79,49,809 LOGARITHMS Plot the data on a graph paper. From your graph determine the decade in which the rate of growth of population was, SEMI-LOG GRAPH (i) the slowest. (POPULATION OF INDIA AT (ii) the fastest. INTERVALS OF 10 YEARS) 3·75 S o l u t i o n . Since we are interested in determining the rate of growth for different 3·70 decades, the appropriate graph will be obtained 3·65 on plotting the data on a semi-logarithmic or ratio scale. Logarithms of the population values 3·60 are plotted along the vertical axis on a natural 3·55 scale and time variable (decades) are plotted along the horizontal axis on a natural scale. 3·50 2789 3185 3610 4391 5479 3·45 3·50 3·56 3·64 3·74 The graph is shown in figure 4·59 0 YEARS Fig. 4·59. 1971 1931 1941 1951 1961 1971 3·45 3·40 1961 —————————————————————— —————————————————————— —————————————————————— 1951 log (Y) 1941 Population (Y) (in lakhs) 1931 Year DIAGRAMMATIC AND GRAPHIC REPRESENTATION 4·53 Comments. Since the graph is ascending throughout, it reflects an increasing rate of population growth throughout the entire period. Further, since the graph has a maximum steep during the period 1961—1971, the rate of growth is maximum during this decade. Again, since the graph has minimum steep for the period 1941—1951, the rate of growth is minimum during this decade. 4·5. LIMITATIONS OF DIAGRAMS AND GRAPHS Diagrams and graphs are very powerful and effective visual statistical aids for presenting the set of numerical data but they have their limitations some of which are outlined below : (i) Diagrams and graphs help in simplifying the textual and tabulated facts and thus may be regarded as supplementary to statistical tables. They should not be regarded as substitutes for classification, tabulation and some other forms of presentation of a set of numerical data under all circumstances and for all purposes. Julin has very elegantly stated this limitation in the following words : “Graphic statistic has a role to play of its own ; it is not the servant of numerical statistics but it cannot pretend, on the other hand, to precede or displace it”. (ii) They give only general idea of data so as to make it readily intelligible and thus furnish only limited and approximate information. For detailed and precise information we have to refer back to the original statistical tables. Accordingly, diagrams and graphs should be used to explain and impress the significance of statistical facts to the general public who find it difficult to understand and follow the numerical figures. They are, therefore, appealing to a layman who does not have any statistical background but not to a statistician because they are not amenable to further mathematical treatment and hence are not of much use to him from analysis point of view. (iii) They are subjective in character and therefore, may be interpreted differently by different people. If the same set of data are presented diagrammatically (graphically) on two different scales, the sizes of the diagrams (graphs) so obtained might differ widely and thus generally, might create wrong and misleading impressions on the minds of the people. Hence, they are likely to be mis-used by unscrupulous and dishonest people to serve their selfish motives during advertisements, publicity, etc. Hence, they should not be accepted on their face value without proper scrutiny and caution. (iv) All the diagrams and graphs are not easy to construct. Two and three-dimensional diagrams, and ratio graphs require more time and great amount of expertise and skill for their construction and interpretation and are not readily perceptible to non-mathematical person. (v) In case of large figures (observations), such a presentation fails to reveal small differences in them. (vi) The choice of a particular diagram or graph to present a given set of data requires great expertise, skill and intelligence on the part of the statistician or the concerned agency engaged in the work. A wrong type of diagram/graph may lead to very fallacious and misleading conclusions. In this context C.W. Lowe writes : “The important point that must be borne in mind at all times is that the pictorial presentation, chosen for any situation, must depict the true relationship and point out the proper conclusion. Use of an inappropriate chart may distort the facts and mislead the reader. Above all, the chart must be honest.” (vii) Diagrammatic presentation should be used only for comparison of different sets of data which relate either to the same phenomenon or different phenomena which are capable of measurement in the same unit. They are not useful, if absolute information is to be represented. EXERCISE 4·2 1. (a) Discuss the utility and limitations of graphic method of presenting statistical data. (b) Discuss the advantages and limitations of representing statistical data by diagrams (including graphs). (c) What are the general rules of graphical presentation of data ? [C.S. (Foundation), June 2001] (d) Explain the advantages of graphic representation of statistical data. 2. (a) What are various types of graphs used for presenting a frequency distribution. Discuss briefly their (i) construction and (ii) relative merits and demerits. (b) Explain briefly the various methods that are used for graphical representation of frequency distribution. 4·54 BUSINESS STATISTICS 3. Give an illustration each of the type of data for which you would expect the frequency curve to be : (i) fairly symmetrical, (ii) positively skewed, (iii) negatively skewed, (iv) J-shaped, (v) U-shaped. 4. Comment on the following : (a) “The wandering of a line is more powerful in its effect on the mind than a tabulated statement; it shows what is happening and what is likely to take place just as quickly as the eye is capable of working.” —Boddington (b) “Graphs are dynamic, dramatic. They may epitomise an epoch, each dot a fact, each slope an event, each curve a history ; wherever there are data to record, inferences to draw, or facts to tell, graphs furnish the unrivalled means whose power we are just beginning to realise and apply.” —Hubbard 5. Explain clearly the distinctions between “natural scale” and “semi-logarithmic scale” used in the graphical presentation of data. 6. (a) What do you mean by a false base line ? Explain its utility in graphic representation of statistical data. (b) “A false base line of a graph is a wrong base line.” Comment. (c) What is false base line ? Under what circumstances should it be used ? 7. Describe briefly the construction of histogram and frequency polygon of a frequency distribution and state their uses. Prepare a Histogram and a Frequency Polygon from the following data : Class : 0—6 6—12 12—18 18—24 24—30 30—36 f : 4 8 15 20 12 6 8. Draw a histogram of the following distribution : Life of Electric Lamp (in hours) (Mid-values) : 1010 1030 1050 1070 1090 Firm A (No. of lamps) : 10 130 482 360 18 9. From the following data draw (i) Histogram, (ii) Frequency polygon and (iii) Frequency curve: Class 0—10 10—20 20—30 30—40 40—50 50—60 60—70 70—80 80—90 Frequency 4 6 7 14 16 14 8 6 5 [C.S. (Foundation), June 2000] 10. Draw a histogram from the following data : Daily wages (Rs.) : 10—20 20—30 30—40 40—60 60—80 80—110 No. of employees : 5 10 12 28 20 24 [C.S. (Foundation), June 2002] 11. Draw a histogram to represent the following distribution : Monthly income 0— 750 750—1000 1000—1500 1500—2000 2000—3000 No. of families 15 51 199 240 324 Monthly income 3000— 5000 5000 — 7500 7500—10000 10000—15000 15000—25000 No. of families 309 162 66 50 27 Total 1443 How many families can be expected to have monthly income between 3500 and 4250 rupees. Hint : 4250 – 3500 × 309 = 115·88 ~– 116. 2000 12. (a) What are cumulative frequencies ? How do you present them diagrammatically for discrete and continuous distributions ? (b) What is a cumulative frequency curve ? Mention its kinds. Take an example to illustrate them. (c) Explain the difference between a histogram and frequency polygon. What is an ogive curve ? State the purpose for which it is used. 13. Draw the ‘less than’ and ‘more than’ ogive curves from the data given below : Weekly Wages (’00 Rs.) : 0—20 20—40 40—60 60—80 80—100 No. of Workers : 10 20 40 20 10 [C.S. (Foundation), Dec. 2000] 4·55 DIAGRAMMATIC AND GRAPHIC REPRESENTATION 14. For the following distribution of wages, draw ogive and hence find the value of median. Monthly Wages Frequency Monthly Wages Frequency 12·5—17·5 2 37·5—42·5 4 17·5—22·5 22 42·5—47·5 6 22·5—27·5 10 47·5—52·5 1 27·5—32·5 14 52·5—57·5 1 ——————— 32·5—37·5 3 Total 63 ——————— Ans. Md. = 26 (approx.) 15. Age distribution of 200 employees of a firm is given below. Construct a less than ogive curve and hence or otherwise calculate semi-inter-quartile range (Q3 – Q1) / 2 of the distribution. Age in years (less than) No. of employees : : 25 10 30 25 35 75 40 130 45 170 50 189 55 200 16. The following table gives the distribution of the wages of 65 employees in a factory : Wages in Rs. (Equal to or more than) 50 60 70 80 90 100 110 120 Number of Employees 65 57 47 31 17 7 2 0 Draw a ‘less than’ ogive curve from the above data, and estimate the number of employees earning at least Rs. 63 but less than Rs. 75. Ans. 15 17. Draw a less than cumulative frequency curve of the following distribution and find the limits for the central 60% of the distribution from the graph. x (less than) : 5 10 15 20 25 30 35 40 45 Frequency : 2 11 29 45 69 83 90 96 100 Ans. 13 to 29·8. 18. Draw a less than ogive from the following data : Weekly Income (Rs.) 12,000 11,000 10,000 8,000 6,000 4,000 3,000 2,000 1,000 (equal to or more than) No. of Families 0 6 14 26 42 54 62 70 80 From the graph estimate the number of families in the income range of Rs. 2,400 and Rs. 10,500. Also find maximum income of the lowest 25% of the families. [Delhi Univ., B.Com,. (Hons.), 2006] Ans. Income (in Rs. ’000) 10—20 20—30 30—40 40—60 60–80 80—100 100—110 110—120 No. of families 10 8 8 12 16 12 8 6 57 (Approximately); Q1 = Rs. 3,250. 19. The monthly profits (Rs. lakhs) earned by 100 companies during the financial year 2002-03 are given in the table below : Monthly Profit (Rs. lakhs) : 20—30 30–40 40–50 50—60 60—70 70—80 80—90 90—100 Number of companies : 4 8 18 30 15 10 8 7 Draw the OGIVE by “less than method” and “more than method.” [Delhi Univ., (FMS), M.B.A., March 2004] 20. Construct a frequency table for the following data regarding annual profits, in thousands of rupees in 50 firms, taking 25—34, 35—44, etc., as class intervals. 28 35 61 29 36 48 57 67 69 50 48 40 47 42 41 37 51 62 63 33 31 32 35 40 38 37 60 51 54 56 37 46 42 38 61 59 58 44 39 57 38 44 45 45 47 38 44 47 47 64 Construct a less than ogive and find : (i) Number of firms having profit between Rs. 37,000 and Rs. 58,000. (ii) Profit above which 10% of the firms will have their profits. (iii) Middle 50% profit group. Ans. (i) 30, (ii) Rs. 62,000, (iii) Rs. 39,000 to Rs. 56,000. 4·56 BUSINESS STATISTICS 21. What do you mean by a historigram ? How does it differ from histogram ? 22. (a) What are different types of graphs commonly used to present a time series data ? Bring out their salient features. (b) Describe briefly : (i) Historigram, (ii) Silhouette or Net Balance Graph, (iii) Range Graph, (iv) Band Graph, for presenting time series data. 23. Represent the data relating to consolidated budgetary position of states in India as given below, on a graph paper. (Rs. crore) Year Revenue Expenditure Surplus or Deficit 1955-56 560·1 626·4 –66·3 1956-57 577·0 654·3 –77·3 1957-58 705·6 677·3 +28·3 1958-59 742·1 745·8 –3·7 1959-60 833·9 829·9 +4·0 Also depict graphically, the net balance of trade. 24. Represent the following data by means of a time series graph. Year 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 Export (Rs. ’000) 267 269 263 275 270 280 282 272 265 266 Import (Rs. ’000) 307 310 280 260 275 271 280 280 260 265 Show also the net balance of trade. 25. (a) What is a false base line? When is it used on an arithmetic line graph ? [Delhi Univ. B.Com. (Pass), 2001] (b) Prepare a graph of the following data by using a false base line. Centre All India Delhi 1969 177 185 Consumer Price Index Numbers (1960 = 100) Years 1970 1971 1972 1973 186 192 207 250 199 211 222 265 1974 360 337 26. Present the following hypothetical data graphically. AREA AND PRODUCTION OF RICE IN INDIA Year : 1987 1988 1989 1990 1991 1992 Area (Million Acres) : 174·1 177·3 176·1 177·9 179·3 179·1 Production (Million Tonnes) : 72·5 77·8 74·8 77·2 78·0 74·8 27. Marks obtained by 100 students in Economics are given below. Draw an appropriate graph to represent them : Marks Males Females Total 30—40 8 6 14 40—50 6 10 16 50—60 14 6 20 60—70 13 12 25 70—80 12 13 25 Total 53 47 100 [C.S. (Foundation). Dec. 1999] Hint: Draw three graphs for males, females and total on the same graph paper. 28. Present the following data about India by a suitable graph : PRODUCTION IN MILLION TONS Year Rice Wheat Pulses Other Cereals Total 1962 30 10 10 14 64 1963 32 11 8 18 69 1964 33 8·5 11·5 20 73 1965 35 12 11 20 78 1966 36 10 10 22 78 1967 38 11 9 23 81 Hint: Band Graph ——————— ——————— ——————— ——————— ——————— ——————— 4·57 DIAGRAMMATIC AND GRAPHIC REPRESENTATION 29. Present the following data by a suitable graph : MINIMUM AND MAXIMUM PRICE OF GOLD FOR 10 GMS. FOR THE YEAR 1967 Months Highest Price Lowest Price Months Highest Price Lowest Price (Rs.) (Rs.) (Rs.) (Rs.) January 160·0 152·0 July 175·0 163·2 February 162·2 156·0 August 175·8 160·0 March 165·0 160·3 September 172·2 165·0 April 166·5 162·4 October 178·0 168·0 May 168·2 160·5 November 171·0 165·0 June 170·0 161·9 December 175·5 167·0 Hint: Range Graph. 30. (a) Differentiate between the natural scale and logarithmic scale used in graphic presentation of data. In which cases should the latter scale be used ? (b) Explain what is meant by semi-logarithmic diagram and discuss its advantages over the natural scale diagram. (c) Explain briefly how you will interpret the graphs drawn on a semi-logarithmic scale. (d) What do you understand by a ratio-scale ? Under what situations ratio charts should be drawn ? 31. The following table shows the total sales of Gold Bonds by the Reserve Bank of India : Month Year Rs. (’000) Month Year October 1965 15,560 April 1966 November 1965 13,170 May 1966 December 1965 18,740 June 1966 January 1966 12,450 July 1966 February 1966 8,320 August 1966 March 1966 7,540 September 1966 Rs. (’000) 3,250 3,570 3,620 3,140 2,580 2,540 Represent the data graphically on the logarithmic scale 32. Plot the following data graphically on the logarithmic scale. Year Total notes issued (in crores Rs.) 2890 3065 3242 3536 3866 1965—66 1966—67 1967—68 1968—69 1969—70 Total notes in circulation (in crores Rs.) 2866 3020 3194 3497 3843 33. Present the following data graphically and comment on the features thus revealed : Year Production of steel plates (in thousand tons) Unit A 30 1990 29 1992 31 1994 30 1996 30 1998 30 2000 30 2002 How will the graph look like if the data are plotted on semi-logarithmic scale ? Unit B 40 20 10 20 30 40 60 5 Averages or Measures of Central Tendency 5·1. INTRODUCTION One of the important objectives of statistical analysis is to determine various numerical measures which describe the inherent characteristics of a frequency distribution. The first of such measures is average. The averages are the measures which condense a huge unwieldy set of numerical data into single numerical values which are representative of the entire distribution. In the words of Prof. R.A. Fisher, “The inherent inability of the human mind to grasp in its entirety a large body of numerical data compels us to seek relatively few constants that will adequately describe the data”. Averages are one of such few constants. Averages provide us the gist and give a bird’s eye view of the huge mass of unwieldy numerical data. Averages are the typical values around which other items of the distribution congregate. They are the values which lie between the two extreme observations, (i.e., the smallest and the largest observations), of the distribution and give us an idea about the concentration of the values in the central part of the distribution. Accordingly they are also sometimes referred to as the Measures of Central Tendency. Averages are very much useful : (i) For describing the distribution in concise manner. (ii) For comparative study of different distributions. (iii) For computing various other statistical measures such as dispersion, skewness, kurtosis and various other basic characteristics of a mass of data. Remark. Averages are also sometimes referred to as Measures of Location since they enable us to locate the position or place of the distribution in question. We give below some definitions of an average as given by different statisticians from time to time. WHAT THEY SAY ABOUT AVERAGES — SOME DEFINITIONS “Averages are statistical constants which enable us to comprehend in a single effort the significance of the whole.”—A.L. Bowley “An average is a single value selected from a group of values to represent them in some way, a value which is supposed to stand for whole group of which it is part, as typical of all the values in the group.”—A.E. Waugh “An average value is a single value within the range of the data that is used to represent all of the values in the series. Since an average is somewhere within the range of the data, it is sometimes called a measure of central value.”—Croxton and Cowden “An average is sometimes called a measure of central tendency because individual values of the variable usually cluster around it. Averages are useful, however, for certain types of data in which there is little or no central tendency.”—Crum and Smith “Statistical analysis seeks to develop concise summary figures which describe a large body of quantitative data. One of the most widely used set of summary figures is known as measures of location, which are often referred to as averages, measures of central tendency or central location. The purpose for computing an average value for a set of observations is to obtain a single value which is representative of all the items and which the mind can grasp simply and quickly. The single value is the point or location around which the individual items cluster.” —Lawrence J. Kaplan 5·2 BUSINESS STATISTICS 5·2. REQUISITES OF A GOOD AVERAGE OR MEASURE OF CENTRAL TENDENCY According to Prof. Yule, the following are the desideratta (requirements) to be satisfied by an ideal average or measure of central tendency : (i) It should be rigidly defined i.e., the definition should be clear and unambiguous so that it leads to one and only one interpretation by different persons. In other words, the definition should not leave anything to the discretion of the investigator or the observer. If it is not rigidly defined then the bias introduced by the investigator will make its value unstable and render it unrepresentative of the distribution. (ii) It should be easy to understand and calculate even for a non-mathematical person. In other words, it should be readily comprehensible and should be computed with sufficient ease and rapidity and should not involve heavy arithmetical calculations. However, this should not be accomplished at the expense of accuracy or some other advantages which an average may possess. (iii) It should be based on all the observations. Thus, in the computation of an ideal average the entire set of data at our disposal should be used and there should not be any loss of information resulting from not using the available data. Obviously, if the whole data is not used in computing the average, it will be unrepresentative of the distribution. (iv) It should be suitable for further mathematical treatment. In other words, the average should possess some important and interesting mathematical properties so that its use in further statistical theory is enhanced. For example, if we are given the averages and sizes (frequencies) of a number of different groups then for an ideal average we should be in a position to compute the average of the combined group. If an average is not amenable to further algebraic manipulation, then obviously its use will be very much limited for further applications in statistical theory. (v) It should be affected as little as possible by fluctuations of sampling. By this we mean that if we take independent random samples of the same size from a given population and compute the average for each of these samples then, for an ideal average, the values so obtained from different samples should not vary much from one another. The difference in the values of the average for different samples is attributed to the so-called fluctuations of sampling. This property is also explained by saying that an ideal average should possess sampling stability. (vi) It should not be affected much by extreme observations. By extreme observations we mean very small or very large observations. Thus a few very small or very large observations should not unduly affect the value of a good average. 5·3. VARIOUS MEASURES OF CENTRAL TENDENCY The following are the five measures of central tendency or measures of location which are commonly used in practice. (i) Arithmetic Mean or simply Mean (ii) Median (iii) Mode (iv) Geometric Mean (v) Harmonic Mean In the following sections we shall discuss them in detail one by one. 5·4. ARITHMETIC MEAN Arithmetic mean of a given set of observations is their sum divided by the number of observations. For example, the arithmetic mean of 5, 8, 10, 15, 24 and 28 is 5 + 8 + 10 + 15 + 24 + 28 90 = 6 6 = 15 In general, if X1, X2,…, Xn are the given n observations, then their arithmetic mean, usually denoted by X is given by : – X1 + X2 + … + Xn ∑X X= = n …(5·1) n – 5·3 AVERAGES OR MEASURES OF CENTRAL TENDENCY where ∑X is the sum of the observations. In case of frequency distribution : X1 X2 X f f1 f2 X2 … Xn —————————————————————————————————— f3 … fn – the arithmetic mean X is given by : (X1 + X1 + … f1 times) + (X 2 + X2 + … f2 times) + . + (Xn + Xn + … fn times.) – X = f1 + f2 + … + fn f1 X1 + f2X2 + … + fn Xn ∑fX ∑fX = = = …(5·2) f1 + f2 + … + fn ∑f N where N = ∑f, is the total frequency. In case of continuous or grouped frequency distribution, the value of X is taken as the mid-value of the corresponding class. Remark. The symbol ∑ is the letter capital sigma of the Greek alphabet and is used in mathematics to denote the sum of values. Steps for the Computation of Arithmetic Mean 1. Multiply each value of X or the mid-value of the class (in case of grouped or continuous frequency distribution) by the corresponding frequency f. 2. Obtain the total of the products obtained in step 1 above to get ∑fX. 3. Divide the total obtained in step 2 by N = ∑f, the total frequency. The resulting value gives the arithmetic mean. Example 5·1. The intelligence quotients (IQ’s) of 10 boys in a class are given below : 70, 120, 110, 101, 88, 83, 95, 98, 107, 100 Find the mean I.Q. – Solution. Mean I.Q. (X) of the 10 boys is given by : – ∑X 1 X = n = 10 (70 + 120 + 110 + 101 + 88 + 83 + 95 + 98 + 107 + 100) = 972 10 = 97·2 Example 5·2. The following is the frequency distribution of the number of telephone calls received in 245 successive one-minute intervals at an exchange : Number of Calls : 0 1 2 Frequency : 14 21 25 Obtain the mean number of calls per minute. 3 43 4 51 5 40 6 39 7 12 Solution. Let the variable X denote the number of calls received per minute at the exchange. COMPUTATION OF MEAN NUMBER OF CALLS No. of Calls (X) Frequency (f) fX 0 14 0 1 21 21 2 25 50 3 43 129 4 51 204 5 40 200 6 39 234 7 12 84 Total N = 245 ∑fX = 922 Mean number of calls per minute at the exchange is given by : – ∑fX 922 X = N = 245 = 3·763 5·4·1. Step Deviation Method for Computing Arithmetic Mean. It may be pointed out that the formula (5·2) can be used conveniently if the values of X or/and f are small. However, if the values of X or/and f are large, the calculation of mean by the formula (5·2) is quite tedious and time consuming. In such 5·4 BUSINESS STATISTICS a case the calculations can be reduced to a great extent by using the step deviation method which consists in taking the deviations (differences) of the given observations from any arbitrary value A. Let d =X–A …(5·3) – ∑fd then, X =A+ …(5·4) N This formula is much more convenient to use for numerical problems than the formula (5·2). In case of grouped or continuous frequency distribution, with class intervals of equal magnitude, the calculations are further simplified by taking : X–A d = h …(5·5) where X is the mid-value of the class and h is the common magnitude of the class intervals. Then – ∑fd X =A+h …(5·6) N Steps for Computation of Mean by Step Deviation Method in (5·6) Step 1. Compute d = (X – A)/h, A being any arbitrary number and h is the common magnitude of the classes. Algebraic signs + or – are to be taken with the deviations. Step 2. Multiply d by the corresponding frequency f to get fd. Step 3. Find the sum of the products obtained in step 2 to get ∑fd. Step 4. Divide the sum obtained in step 3 by N, the total frequency. Step 5. Multiply the value obtained in step 4 by h. Step 6. Add A to the value obtained in step (5). The resulting value gives the arithmetic mean of the given distribution. Remarks 1. If we take h = 1, then formula (5·6) reduces to formula (5·4). 2. Any number can serve the purpose of the arbitrary constant ‘A’ used in (5·4) and (5·6) but generally the value of X corresponding to the middle part of the distribution will be more convenient. In fact, ‘A’ need not necessarily be one of the values of X. Example 5·3. Calculate the mean for the following frequency distribution : Marks : Number of students : 0—10 6 10—20 5 (i) By the direct formula. Solution. Marks 0—10 10—20 20—30 30—40 40—50 50—60 60—70 ; 20—30 8 30—40 15 40—50 7 ∑ fX 1670 = 50 = 33·4 marks. ∑f (ii) Step Deviation Method : In the usual notations we have A = 35 and h = 10. h∑fd – ∴ X = A + N = 35 + 10 ×50(– 8) = 35 – 1·6 = 33·4 marks. – Mean (X) = 60—70 3 (ii) By the step deviation method. COMPUTATION OF ARITHMETIC MEAN Number of X – 35 fX d = 10 Mid-value (X) Students (f) –3 30 6 5 –2 75 5 15 –1 200 8 25 0 525 15 35 1 315 7 45 2 330 6 55 3 195 3 65 N = ∑f = 50 ∑fX = 1670 (i) Direct Formula : 50—60 6 fd –18 –10 –8 0 7 12 9 ∑fd = – 8 5·5 AVERAGES OR MEASURES OF CENTRAL TENDENCY 5·4·2. Mathematical Properties of Arithmetic Mean. Arithmetic mean possesses some very interesting and important mathematical properties as given below : Property 1. The algebraic sum of the deviations of the given set of observations from their arithmetic mean is zero. – Mathematically, ∑(X – X ) = 0, …(5·7) – or for a frequency distribution : ∑f(X – X) = 0 …(5·7a) — – – Proof. ∑f(X – X ) = ∑(fX – fX) =∑fX – ∑fX – – – = ∑fX – X ∑ f = ∑fX – X.N [·. · X is a constant and ∑f = N] – – – – 1 – ∴ ∑f(X – X) = NX – X.N = 0 [·. · X = N ∑fX ⇒ ∑fX = NX] Remarks 1. In computing algebraic sum of deviations, we take into consideration the plus and minus – sign of the deviations (X – X) as against the absolute deviations (c.f. Mean Deviation in Chapter 6) where we ignore the signs of the deviations. 2. Verification of Property 1 from the data of Example 5·3. ALGEBRAIC SUM OF DEVIATIONS FROM MEAN – X f X – X = X – 33·4 –28·4 6 5 –18·4 5 15 – 8·4 8 25 1·6 15 35 11·6 7 45 21·6 6 55 31·6 3 65 Marks 0—10 10—20 20—30 30—40 40—50 50—60 60—70 – f (X – X) –170·4 – 92·0 – 67·2 24·0 81·2 129·6 94·8 – ∑f(X – X) = 0 – Thus ∑f(X – X) = 0, as required. It should be kept in mind that in case of the values of the variable when no frequencies are given, we – will get ∑(X – X) = 0. As a simple illustration, let us consider the following case. X 1 2 3 4 5 6 7 – X– X = X – 4 –3 –2 –1 0 1 2 3 ∑ X = 28 – ∑(X – X ) = 0 – ∑X 28 X = n = =4 – 7 Hence ∑(X – X ) = 0. Property 2. Mean of the Combined Series. If we know the sizes and means of two component series, then we can find the mean of the resultant series obtained on combining the given series. – – – If n1 and n2 are the sizes and X1 , X2 are the respective means of two groups then the mean X of the combined group of size n 1 + n2 is given by : – – – n1X1 + n2X2 X = …(5·8) n1 + n 2 – Proof. We know that if X is the mean of n observations – – then X = ∑X ⇒ ∑X = nX n i.e., Sum of n observations = n × Arithmetic Mean …(*) 5·6 BUSINESS STATISTICS – – If X1 is the mean of n 1 observations of the first group and X2 is the mean of n 2 observations of the second group, then on using (*), we get – – The sum of n1 observations of the first group = n1 × X1 = n1X1 – – The sum of other n2 observations of the second group = n2 × X2 = n2X2 – – ∴ The sum of (n1 + n 2) observations of the combined group = n1X 1 + n2X 2 …(**) – Hence, the mean X of the combined group n1 + n2 observations is given by : – – Sum of (n1 + n2 ) observations n1 X1 + n2X2 – X = = [From (**)] n1 + n2 n1 + n2 – Remarks 1. Some writers use the notation X12 for the combined mean of two groups and thus we may write : – – – n X + n2X2 X12 = 1 1 …(5·8a) n1 + n2 – – – 2. Generalisation. In general, if X 1 , X 2 ,…,X k are the arithmetic means of k groups with n1 , n2 ,…,nk – observations respectively then we can similarly prove that the mean X of the combined group of size n1 + n2 + … + nk is given by : – – – n1 X1 + n2X2 + …+ nkXk – X = …(5·8b) n1 + n2 + … + n k Property 3. The sum of the squares of deviations of the given set of observations is minimum when taken from the arithmetic mean. Mathematically, for a given frequency distribution, the sum S = ∑f(X – A)2, … (5·9) which represents the–sum of the squares of deviations of given observations from any arbitrary value ‘A’ is minimum when A = X . This means that, if for any set of data, we compute : – S1 = Sum of squared deviations from mean = ∑ (X – X )2 , and S = Sum of squared deviations from any arbitrary point A – = ∑ (X – A) 2 ; A ≠ X , then S1 is always less than S i.e., S1 < S. Illustration of Property 3. Let us consider the values of the variable X as 1, 2, 3, 4, 5, 6, 7. TABLE 1. SUM OF SQUARED DEVIATIONS FROM MEAN – – X 1 2 3 4 5 6 7 X – X= X – 4 –3 –2 –1 0 1 2 3 (X – X )2 9 4 1 0 1 4 9 Total 28 ∑(X – X ) = 0 – ∑(X – X )2 = 28 – TABLE 2. SUM OF SQUARED DEVIATIONS ABOUT ARBITRARY POINT A = 5 X 1 2 3 4 5 6 7 Total X–A=X–5 –4 –3 –2 –1 0 1 2 (X – A)2 16 9 4 1 0 1 4 ∑(X – A)2 = 35 5·7 AVERAGES OR MEASURES OF CENTRAL TENDENCY 28 Mean (X ) = ∑X 7 = 7 =4 – The sum of the squared deviations of given observations from their mean, in this case is – ∑(X – X)2 = 28. (From Table 1) – For the above case we take the deviations of the values X from any arbitrary point A, (A ≠ X) and then – compute the sum of squared deviations about A viz., ∑(X – A)2, A ≠ X; then this sum will be greater than 28 – for all values of A. Let us in particular take A = 5, (not equal to mean X = 4). ∑(X – A)2 = ∑(X – 5 )2 = 35, Thus (From Table 2) which is greater than the sum of squared deviations about mean viz., 28. – ∑ fX – Property 4. We have : X= N ⇒ ∑ fX = NX …(5·10) Result (5·10) is quite useful in the following problems : – (a) If we are given the mean wages (X) of a number of workers (N) in a factory, then using (5·10) we can determine the total wage bill of the factory. – (b) Wrong Observations. Suppose we compute the mean X of N observations and later on it is found that one, two or more of the observations were wrongly copied down. It is now required to compute the corrected mean by replacing the wrong observations by the correct – ones. By using (5·10), we can obtain the uncorrected sum of the observations which is given by N X. From this, if we subtract the wrong observations, say, X1′ and X2 ′ and add the corresponding correct observations, say, X1 and X2 we can obtain the corrected sum of the observations which will be given by – NX – (X1′ + X2 ′) + X 1 + X2 Dividing this by N, we get the corrected mean. In general, if r observations are misread as X1′, X2 ′,…, Xr′ while correct observations are X1 , X2,…,Xr , then the corrected sum of observations is given by – NX – (X1′ + X2 ′ + … + Xr′) + (X1 + X2 + … + Xr) Dividing this sum by N, we get the corrected mean. For numerical illustration see Example 5·11. Remark. If all the observations of a series are added, subtracted, multiplied or divided by a constant β, the mean is also added, subtracted, multiplied or divided by the same constant. Let the given observations of the series be x , x ,…, x with mean –x, given by 1 ( 2 n –x = 1 x + x + … + x 1 2 n n ) …(*) Let the new series obtained on adding, subtracting, multiplying and dividing each observation of the given x-series by a constant β be denoted by the variables U, V, W and Z respectively so that : U=X+β, V=X–β, W = βX , – U = –x + β, – – W = β . –x , Z = Xβ Then V = –x – β, – and Z = l – ·x β Illustration. Let the mean of 10 observations be 35. If each observation is increased by 5, then the new mean is also increased by 5, i.e., it becomes 35 + 5 = 40. Similarly, if each observation is decreased by 5, the new mean will be 35 – 5 = 30. Further, if each observation is multiplied by 2, the new mean will be 2 × 35 = 70. 5·8 BUSINESS STATISTICS 5·4·3. Merits and Demerits of Arithmetic Mean Merits. In the light of the properties laid down by Prof. Yule for an ideal measure of central tendency, arithmetic mean possesses the following merits : (i) It is rigidly defined. (ii) It is easy to calculate and understand (iii) It is based on all the observations. (iv) It is suitable for further mathematical treatment. The mean of the combined series is given by (5·8) or (5·8a). Moreover, it possesses many important mathematical properties (Properties 1 to 4 as discussed earlier) because of which it has very wide applications in statistical theory. (v) Of all the averages, arithmetic mean is affected least by fluctuations of sampling. This property is explained by saying that arithmetic mean is a stable average. Demerits. (i) The strongest drawback of arithmetic mean is that it is very much affected by extreme observations. Two or three very large values of the variable may unduly affect the value of the arithmetic mean. Let us consider an industrial complex which houses the workers and some big officials like general manager, chief engineer, architect etc. The average salary of the workers (skilled and unskilled) is, say, Rs. 8,000 per month. If the salaries of the few big bosses (who draw very high salaries) are also included, the average wage per worker comes out to be Rs. 12,000 say. Thus, if we say that the average salary of the workers in the factory is Rs. 12,000 p.m. it gives a very good impression and one is tempted to think that the workers are well paid and their standard of living is good. But the real picture is entirely different. Thus, in the case of extreme observations, the arithmetic mean gives a distorted picture and is no longer representative of the distribution and quite often leads to very misleading conclusions. Thus, while dealing with extreme observations, arithmetic mean should be used with caution. (ii) Arithmetic mean cannot be used in the case of open end classes such as less than 10, more than 70, etc., since for such classes we cannot determine the mid-value X of the class intervals unless (i) we estimate the end intervals or (ii) we are given the total value of the variable in the open end classes. In such cases mode or median (discussed latter) may be used. (iii) It cannot be determined by inspection nor can it be located graphically. (iv) Arithmetic mean cannot be used if we are dealing with qualitative characteristics which cannot be measured quantitatively such as intelligence, honesty, beauty, etc. In such cases median (discussed later) is the only average to be used. (v) Arithmetic mean cannot be obtained if a single observation is missing or lost or is illegible unless we drop it out and compute the arithmetic mean of the remaining values. (vi) In extremely asymmetrical (skewed) distribution, usually arithmetic mean is not representative of the distribution and hence is not a suitable measure of location. (vii) Arithmetic mean may lead to wrong conclusions if the details of the data from which it is obtained are not available. In this connection it is worthwhile to quote the words of H. Secrist : “If an average is taken as a substitute for the details, then the arithmetic mean, in spite of the simplicity and ease of calculation, has little to recommand when series are non-homogeneous.” The following example will illustrate this view point. Let us consider the following marks obtained by two students A and B in three tests, viz., terminal test, half-yearly examination and annual examination respectively. Marks in : Student A Student B I Test 55% 65% II Test 60% 60% III Test 65% 55% Average marks 60% 60% 5·9 AVERAGES OR MEASURES OF CENTRAL TENDENCY The average marks obtained by each of the two students at the end of year are 60%. If we are given the average marks alone we conclude that the level of intelligence of both the students at the end of the year is same. This is a fallacious conclusion since we find from the data that student A has improved consistently while student B has deteriorated consistently. (viii) Arithmetic mean may not be one of the values which the variable actually takes and is termed as a fictitious average. Sometimes, it may give meaningless results. In this context it is interesting to quote the remarks of the ‘Punch’ journal : “The figure of 2·2 children per adult female was felt to be in some respects absurd, and a Royal Commission suggested that middle classes be paid money to increase the average to a sounder and more convenient number”. Example 5·4. The numbers 3·2, 5·8, 7·9 and 4·5, have frequencies x, (x + 2), (x – 3 ) and (x + 6) respectively. If the arithmetic mean is 4·876, find the value of x. Solution. COMPUTATION OF MEAN we have : ∑ f = x + (x + 2) + (x – 3) + (x + 6) = 4x + 5 ∑ f X = 3·2x + 5·8(x + 2) + 7·9(x – 3) + 4·5(x + 6) = (3·2 + 5·8 + 7·9 + 4·5)x + 11·6 – 23·7 + 27·0 = 21·4x + 14·9 Number (X) Frequency (f) fX ————————————————————————————————————————————————————————————————————— 3·2 5·8 7·9 4·5 x x+2 x–3 x+6 3·2x 5·8 (x + 2) 7·9 (x – 3) 4·5 (x + 6) 21·4x + 14·9 Mean= ∑∑fX = 4·876 (Given) f = 4x + 5 ∴ ⇒ ⇒ ⇒ 21·4x + 14·9 = 4·876 (4x + 5) 21·4x + 14·9 = 19·504x = 24·380 (21·400 – 19·504)x = 24·380 – 14·900 ⇒ 1·896x = 9·480 ⇒ 9·480 x = 1·896 = 5 Example 5·5. In the following grouped data, X are the mid-values of the class intervals and c is a constant. If the arithmetics mean of the original distribution is 35·84, find its class intervals. X–c : –21 –14 –7 0 7 14 21 Total f : 2 12 19 29 20 13 5 100 [Delhi Univ. B.Com. (Hons.) External), 2007] Solution. Here X – c is the deviation d from arbitrary point c i.e., d = X – c. Hence, the mean of the distribution is given by : ∑ fd ∑ f(X – c) – X=c+ = c + …(*) N N where N = ∑f. COMPUTATION OF MEAN AND CLASS INTERVALS (X – c) –21 –14 –7 0 7 14 21 f 2 12 19 29 20 13 5 f(X – c) – 42 –168 –133 0 140 182 105 Total N = 100 ∑ f (X – c) = 84 X 14 21 28 35 42 49 56 Class interval 10·5—17·5 17·5—24·5 24·5—31·5 31·5—38·5 38·5—45·5 45·5—52·5 52·5—59·5 5·10 BUSINESS STATISTICS Using (*) we get – – c) X = c + ∑ f (X = 35·84 (Given) N ∴ 84 c + 100 = 35·84 ⇒ ⇒ X–c=0 ⇒ c = 35·84 – 0·84 = 35 X = c = 35 Thus the mid-value of the class corresponding to the value X – c = 0 is X = 35. Further, since the magnitude of the class interval is 7, the corresponding class interval is obtained on adding and subtracting (7/2) = 3·5 from 35 and is given by (35 – 3·5, 35 + 3·5) i.e., 31·5—38·5. The class intervals are given in the last column of the above table. Example 5·6. Find the class intervals if the arithmetic mean of the following distribution is 33 and assumed mean 35 : Step deviation Frequency –3 5 –2 10 –1 25 0 30 +1 20 +2 10 Solution. Here the given step deviation is the deviation d, where d = (X – A)/h. – X = A + h∑fd N ; N = ∑ f, Hence A = 35 (given) …(*) COMPUTATION OF CLASS INTERVALS Step deviation (d) –3 –2 –1 0 1 2 Frequency (f) fd 5 10 25 30 20 10 Total Using (*), X –15 –20 –25 0 20 20 Class Interval 5 15 25 35 45 55 0—10 10—20 20—30 30—40 40—50 50—60 ∑fd = –20 N = 100 – h∑fd X = A + N = 33 (Given) 20h 33 = 35 – 100 = 35 – 0·2h ⇒ ⇒ 0·2h = 2 Also d = (X h– A) ⇒ h = 10 ⇒ X = A + hd = 35 + 10d Hence, we can calculate different values of X corresponding to the given values of d. Further, since the magnitude of each class interval is h = 10, we obtain the required C.I.’s by adding 10 2 = 5, to and subtracting 5 from each mid-value (X), as shown in the last column of the above table. Example 5·7. From the following data of income distribution calculate the arithmetic mean. It is given that (i) the total income of persons in the highest group is Rs. 435, and (ii) none is earning less than Rs. 20. Income (Rs.) Below 30 ” 40 ” 50 ” 60 No. of persons 16 36 61 76 Income (Rs.) Below 70 80 ” 80 and over No. of persons 87 95 5 Solution. The open class “Income below 30” includes the persons with income less than Rs. 30. But since we are given that none is earning less than Rs. 20, this class will be 20—30. Moreover, we are given the cumulative frequency distribution which has to be converted into the ordinary frequency distribution as given in the following table : 5·11 AVERAGES OR MEASURES OF CENTRAL TENDENCY COMPUTATION OF ARITHMETIC MEAN Income Mid-value (X) No. of persons (f) fX 20—30 30—40 40–50 50—60 60—70 70—80 80 and over 25 35 45 55 65 75 — 16 36 – 16 = 20 61 – 36 = 25 76 – 61 = 15 87 – 76 = 11 95 – 87 = 8 5 400 700 1125 825 715 600 435 * ∑ f = 100 *It ∑ fX = 4,800 is given that total income in the highest group is Rs. 435. 4‚800 Arithmetic Mean = ∑∑fX f = 100 = Rs. 48. ∴ Example 5·8. An investor buys Rs. 1,200 worth of shares in a company each month. During the first 5 months he bought the shares at a price of Rs. 10, Rs. 12, Rs. 15, Rs. 20 and Rs. 24 per share. After 5 months what is the average price paid for the shares by him ? Month Price per share (X) Total Cost (fX Rs.) No. of shares bought (f) ———————————————————— ————————————————————— —————————————————————————— ———————————————————————————————— Solution. Let X denote the price (in Rupees) of a share. Then the distribution of shares purchased during the first five months is as follows : Hence, the average price paid per share for the first five months is – ∑ f X 6000 X= = 410 = Rs. 14·63. ∑f 1st 10 1200 1200 10 2nd 12 1200 1200 = 12 100 3rd 15 1200 1200 = 15 80 4th 20 1200 1200 = 20 60 5th 24 1200 1200 = 24 50 ∑fX = Rs. 6000 Total = 120 ∑f = 410 Remark. For an alternative solution of this problem see Harmonic Mean, Example 5·58. Example 5·9. For a certain frequency table which has only been partly reproduced here, the mean was found to be 1·46. No. of accidents Frequency (No. of days) : : 0 46 1 ? 2 ? 3 25 4 10 5 5 Total 200 Calculate the missing frequencies. Solution. Let X denote the number of accidents and let the missing frequencies corresponding to X = 1 and X = 2 be f1 and f2 respectively. We have COMPUTATION OF ARITHMETIC MEAN Frequency No. of accidents fX 200 = 86 + f1 + f2 ⇒ f1 + f2 = 200 – 86 = 114 – X = ⇒ ⇒ ∑fX f1 + 2f2 + 140 = ∑f = 200 …(*) 1·46 (Given) f1 + 2f2 + 140 = 1·46 × 200 = 292 f1 + 2f2 = 292 – 140 = 152 Subtracting (*) from (**), we get f2 = 152 – 114 = 38 …(**) (X) (f) ————————————————————————— —————————————————————————— ——————————————————————— 0 1 2 3 4 5 Total 46 f1 f2 25 10 5 86 + f1 + f2 = 200 0 f1 2f2 75 40 25 f 1 + 2f 2 + 140 5·12 BUSINESS STATISTICS Substituting in (*), we get f1 = 114 – f2 = 114 – 38 = 76 Example 5·10. The following are the hourly salaries in rupees of 20 employees of a firm : 130 62 145 118 125 76 151 142 110 65 116 100 103 71 85 80 122 132 98 95 The firm gives bonuses of Rs. 10,15, 20, 25 and 30 for individuals in the respective salary groups exceeding Rs. 60 but not exceeding Rs. 80, exceeding Rs. 80 but not exceeding Rs. 100, and so on up to exceeding Rs. 140 but not exceeding Rs. 160. Find the average hourly bonus paid per employee. Solution. First we shall express the given data in the form of a grouped frequency distribution with salaries (in Rupees) in the class intervals 61—80, 81—100, 101—120, 121—140 and 141—160. The first value in the above distribution is 130, so we put a tally mark against the class interval 121—140; next value is 62, so we put a tally mark against the class 61—80 and so on. Thus the grouped frequency distribution is as follows : Salary (in Rs.) COMPUTATION OF AVERAGE HOURLY BONUS PER EMPLOYEE Tally Marks Frequency (f) Bonus (in Rs.) (X) fX 61—80 |||| 5 10 50 81—100 |||| 4 15 60 101—120 |||| 4 20 80 121—140 |||| 4 25 100 141—160 ||| 3 30 90 ∑ f = 20 Total ∑ f X = 380 ∑ fX 380 = = Rs. 19. ∑ f 20 Example 5·11. The mean salary paid to 1,000 employees of an establishment was found to be Rs. 180·40. Later on, after disbursement of salary, it was discovered that the salary of two employees was wrongly entered as Rs. 297 and 165. Their correct salaries were Rs. 197 and Rs. 185. Find the correct Arithmetic Mean. Solution. Let the variable X denote the salary (in rupees) of an employee. Then we are given : ∴ Average hourly bonus paid per employee = – ∑X X = 1000 = 180·40 ⇒ ∑X = 180400 …(*) Thus the total salary disbursed to all the employees in the establishment is Rs. 1,80,400. After incorporating the corrections we have : Corrected ∑X = 180400 – (sum of wrong salaries) + (sum of correct salaries) = 180400 – (297 + 165) + (197 + 185) = 180400 – 462 + 382 = 180320 ∴ Corrected mean salary = 180320 1000 = Rs. 180·32. Example 5·12. The table below shows the number of skilled and unskilled workers in two small communities, together with their average hourly wages : Worker category Skilled Unskilled Ram Nagar Shyam Nagar Number Wage per hour Number Wage per hour 150 850 Rs. 180 Rs. 130 350 650 Rs. 175 Rs. 125 Determine the average hourly wage for each community. Also give reasons why the results show that the average hourly wage in Shyam Nagar exceeds the hourly wage in Ram Nagar even though in Shyam Nagar the average hourly wage of both categories of workers is lower. 5·13 AVERAGES OR MEASURES OF CENTRAL TENDENCY – – Solution. Let n1 and n2 denote the number, andX1 and X2 denote the wages (in rupees) per hour of the – skilled and unskilled workers respectively in the community. Let X be the mean wages of all the workers in the community. Ram Nagar. We have : Shyam Nagar. We have : – n1 = 150, X1 = Rs. 180 – n2 = 850, – ∴ X1 = 175 – X2 = Rs. 130 n2 = 650, X2 = 125 – – – n X + n2X2 350 × 175 + 650 × 125 X = 1 1 = 350 + 650 n1 + n2 – n X + n 2 X2 150 × 180 + 850 × 130 X = 1 1 = 150 + 850 n1 + n2 – – n1 = 350, ∴ + 110500 137500 = 270001000 = = Rs. 137·50 1000 = 61250 + 81250 142500 = 1000 1000 = Rs. 142·50 Thus, we see that the average wage per hour for all the workers combined is higher in Shyam Nagar than in Ram Nagar, although the average hourly wages of both types of workers are lower in Shyam Nagar. The reasons for this somewhat strange looking result may be assigned as follows : The difference in the average hourly wages in Ram Nagar and Shyam Nagar : (a) For skilled workers is Rs. (180 – 175) = Rs. 5 (b) For unskilled workers is Rs. (130 – 125) = Rs. 5 Thus, although the difference in the wages of skilled and unskilled workers in both the communities is same viz., Rs. 5, the number of skilled workers getting relatively higher wages than the unskilled workers is much more in Shyam Nagar than in Ram Nagar and the number of unskilled workers getting relatively less wages is much less in Shyam Nagar than in Ram Nagar. In fact, the ratio of skilled workers to unskilled workers in Ram Nagar is 150 : 850 i.e., 3 : 17 while in Shyam Nagar, it is 350 : 650 i.e., 7 :13. Example 5·13. The mean of marks in Statistics of 100 students in a class was 72. The mean of marks of boys was 75, while their number was 70. Find out the mean marks of girls in the class. Solution. In the usual notations we are given : n = 70, –x = 75 ; n + n = 100, –x = 72 ; ∴ n = 100 – 70 = 30. We want –x . 1 1 1 2 – – –x = n1 x 1 + n2 x2 We have n1 + n2 ∴ 72 × 100 = 5250 + 30 –x2 2 2 70 × 75 + 30 –x 2 72 = 100 –x = 7200 – 5250 = 1950 = 65 ⇒ ⇒ 2 30 30 Hence, the mean of marks of girls in the class is 65. Example 5·14. The average daily wage of all workers in a factory is Rs. 444. If the average daily wages paid to male and female workers are Rs. 480 and Rs. 360 respectively, find the percentage of male and female workers employed by the factory. Solution. Let n1 and n2 denote respectively the number of male and female workers in the factory and – – – X1 and X2 denote respectively their average daily salary (in Rupees). Let X denote the average salary of all the workers in the factory. Then we are given that : – – X1 = 480, We have ⇒ X2 = 360 – X = 444 – n1X1 + n2 X2 X = n1 + n2 – – and 444 (n1 + n2 ) = 480n1 + 360n2 ⇒ ⇒ – — – (n1 + n2 ) X = n1 X 1 + n2 X2 (480 – 444)n1 = (444 – 360)n2 5·14 BUSINESS STATISTICS ∴ n1 84 7 = = n2 36 3 ⇒ 36n1 = 84n2 Hence, the male workers in the factory are : and the female workers in the factory are : 7 7+3 3 7+3 7 × 100 = 10 × 100 = 70% 3 × 100 = 10 × 100 = 30%. Example 5·15. The arithmetic mean height of 50 students of a college is 5′—8˝. The height of 30 of these is given in the frequency distribution below. Find the arithmetic mean height of the remaining 20 students. Height in inches : 5′—4˝ 5′—6˝ 5′—8˝ 5′—10˝ 6′—0˝ Frequency : 4 12 4 8 2 Solution. Let the variable X denote the height of the students in inches. COMPUTATION OF MEAN HEIGHT OF 30 STUDENTS Height in inches Frequency fd d = X –2 68 (f) (X) ———————————————————————————— —————————————————— —————————————————— ————————————————— 64 66 68 70 72 – X1 = Mean height (in inches) of n1 = 30 students 2 × (–8) = A + h∑fd ∑f = 68 + 30 = 204030– 16 = 2024 30 inches …(*) –2 –1 0 1 2 4 12 4 8 2 ∑ f = 30 – –8 –12 0 8 4 ∑ fd = – 8 – Let X2 denote the mean height of the remaining n2 = 50 – 30 = 20 students. If X is the mean height of the 50 students, then we are given that : – X = 5′—8˝ = 68 ˝ – – n1 X1 + n2 X2 X = n1 + n2 – ∴ ⇒ – 2024 + 20 X2 = 68 × 50 = 3400 ⇒ 30 × 68 = – X2 = ( 2024 30 ) + 20 X – 2 [Using (*)] 50 3400 – 2024 1376 = 20 = 20 68·8′′ = 5′—8·8′′. 5·5. WEIGHTED ARITHMETIC MEAN The formulae discussed so far in (5·1) to (5·6) for computing the arithmetic mean are based on the assumption that all the items in the distribution are of equal importance. However, in practice, we might come across situations where the relative importance of all the items of the distribution is not same. If some items in a distribution are more important than others, then this point must be borne in mind, in order that average computed is representative of the distribution. In such cases, proper weightage is to be given to various items - the weights attached to each item being proportional to the importance of the item in the distribution. For example, if we want to have an idea of the change in cost of living of a certain group of people, then the simple mean of the prices of the commodities consumed by them will not do, since all the commodities are not equally important, e.g., wheat, rice, pulses, housing, fuel and lighting are more important than cigarettes, tea, confectionery, cosmetics, etc. Let W 1 , W2,…, W n be the weights attached to variable values X 1 , X2 ,…, X n respectively. Then the – weighted arithmetic mean, usually denoted by Xw is given by : – Xw = W1 X1 + W2X2 + … + W n Xn ∑WX = W1 + W2 + … + W n ∑W …(5·11) This is precisely same as formula (5·2) with f replaced by W. In case of frequency distribution, if f1 , f2 ,…, fn are the frequencies of the variable values X1 , X 2 ,…, Xn respectively then the weighted arithmetic mean is given by : 5·15 AVERAGES OR MEASURES OF CENTRAL TENDENCY – Xw = W1(f1 X1 ) + W2(f2 X2 ) + … + Wn(fn Xn ) ∑W(f X) = W1 + W2 + …+ Wn ∑W …(5·12) where W1, W2,…, Wn are the respective weights of X1, X2 ,…, Xn . Example 5·16. A candidate obtained the following percentages of marks in an examination : English 60; Hindi 75; Mathematics 63; Physics 59 ; Chemistry 55. Find the candidate’s weighted arithmetic mean if weights 1, 2, 1, 3, 3 respectively are allotted to the subjects. Solution. Let the variable X denote the percentage of marks in the examination. COMPUTATION OF WEIGHTED MEAN Subject Marks (%) (X) Weight (W) English 60 1 60 Hindi 75 2 150 Mathematics 63 1 63 Physics 59 3 177 Chemistry 55 ∴ Weighted Arithmetic Mean (in%) = ∑WX 615 ∑W = 10 WX 3 165 ∑W = 10 ∑WX = 615 = 61·5. Example 5·17. Comment on the performance of the students in three universities given below, using simple and weighted averages : University Course of study Bombay % of pass Calcutta No. of students (in ’00s) % of pass Madras No. of students (in ’00s) % of pass No. of students (in ’00s) M.A. 71 3 82 2 81 2 M. Com. 83 4 76 3 76 3·5 B.A. 73 5 73 6 74 4·5 B.Com. 74 2 76 7 58 2 B.Sc. 65 3 65 3 70 7 M.Sc. 66 3 60 7 73 2 Solution. COMPUTATION OF SIMPLE AND WEIGHTED AVERAGES University → Course of study M.A. M.Com. B.A. B.Com, B.Sc. M.Sc. Total Bombay Pass % age Calcutta Pass % age No. of students (in ’00s) Madras No. of students Pass % age No. of students (in ’00s) (W3) (in ’00s) (X1) (W1 ) W1X1 (X2) (W2) W2X2 (X3) 71 83 73 74 65 66 432 3 4 5 2 3 3 20 213 332 365 148 195 198 1451 82 76 73 76 65 60 432 2 3 6 7 3 7 28 164 228 438 532 195 420 1977 81 76 74 58 70 73 432 2 3·5 4·5 2 7 2 21 W3X3 162 266 333 116 490 146 1513 5·16 BUSINESS STATISTICS University Bombay Calcutta Madras Simple Average Weighted Average ∑X1 432 = = 72 6 6 ∑X2 432 = = 72 6 6 ∑X3 432 = = 72 6 6 ∑W1X1 1451 = = 72·55 20 ∑W1 ∑W2 X2 1977 = = 70·61 28 ∑W2 ∑W3 X3 1513 = = 72·05 21 ∑W3 On the basis of the simple arithmetic mean which comes out to be same for each University viz., 72, we cannot distinguish between the pass percentage of the students in the three Universities. However, the weighted averages show that the results are the best in Bombay University (which has highest weighted average of 72·55), followed by Madras University (which has the weighted average 72·05), while Calcutta University shows the lowest performance. Example 5·18. From the results of two colleges A and B below state which of them is better and why ? Name of College A Examination College B Appeared Passed Appeared Passed 300 500 2000 1200 4000 250 450 1500 750 2950 1000 1200 1000 800 4000 800 950 700 500 2950 M.A. M. Com. B.A. B.Com. Total Solution. Name of Examination Appeared (WA) College A Passed M.A. 300 250 M.Com. 500 450 B.A. 2000 1500 B.Com. 1200 750 Total 4000 2950 Pass % age (XA) Appeared (WB ) 250 × 100 = 83·33 300 450 × 100 = 90 500 1500 × 100 = 75 2000 750 × 100 = 62·5 1200 2950 × 100 = 73·75 4000 College B Passed 1000 800 1200 950 1000 700 800 500 4000 2950 Pass % age (XB) 800 × 100 = 80 1000 950 × 100 = 79·17 1200 700 × 100 = 70 1000 500 × 100 = 62·5 800 2950 × 100 = 73·75 4000 On the basis of the given information, it is not possible to decide which college is better, since the criterion for ‘better college’ is not defined. Let us try to solve this problem by taking the ‘Higher Pass Percentage’ as criterion for ‘better college’. From the calculation table, we find that the pass percentage in M.A., M.Com. and B.A. is better in college A than in college B and in B.Com. the pass percentage is same in both the colleges. The simple arithmetic mean of pass percentages in all the four courses is : – XA = 83·33 + 90 +4 75 + 62·50 = 310·83 4 = 77·71 ; – XB = 80 + 79·17 +4 70 + 62·50 = 291·67 4 = 72·92 Since the mean pass percentage is higher for college A than for college B, we are tempted to conclude that college A is better than college B. However, this conclusion is not valid since the average pass percentage is affected by the number of students appearing in the examination in different courses. An appropriate average would be the weighted average of these pass percentages in different courses, the corresponding weights being the number of students appearing in the examination. The weighted means are : – XW (A) = ∑WAXA ∑WA Total number of students passed in college A 2950 = Total number of students appeared in college A × 100 = 4000 × 100 = 73·75 5·17 AVERAGES OR MEASURES OF CENTRAL TENDENCY – XW (B) = ∑WB XB Total number of students passed in college B ∑WB = Total number of students appeared in college B × 100 = 2950 4000 × 100 = 73·75 On comparing the weighted means, we conclude that both the colleges A and B are equally good on the basis of the criterion of higher pass percentage for all the students taken together. Example 5·19. (a). Show that the weighted arithmetic mean of first n natural numbers whose weights are equal to the corresponding numbers is equal to (2n + 1)/3. (b) Also obtain simple arithmetic mean. Solution. The first n natural numbers are 1, 2, 3, …, n. We know that : 1 + 2 + 3 + … + n = n(n2+ 1) 12 + 2 2 + 3 2 + … + n2 =n(n + 1)6(2n + 1) } COMPUTATION OF WEIGHTED A.M. X W WX —————————————————————— —————————————————————— —————————————————————— …(*) 1 +2 +3 +…+n X w = ∑WX ∑W = 1 + 2 + 3 + … + n 2 2 3 2 1 2 3 12 22 32 n n n2 .. . (a) Weighted arithmetic mean is given by : – 1 2 3 .. . .. . [From above Table] = n(n + 1)6(2n + 1) · n(n2+ 1) = 2n3+ 1 [From (*)] (b) Simple A.M. of first n natural numbers is – 1 + 2 + 3 +… + n X = ∑X n = n = n(n + 1) n+1 = 2 2n [From (*)] EXERCISE 5·1 1. (a) What is a statistical average ? What are the desirable properties for an average to possess ? Mention different types of averages and state why the arithmetic mean is the most commonly used amongst them. (b) State two important objects of measures of central value. [Delhi Univ. B.Com. (Pass), 1997] Hint. (i) To obtain a single figure which is representative of the distribution. (ii) To facilitate comparisons. 2. What are ‘measures of location’ ? In what circumstances would you consider them as the most suitable measures for describing the central tendency of a frequency distribution ? 3. (a) Explain the properties of a good average. In the light of these properties which average do you think is the best and why ? (b) What are the criteria of a satisfactory measure of the central tendency ? Discuss the standard measures of central tendency and say which of these satisfy your criteria. (c) What do you mean by an ‘Average’ in Statistics. Mention the essentials of a good average. 4. What do you understand by arithmetic mean ? Discuss its merits and demerits. Also state its important properties. 5. “The figure of 2·2 children per adult female was felt to be in some respects absurd and the Royal Commission suggested that the middle class be paid money to increase the average to a rounder and more convenient number.” (Punch) Commenting on the above statement, discuss the limitations of the arithmetic average. Also point out the characteristics of a good measure of central tendency. 6. Calculate the average bonus paid per member from the following data : Bonus (in Rs.) : 50 60 70 80 90 100 110 No. of persons : 1 3 5 7 6 2 1 Ans. Rs. 79.60. 7. (a) Peter travelled by car for 4 days. He drove 10 hours each day. He drove : first day at the rate of 45 km. per hour, second day at the rate of 40 km. per hour, third day at the rate of 38 km. per hour and fourth day at the rate of 37 km. per hour. What was his average speed ? Ans. 40 km. p.h. 5·18 BUSINESS STATISTICS (b) Typist A can type a letter in 5 minutes, typist B in 10 minutes and typist C in 15 minutes. What is the average number of letters typed per hour per typist ? Ans. Required average = (12 + 6 + 4)/3 = 7·33. 8. (a) A taxi ride in a city costs one rupee for the first kilometre and sixty paise for each additional kilometre. The cost for each kilometre is incurred at the beginning of the kilometre, so that the rider pays for a whole kilometre. What is the average cost for 2 43 kilometres ? 4 Ans. Average cost for 2 34 kms = (100 + 60 + 60) × 11 Paise = 80 Paise. (b) The mean weight of a student in a group of 6 students is 119 lbs. The individual weights of five of them are 115, 109, 129, 117 and 114 lbs. What is the weight of the sixth student ? Ans. 130 lbs. 9. (a) Average marks in Statistics of 10 students of a class was 68. A new student took admission with 72 marks whereas two existing students left the college. If the marks of these students were 40 and 39, find the average marks of the remaining students. [Delhi Univ.. B.Com. (Pass), 2000] – (68 × 10) + 72 – 40 – 39 = 74·7 8 marks (approx.). Hint. x = 10 + 1 – 2 (b) Shri Narendra Kumar has invested his capital in three securities, namely RELIANCE Ltd., TISCO and SATYAM : Rs. 40,000; Rs. 50,000 and Rs. 80,000 respectively. If he collects dividends of Rs. 10,000 from each company, compute his average return from three securities. [Delhi Univ. B.Com. (Pass), 2000] Total return 3 × 10‚000 Hint. Average rate of return = Total = tnvestment 40‚000 + 50‚000 + 80‚000 Ans. 17·65%. 10. (a) Twelve persons gambled on a certain night. Seven of them lost at an average rate of Rs. 10·50 while the remaining five gained at an average of Rs. 13·00. Is the information given above correct ? If not, why ? Ans. Information is incorrect. (b) Goals scored by a hockey team in successive matches are 5, 7, 4, 2, 4, 0, 5, 5 and 3. What is the number of goals, the team must score in 10th match in order that the average comes to 4 goals per match. Ans. 5. (c) The sum of deviations of a certain number of observations measured from 4 is 72 and the sum of the deviations of the same value from 7 is –3. Find the number of observations and their mean. [Delhi Univ. B.Com. (Hons.), 1997] Hint. Let n be the number of observations. – – – If d = X – A, then X = A + ∑d ; ∴ X = 4 + 72 = 7 + (–3) · . Solving, we get n = 25, X = 6·88. n n n (d) The daily average sales of a store were Rs. 2,750 for the month of Feb. 1996. During the month, the highest and the lowest sales were Rs. 8,950 and Rs. 580 respectively. Find the average daily sales if the highest and the lowest sales are not taken into account. [Delhi Univ. B.Com. (Hons.), 1997] Hint and Ans. n = No. of days in month of February of 1996 (Leap Year) = 29 1 Revised mean = Rs. 27 [∑X – 8,950 – 580] = Rs. 271 [ 29 × 2,750 – 8,950 – 580] = Rs. 2,600.74 Ans. Rs. 2,600.74 (e) Two variables x and y are related by : y = (x – 5)/10 and each of them has 5 observations. If the mean of x is 45, find the mean of y. [I.C.W.A. (Foundation), Dec. 2006] – – Ans. y = [ x – 5)/10 ] = 4. 11. (a) The following are the daily salaries in rupees of 30 employees of a firm : 91, 139, 126, 119, 100, 87, 65, 77, 99, 95, 108, 127, 86, 148, 116, 76, 69, 88, 112, 118, 89, 116, 97, 105, 95, 80, 86, 106, 93, 135. The firm gave bonus of Rs. 10, 15, 20, 25, 30, 35, 40, 45 and 50 to employees in the respective salary groups : exceeding 60 but not exceeding 70, exceeding 70 but not exceeding 80 and so on up to exceeding 140 but not exceeding 150. Construct a frequency distribution and find out the total daily bonus paid per employee. Ans. Average daily bonus = Rs. 27·50. (b) The management of a college decides to give scholarship to the students who have scored marks 70 and above 70 in Business Statistics. The following are the marks scored by II B.Com. students : 5·19 AVERAGES OR MEASURES OF CENTRAL TENDENCY 71 74 73 74 74 76 85 93 86 91 The scholarship payable is given below : Marks : 70—75 Scholarship amount (Rs.) : 100 75—80 200 88 94 80—85 300 91 96 94 98 85—90 400 96 88 90—95 500 99 94 95—100 600 Estimate the total scholarship payable and the average scholarship payable. (Bangalore Univ. B.Com., 1999) 12. A certain number of salesmen were appointed in different territories and the following data were compiled from their sales reports : Sales (’000 Rs.) : 4—8 8—12 12—16 16—20 20—24 24—28 28—32 32—36 36—40 No. of salesmen : 11 13 16 14 — 9 17 6 4 If the average sales is believed to be Rs. 19,920, find the missing information. Ans. Missing Frequency = 10. 13. The mean of the following frequency distribution is 50. But the frequencies f1 and f 2 in classes 20—40 and 60—80 are missing. Find the missing frequencies. Class : 0—20 20—40 40—60 60—80 80—100 Frequency : 17 f1 32 f2 19 Total 120 [Delhi Univ. B.Com. (Pass), 1997] Ans. f1 = 28, f2 = 24. 14. (a) The average salary of 49 out of 50 employees in a firm is Rs. 100. The salary of the 50th employee is Rs. 97·50 more than the average salary of all the 50 workers. Find the mean salary of all the employees in the firm. (b) The mean of 99 items is 55. The value of 100th item is 99 more than the mean of 100 items. What is the value of 100th item. [Delhi Univ. B.Com (Hons.), 2001] Ans. (a) Rs. 101·99, (b) 155. 15. (a) The mean of 200 items was 50. Later on it was discovered that two items were wrongly read as 92 and 8 instead of 192 and 88. Find out the correct mean. Ans. 50·9. (b) The average daily income for a group of 50 persons working in a factory was calculated to be Rs. 169. It was later discovered that one figure was mis-read as 134 instead of the correct value 143. Calculate the correct average income. Ans. Rs. 169·18. (c) The average marks of 80 students were found to be 40. Later, it was discovered that a score of 54 was misread as 84. Find the correct mean of 80 students. [C.S. (Foundation), June 2001] Ans. 39·625. 16. 100 students appeared for an examination. The results of those who failed are given below : Marks 5 10 15 20 25 30 Total No. of Students 4 6 8 7 3 2 30 If the average marks of all students were 68.6, find out average marks of those who passed. [Delhi Univ. B.Com. (Hons.), 2008] — ∑fX 475 Ans. n1 + n2 = 100, n1 = 30 ⇒ n2 = 70 ; X 1 = Mean marks of failed students = = · ∑f 30 — — — n 1 X 1 + n2 X 2 475 + 70 X 2 — = = 68.6 ⇒ X 2 = 91.21 n 1 + n2 100 Marks No. of students 17. Fifty students appeared in an examination. The results of the passed students in given in the adjoining table. 40 6 The average marks of all the students is 52. Find the 50 14 average marks of the students who failed in the examination. 60 7 [I.C.W.A. (Foundation), Dec. 2006] 70 5 80 4 Ans. 21. 90 4 18. Out of 50 examinees, those passing the examination are shown below. If average marks of all the examinees is 5·16, what would be the average marks of examinees having failed in it ? — X 12 = 5·20 Marks obtained No. of students passing the Exam. BUSINESS STATISTICS : : 4 8 5 10 6 9 7 8 9 6 4 3 [C.S. (Foundation), June 2002] Ans. 2·1. 19. (a) The mean age of a combined group of men and women is 30 years. If the mean age of the group of men is 32 and that of the group of women is 27, find out the percentage of the men and women in the group. Ans. Men = 60%, Women = 40%. (b) The mean annual salary of all employees in a company is Rs. 25,000. The mean salary of male and female employees is Rs. 27,000 and Rs. 17,000 respectively. Find the percentage of males and females employed by the company. [C.A. (Foundation), Nov. 1995] Ans. Males = 80%, Females = 20%. (c) If the means of two groups of m and n observations are 40 and 50 respectively, and the combined mean of two groups is 42, find the ratio m : n. [I.C.W.A. (Foundation), June 2007] Ans. m : n = 4 : 1. 20. (a) The mean marks obtained by 300 students in the subject of Statistics are 45. The mean of the top 100 of them was found to be 70 and the mean of the last 100 was known to be 20. What is the mean of the remaining 100 students ? Ans. 45. (b) The mean hourly wage of 100 labourers working in a factory, running two shifts of 60 and 40 workers respectively, is Rs. 38. The mean hourly wage of 60 labourers working in the morning shift is Rs. 40. Find the mean hourly wage of 40 labourers working in the evening shift. [Delhi Univ. B.Com. (Pass), 1996] Ans. Rs. 35. 21. (a) There are there sections in B.Com. 1st year in a certain college. The number of students in each section and the average marks obtained by them in the Statistics paper in the annual examination are as follows : Section Average marks in Statistics No. of Students A 75 50 B 60 60 C 55 50 Find the average marks obtained by the students of all the sections taken together. Ans. 63·125. (b) B.Com. (Pass) III year has three Sections A, B and C with 50, 40, 60 students respectively. The mean marks for the three sections were determined as 85, 60 and 65 respectively. However, marks of a student of section A were wrongly recorded as 50 instead of zero. Determine the mean marks of all the three sections put together. [Delhi Univ. B.Com. (Pass), 1995] – – 50 × 84 + 40 × 60 + 60 × 65 50 × 85 – 50 + 0 Hint. Corrected XA = = 84 ; ∴ Combined mean (x) = = 70 50 50 + 40 + 60 22. The mean monthly salary paid to 77 employees in a company was Rs. 78. The mean salary of 32 of them was Rs. 75 and that of other 25 was Rs. 82. What was the mean salary of the remaining ? Ans. Rs. 77·80. 23. Define the weighted arithmetic mean of a set of numbers. Show that it is unaffected if all the weights are multiplied by some common factor. 24. A contractor employs three types of workers-male, female and children. To a male worker he pays Rs. 16 per hour, to a female worker Rs. 13 per hour and to a child worker Rs. 10 per hour. What is the average wage per hour paid by the contractor if the number of males, females and children is 20, 15 and 5 respectively ? Ans. Rs. 14·12. 25. Define a ‘weighted mean’. Under what circumstances would you prefer it to an unweighted mean ? Calculate the weighted mean price of a table from the following data, assuming that weights are proportional to the number of tables sold : Price per table (Rs.) : 3600 4000 4400 4800 No. of tables sold : 14 11 9 6 Ans. Rs. 4070. 5·21 AVERAGES OR MEASURES OF CENTRAL TENDENCY 26. Compute the weighted arithmetic mean of the index number from the data below : Food 125 7 Index No. Weight Group Fuel and Light 141 4 Clothing 133 5 House Rent 173 1 Miscellaneous 182 3 Ans. 141·15. 27. The following table gives the distribution of 100 accidents during seven days of the week of a given month. During the particular month there are 5 Mondays, Tuesdays and Wednesdays and only four each of the other days. Calculate the average numbr of accidents per day. Days No. of Accidents Days No. of Accidents 26 16 12 10 Thursday Friday Saturday 8 10 18 Sunday Monday Tuesday Wednesday Ans. 14·13 –~ 14. 28. To produce a scooter of a certain make, labour of different kinds is required in quantities as follows : Skilled labour : 50 hours Semi-skilled labour : 100 hours Unskilled labour : 300 hours If hourly wage rates for these three kinds of labour are Rs. 100, Rs. 70 and Rs. 20 respectively, what is the average labour cost per hour in producing the scooter ? [Delhi Univ. B.A. (Econ. Hons.), 1990] Hint. Use weighted arithmetic mean. Ans. Rs. 40 per hour. 29. A candidate obtained the following percentages of marks in different subjects in the Half-Yearly Examination : English Statistics Cost Accountancy Economics Income Tax 46% 67% 72% 58% 53% It is agreed to give double weights to marks in English and Statistics as compared to other subjects. What is the simple and weighted arithmetic mean ? [Delhi Univ. B.Com. (Pass), 2002] – — Ans. X = 59·2% and XW = 58·43% 30. Calculate simple and weighted arithmetic averages from the following data and comment on them : Designation Daily salary (in Rs.) Class I Officers Class II Officers Subordinate staff Clerical staff Lower staff Strength of the cadre 1,500 800 500 250 100 10 20 70 100 150 Ans. Simple Arithmetic Mean = Rs. 630. ; Weighted Arithmetic Mean = Rs. 302·86. 31. Comment on the performance of the students of three Universities given below using an appropriate average : University → Course of Study ↓ M.A. M.Com. M.Sc. B.Com. B.Sc. B.A. A % of Pass 81 76 73 58 70 74 B No. of students in hundreds 2 3·5 2 2 7 4·5 % of Pass 82 76 60 76 65 73 C No. of students in hundreds 2 3 7 7 3 6 % of Pass No. of students in hundreds 71 83 66 74 65 73 3 4 3 2 3 5 5·22 BUSINESS STATISTICS Ans. Simple average (A.M.) of pass percentage is 72% in each case; we are unable to distinguish between the performance of students in the three universities. However, on the basis of weighted average of pass percentage, University C (72·55%) is the best followed by University A (72·05%) and University B (70·61%). 32. From the results of two colleges A and B given below, state which of them is better and why ? Name of Examination College A College B Appeared Passed Appeared Passed M.A. 60 50 200 160 M.Com. 100 90 240 190 B.A. 400 300 200 140 B.Com. 240 150 160 100 Total 800 590 800 590 Hint and Ans. Find the weighted average of percentage of passed students (X), the corresponding weights (W) being the number of students appeared. – XW (A) = ——————————— ——————————— ——————————— ——————————— ——————————— ——————————— ——————————— ——————————— ∑WAXA 590 ∑WBXB 590 – = × 100 = 73·75 ; XW (B) = = 800 × 100 = 73·75 ∑WA 800 ∑WB Taking ‘higher pass percentage’ as the criterion for better college, both the colleges A and B are equally good. 33. A travelling salesman made five trips in two months. The record of sales is given below : Trip The sales manager criticised the salesman’s performance as not very good since his mean daily sales were only Rs. 54,000 (2,70,000/5). The salesman called this an unfair statement for his daily mean sales were as high as Rs. 55,200 (13,80,000/25). What does each average mean here ? Which average seems to be more appropriate in this case ? No. of days Value of sales (in ’00 Rs.) Sales per day (in ’00 Rs.) ——————————————— —————————————————————— ———————————————————————— ——————————————————————— 1 2 3 4 5 5 4 3 7 6 3,000 1,600 1,500 3,500 4,200 600 400 500 500 700 25 13,800 2,700 Ans. The Manager obtained the simple arithmetic mean of the sales per day, while the salesman obtained the weighted arithmetic mean. The latter (weighted average) seems to be more appropriate. 5·6. MEDIAN In the words of L.R. Connor : “The median is that value of the variable which divides the group in two equal parts, one part comprising all the values greater and the other, all the values less than median”. Thus median of a distribution may be defined as that value of the variable which exceeds and is exceeded by the same number of observations i.e., it is the value such that the number of observations above it is equal to the number of observations below it. Thus, we see that as against arithmetic mean which is based on all the items of the distribution, the median is only positional average i.e., its value depends on the position occupied by a value in the frequency distribution. 5·6·1. Calculation of Median. Case (I) : Ungrouped Data. If the number of observations is odd, then the median is the middle value after the observations have been arranged in ascending or descending order of magnitude. For example, the median of 5 observations 35, 12, 40, 8, 60 i.e., 8, 12, 35, 40, 60, is 35. In case of even number of observations median is obtained as the arithmetic mean of the two middle observations after they are arranged in ascending or descending order of magnitude. Thus, if one more observation, say, 50 is added to the above five observations then the six observations in ascending order of magnitude are : 8, 12, 35, 40, 50, 60. Thus, Median = Arithmetic mean of two middle terms = 12 (35 + 40) = 37·5. Remark. It should be clearly understood that in case of even number of observations, in fact, any value lying between the two middle values can serve as a median but it is a convention to estimate median by taking the arithmetic mean of the two middle values. 5·23 AVERAGES OR MEASURES OF CENTRAL TENDENCY Case (II) : Frequency Distribution. In case of frequency distribution where the variable takes the values X1, X2,…, Xn with respective frequencies f1 , f2 ,…, fn with ∑f = N, total frequency, median is the size of the (N + 1)/2th item or observation. In this case the use of cumulative frequency (c.f.) distribution facilitates the calculations. The steps involved are : (i) Prepare the ‘less than’ cumulative frequency (c.f.) distribution. (ii) Find N/2. (iii) See the c.f., just greater than N/2. (iv) The corresponding value of the variable gives median. The example given below illustrates the method. Example 5·20. Eight coins were tossed together and the number of heads (X) resulting was noted. The operation was repeated 256 times and the frequency distribution of the number of heads is given below : No. of heads (X) : 0 1 2 3 4 5 6 7 8 Frequency (f) : 1 9 26 59 72 52 29 7 1 Calculate median. Solution. Here N = ∑ f = 256, X ⇒ N 2 = COMPUTATION OF MEDIAN f Less than c.f. ——————————————————— ————————————————— —————————————————————————————————— 128 0 1 2 3 4 5 6 7 8 1 9 26 59 72 52 29 7 1 1 1 + 9 = 10 10 + 26 = 36 36 + 59 = 95 95 + 72 = 167 167 + 52 = 219 219 + 29 = 248 248 + 7 = 255 255 + 1 = 256 The cumulative frequency (c.f.) just greater than 128 is 167 and the value of X corresponding to 167 is 4. Hence, median number of heads is 4. Case (III) : Continuous Frequency Distribution. As before, median is the size (value) of the (N + 1)/2th observation. Steps involved for its computation are : (i) Prepare ‘less than’ cumulative frequency (c.f.) distribution. (ii) Find N/2. (iii) See c.f. just greater than N/2. (iv) The corresponding class contains the median value and is called the median class. The value of median is now obtained by using the interpolation formula : where and ( ) h N –C f 2 l is the lower limit of the median class, f is the frequency of the median class, h is the magnitude or width of the median class, N = ∑ f, is the total frequency, C is the cumulative frequency of the class preceding the median class. Median = l + …(5·13) Remarks 1. The interpolation formula (5·13) is based on the following assumptions : (i) The distribution of the variable under consideration is continuous with exclusive type classes without any gaps. (ii) There is an orderly and even distribution of observations within each class. However, if the data are given as a grouped frequency distribution where classes are not continuous, then it must be converted into a continuous frequency distribution before applying the formula. This adjustment will affect only the value of l in (5·13). 2. Median will be abbreviated by the symbol Md. 5·24 BUSINESS STATISTICS 3. The sum of absolute deviations of a given set of observations is minimum when taken from median. By absolute deviation we mean the deviation after ignoring the algebraic sign. Thus, if we take the deviation of the given values of the variable X from an assumed mean A , then X – A may be positive or negative but its absolute value denoted by | X – A |, read as (X – A) modulus or (X – A) mod is always positive and we have ∑ f | X – A | > ∑ d | X – Md | or ∑ f | X – Md | < ∑ f | X – A |; A ≠ Md. i.e., the sum of the absolute deviations about any arbitrary point A is always greater than the sum of the absolute deviations about the median. For further discussion, see Mean Deviation in Chapter 6 on Dispersion. 5·6·2. Merits and Demerits of Median. Merits. (i) It is rigidly defined. (ii) Median is easy to understand and easy to calculate for a non-mathematical person. (iii) Since median is a positional average, it is not affected at all by extreme observations and as such is very useful in the case of skewed distributions (c.f. Chapter 7), J-shaped or inverted J-shaped distributions (c.f. Chapter 4) such as the distribution of wages, incomes and wealth. So in case of extreme observations, median is a better average to use than the arithmetic mean since the later gives a distorted picture of the distribution. (iv) Median can be computed while dealing with a distribution with open end classes. (v) Median can sometimes be located by simple inspection and can also be computed graphically. (See Ogive discussed in § 5·6·4.) (vi) Median is the only average to be used while dealing with qualitative characteristics which cannot be measured quantitatively but can still be arranged in ascending or descending order of magnitude e.g., to find the average intelligence, average beauty, average honesty, etc., among a group of people. Demerits. (i) In case of even number of observations for an ungrouped data, median cannot be determined exactly. We merely estimate it as the arithmetic mean of the two middle terms. In fact any value lying between the two middle observations can serve the purpose of median. (ii) Median, being a positional average, is not based on each and every item of the distribution. It depends on all the observations only to the extent whether they are smaller than or greater than it ; the exact magnitude of the observations being immaterial. Let us consider a simple example. The median value of 35, 12, 8, 40 and 60 i.e., 8, 12, 35, 40, 60 is 35. Now if we replace the values 8 and 12 by any two values which are less than 35 and the values 40 and 60 by any two values greater than 35 the median is unaffected. This property is sometimes described by saying that median is sensitive. (iii) Median is not suitable for further mathematical treatment i.e., given the sizes and the median values of different groups, we cannot compute the median of the combined group. (iv) Median is relatively less stable than mean, particularly for small samples since it is affected more by fluctuations of sampling as compared with arithmetic mean. Example 5·21. (a) In a batch of 15 students, 5 students failed in a test. The marks of 10 students who passed were 9, 6, 7, 8, 8, 9, 6, 5, 4, 7. What was the median of all the 15 students ? (b) If the relation between two variables x and y is 2x + 3y = 7, and the median of y is 2, find the median of x. [I.C.W.A. (Foundation), Dec. 2005] Solution. (a) The marks of 10 students who passed when arranged in ascending order of magnitude are : 4, 5, 6, 6, 7, 7, 8, 8, 9, 9. Since the five students who failed must have scored less than 4 marks, the marks of 15 students when arranged in ascending order are : ., ., ., ., ., 4, 5, 6, 6, 7, 7, 8, 8, 9, 9. (1) Here N = 15. Hence, the median value is the middle value viz., 8th value in the series (1). Hence, median is 6. 5·25 AVERAGES OR MEASURES OF CENTRAL TENDENCY (b) We are given : 1 x = (7 – 3y) …(**) 2 Since the change of origin and the scale in the observations does not result in any change in the order (rank) of the observations, get from (**) and (*), 1 1 1 Median (x) = [7 – 3 Median (y)] = (7 – 3 × 2) = · 2 2 2 Median (y) = 2 …(*) and 2x + 3y = 7 ⇒ Example 5·22. The following table shows the age distribution of persons in a particular region. Age No. of persons Age No. of persons (years) (in thousands) (years) (in thousands) Below 10 2 Below 50 14 ” 20 5 ” 60 15 ” 30 9 ” 70 15·5 ” 40 12 70 and over 15·6 (i) Find the median age. (ii) Why is the median a more suitable measure of central tendency than the mean in this case ? Solution. COMPUTATION OF MEDIAN Number of persons c.f. (i) First of all we shall convert the given distribution into Age in ’000 (f) (less than) (in years) the continuous frequency distribution as given in the adjoining table and then compute the median. 0—10 2 2 ————————————————————— ——————————————————————————— —————————————————— N 15·6 Here 2 = 2 = 7·8. Cumulative frequency (c.f.) greater than 7·8 is 9. Thus the corresponding class 20—30 is the median class. Hence, using the median formula (5·13), we get Median = 20 + 10 4 5 (7·8 – 5) = 20 + 2 × 2·8 = 20 + 5 × 1·4 = 27 Hence, median age is 27 years. 10—20 20—30 30—40 40—50 50—60 60—70 70 and over 5–2=3 9–5=4 12 – 9 = 3 14 – 12 = 2 15 – 14 = 1 15·5 – 15 = 0·5 15·6 – 15·5 = 0·1 N = ∑ f = 15·6 5 9 12 14 15 15·5 15·6 (ii) In this case median is a more suitable measure of central tendency than mean because the last class viz., 70 and over is open end class and as such we cannot obtain the class mark for this class and hence arithmetic mean cannot be computed. Example 5·23. The frequency distribution of weight in grams of mangoes of a given variety is given below. Calculate the arithmetic mean and the median. Weight in grams : Number of mangoes : 410—419 14 420—429 20 430—439 42 440—449 54 450—459 45 460—469 18 470—479 7 Solution. Since the interpolation formula for median is based on continuous frequency distribution we shall first convert the given inclusive class interval series into exclusive class interval series. CALCULATIONS FOR MEAN AND MEDIAN Weight in grams (Class boundaries) 409·5—419·5 419·5—429·5 429·5—439·5 439·5—449·5 449·5—459·5 459·5—469·5 469·5—479·5 Total No. of Mangoes (f) 14 20 42 54 45 18 7 ∑f = 200 = N Mid-value (X) 444·5 d = X –10 414·5 424·5 434·5 444·5 454·5 464·5 474·5 –3 –2 –1 0 1 2 3 fd – 42 – 40 – 42 0 45 36 21 ∑fd = –22 (Less than) c.f. 14 34 76 130 175 193 200 5·26 BUSINESS STATISTICS – 10 × (–22) h∑fd Mean (X) = A + N = 444·5 + 200 = 444·5 – 1·1 = 443·4 gms. N/2 = 100. The c.f. just greater than 100 is 130. Hence, the corresponding class 439·5— 449·5 is the median class. Using the median formula, we get Md = l + ( ) h f N 2 –C 10 = 439·5 + 54 (100 – 76) × 24 = 439·5 + 1054 = 439·50 + 4·44 = 443·94 gms. Example 5·24. Find the missing frequency from the following distribution of daily sales of shops, given that the median sale of shops is Rs. 2,400. Sale in hundred Rs. : 0—10 10—20 20—30 30—40 40—50 No. of shops : 5 25 — 18 7 Solution. Let the missing frequency be ‘a’. Since median sales is Rs. 2,400 (24 hundred), 20—30 is the median class. Using median formula, we get 24 = 20 + ∴ ( 10 55 + a a 2 – 30 ) ⇒4= 10 a ( 55 + a – 60 2 4a = 5a – 25 ⇒ a = 25. Hence, the missing frequency is 25. ) CALCULATIONS FOR MEDIAN No. of shops Cumulative Sales in (f) frequency (c.f.) hundred Rs. ————————————————————— ———————————————————— —————————————————————— = 5(aa– 5) 0—10 10—20 20—30 30—40 40—50 5 25 a 18 7 5 30 30 + a 48 + a N = 55 + a Example 5·25. In the frequency distribution of 100 families given below, the number of families corresponding to expenditure groups 20—40 and 60—80 are missing from the table. However, the median is known to be 50. Find the missing frequencies. Expenditure No. of families : : 0—20 14 20—40 ? 40—60 27 60—80 ? 80—100 15 Solution. Let the missing frequencies for the classes 20—40 and 60—80 be f1 and f2 respectively. COMPUTATION OF MEDIAN No. of families c.f. (f) (Less than) Expenditure From the adjoining table, we have (in Rupees) ∑f = 56 + f1 + f2 = 100 (Given) 0—20 ⇒ f1 + f2 = 100 – 56 = 44 …(*) 20—40 Since median is given to be 50, which lies in the 40—60 class 40—60, therefore, 40—60 is the median class. 60—80 Using the median formula, we get : 80—100 ————————————————————— —————————————————————————————— ——————————————————— 50 = 40 + 20 27 [50 – (14 + f1 )] ∴ 20 10 = 27 (36 – f1 ) ⇒ 27 = 2(36 – f1 ) = 72 – 2f1 ⇒ ⇒ 14 f1 27 f2 15 N = 100 = 56 + f1 + f2 14 14 + f1 41 + f1 41 + f1 + f2 56 + f1 + f2 20 50 – 40 = 27 [36 – f1 ] 2f1 = 72 – 27 = 45 ~ ⇒ f1 = 45 2 = 22·5 – 23. [Since frequency can’t be fractional] Substituting in (*), we get f2 = 44 – f1 = 44 – 23 = 21. 5·6·3. Partition Values. The values which divide the series into a number of equal parts are called the partition values. Thus median may be regarded as a particular partition value which divides the given data into two equal parts. Quartiles. The values which divide the given data into four equal parts are known as quartiles. Obviously there will be three such points Q1, Q2 and Q3 such that Q1 ≤ Q2 ≤ Q3, termed as the three quartiles. Q 1 , known as the lower or first quartile is the value which has 25% of the items of the distribution 5·27 AVERAGES OR MEASURES OF CENTRAL TENDENCY below it and consequently 75% of the items are greater than it. Incidentally Q 2, the second quartile, coincides with the median and has an equal number of observations above it and below it. Q3, known as the upper or third quartile, has 75% of the observations below it and consequently 25% of the observations above it. The working principle for computing the quartiles is basically the same as that of computing the median. To compute Q1, the following steps are required : (i) Find N/4, where N = ∑f is the total frequency. (ii) See the (less than) cumulative frequency (c.f.) just greater than N/4. (iii) The corresponding value of X gives the value of Q 1 . In case of continuous frequency distribution, the corresponding class contains Q1 and the value of Q1 is obtained by the interpolation formula : ( ) h N –C …(5·14) f 4 where l is the lower limit , f is the frequency, and h is the magnitude of the class containing Q1, and C is the cumulative frequency (c.f.) of the class preceding the class containing Q1. Similarly to compute Q3, see the (less than) c.f., just greater than 3N/4. The corresponding value of X gives Q3. In case of continuous frequency distribution, the corresponding class contains Q 3 and the value of Q3 is given by the formula : h 3N Q3 = l + f 4 – C …(5·15) Q1 = l + ( ) where l is the lower limit, h is the magnitude, and f is the frequency of the class containing Q3, and C is the c.f. of the class preceding the class containing Q3. Deciles. Deciles are the values which divide the series into ten equal parts. Obviously there are nine deciles D1, D2, D3,…, D9, (say), such that D 1 ≤ D2 ≤ … ≤ D9. Incidentally D5 coincides with the median. The method of computing the deciles Di, (i = 1, 2,…, 9) is the same as discussed for Q 1 and Q 3 . To ×N compute the ith decile Di, (i = 1, 2,…, 9) see the c.f. just greater than i 10 ·. The corresponding value of X is Di. In case of continuous frequency distribution the corresponding class contains Di and its value is obtained by the formula : h ×N Di = l + f i 10 – C , (i = 1, 2,…, 9) …(5·16) ( ) where l is the lower limit, f is the frequency and h is the magnitude of the class containing Di, and C is the c.f. of the class preceding the class containing Di Percentiles. Percentiles are the values which divide the series into 100 equal parts. Obviously, there are 99 percentiles P 1 , P 2 ,…, P99 such that P1 ≤ P 2 ≤ … ≤ P99. The ith percentile Pi, (i = 1, 2,…, 99) is the ×N value of X corresponding to c.f. just greater than i100 . In case of continuous frequency distribution, the corresponding class contains P i and its value is obtained by the interpolation formula : Pi = l + h f ( i×N 100 ) – C , (i = 1, 2,…, 99) …(5·17) where l is the lower limit, f is the frequency and h is the magnitude of the class containing Pi, and C is the c.f. of the class preceding the class containing Pi. In particular, we shall have : P25 = Q1, P50 = D5 = Q2, P75 = Q3, D1 = P10, D2 = P20, D3 = P30,…, D9 = P90. Remark. Importance of partition values. Partition values, particularly the percentiles are specially useful in the scaling and ranking of test scores in psychological and educational statistics. In the data relating to business and economic statistics, these partition values, specially quartiles, are useful in personnel work and productivity ratings. 5·28 BUSINESS STATISTICS 5·6·4. Graphic Method of Locating Partition Values. The various partition values viz., quartiles, deciles and percentiles can be easily located graphically with the help of a curve called the cumulative frequency curve or Ogive. The procedure involves the following steps : Less Than Ogive Steps 1. Represent the given distribution in the form of a less than cumulative frequency distribution. 2. Take the values of the variable (in the case of frequency distribution) and the class intervals (in the case of continuous frequency distribution) along the horizontal scale (X-axis) and the cumulative frequency along the vertical scale (Y-axis). 3. Plot the c.f. against the corresponding value of the variable (in the case of frequency distribution) and against the upper limit of the corresponding class (in the case of continuous frequency distribution). 4. The smooth curve obtained by joining the points so obtained by means of a free-hand drawing is called ‘less than’ cumulative frequency curve or ‘less than’ ogive. The various partition values can be easily obtained from this ogive as illustrated in Example 5·32. More Than Ogive. In this case we form the ‘more than’ cumulative frequency distribution and plot it against the corresponding value of the variable or against the lower limit of the corresponding class (in case of continuous frequency distribution). The curve obtained on joining the points so obtained by smooth freehand drawing is called ‘more than’ cumulative frequency curve or ‘more than’ ogive. Remark. If we draw a perpendicular from the point of intersection of the two ogives on the x-axis, the foot of the perpendicular gives the value of median. Example 5·26. The following data gives the distribution of marks of 100 students. Calculate the most suitable average, giving the reason for your choice. Also obtain the values of quartiles, 6th decile and 70th percentile from the following data. Marks Less than 10 ” 20 ” 30 ” 40 No. of students 5 13 20 32 Marks Less than 50 ” 60 ” 70 ” 80 No. of students 60 80 90 100 Solution. We are given ‘less than’ cumulative frequency distribution. We shall first convert it into a grouped frequency distribution. Since ‘marks’ is a discrete random variable COMPUTATIONS FOR MEDIAN, QUARTILES AND PERCENTILES Frequency (f) Less than c.f. Class Boundaries taking only integral values, the Class classes are : Less than 10, 10—19,…, Less than 10 5 5 Below 9·5 70—79. Further, since the formulae 10—19 13 – 5 = 8 13 9·5—19·5 for median, quartiles and percentiles 20—29 20 – 13 = 7 20 19·5—29·5 are based on continuous frequency 30—39 32 – 20 = 12 32 29·5—39·5 distribution, we convert the 40—49 60 – 32 = 28 60 39·5—49·5 distribution into exclusive type 50—59 80 – 60 = 20 80 49·5—59·5 classes with class boundaries below 60—69 90 – 80 = 10 90 59·5—69·5 9·5, 9·5—19·5,…, 69·5—79·5 as 70—79 100 – 90 = 10 100 = N 69·5—79·5 given in the adjoining table. Since the first class ‘less than 10’ is an open end class, we cannot compute any of the mathematical averages like mean, geometric mean or harmonic mean. The only averages we can compute in this case are median and mode. We compute below the median of the above distribution. Median. 2N = 100 2 = 50. The c.f. just greater than 50 is 60. Hence, the corresponding class 39·5—49·5 is the median class. ( ) 10 × 18 ∴ Median = 39·5 + 10 28 50 – 32 = 39·5 + 28 = 39·50 + 6·43 = 45·93 Hence, median marks are 45·93. 5·29 AVERAGES OR MEASURES OF CENTRAL TENDENCY 3 × 100 3N 100 Quartiles. N and = 75. The c.f. just greater than N/4 is 32. Hence, the 4 = 4 4 = 4 = 25 corresponding class 29·5—39·5 contains Q1 which is given by : 10 ( ) Q1 = 29·5 + 12 25 – 20 = 29·5 + 10 × 5 12 = 29·50 + 4·17 = 33·67 The c.f. just greater than 3N/4 = 75 is 80. Hence, the corresponding class 49·5—59·5 contains Q3 which is given by : 10 ( ) Q3 = 49·5 + 20 75 – 60 = 49·50 + 6N 6 × 100 Decile. 10 = 10 = 10 × 15 20 = 49·5 + 7·5 = 57·0. 6th 60. The c.f. just greater than 60 is 80. Hence, the corresponding class 49·5—59·5 contains D6 which is given by : 10 ( ) D6 = 49·5 + 20 60 – 60 = 49·5 70N 70 × 100 70th Percentile. 100 = 100 = 70. The c.f. just greater than 70 is 80. Hence, the corresponding class 49·5—59·5 contains P70 which is given by : 10 ( ) P70 = 49·5 + 20 70 – 60 = 49·5 + 10 × 10 20 = 49·5 + 5 = 54·5. Example 5·27. Comment on the following statement : “The median of a distribution is N/2, the lower quartile is N/4 and the upper quartile in 3N/4”. (Here N denotes the total frequency.) Solution. The statement is wrong. The median of a distribution is not N/2 but it is the value of the variable X which divides the distribution into two equal parts i.e., median is the value of the variable X such that N/2 (i.e., 50%) of the observations are less than it and N/2 observations exceed it. The lower quartile Q1 is not N/4 but it is the value of the variable such that N/4 (i.e., 25%) of the observations are less than Q1. Similarly the upper quartile Q3 is not 3N/4 but it is the value of the variable such that 3N/4 (i.e., 75%) of the observations are less than it. Example 5·32. The following are the marks obtained by the students in Statistics : Marks Number of students Marks Number of students 10 marks or less 20 ” ” 30 ” ” 4 10 30 40 marks or less 50 ” ” 60 ” ” 40 47 50 Draw a ‘less than’ ogive curve on the graph paper and show therein : (i) The range of marks obtained by middle 80% of the students. (ii) The median. Also verify your results by direct formula calculations. Solution. The above data can be arranged in the form of a continuous frequency distribution as given in the adjoining table. Less Than Ogive. Plot the less than c.f. against the Frequency (f) (Less than) c.f. Marks corresponding value of the variable in the original table 0—10 4 4 (or against the upper limit of the corresponding class in 10—20 10 – 4 = 6 10 the adjoining table) and join these points by a smooth 20—30 30 – 10 = 20 30 free hand curve to obtain ogive. [See Fig. 5·1] ——————————————————————— —————————————————————————— —————————————————————————— N 30—40 40 – 30 = 10 (i) At the frequency 2 = 25, (along the Y-axis) 40—50 47 – 40 = 7 draw a line parallel to x-axis meeting the ogive at point 50—60 50 – 47 = 3 P. Draw PM perpendicular to the x-axis. Then OM = 27·5, is the median marks. 40 47 N = ∑ f = 50 5·30 BUSINESS STATISTICS (ii) The range of the marks obtained by the middle 80% of the students is given by P90 – P 10. To find P 90 and P 10 graphically, at the frequency 90 100 10 N = 45 and 100 N = 5, draw lines parallel to the xaxis meeting the (less than) ogive at Q and R respectively. Draw QN and RL perpendicular to the xaxis. Then P90 = ON = 47 (app.) and P10 = OL =11·7 (app.) ∴ Required range of marks = P90 – P10 = 47 – 11·7 = 35·3. OGIVE 50 40 Ogive (less than) 30 P 20 10 Values by Direct Calculations R L M N 10 20 30 40 50 60 M = 27·5 P10 = 11·7 (app.) d P90 = 47 (app.) 0 N Median. Here 2 = 25. The c.f. just greater than 25 is 30. Thus the corresponding class 20—30 is the median class. Using median formula, we get Median = 20 + 10 20 Q ( 502 – 10 ) = 20 + 21 × 15 = 20 + 7·5 = 27·5 10 100 P10 and P90. Fig. 5·1. 10 N = 100 × 50 = 5 The c.f. greater than 5 is 10. Hence, P10 lies in the corresponding class 10—20. ∴ 10 P10 = 10 + 10 6 (5 – 4) = 10 + 6 = 10 + 1·67 = 11·67 90 100 N 90 = 100 × 50 = 45. The c.f. greater than 45 is 47. Hence, the corresponding class 40—50 contains P90 and 10 × 5 P90 = 40 + 10 7 (45 – 40) = 40 + 7 = 40 + 7·14 = 47·14 Hence, the range of the marks obtained by the middle 80% of the students is P90 – P10 = 47·14 – 11·67 = 35·47. Example 5·28. For a group of 5000 workers, the hourly wages vary from Rs. 20 to Rs. 80. The wages of 4 per cent of the workers are under Rs. 25 and those of 10 per cent are under 30; 15 per cent of the workers earn Rs. 60 and over, and 5 per cent of them get Rs. 70 and over. The quartile wages are Rs. 40 and Rs. 54, and the sixth decile is Rs. 50. Put this information in the form of a frequency table. Solution. We are given : N = 5000. 25 (a) Q1 = 40 Rs. ⇒ 25% i.e., 100 × 5000 = 1250 workers earn less than Rs. 40. (b) D6 = 50 Rs. ⇒ 60% i.e., (c) Q3 = 54 Rs. ⇒ 75% i.e., 60 100 × 75 100 × 5000 = 3000 workers earn below Rs. 50. 5000 = 3750 workers earn below Rs. 54. Further, we are given that : 4 (i) 4% i.e., 100 × 5000 = 200 workers earn under Rs. 25. 10 (ii) 10% i.e., 100 × 5000 = 500 workers earn under Rs. 30. 15 (iii) 15% i.e., 100 × 5000 = 750 workers earn Rs. 60 and over and 5 (iv) 5% i.e., 100 × 5000 = 250 workers earn over Rs. 70. Using the above information, we can compute the frequencies for the following class intervals : Wages in Rs. : Under 25, 25—30, 30—40, 40—50, 50—54, 54—60, 60—70, 70 and over, 5·31 AVERAGES OR MEASURES OF CENTRAL TENDENCY as given in the following table : Hourly Wages (in Rs.) Under 25 25—30 30—40 40—50 50—54 54—60 60—70 70 and over No. of workers Less than c.f. 200 500 1250 3000 3750 5000 – 750 = 4250 — — 200 500 – 200 = 300 1250 – 500 = 750 3000 – 1250 = 1750 3750 – 3000 = 750 4250 – 3750 = 500 750 – 250 = 500 250 Since the number of workers with wages under Rs. 25 is 200 and further, since it is given that the wages vary from Rs. 20 to Rs. 80, the first class viz., under 25 can be taken as 20—25 and the last class, viz., 70 and over can be taken as 70—80. In the above table, the various classes are of unequal widths. Rearranging and combining them to have classes with equal magnitude of 10 each, the final frequency distribution of wages of 5000 workers is as shown in the adjoining Table. FREQUENCY DISTRIBUTION OF WAGES OF WORKERS No. of workers Hourly wages (f) (in Rs.) —————————————————————————— ————————————————————————————————— 20—30 30—40 40—50 50—60 60—70 70—80 200 + 300 = 500 750 1750 750 + 500 = 1250 500 250 EXERCISE 5·2 1. Define median and discuss its relative merits and demerits. 2. The mean is the most common measure of central tendency of the data. It satisfies almost all the requirements of a good average. The median is also an average, but it does not satisfy all the requirements of a good average. However, it carries certain merits and hence is useful in particular fields. Critically examine both the averages. 3. What do you understand by central tendency ? Under what conditions is median more suitable than other measures of central tendency ? 4. In each of the following cases, explain whether the description applies to mean, median or both : (i) Can be calculated from a frequency distribution with open end classes. (ii) The values of all items are taken into consideration in the calculation. (iii) The values of extreme items do not influence the average. (iv) In a distribution with a single peak and moderate skewness to the right, it is closer to the concentration of the distribution. Ans. (i) median, (ii) mean, (iii) median (iv) median. 5. Find the medians of the following two series : (i) 38 34 39 35 32 31 37 30 41 (ii) 30 31 36 33 29 28 35 36 Ans. (i) 35, (ii) 32. 6. What are the properties of median ? Following are the marks obtained by a batch of 10 students in a certain class test in Statistics (X) and Accountancy (Y). Roll No. : 1 2 3 4 5 6 7 8 X : 63 64 62 32 30 60 47 46 Y : 68 66 35 42 26 85 44 80 In which subject is the level of knowledge of the students higher ? Ans. Md (X) = 46·5, Md. (Y) = 55. Level of knowledge of students is higher in Accountancy. 7. Find mean and median from the data given below : Marks obtained : 0—10 10—20 No. of students : 12 18 Ans. Mean = 28, Median = 27·41 20—30 27 30—40 20 40—50 17 9 35 33 10 28 72 50—60 6 5·32 BUSINESS STATISTICS 8. Calculate arithmetic mean and median from the following series : Income (Rs.) : 0—5 5—10 10—15 Frequency : 5 7 10 15—20 8 20—25 25—30 6 4 [C.S. (Foundation), Dec. 2000] Ans. Arithmetic mean = 14·375 ; Median = 14. 9. For the data given below, find the missing freqeuncy if the Arithmetic Mean is Rs. 33. Also find the median of the series : Loss per shop (Rs.) : 0—10 10—20 20—30 30—40 40—50 50—60 No. of shops : 10 15 30 — 25 20 [C.A. (Foundation), Nov. 2000] Ans. Missing frequency = 25 ; Median = 33 10. Given below is the distribution of marks obtained by 140 students in an examination. Marks : 10—19 20—29 30—39 40—49 50—59 60—69 70—79 80—89 90—99 No. of students : 7 15 18 25 30 20 16 7 2 Find the median of the distribution. [C.A. PEE-1, May 2004] Ans. 51.167. 11. Compute median from the following data : Mid-value : 115 125 135 145 155 165 175 185 195 Frequency : 6 25 48 72 116 60 38 22 3 Hint. The class intervals are : 110—120, 120—130,……, 190—200 Ans. Median = 153·79. 12. You are given below a certain statistical distribution : Value : Less than 100 100—200 200—300 300—400 400 and above Total Frequency : 40 89 148 64 39 380 Calculate the most suitable average giving reasons for your choice. Ans. Md = 241·22. 13. The following table gives the distribution of marks secured by some students in a certain examination : Marks : 0—20 21—30 31—40 41—50 51—60 61—70 71—80 No. of Students : 42 38 120 84 48 36 31 Find : (i) Median marks. (ii) The percentage of failure if minimum for a pass is 35 marks. Ans. (i) Md = 40·46 (ii) 31·58%. 14. Calculate the median from the following data : Weight (in gms.) : 410—419 420—429 430—439 440—449 No. of Apples : 14 20 42 54 450—459 45 460—469 18 470—479 7 [Andhra Pradesh Univ. B.Com., 1999] Ans. Median = 443·94 gms 15. Given below is the distribution of 140 candidates obtaining marks X or higher in a certain examination (all marks are given in whole numbers) Marks (More than) : 10 20 30 40 50 60 70 80 90 100 Freqeuncy : 140 133 118 100 75 45 25 9 2 0 Calculate the mean and median marks obtained by the candidates. Ans. Mean = 50·714, Median = 51·167. 16. The following table gives the weekly wages in rupees in a certain commercial organisation. Weekly wages (’00 Rs.) : Frequency : 30— 3 32— 8 34— 24 36— 31 38— 50 40— 61 42— 38 44— 21 46— 12 48—50 2 Find : (i) the median and the first quartile, (ii) the number of wage earners receiving between Rs. 3700 and Rs. 4700 per week. Ans. (i) Md = Rs. 4029.51; Q1 = Rs. 3777.42; (ii) 191. 5·33 AVERAGES OR MEASURES OF CENTRAL TENDENCY 17. Define a percentile. Find the 45th and 57th percentiles for the following data on marks obtained by 100 students : Marks 20—25 25—30 30—35 35—40 40—45 45—50 10 20 20 15 15 20 No. of Students [C.A. (Foundation), May 1996] Ans. P45 = 33·75 ; P57 = 37·33. 18. Find : (a) the 2nd decile, (b) the 4th decile. (c) the 90th percentile, and for the data given below, interpreting clearly the significance of each. Age of Head of Family (years) Under 25 25—29 30—34 35—44 45—54 Ans. D2 = 31·94 years, Number (in millions) 2·22 4·05 5·08 10·45 9·47 (d) the 68th percentile Age of Head of Family (years) Number (in millions) 55—64 65—74 75 and over 6·63 4·16 1·66 ———————— Total 43·72 ———————— D4 = 40·38 years , P90 = 67·98 years, P68 = 52·87 years. 19. Find the (i) Lower quartile, (ii) Upper quartile, (iii) 7th decile, and (iv) 60th percentile, for the following frequency distribution : Wages (Rs.) : 30—40 40—50 50—60 60—70 70—80 80—90 No. of Persons : 1 3 11 21 43 32 Ans. (i) Rs. 67·14, (ii) Rs. 83·44, (iii) Rs. 81·56, 90—100 9 (iv) Rs. 78·37. 20. Draw an ogive for the data given below and show how can the value of median be read off from this graph. Verify your result. Class Interval : 0—5 5—10 10—15 15—20 20—25 25—30 Frequency : 5 10 15 8 7 5 Ans. Median = 13·5 (approx.); By formula, Md = 13·33. 21. Draw a ‘less than ogive’ from the following data and hence find out the value of lower quartile. Class Interval : 0—5 5—10 10—20 20—30 30—40 Frequency : 5 7 15 20 8 40—50 5 Ans. Q1 = 12. 22. The frequency distribution of heights of 100 college students is as follows : Height (cms.) : 141—150 151—160 161—170 171—180 Frequency : 5 16 56 19 181—190 4 Total 100 Draw an ogive (less than or more than type) of this distribution and from the ogive find (i) the first quartile, (ii) the median, (iii) the third quartile, and (iv) Inter-quartile Range. Ans. Q1 = 161·2 cms, Q3 = 170·1 cms, Median = 165·7 cms, I.Q. Range = 8.9 cm. 23. The monthly salary distribution of 250 families in a certain locality in Agra is given below : Monthly Salary (Rs.) No. of Families Monthly Salary (Rs.) More than 0 More than 500 More than 1,000 More than 1,500 250 200 120 80 More than 2,000 More than 2,500 More than 3,000 More than 3,500 No. of Families 55 30 15 5 5·34 BUSINESS STATISTICS Draw a ‘less than’ ogive for the data given above and hence find out : (i) Limits of the income of middle 50% of the families ; and (ii) If income-tax is to be levied on families whose income exceeds Rs. 1,800 p.m., calculate the percentage of families, which will be paying income-tax. [Delhi Univ. B.Com. (Hons.), 2007] Ans. (i) Q1 = Rs. 578 (approx.); Q 3 = Rs. 1850 25 (ii) (2000 – 1500) × (2000 – 1800) + 25 + 15 + 10 + 5 = 65 ∴ 65 Percentage of families paying income tax = 250 × 100 = 26%. 24. Draw a ‘less than’ and ‘more than’ ogive curve for the following data and find median value : No. of Children No. of Families 0 150 1 72 2 50 3 28 4 12 5 8 6 5 [Delhi Univ. B.Com. (Pass), 1999] Hint. Since the number of children is a discrete random variable which can take only positive integer values, the given frequency distribution can be expressed as grouped frequency distribution with exclusive type classes as given below. Variable Frequency 0—1 150 1—2 72 2—3 50 3—4 28 4—5 12 5—6 8 6—7 5 Ans. Median from ogive = 1·1 (approx.). 25. With the help of given data, find : (i) Value of middle 50% items; (ii) Value of exactly 50% item; (iii) The value of P 40 and D6; (iv) Graphically with the help of ogive curve, the values of Q 1, Q3, median, P40 and D6 : Class Interval Frequencies 10—14 5 15—19 10 Ans. (i) Q3 – Q1 = 29.19 – 19.92 = 9.27; 20—24 15 25—29 20 (ii) Md = Q2 = 25.13 ; 30—34 10 35—39 5 Total 65 [Delhi Univ. B.Com. (Hons.), 2008] (iii) P40 = 23.17, D6 = 26.75 26. One hundred and twenty students appeared for a certain test and the following marks distribution was obtained: Marks Students Find : : : 0—20 10 20—40 30 40—60 36 60—80 30 80—100 14 (i) The limits of marks of middle 30% students. (ii) The percentage of students getting marks more than 75. (iii) The number of students who fail, if 35 marks are required for passing. Ans. (i) P35 = 41·1 ; P65 = 61·3 ; 100 (ii) 120 [( ) ] 30 × 20 5 + 14 = 17·9 %; (iii) 10 + 15 × 30 = 32·5 ~– 33. 20 27. The expenditure of 1,000 families is given as under : Expenditure (in Rs.) : 40—59 60—79 80—99 No. of families : 50 ? 500 The median for the distribution is Rs. 87. Calculate the missing frequencies. Ans. 262·5, 137·5 ~– 263, 137. 28. An incomplete frequency distribution is given as follows : Variable : 10—20 20—30 30—40 40—50 50–60 Frequency : 12 30 ? 65 ? You are given that median value is 46. (a) Using the median formula, fill up the missing frequencies. (b) Calculate the Arithmetic Mean of the completed table. Ans. (a) 34, 45 (b) 45·96 100—119 ? 60—70 25 70—80 19 120—139 50 Total 230 5·35 AVERAGES OR MEASURES OF CENTRAL TENDENCY 29. An incomplete distribution is given below : Variable : 0—10 10—20 Frequency : 10 20 20—30 ? 30—40 40 40—50 ? 50—60 25 60—70 15 (i) You are given that the median value is 35. Find out missing frequency (given the total frequency = 170). (ii) Calculate the arithmetic mean of the completed table. [Himachal Pradesh Univ. B.Com., 1999; Kerala Univ. B.Com., 1999] Ans. (i) 35,25 (ii) 35·88. 30. The data in the adjoining table represent travel expenses (other than transportation) for 7 trips made during November by a salesman for a small firm : An auditor criticised these expenses as excessive, asserting that the average expense per day is Rs. 10 (Rs. 70 divided by 7). The salesman replied that the average is only Rs. 4·20 (Rs. 105 divided by 25) and that in any event the median is the appropriate measure and is only Rs. 3. The auditor rejoined that the arithmetic mean is the appropriate measure, but that the median is Rs. 6. You are required to : Trip Days Expenses (Rs.) Expenses per day (Rs.) ————————————————— ———————————————— —————————————————— ——————————————————————————— 1 2 3 4 5 6 7 0·5 2·0 3·5 1·0 9·0 0·5 8·5 13·50 12·00 17·50 9·00 27·00 9·00 17·00 27 6 5 9 3 18 2 ————————————————— ——————————————— —————————————————— —————————————————————————— Total 25·0 105·00 70 (i) Explain the proper interpretation of each of the four averages mentioned. (ii) Which average seems appropriate to you ? 31. For a certain class of workers, numbering 700, hourly wages vary between Rs. 30 and Rs. 75. 12% of the workers are earning less than Rs. 35 while 13% are getting equal to or more than Rs. 60, out of which 6% are earning between Rs. 70 and Rs. 75. The first quartile and median wages are, respectively, Rs. 40 and Rs. 47. The 40th and 65th percentiles are Rs. 43 and Rs. 53 respectively. You are required to put the above information in the form of a frequency distribution and estimate the mean wages of the workers. Ans. Hourly wages (Rs.) : 30—35 No. of workers : 84 – X = Rs. 48·33. 35— 91 40— 105 43— 70 47— 105 53— 91 60— 49 32. For a certain group of saree weavers of Varanasi, the median and quartile earnings per hour are Rs. 44·3, Rs. 43·0 and Rs. 45·9 respectively. The earnings for the group range between Rs. 40 and Rs. 50. Ten per cent of the group earn under Rs. 42; 13% earn Rs. 47 and over, and 6% Rs. 48 and over. Put these data in the form of a frequency distribution and obtain the value of the mean wage. Ans. Hourly Wages (Rs.) : 40—42 No. of workers : 10 Mean = Rs. 44·50. 42— 15 43— 25 44·3— 25 45·9— 12 47— 7 48—50 6 5·7. MODE Mode is the value which occurs most frequently in a set of observations and around which the other items of the set cluster densely. In other words, mode is the value of a series which is predominant in it. In the words of Croxton and Cowden, “The mode of a distribution is value at the point around which the items tend to be most heavily concentrated. It may be regarded as the most typical of a series of values.” According to A.M. Tuttle, ‘Mode is the value which has the greatest frequency density in its immediate neighbourhood’. Accordingly mode may also be termed as the fashionable value (a derivation of the French word ‘la Mode’) of the distribution. In the following statements : (i) average size of the shoe sold in a shop is 7, (ii) average height of an Indian (male) is 5 feet 6 inches (1·68 metres approx.), (iii) average collar size of the shirt sold in a ready-made garment shop is 35 cms, (iv) average student in a professional college spends Rs. 2,500 per month; 5·36 BUSINESS STATISTICS Frequency the average referred to is neither mean nor median but mode, the most frequent value in the distribution. For example, by the first statement we mean that there is maximum demand for the shoe of size No. 7. 5·7·1. Computation of Mode. In case of a frequency distribution, mode is the value of the variable corresponding to the maximum frequency. This method can be applied with ease and simplicity if the distribution is ‘unimodal’, i.e., if it has only one mode. In other O words, this method can be used with convenience if there is only one value with highest concentration of observations. For example, in the distribution : X: 1 2 3 4 5 f: 3 1 18 25 40 Mode X Fig. 5·2. 6 30 7 22 8 10 9 6 the maximum frequency is 40 and therefore, the corresponding value of X viz., 5 gives the value of mode. In case of a frequency curve (see Fig. 5.2) mode corresponds to the peak of the curve. In the case of continuous frequency distribution, the class corresponding to the maximum frequency is called the modal class and the value of mode is obtained by the interpolation formula : Mode = l + where and h (f1 – f0 ) h (f1 – f0 ) =l+ (f1 – f0 ) – (f2 – f1 ) 2 f1 – f0 – f2 …(5·18) l is the lower limit of the modal class, f1 is the frequency of the modal class, f0 is the frequency of the class preceding the modal class, f2 in the frequency of the class succeeding the modal class, h is the magnitude of the modal class. The symbols f0 , f1 and f2 can be explained easily as follows : f0 : Frequency of preceding class, f1 : Maximum frequency (Frequency of Modal class), f2 : Frequency of succeeding class. Remarks 1. It may be pointed out that the formula (5·18) for computing mode is based on the following assumptions : (i) The frequency distribution must be continuous with exclusive type classes without any gaps. If the data are not given in the form of continuous classes, it must first be converted into continuous classes before applying formula (5·18). (ii) The class intervals must be uniform throughout i.e., the width of all the class intervals must be the same. In case of the distribution with unequal class intervals, they should be made equal under the assumption that the frequencies are uniformly distributed over all the classes, otherwise the value of mode computed from (5·18) will give misleading results. 2. However, the above technique of locating mode is not practicable in the following situations : (i) If the maximum frequency is repeated or approximately equal concentration is found in two or more neighbouring values. (ii) If the maximum frequency occurs either in the very beginning or at the end of the distribution. (iii) If there are irregularities in the distribution i.e., the frequencies of the variable increase or decrease in a haphazard way. In the above situations mode (or modal class in the case of continuous frequency distribution) is located by the method of grouping as discussed in Examples 5·31 and 5·32. 5·37 AVERAGES OR MEASURES OF CENTRAL TENDENCY 3. If the method of grouping gives the modal class which does not correspond to the maximum frequency f1 i.e., the frequency of modal class is not the maximum frequency, then in some situations we may get 2f1 – f0 – f2 = 0. [This will not be possible if f1 is maximum and f0 and f2 are less than f1 ]. In such a situation viz., 2f1 – f0 – f2 = 0, the value of mode cannot be computed by the formula. h (f1 – f0 ) Mode = l + 2f1 – f0 – f2 as it gives Mode = l + ∞ = ∞. [·. · 2f1 – f0 – f2 = 0] In such cases, the value of mode can be obtained by the formula : h⎟ f1 – f0 ⎟ Mode = l + …(5·18a) ⎟ f1 – f0 ⎟ + ⎟ f1 – f2 ⎟ where | A | represents the absolute (positive) value of A. Formula (5·18a) is only an approximate formula and does not give very correct result because further grouping of classes, say, 4 at a time may give different value of the modal class and as such a different result. As an illustration, for the following data : X : f : 10—20 4 20—30 6 30—40 5 40—50 10 50—60 20 60—70 22 70—80 24 80—90 6 90—100 2 100—110 1 the usual method of grouping (up to 3 classes at a time) will give 60—70 as the modal class such that : f1 = 22, f0 = 20, f2 = 24 and therefore, 2f1 – f0 – f2 = 44 – 20 – 24 = 0. Hence, usual formula for mode cannot be applied. Using (5·18a), an approximate value of mode may be obtained as : 10 | 22 – 20 | ×2 Mo = 60 + = 60 + 10 2 + 2 = 60 + 5 = 65 | 22 – 20 | + | 22 – 24 | 5·7·2. Merits and Demerits of Mode. Merits. (i) Mode is easy to calculate and understand. In some cases it can be located merely by inspection. It can also be estimated graphically from a histogram (c.f. § 5·7·3). (ii) Mode is not at all affected by extreme observations and as such is preferred to arithmetic mean while dealing with extreme observations. (iii) It can be conveniently obtained in the case of open end classes which do not pose any problems here. Demerits. (i) Mode is not rigidly defined. It is ill-defined if the maximum frequency is repeated or if the maximum frequency occurs either, in the very beginning or at the end of the distribution; or if the distribution is irregular. In these cases, its value is located by the method of grouping (c.f. Examples, 5·31). If the grouping method also gives two values of mode, then the distribution is called bi-modal distribution (c.f. Example 5·32). We may also come across distributions with more than two modes, in which case it is called multimodal distribution. In case of bimodal or multimodal distributions, mode is not a representative measure of location and its estimate is obtained by the empirical relation : Mode = 3 Median – 2 Mean, discussed in § 5·8. (ii) Since mode is the value of X corresponding to the maximum frequency, it is not based on all the observations of the series. Even in the case of the continuous frequency distribution [c.f. Formula (5·18)], mode depends on the frequencies of modal class and the classes preceding and succeeding it. (iii) Mode is not suitable for further mathematical treatment. For example, from the modal values and the sizes of two or more series, we cannot find the mode of the combined series. (iv) As compared with mean, mode is affected to a greater extent by the fluctuations of sampling. Uses. Being the point of maximum density, mode is specially useful in finding the most popular size in studies relating to marketing, trade, business and industry. It is the appropriate average to be used to find the ideal size e.g., in business forecasting, in the manufacture of shoes or readymade garments, in sales, in production, etc. 5·38 BUSINESS STATISTICS 5·7·3. Graphic Location of Mode. Mode can be located graphically from the histogram of frequency distribution by making use of the rectangles erected on the modal, pre-modal and postmodal classes. The method consists of the following steps : (i) Join the top right corner of the rectangle erected on the modal class with the top right corner of the rectangle erected on the preceding class by means of a straight line. (ii) Join the top left corner of the rectangle erected on the modal class with the top left corner of the rectangle erected on the succeeding class by a straight line. (iii) From the point of intersection of the lines in steps (i) and (ii) above, draw a perpendicular to the X-axis (the horizontal scale). The abscissa (X-coordinate) of the point where this perpendicular meets the X-axis gives the modal value. 5·8. EMPIRICAL RELATION BETWEEN MEAN (M), MEDIAN (Md) AND MODE (Mo) In case of a symmetrical distribution mean, median and mode coincide i.e., Mean = Median = Mode (c.f. Chapter 7 on Skewness). However, for a moderately asymmetrical (non-symmetrical or skewed) distribution, mean and mode usually lie on the two ends and median lies in between them and they obey the following important empirical relationship, given by Prof. Karl Pearson. Mode = Mean – 3 (Mean – Median) …(5·19) ⇒ Mean – Mode = 3 (Mean – Median) ⇒ Mean – Median = 31 (Mean – Mode) …(5·19a) Thus we see that the difference between mean and mode is three times the difference between mean and median. In other words, median is closer to mean than mode. The above relation between mean (M), median (Md) and mode (Mo) can be exhibited diagrammatically as follows (Fig. 5.3) : Remarks. 1. Equation (5·19) may be rewritten to give : Mode = Mean – 3 Mean + 3 Median ⇒ Mode = 3 Median – 2 Mean …(5·20) This formula is specially useful to RELATIONSHIP BETWEEN ARITHMETIC MEAN, determine the value of mode in case it is illMEDIAN AND MODE defined, e.g., in the case of biomodal or multimodal distributions [c.f. Example 5·39]. Divides area in halves 2. If we know any two of the three values M, Md and M o , the third can be Under peak Centre of of curve estimated by using (5·20). The value so gravity computed will be more or less same as obtained by using the exact formula provided the distribution is moderately asymmetrical. 3. For a positively skewed distribution Mo MdM [c.f. Chapter 7], mean will be greater than median and median will be greater than Fig. 5·3. mode i.e., M > Md > Mo ⇒ Mo < Md < M However, in a negatively skewed distribution the order of the magnitudes of the three averages will be reversed i.e., for negatively skewed distribution, we have Mo > Md > M ⇒ M < Md < Mo Example 5.29. (a) Find the mode of the following distribution : 7, 4, 3, 5, 6, 3, 3, 2, 4, 3, 4, 3, 3, 4, 4, 2, 3 [I.C.W.A. (Foundation), June 2005] (b) If the relation between two variables x and y be 2x + 5y = 24 and mode of y be 4, find the mode of x. [I.C.W.A. (Foundation), June 2006] 5·39 AVERAGES OR MEASURES OF CENTRAL TENDENCY Solution. (a) The frequency distribution of the variable (x) is obtained as given below. x 2 3 4 5 6 7 Tally Marks || |||| || |||| | | | Frequency (f) 2 7 5 1 1 1 Since the maximum frequency (7) corresponds to x = 3, the value of mode is 3. (b) We are given : 1 Mode (y) = 4 …(*) and 2x + 5y = 24 ⇒ x = (24 – 5y) … (**) 2 If x and y are corrected by the relation y = ax + b, then the frequencies of the variable y are the same as the frequencies of the corresponding values of the variable x. Hence, the modes of the variables x and y are also connected by the same equation, i.e., y = ax + b ⇒ Mode (y) = Mode (ax + b) = a [Mode (x)] + b …(***) Hence on using (***), we get from (**), Mode (x) = Mode [ ] 1 (24 – 5y) = 12 – 52 · Mode (y) = 12 – 52 × 4 = 2 2 [From (*)] Example 5·30. Find the value of mean , mode and median from the data given below : Weight (in kg.) : No. of students : 93—97 3 98—102 5 103—107 12 108—112 17 113—117 14 118—122 6 123—127 3 128—132 1 Solution. Since the formula for mode requires the distribution to be continuous with ‘exclusive type’ classes we first convert the classes into class bounderies as given in the following table : Weight (in kg) 93—97 98—102 103—107 108—112 113—117 118—122 123—127 128—132 COMPUTATION OF MEAN, MODE AND MEDIAN Class Mid-value No. of X – 110 d= 5 boundaries (X) students (f) –3 3 95 92·5—97·5 –2 5 100 97·5—102·5 –1 12 105 102·5—107·5 0 17 110 107·5—112·5 1 14 115 112·5—117·5 2 6 120 117·5—122·5 3 3 125 122·5—127·5 4 1 130 127·5—132·5 N = ∑f = 61 fd –9 –10 –12 0 14 12 9 4 ∑fd = 8 Less than c.f. 3 8 20 37 51 57 60 61 5×8 Mean. Mean = A + h∑fd N = 110 + 61 = 110·66 kgs. Mode. Here maximum frequency is 17. The corresponding class 107·5—112·5 is the modal class. Using the mode formula, we get h(f1 – f0 ) – 12) = 107·5 + 2 5× ×17(17 – 12 – 14 1 – f0 – f2 107·5 + 25 8 = 107·5 + 3·125 = 110·625 Mode = l + 2f = kgs. Median. (N/2) = (61/2) = 30·5. The c.f. just greater than 30·5 is 37. Hence, the corresponding class 107·5—112·5 is the median class. Using the median formula, we get 5 Md = 107·5 + 17 (30·5 – 20) = 107·5 + 5 ×1710·5 = 107·50 + 3·09 = 110·59 kg. Example 5·31. Construct a frequency distribution showing the frequencies with which words of different number of letters occur in the extract reproduced below (omitting punctuation marks, treating as the variable the number of letters in each word) and obtain the median and the mode of the distribution. 5·40 BUSINESS STATISTICS A candidate at the time of applying for registration as a student of the institute should be not less than eighteen years of age and have passed the intermediate examination of a university constituted by law in India or an examination recognized by Central Government as equivalent thereto, or the National Diploma in Commerce Examination or the Diploma in Rural Service Examination conducted by the National Council of Rural Higher Education. Solution. Here the variable X represents the number of letters in each word. For example, in the word ‘candidate’ there are 9 letters c, a, n, d, i, d, a, t and e. Hence X, corresponding to the word ‘candidate’ is 9. Thus replacing each word by the number of letters in it, the distribution of the number of letters in each word in the given paragraph is as follows : 1, 4, 2, 8, 9, 4, 5, 11, 2, 8, 2, 2, 3, 5, 2, 3, 4, 2, 11, 7, 2, 3, 10, 2, 8, 3, 2, 5, 3, 4, 3, 7, 12, 6, 7, 11, 2, 3, 10, 8, 1, 12, 2, 2, 7, 11, 10, 3, 2, 2, 7, 8, 3, 1, 2, 7, 9, 10, 3, 2, 6, 11, 8, 5, COMPUTATION OF MEDIAN Frequency (f) The above data can be arranged in the No. of letters in a Tally Marks word (X) form of a frequency distribution as given in the adjoining table . 1 ||| N 72 2 |||| |||| |||| |||| Median. Here 2 = 2 = 36. Since c.f. just 3 |||| |||| || greater than 36 is 38, the corresponding value 2, 2, 7, 6, 3, 3, 2, 9 (Less than) c.f. 3 22 34 38 42 45 52 58 61 65 70 72 ———————————————————————— ————————————————————————— —————————————————— —————————————————— 4 5 6 7 8 9 10 11 12 of X is median, which is 4. Mode : Since the above frequency distribution is not regular, the value of mode is located by the method of grouping. As the distribution is not regular we cannot say that the value of mode is 2, which corresponds to the maximum frequency is 19. Here we try to locate mode by the method of grouping as explained in table below. X (1) 1 3 2 19 3 12 4 4 5 4 6 3 7 7 8 6 9 3 10 4 11 5 12 2 |||| |||| ||| |||| || |||| | ||| |||| |||| || COMPUTATION OF MODE : GROUPING TABLE Frequencies (2) (3) (4) } } } } } } 22 16 7 13 7 7 } } } } } } } } } 34 31 8 10 11 16 9 9 3 19 12 4 4 3 7 6 3 4 5 2 11 (5) } } } 35 14 13 (6) } } } 20 16 12 The frequencies in column (1) are the original frequencies. Column (2) is obtained by combining the frequencies two by two in column (1). Column (3) is obtained on combining the frequencies two by two in column (1) after leaving the first frequency. If we leave the first two frequencies and combine the 5·41 AVERAGES OR MEASURES OF CENTRAL TENDENCY frequencies two by two in column (1), we shall get a repetition of values obtained in column (2). Hence, we proceed to combine the frequencies in column (1) three by three to get column (4). The combination of frequencies three by three after leaving the first frequency and first two frequencies in column (1) results in columns (5) and (6) respectively. If we combine the frequencies three by three after leaving the first three frequencies in column (1), we get a repetition of values obtained in column (4). The maximum frequency in each column is represented by ‘‘bold type’. For computing the value of mode we prepare the following analysis table : ANALYSIS TABLE Column No. in above Table (I) (1) (2) (3) (4) (5) (6) Frequency of the variable (X) Maximum frequency (II) 19 22 31 34 35 20 Value or combination of values of X corresponding to maximum frequency in column (II) (III) 2 2 1 2 3 2 3 1 2 3 4 3 4 5 2 5 4 2 1 Since the value 2 is repeated maximum number (5) of times, the mode is 2. Example 5·32. Calculate mode from the following data : Marks Below 10 20 ” 30 ” 40 ” 50 ” No. of Students 4 6 24 46 67 Marks Below 60 70 ” 80 ” 90 ” No. of Students 86 96 99 100 Solution. Since we are given the cumulative frequency distribution of marks, first we shall convert it into the frequency distribution as given in the adjoining table. Further, since the frequencies first decrease, then increase and again decrease, the distribution is irregular and hence the modal class is located by the method of grouping as explained in the table given below. Frequency (f) 4 6–4 = 2 24 – 6 = 18 46 – 24 = 22 67 – 46 = 21 86 – 67 = 19 96 – 86 = 10 99 – 96 = 3 100 – 99 = 1 Marks 0—10 10—20 20—30 30—40 40—50 50—60 60—70 70—80 80—90 ——————————————————————— —————————————————————————— Marks (1) 0—10 4 10—20 2 20—30 30—40 40—50 50—60 60—70 70—80 80—90 18 22 21 19 10 3 1 (2) } } } } 6 40 40 13 GROUPING TABLE Frequencies (3) (4) } } } } } } } 24 20 43 29 62 14 4 (5) } } 42 50 (6) } } 61 32 5·42 BUSINESS STATISTICS For computing modal class, we prepare the analysis table as given below : Column Maximum frequency No. in above Table (I) (II) 22 (1) 40 (2) 43 (3) 62 (4) 50 (5) 61 (6) Number of times the class occurs ANALYSIS TABLE Class (es) corresponding to maximum frequency in (II) (III) 30—40 30—40 30—40 30—40 20—30 30—40 5 20—30 2 40—50 40—50 40—50 40—50 40—50 5 50—60 50—60 50—60 60—70 3 1 In the above table there are two classes viz., 30—40 and 40—50 which are repeated maximum number (5) of times and as such we cannot decide about the modal class. Thus, even the method of grouping fails to give the modal class. We say that in the above example mode is ill-defined and we locate it by the empirical formula : Mo = 3Md – 2M …(*) For computation of Mean (M) and Median (Md), see calculation Table on page 5.52 Mean = A + h∑fd 10 × (–28) = N = 45 + 100 45 – 2·8 = 42·2. Here 2N = 100 2 = 50. Since c.f. just greater than 50 is 67, the corresponding class 40—50 is the median class. Hence, using the median formula, we get 10 Median = 40 + 21 Marks 0—10 10—20 20—30 30—40 40—50 50—60 60—70 70—80 80—90 ( 100 2 ) – 46 = 40 + 1021× 4 = 40 + 1·9 = 41·9. COMPUTATION OF ARITHMETIC MEAN AND MEDIAN Mid-value Frequency Less than X – 45 d = 10 (X) (f) c.f. –4 4 4 5 –3 6 2 15 –2 24 18 25 –1 46 22 35 0 67 21 45 1 86 19 55 2 96 10 65 3 99 3 75 4 100 1 85 ∑f = 100 fd –16 – 6 –36 –22 0 19 20 9 4 ∑fd = –28 Substituting the values of M and Md in (*), we get Mode = 3 × 41·9 – 2 × 42·2 = 125·7 – 84·4 = 41·3. Example 5.33. In 500 small scale units, the return on investment ranged from 0 to 30 per cent, no unit sustaining any loss. Five per cent of the units had returns ranging from zero per cent to 5 per cent, 15 per cent of the units earned returns between 5 per cent and 10 per cent. The median rate of return was 15 per cent and the upper quartile was 20 per cent. The uppermost layer of returns of 25—30 per cent was earned by 50 units. Put this information in the form of a frequency table and find the rate of return around which there is maximum concentration of units. [Delhi Unit. B.Com. (Hons.) (External), 2007] Solution. On the basis of the given information we have : N = Total number of units = 500 5 (1) Number of units with returns ranging from 0 to 5% = 5% of 500 = × 500 = 25 100 5·43 AVERAGES OR MEASURES OF CENTRAL TENDENCY (2) Number of units with returns between 5% and 10% = 15% of 500 = 15 × 500 = 75 100 500 = 250 units have return ≤ 15% …(i) 2 ∴ Number of units with return exceeding 10% but not exceeding 15% = 250 – (25 + 75) = 150 3N 3 × 500 Upper Quartile (Q3) = 20% ⇒ = = 375 units have returns ≤ 20% 4 4 ∴ Number of units with returns exceeding 15% but ≤ 20% = 375 – 250 = 125 [using (i)] Number of units with returns between 25% to 30% = 50 (Given) Hence, by the residual balance, the number of units with returns between 20% to 25% = 500 – [25 + 75 + 150 + 125 + 50] = 500 – 425 = 75 (3) Median rate of return = 15% ⇒ 50% of N = Thus, the given information can be summarised in the form of a frequency distribution as given in the adjoining table. The rate of return about which there is maximum concentration of units is given by the Mode of the rate of returns. Since maximum frequency is 150, the modal class is 10—15. Return in % No. of units (f) 0—5 5—10 25 75 (f0) 10—15 150 (f1) 15—20 20—25 25—30 125 (f2) 75 50 Modal Class h (f1 – f0 ) 5 (150 – 75) 5 × 75 = 10 + = 10 + = 10 + 3.75 = 13.75 2f1 – f0 – f2 300 – 75 – 125 100 Hence, the rate of return around which there is maximum concentration of units is 13.75%. Example 5·34. Below is given the frequency distribution of weights of a group of 60 students of a class in a school : ∴ Mode = l + Weight in kg. 30—34 35—39 40—44 45—49 Number of students 3 5 12 18 Weight in kg. 50—54 55—59 60—64 Number of students 14 6 2 (a) Draw histogram for this distribution and find the modal value. (b) (i) Prepare the cumulative frequency (both less than and more than types) distribution, and (ii) represent them graphically on the same graph paper. Hence, find the (iii) median, and (iv) co-efficient of quartile deviation. (c) With the modal and the median values as obtained in (a) and (b), use an appropriate empirical formula to find the arithmetic mean of this distribution. (d) If students with weight below 40 kg. are eliminated from the frequency distribution, what will be the revised mean ? [Calculate the mean of the two rejected classes only and use the result obtained in (c).] Less than c.f. More than c.f. Number of Solution. (a) To draw the histogram and Weight in students (f) kgs. cumulative frequency curves (both less than and more than types) we first convert the 29·5—34·5 3 3 60 5 distribution into continuous class intervals as 34·5—39·5 8 57 12 given in the adjoining table. 39·5—44·5 20 52 ————————————————————— ——————————————————————— ————————————————————————— —————————————————————— 44·5—49·5 49·5—54·5 54·5—59·5 59·5—64·5 18 14 6 2 38 52 58 60 40 22 8 2 5·44 BUSINESS STATISTICS (a) Histogram. Histogram is obtained on erecting rectangles on the class intervals with heights proportional to the corresponding class frequencies. [See Fig. 5·4.] Mode = OA = 48. (b) (i) The ‘less than’ and ‘more than’ cumulative frequency distributions are given in the Table in Part (a). (ii) The ‘less than’ and ‘more than’ (ogives) are drawn in the Fig. 5·5. OGIVES Y Frequency 60 50 HISTOGRAM Y 18 16 14 12 10 8 6 4 2 Less than Ogive Q 3 = 52·0 (app.) S 40 30 20 Q2 = 47·8 Q1 = 42·9 10 A O 29·5 34·5 39·5 44·5 49·5 54·5 59·5 64·5 O0 X 34·5 Class Intervals 3N 4 P More than Ogive R N 4 N 2 Q M L 39·5 44·5 49·5 54·5 59·5 64·5 Class Intervals Fig. 5·4. X Fig. 5·5. (iii) From the point of intersection P of the two curves (ogives), draw perpendicular on X-axis meeting X-axis at Q. Then OQ = 47·8 kgs. gives the median weight. (iv) Draw lines parallel to X-axis at frequency equal to N/4 and 3N/4 meeting the less than ogive at points R & S respectively. From R & S draw perpendiculars to X-axis meeting OX at L, M respectively. Then Q1 = OL = 42·9 kgs. and Q3 = OM = 51·75 kgs. The Coefficient of Quartile Deviation is given by : Q –Q Coefficient of Q.D. = Q 3 + Q1 3 1 51·75 – 42·90 8·85 = 51·75 + 42·90 = 94·65 = 0·0935 (c) The empirical relation between mean, median and mode is given by : Mo = 3Md – 2M ∴ Mean (M) = 3 × 47·8 – 48 3Md – Mo = = 143·42 – 48 = 95·4 2 = 47·700 kgs. 2 2 – (d) Let X1 be the mean of the two classes Weight in kgs. Mid-value (X) with weight below 40 kgs. (f) fX ——————————————————————— ——————————————————————— —————————————————————————— ——————————————————— ∴ – X1 = ∑ fX 281 = = 35·125 kgs. ∑f 8 30—34 35—39 32 37 3 5 96 185 ∑ f = 8 = n1, (say) ∑ fX = 281 – Let X2 be the mean of the distribution obtained on eliminating the first two classes (i.e., classes with weight below 40 kgs.). Then in the usual notations, we have – n1 = 8, X1 = 35·125; n2 = 60 – 8 = 52, – Using the formula, – X = – X2 = ?, – X = 47·700 [From Part (c)] – n1 X1 + n2 X2 n1 + n 2 , we get – 60 × 47·700 = 8 × 35·125 + 52 X2 ⇒ – X2 = 286252– 281 = 2581 52 = 49·635 kgs. 5·45 AVERAGES OR MEASURES OF CENTRAL TENDENCY EXERCISE 5·3 1. What do you understand by mode ? Discuss its relative merits and demerits as a measure of central tendency. Also give two practical situations where you will recommend the use of mode. 2. What are the desideratta of a good average ? Compare the mean, the median and the mode in the light of these desideratta. Why are averages called measures of central tendency ? 3. How would you account for the predominant choice of arithmetic mean of statistical data as a measure of central tendency ? Under what circumstances would it be appropriate to use mode or median ? 4. Point out the merits and demerits of the mean the median, and the mode as measure of central tendency of numerical data. 5. Compare, giving illustrations, the arithmetic mean, the median and the mode in regard to : (a) the effect of extreme items in computation, (b) ease in computation, (c) stability in sampling situations, (d) existence of the average as an actual case, and (e) popular use. 6. (a) Define mode. When is mode said to be ill-defined ? [Delhi Univ. B.Com. (Pass), 1997] (b) A stockist of readymade garments should follow which type of average and why ? [Delhi Univ. B.Com. (Pass), 2001] 7. The Bharat Ball Bearings Ltd., has collected the following data. 12, 19, 21, 30, 13, 19, 22, 31, 17, 20, 24, 31, 18, 21, 27, 31. (i) Compute the arithmetic mean, the median and the mode using the sixteen observations given. (ii) Why is the mode said to be an erratic measure of central tendency ? (iii) Why is the median called a position average ? Ans. A.M. = 22·25, Md = 21, Mo = 31 8. Calculate mean, median and mode from the following data of the heights in inches of a group of students : 61, 62, 63, 61, 63, 64, 64, 60, 65, 63, 64, 65, 66, 64 Now suppose that a group of students whose heights are 60, 66, 59, 68, 67, and 70 inches, is added to the original group. Find mean, median and mode of the combined group. Ans. First group : M = 63·2, Md = 63·5, Mo = 64 Combined group : M = 63·75, Md = 64, Mo = 64. 9. Atul gets a pocket money allowance of Rs. 12 per day. Thinking that this was rather less, he asked his friends about their allowances and obtained the following data which includes his allowance also — (amounts in Rs.) 12, 18, 10, 5, 25, 20, 20, 22, 15, 10, 10, 15, 13, 20, 18, 10, 15, 10, 18, 15, 12, 15, 10, 15, 10, 12, 18, 20, 5, 8. He presented these data to his father and asked for an increase in his allowance as he was getting less than average amount. His father, a statistician, countered pointing out that Atul’s allowance was actually more than the average amount. Reconcile these statements. Ans. Atul computed A.M. and his father computed Mode. 10. The number of fully formed apples on 100 plants were counted with following results : No. of apples 0 1 2 3 4 5 6 7 8 9 10 No. of plants 2 5 7 11 18 24 12 8 6 4 3 (i) How many apples were there in all ? (ii) What was the average of number of apples per plant ? (iii) What was the modal number of apples ? [Delhi Univ., B.Com., 1989, Allahabad Univ. B.Com., 1996] – Ans. (i) 486 (ii) X = 4·86, (iii) Mo = 5. 11. Given below is the frequency distribution of marks obtained by 90 students. Compute the arithmetic mean, median and mode. 5·46 BUSINESS STATISTICS Marks No. of students 6 15—19 14 20—24 12 25—29 10 30—34 10 35—39 9 40—44 Ans. Mean = 37·17, Md = 36, Mo = 23·5. 12. Find out the median and mode from the following table : No. of days absent Less than 5 Less than 10 Less than 15 Less than 20 Less than 25 No. of students 29 224 465 582 634 Marls 45—49 50—54 55—59 60—64 65—69 No. of students 9 10 5 4 1 No. of days absent Less than 30 Less than 35 Less than 40 Less than 45 No. of students 644 650 653 655 Ans. Md =12·75, Mo = 11·35. 13. Find out the Mean, Median and the Mode in the following series— Size (below) : 5 10 15 20 Frequency : 1 3 13 17 25 30 35 27 36 38 (Andhra Pradesh Univ. B.Com., 1998) Ans. Mean = 19·74, Md = 21, Mo = 24·3. 14. In 500 small scale industrial units, the return on investment ranged from 0 to 30%, no unit sustaining any loss. 5% of industrial units had returns exceeding 0% but not exceeding 5%. 15% of units had returns exceeding 5% but not exceeding 10%. Median and upper quartile rate of return was 15% and 20% respectively. The uppermost layer of returns exceeding 25% but not exceeding 30% was earned by 25%. Present this information in the form of frequency table with intervals as follows : Exceeding 0% but not exceeding 5% ; Exceeding 5% but not exceeding 10% Exceeding 10% but not exceeding 15% ; Exceeding 15% but not exceeding 20% Exceeding 20% but not exceeding 25% ; Exceeding 25% but not exceeding 30%. Use N/4, 2N/4, 3N/4 as ranks of lower, middle and upper quartiles respectively. Find the rate of return around which there is maximum concentration of units. [Delhi Univ. B.Com. (Hons.), 2008] Return in % No. of units 0—5 25 5–10 75 10—15 150 15—20 125 20–25 0 25—30 125 Mode = 13.75; Rate of return around which there is maximum concentration of units is 13.75%. 15. Calculate the arithmetic mean and the median of the frequency distribution given below. Hence calculate the mode using the empirical relation between the three. Class limits : 130—134 135—139 140—144 145—149 150—154 155—159 160—164 Frequency : 5 15 28 24 17 10 1 ———————— Ans. M = 145·35, Md = 144·92, Mo = 144·06. 16. (a) Briefly explain the role of grouping and analysis table in calculation of mode. [Delhi Univ. B.Com. (Pass), 1999] (b) From the following data of weight of 122 persons determine the modal weight by the method of grouping. Weight (in lbs.) 100—110 110—120 120—130 130—140 140—150 No. of persons 4 6 20 32 33 150—160 160—170 170—180 17 8 2 [Osmania Univ. B.Com. 1998] Hint. Method of grouping gives two modal classes 130—140 and 140—150 i.e., the distribution is bimodal. Locate the value of mode by using the empirical relation Mo = 3Md – 2M. Ans. Mean (M) = 139·51 ; Median (Md) = 139·69; Mode (Mo) = 140·05. 5·47 AVERAGES OR MEASURES OF CENTRAL TENDENCY 17. Calculate the Mode, Median and Arithmetic average from the following data. Class 0—2 2—4 4—10 10—15 15—20 20—25 f 8 12 20 10 16 25 Class 25—30 30—40 40—50 50—60 60—80 80—100 f 45 60 20 13 15 4 Hint. Rewrite the frequency distribution with classes of equal magnitude 10. Ans. Mo = 28·15, Md = 28·29, Mean = 30·08. 18. In the following data, two class frequencies are missing. Class 100—110 110—120 120—130 130—140 140—150 Frequency 4 7 15 ? 40 Class 150—160 160—170 170—180 180—190 190—200 Frequency ? 16 10 6 3 However, it was possible to ascertain that the total number of frequencies was 150 and that the median has been correctly found to be 146·25. You are required to find out with the help of the information given : (i) Two missing frequencies. (ii) Having found the missing frequencies, calculate arithmetic mean. (iii) Without using the direct formula, find the value of the mode. Ans. (i) 24, 25 ; – (ii) X = 147·33; (iii) Mode = 144·08 19. The median and mode of the following hourly wage distribution are known to be Rs. 33·5 and Rs. 34 respectively. Three frequency values from the table are, however, missing. You are required to find out those values. 60—70 Wages in Rs. : 0—10 10—20 20—30 30—40 40–50 50—60 Total 4 No. of persons : 4 16 ? ? ? 6 230 Ans. 60, 100, 40. 20. You are given the following incomplete frequency distribution. It is known that the total frequency is 1000 and that the median is 413·11. Estimate by calculation the missing frequencies and find the value of the mode. Value (X) 300—325 325—350 350—375 375—400 Frequency (f) 5 17 80 ? Value (X) 400—425 425—450 450—475 475—500 Frequency (f) 326 ? 88 9 Ans. Missing frequencies are 227 and 248 respectively. Mo = 413·98. 21. “Hari put the jar of water and the packet of sweets on the ground and sat down in the shade of the tree and waited.” Prepare a frequency distribution for the words in the above sentence taking the number of letters in words as the variable. Calculate the mean, median and mode. Ans. Mean = 3·56, Median = Mode = 3. 22. Treating the number of letters in each word in the following passage as the variable x, prepare the frequency distribution table and obtain its mean, median, mode. “The reliability of data must always be examined before any attempt is made to base conclusions upon them. This is true of all data, but particularly so of numerical data, which do not carry their quality written large on them. It is a waste of time to apply the refined theoretical methods of Statistics to data which are suspect from the beginning.” Ans. Mean = 4·565, Median = 4, Mode = 3. 5·48 BUSINESS STATISTICS 23. The frequency distribution of marks obtained by 60 students of a class in a college is given below : Marks : 30—34 35—39 40—44 45—49 50—54 55—59 60—64 No. of Students : 3 5 12 18 14 6 2 (i) Draw Histogram for this distribution and find the modal value. (ii) Draw a cumulative frequency curve and find the marks limits of the middle 50% students. [Delhi Univ. B.Com. (Hons.), 1991] Ans. (i) Mode = 47·5 marks, (ii) Q1 = 42·5 marks, Q3 = 52 marks. 24. Determine the values of Median and Mode of the following distribution graphically. Verify the results by actual calculations. After verifying, calculate the value of Mean and sketch a curve indicating the general shape of the distribution and comment. Size Frequency 10—19 11 20—29 19 70—79 80—89 90—99 6 3 1 [Delhi Univ. B.Com. (Hons.), 2009] Hint. Change classes into class boundaries for Md and Mode. Use Ogive for Md and Histogram for Mode graphically. Using Formula; Md = 37.83, Mo = 32.35; Mean = [ (3Md – Mo)/2] = 40.57 M > Md > Mo 30—39 21 40—49 16 50—59 10 60—69 8 ⇒ Distribution is positively skewed. 25. In a moderately skewed distribution : (a) Arithmetic mean = 24·6 and the mode = 26·1. Find the value of the median and explain the reason for the method employed. (b) In a moderately asymmetrical distribution the value of median is 42·8 and the value of mode is 40. Find the mean. (c) In a moderately asymmetrical distribution the value of mean is 75 and value of mode is 60. Find the value of median. [Delhi Univ. B.Com. (Pass), 1996] Ans. (a) Median = 25·1, (b) Mean = 44·2, (c) Median = 70. 26. Find out the missing figures : (a) Mean = ? (3 Median – Mode) ; (c) Median = Mode + ? (Mean – Mode) ; Ans. (a) 1/2, (b) 3, (c) 2/3, (d) 3. (b) Mean – Mode = ? (Mean – Median) (d) Mode = Mean – ? (Mean – Median). 27. (a) Which average would you use in the following situations : (i) Sale of shirts : 16′′, 15 12′′, 15′′, 15′′, 14′′, 13′′, 15′′. (ii) Marks obtained : 10, 8, 12, 4, 7, 11 and X, (X < 5). Justify your answer. Ans. (i) Mode, (ii) Median (b) A.M. and Median of 50 items are 100 and 95 respectively. At the time of calculations two items 180 and 90 were wrongly taken as 100 and 10. What are the correct values of Mean and Median ? Ans. Mean = 103·2; Median is same viz., 95. (c) Can the values of mean, mode and median be same ? If yes, state the situation. Ans. M = Md = Mo, for symmetrical distribution. 28. (a) Find out the missing figure ; Mean = ? (3 Median – Mode) Ans. 1/2. (b) In a moderately asymmetrical distribution, the values of mode and median are 20 and 24 respectively. Locate the value of mean. Ans. 25. 29. Fill in the blanks : (i) …… can be calculated from a frequency distribution with open end classes. (ii) In the calculation of …… , all the observations are taken into consideration. (iii) …… is not affected by extreme observations. (iv) Average rainfall of a city from Monday to Saturday is 0·3 inch. Due to heavy rainfall on Sunday, the average rainfall for the week increases to 0·5 inch. The rainfall on Sunday was … … AVERAGES OR MEASURES OF CENTRAL TENDENCY (v) (vi) (vii) (viii) (ix) (x) (xi) (xii) (xiii) (xiv) (xv) (xvi) (xvii) (xviii) (xix) (xx) 5·49 The sum of squared deviations is minimum when taken from …… The sum of absolute deviations is minimum when taken from …… Median = …… Quartile. Mean is …… by extreme observations. Median is the average suited for …… classes. For studying phenomenon like intelligence and honesty …… is a better average to be used while for phenomenon like size of shoes or readymade garments the average to be preferred is …… . Typist A can type a sheet in 5 minutes, typist B in 6 minutes and typist C in 8 minutes. The average number of sheets typed per hour per typist is …… The mean of 10 observations is 20 and median is 15. If 5 is added to each observation, the new mean is …… and median is …… A distribution with two modes is called …… and with more than two modes is called …… . Average suited for a qualitative phenomenon is …… . If 25% of the observations lie above 80, 40% of the observations are less than 50 and 70% are greater than 40, then. …… = 80 ; …… = 50 ; …… = 40 Relationship between Md, Q1, Q2 and Q3 is …… D5, P80, Md, D7 and P50 are related by …… Relationship between D4, Q2, P60, P75 and Q3 is …… The empirical relationship between mean, median and mode for a moderately asymmetrical distribution is …… If the maximum frequency is repeated then mode is located by the method of …… Ans. (i) Md or Mo (ii) Mean (iii) Md or Mo (iv) 1·7′′ (v) Mean (vi) Median (vii) Second (viii) Very much affected (ix) Open end (x) Median, Mode (xi) 9·47 (xii) 25, 20 (xiii) Bi-modal, Multi-modal (xiv) Median (xv) Q 3 = 80, P 40 = D4 = 50, P30 = 40 (xvi) Q1 ≤ Q2 = Md ≤ Q3 (xvii) D5 = P50 = Md ≤ D7 ≤ P80 (xviii) D4 ≤ Q2 ≤ P60 ≤ P75 = Q3 (xix) Mo = 3Md – 2M. (xx) Grouping. 30. State, giving reasons, the average to be used in the following situations : (i) To determine the average size of the shoe sold in a shop. (ii) To determine the size of agricultural holdings. (iii) To determine the average wages in an industrial concern. (iv) To find the per capita income in different cities. (v) To find the average beauty among a group of students in a class. Ans. (i) Mode; (ii), (iii) and (v) Median ; (iv) Mean. 5·9. GEOMETRIC MEAN The geometric mean, usually abbreviated as G.M.) of a set of n observations is the nth root of their product. Thus if X1 , X2 ,…, Xn are the given n observations then their G.M. is given by n G.M. = √ X1 × X2 × X3 × … × Xn = (X1 . X2 … Xn )1/n …(5·21) ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ If n = 2 i.e., if we are dealing with two observations only then G.M. can be computed by taking the square root of their product. For example, G.M. of 4 and 16 is ⎯√⎯⎯⎯ 4 × 16 = ⎯⎯ √64 = 8. But if n, the number of observations is greater than 2, then the computation of the n th root is very tedious. In such a case the calculations are facilitated by making use of the logarithms. Taking logarithm of both sides in (5·21), we get log (G.M.) = 1n log (X1 X2 …Xn ) = 1n (log X1 + log X2 + … + log Xn ) = 1n ∑ log X …(5·21a) Thus we see that the logarithm of the G.M. of a set observations is the arithmetic mean of their logarithms. 5·50 BUSINESS STATISTICS Taking Antilog of both sides in (5·21a), we finally obtain, G.M. = Antilog [ 1n ∑ log X ] …(5 ·21b) In case of frequency distribution (Xi, fi); i = 1, 2, …, n, where the total number of observations is N = ∑f, [ ] 1/N G.M. = (X1 × X1 × …f1 times) × (X2 × X2 × …f2 times) × … × (Xn × Xn × … fn times 1/N = (X 1 f1 × X2 f2 × … × Xn fn) …(5·22) Taking logarithm of both sides in (5·22), we get [ ( 1 = N [ log X f 1 = N [ f log X 1 log G.M. = N log X1 f1 . X2 f2 … Xn fn 1 1 1 = ⇒ 1 ] ] log X ] + log X2 f2 + … + log Xn fn + f2 log X2 + … + fn n 1 ∑ f log X N G.M. = Antilog [ 1N ∑ f log X ] … (5·22a) … (5·22b) In the case of grouped or continuous frequency distributions, the values of X are the mid-values of the corresponding classes. Steps for the Computation of G.M. in (5·22b) 1. Find log X, where X is the value of the variable or the mid-value of the class (in case of grouped or continuous frequency distribution). 2. Compute f × log X i.e., multiply the values of log X obtained in step 1 by the corresponding frequencies. 3. Obtain the sum of the products f log X obtained in step 2 to get ∑ f log X. 4. Divide the sum obtained in step 3 by N, the total frequency. 5. Take the Antilog of the value obtained in step 4. The resulting figure gives the value of G.M. 5·9·1. Merits and Demerits of Geometric Mean. Merits : (i) Geometric mean is rigidly defined. (ii) It is based on all the observations. (iii) It is suitable for further mathematical treatment. If G1 and G 2 are the geometric means of two groups of sizes n1 and n2 respectively, then the geometric mean G of the combined group of size n1 + n2 is given by n log G1 + n2 log G2 log G = 1 …(5·23) n1 + n2 Remark. The result in (5·23) can be easily generalised to the case of k groups as follows : If G 1 , G2, …,G k are the geometric means of the k groups of sizes n1 , n2 ,…, nk respectively, then the geometric mean G of the combined group of size n1 + n2 + … + nk is given by : log G = n1 log G1 + n2 log G2 + … + nk log Gk n1 + n2 + … + n k …(5·23a) (iv) Unlike arithmetic mean which has a bias for higher values, geometric mean has bias for smaller observations and as such is quite useful in phenomenon (such as prices) which has a lower limit (prices cannot go below zero) but has no such upper limit. (v) As compared with mean, G.M. is affected to a lesser extent by extreme observations. (vi) It is not affected much by fluctuations of sampling. AVERAGES OR MEASURES OF CENTRAL TENDENCY 5·51 Demerits. (i) Because of its abstract mathematical character, geometric mean is not easy to understand and to calculate for a non-mathematical person. (ii) If any one of the observations is zero, geometric mean becomes zero and if any one of the observations is negative, geometric mean becomes imaginary regardless of the magnitude of the other items. Uses. In spite of its merits and limitations, geometric mean is specially useful in averaging ratios, percentages, and rates of increase between two periods. For example, G.M. is the appropriate average to be used for computing the average rate of growth of population or average increase in the rate of profits, sales, production, etc., or the rate of money. Geometric mean is used in the construction of Index Numbers. Irving Fisher’s ideal index number is based on geometric mean [See Chapter 10 on Index Numbers]. While dealing with data pertaining to economic and social sciences, we usually come across the situations where it is desired to give more weightage to smaller items and small weightage to larger items. G.M. is the most appropriate average to be used in such cases. 5·9·2. Compound Interest Formula. Let us suppose that P0 is the initial value of the variable (i.e., the value of the variable in the beginning and Pn be its value at the end of the period n and let r be the rate of growth per unit per period. Since r is the rate of growth per unit per period, growth for period 1 is P0 r and thus the value of the variate at the end of period 1 is r P 0 + P0 = P0 (1 + r). For the 2nd period the initial value of the variable becomes P0 (1 + r). The growth for the 2nd period is P0 (1 + r) r and consequently the value of the variable at the end of 2nd period is P2 = P0 (1 + r) + P0 (1 + r)r = P0 (1 + r) [1 + r] = P0 (1 + r) 2 Similarly proceeding, the value of the variable at the end of period 3 is P3 = P0 (1 + r)2 + P0 (1 + r)2 r = P0 (1 + r)2 [1 + r] = P0 (1 + r)3 and finally, its value at the end of period n will be given Pn = P0 (1 + r)n , …(5·24) which is the compound interest formula for money. Equation (5·24) involves four unknown quantities : Pn : The value at the end of period n ; n : The length of the period ; P0 : The value in the beginning ; r : The rate per unit per period. If we are given P0 , r and n we can compute Pn by using (5·24) directly. However, (5·24) can be used to obtain any one of the four values when the remaining three values are given. For example, for given P n , r and n we have : P P0 = (1 +nr)n …(5·24a) 5·9·3. Average Rate of a Variable Which Increases by Different Rates at Different Periods. Let us suppose that instead of the values of the variable increasing at a constant rate in each period, the rate per unit per period is different, say, r1, r 2 ,…,rn for the 1st, 2nd,…and nth period respectively. Then, as discussed in the previous section we shall get : P1 = The value at the end of 1st period = P0 (1 + r1 ) P2 = The value at the end of 2nd period = P0 (1 + r1 ) (1 + r2 ) Pn = The value at the end of period n = P0(1 + r1 ) (1 + r2 )…(1 + rn ) …(*) If r is assumed to be the constant rate of growth per unit per period, then we get, [From (*)], Pn = P0 (1 + r) n …(**) 5·52 BUSINESS STATISTICS Hence, equating the values of Pn in (*) and (**), the average rate of growth over the period n is given by : (l + r)n = (l + r1 ) (l + r2)…(l + rn ) [ ⇒ l + r = (l + r1 ) (l + r2)…(l + rn ) ] 1/n …(5·25) If r1 , r2 ,…, rn denote the percentage growth per unit per period for the n periods respectively then we have 1/n r1 r2 r r 1 + 100 = 1+ 1+ … 1+ n …(5·26) 100 100 100 [( )( )] ) ( where r is the average percentage growth rate over n periods. ∴ [ ] = [ (100 + r ) (100 + r ) … (100 + r ) ] 100 + r = (100 + r1 ) (100 + r2 ) … (100 + rn ) ⇒ r 1 2 1/n 1/n n – 100 …(5·26a) Thus we see that if rates are given as percentages, then the average percentage growth rate can be obtained on subtracting 100 from the G.M. of (100 + r1 ), (100 + r2 ),…,(100 + rn ). Remark. It should be clearly understood that average percentage growth rate is given by (5·26) and not by the geometric mean of r1 , r2 ,…, rn. 5·9·4. Wrong Observations and Geometric Mean. Let us suppose that the value of the geometric mean computed from n observations, say, X1 , X 2 ,…,Xn is G. On checking, it is found that some of the observations, say, X1, X2 and X3 were wrongly copied instead of the correct observations X 1 ′, X2 ′ and X3 ′. We are interested in computing the correct value of the geometric mean. G = Geometric Mean of X1 , X2 , …,Xn = (X1 . X2 . X3 …Xn )1/n. …(***) On replacing the wrong observations X 1 , X 2 and X3 by the correct values, the corrected value of the geometric mean, say, G′ is given by : 1/n X′ X′ X′ G′ = (X 1 ′ X2 ′ X3 ′ … X n )1/n = X1 · X2 · X3 · X1 X2 X3 … Xn ( ( 1 2 ) 3 ) ( ) X1 ′ X2 ′ X3 ′ 1/n X1 ′ X2 ′ X3 ′ 1 /n = G. · [From (***)] (5·26b) X1 X1 X3 X1 X2 X3 The result in (5·26b), can be generalised to the case of more than three observations. For illustration, see Example 5·37. = (X1 X2 X3 … Xn )1/n Example 5·35. (a) Find the Geometric Mean of 2, 4, 8, 12, 16 and 24. (b) If the observations 2, 4, 8 and 16 occur with frequencies 4, 3, 2 and 1 respectively, find their geometric mean. [I.C.W.A. (Foundation), Dec. 2005] Solution. (a) X log X 2 0·3010 log (G.M.) ∴ Aliter ∴ 4 = n1 ∑ 0·6021 log X = 5·4697 6 = 8 12 16 24 0·9031 1·0792 1·2041 1·3802 0·9116 Total 5·4697 [Using (5·21 a)] G.M. = Antilog (0·9116) = 8·158 G.M. = (2 × 4 × 8 × 12 × 16 × 24)1/6 = (294912)1/6 log G.M. = 16 log 294912 = 5·4698 = 0·9116 6 (b) We are given : ⇒ G.M. = Antilog (0·9116) = 8·158. x 2 4 8 16 f 4 3 2 1 N = ∑f = 10 5·53 AVERAGES OR MEASURES OF CENTRAL TENDENCY G.M. = (24 × 43 × 82 × 161 )1/10 = [24 × (22 )3 × (23 )2 × 24]1/10 = [2 4 × 2 6 × 2 6 × 2 4 ] 1/10 = [2 4 + 6 + 6 + 4 ] 1/10 = 2 (20 × 101 ) = 22 = 4 Example 5·36. Find the geometric mean for the following distribution : Marks No. of students : : 0—10 5 10—20 7 20—30 15 30—40 25 40—50 8 Solution. Marks 0—10 10—20 20—30 30—40 40—50 Mid-Point (X) 5 15 25 35 45 No. of Students (f) 5 7 15 25 8 log X 0·6990 1·1761 1·3979 1·5441 1·6532 N = 60 Geometric mean = Antilog [ ∑ f log X N ] = Antilog [ 84·5243 60 ] f. log X 3·4950 8·2327 20·9685 38·6025 13·2256 84·5243 = Antilog [1·40874] = 25·64 marks. Example 5·37. The geometric mean of 10 observations on a certain variable was calculated as 16·2. It was later discovered that one of the observations was wrongly recorded as 12·9; in fact it was 21·9. Apply appropriate correction and calculate the correct geometric mean. Solution. Geometric mean G of n observations is given by : G = (X 1 X2 ,…, Xn )1/n ⇒ G n = X1 X2 …Xn …(*) Thus the product of the numbers is given by : X1 X2 … Xn = G n = (16·2)10 [Given n = 10, G = 16·2] …(**) If the wrong observation 12·9 is replaced by the correct values 21·9, then the corrected value of the product of 10 numbers is obtained on dividing the expression in (**) by wrong observation and multiplying by the correct observation. Thus, 10 Corrected product (X1X2 …Xn ) = (16·2)12·9× 21·9 Hence, corrected value of the geometric mean G′, (say), is given by : G′ = ⇒ ⇒ [ (16·2)10 × 21·9 12·9 ] 1/10 [ ] [ ] 1 = 10 [ 10 × 1·2095 + 1·3404 – 1·1106 ] = 101 [ 12·0950 + 1·3404 – 1·1106 ] = 12·3248 10 = 1·2325 1 1 log G′ = 10 log (16·2)10 + log 21·9 – log 12·9 = 10 10 log 16·2 + log 21·9 – log 12·9 G′ = Antilog (1·2325) = 17·08 Aliter. Using (5·26b) directly, we get : G′ = G . ( ) X1 ′ X1 1/10 = 16.2 × ( ) 21·9 12·9 1/10 Example 5·38. Three groups of observations contain 8, 7 and 5 observations. Their geometric means are 8·52, 10·12 and 7·75 respectively. Find the geometric mean of the 20 observations in the single group formed by pooling the three groups. Solution. In the usual notations, we are given that : n1 = 8, n2 = 7, n3 = 5; G1 = 8·52, G2 = 10·12, G3 = 7·75 The geometric mean G of the combined group of size N = n1 + n2 + n3 = 8 + 7 + 5 = 20, is given by : 1 [ ] log G = N n1 log G1 + n2 log G2 + n3 log G3 = 1 20 [ 8 log 8·52 + 7 log 10·12 + 5 log 7·75 ] 5·54 BUSINESS STATISTICS [ ] [ ] 1 1 = 20 8 × 0·9304 + 7 × 1·0052 + 5 × 0·8893 = 20 7·4432 + 7·0364 + 4·4465 = 18·9261 20 = 0·9463 ∴ G = Antilog (0·9463) = 8·837 Example 5·39. Find the missing information in the following table : Groups B 8 — 7 A 10 20 10 C — 6 — Combined Number 24 Mean 15 Geometric Mean 8·397. [Delhi Univ. B.Com, (Hons.), 1998] Solution. Taking the groups A, B and C as groups 1, 2 and 3 respectively, in the usual notations, we are given : Group A Group B Group C Combined Number : n1 = 10 n2 = 8 n3 = ? n1 + n2 + n 3 = 24 …(i) – – – – Mean : x 1 = 20 x2 = ? x3 = 6 x = 15 …(ii) Geometric Mean : G1 = 10 G2 = 7 G3 = ? G = 8·397 …(iii) From (i), we get n3 = 24 – n1 – n2 = 24 – 10 – 8 = 6 …(iv) – – – – –x = n1 x 1 + n2 x 2 + n3x3 = 10 × 20 + 8x 2 + 6 × 6 = 15 (Given) n1 + n2 + n3 24 –x = 124 = 15·5 ⇒ 8 –x = 15 × 24 – 200 – 36 = 124 ⇒ …(v) 2 n1 log G1 + n2 log G2 + n3 log G3 ⇒ n1 + n2 + n3 10 × 1 + 8 × 0·8451 + 6 log G3 = 24 × 0·9242 log G ∴ ⇒ 2 = G3 = Antilog ( 22·1808 – 10 – 6·7608 6 ) = Antilog 8 10 log 10 + 8 log 7 + 6 log G3 24 ( ) 5·42 6 = log (8·397) = Antilog (0·9033) = 8·004 ~– 8 (approx.). Example 5·40. Find the average rate of increase in population which in the first decade had increased by 20%, in the next by 30% and in the third by 40%. Solution. Since we are dealing with rate of increase in population, the appropriate average to be computed is the geometric mean and not the arithmetic mean. ( n1 ∑ log X ) = Antilog ( 6·3392 3 ) = Antilog (2·1131) G.M. = Antilog = 129·7 CALCULATIONS FOR GEOMETRIC MEAN log X Decade Rate of growth Population at the end of population of the decade (X) ———————————————————————————————————— ——————————————————————————————————— ————————————— 1 2 3 Total 20% 30% 40% 120 130 140 2·0792 2·1139 2·1461 ∑ log X = 6·3392 Hence, the average percentage rate of increase in the population per decade over the entire period is : 129·7 – 100 = 29·7. Example 5·41. Under what condition is geometric mean indeterminate ? If the price of a commodity doubles in a period of 4 years, what is the average annual percentage increase ? Solution. Geometric mean is indeterminate if any one of the given observations is negative. In this case G.M. becomes imaginary. In general, G.M. is indeterminate (imaginary) if odd number of given observations are negative. 5·55 AVERAGES OR MEASURES OF CENTRAL TENDENCY Also, if any one of the given observations is zero, G.M. becomes zero irrespective of the size of the other observations. In the usual notations we are given : P0 = Rs. x, (say) ; Pn = Rs. 2x ; n = 4. If r is the average annual percentage increase in price over this period then we have : Pn = P0 (1 + r) n ⇒ 2x = x (1 + r)4 (1 + r)4 = 2 ⇒ (1 + r) = 21/4 ⇒ ( 1 + r = Antilog 1 4 ) log 2 = Antilog ( 0·3010 4 ) = Antilog (0·07525) = 1·190 ∴ r = 1·190 – 1 = 0·19 Hence, the average annual percentage increase is 0·19 i.e., 19%. Example 5·42. An assessee depreciated the machinery of his factory by 10% each in the first two years and by 40% in the third year and thereby claimed 21% average depreciation relief from taxation department, but the I.T.O. objected and allowed only 20%. Show which of the two is right. Solution. COMPUTATION OF A.M. AND G.M. The average value (arithmetic mean) at the end of three years is : – X= ∑X 240 3 = 3 Year Rate of Value at the end depreciation of the year (X) log X —————————————————————————————— ——————————————————————————— ————————————————————————— 1 2 3 = 80 10% 10% 40% 90 90 60 1·9542 1·9542 1·7782 Hence, the average rate of depreciation per annum for the entire period of 3 years is 100 – 80 = 20%. ∑ X = 240 ∑ log X = 5·6866 This can be computed otherwise by taking the average of 10%, 10%, 40% which is : ———————————— —————————————————————————————————————————— —————————————————————————— 10 + 10 + 40 3 The geometric mean is given by : G.M. = Antilog ( 1 3 ) = 60 3 = 20% ∑log X = Antilog ( 5·6866 3 ) = Antilog (1·8955) = 78·61 Hence, the average (geometric mean) rate of depreciation per annum for the entire period of three years is 100 – 78·61 = 21·39% ~– 21%. The assessee had claimed 21% depreciation using G.M. while the I.T.O. objected and allowed 20% depreciation using A.M. Since we are dealing with rates, the arithmetic mean does not depict the average depreciation correctly. Geometric mean is the correct average to be used. Hence, I.T.O. was wrong in not allowing 21% depreciation as claimed by the assesse. Example 5·43. (a) Show that in finding the A.M. of a set of readings on a thermometer, it does not matter whether we measure the temperature in Centigrade (C) or Fahrenheit (F) degrees. [Delhi Univ. B.A. (Econ. Hons.) 2000] (b) However, in computing Geometric Mean, it does matter which readings we use. Solution. If F and C be the readings in Fahrenheit and Centigrade respectively then we have the relation : F – 32 C 9 = ⇒ F = 32 + C 180 100 5 Thus the Fahrenheit equivalents of C1 , C 2 , … , Cn are 32 + 59 C 1 , 32 + 95 C 2 , …, 32 + 95 C n respectively. 5·56 BUSINESS STATISTICS Hence, the arithmetic mean of the readings in Fahrenheit is = 1n {( { ) ( ) ( 32 + 59 C 1 + 32 + 59 C 2 + … + 32 + 59 Cn } = 1n 32n + 59 (C1 + C2 + … + Cn ) = 32 + 95 ( )} C1 + C2 + … + Cn n 9 – ) – = 32 + 5 C , which is the Fahrenheit equivalent of C . Hence, in finding the arithmetic mean of a set of n readings on a thermometer, it is immaterial whether we measure temperature in Centigrade or Fahrenheit. Geometric mean G, of n readings in Centigrade is : G = (C 1 C2…C n )1/n Geometric mean G1, (say), of Fahrenheit equivalents of C 1 , C 2 ,…,C n is { ( 32 + C ) ( 32 + C ) … ( 32 + C ) } which is not equal to the Fahrenheit equivalent of G viz., { ( C . C …C ) + 32 } . 9 5 G1 = 1 9 5 1/n 9 5 2 9 5 n 1/n 1 2 n Hence, in finding the geometric mean of the n readings on a thermometer, the scale (Centigrade or Fahrenheit) is important. Example 5.44. If the arithmetic mean of two unequal positive real number ‘a’ and ‘b’, (a > b), be twice as great as their geometric mean, show that a : b = (2 + ⎯ √ 3) : (2 – √ ⎯ 3) [I.C.W.A. (Foundation), June 2001] Solution. The arithmetic mean (A.M.) and the geometric mean (G.M.) of two unequal positive real numbers a and b, (a > b), are given by : a+b A.M. = and G.M. = ⎯ √⎯ab . 2 We are given : a+b A.M. = 2 G.M. ⇒ =2⎯ √⎯ab ⇒ a+b=4⎯ √⎯ab …(i) 2 Also (a – b) 2 = (a + b)2 – 4ab = 16 ab – 4 ab = 12 ab [From (i)] ⇒ (a – b) = ± 2 √ ⎯ 3 ⎯√⎯ab ⇒ Adding and subtracting (i) and (ii), we get respectively : a–b=2⎯ √ 3 ⎯√⎯ab (Q a > b) 2a = 2 √ ⎯⎯ab (2 + √ ⎯ 3) …(iii) and 2b = 2 ⎯ √⎯ab (2 – ⎯√ 3) Dividing (iii) by (iv), we get : a 2+√ ⎯3 = ⇒ a : b = (2 + ⎯ √ 3) : (2 – ⎯√ 3) b 2 –√ ⎯3 …(ii) …(iv) 5·9·5. Weighted Geometric Mean. If the different values X1 , X2,…, Xn of the variable are not of equal importance and are assigned different weights, say, W1, W 2 ,…,Wn respectively according to their degree of importance then their weighted geometric mean G.M. (W) is given by G.M.(W) = (X 1 W1 × X2 W2 × … × Xn Wn )1/N …(5·27) where N = W1 + W2 + … + W n = ∑W, is the sum of weights. Taking logarithm of both sides in (5·27), we get [ ] 1 log [G.M. (W)] = 1N W1 log X1 + W2 log X2 + … + W n log Xn = N ∑W log X ⇒ G.M. (W) = Antilog [ 1N ∑ ] W log X = Antilog [ ∑W log X ∑W ] …(5·27a) 5·57 AVERAGES OR MEASURES OF CENTRAL TENDENCY Example 5·45. The weighted geometric mean of the four numbers 8, 25, 19 and 28 is 22·15. If the weights of the first three numbers are 3, 5, 7 respectively, find the weight (positive integer) of the fourth number. Solution. Let the weight of the fourth number be w. Weighted Geometric Mean (G) = 22·15 (Given) Also log X log G = ∑W∑W log X log 22·15 = ∑W∑W ⇒ COMPUTATION OF WEIGHTED G.M. ⇒ log (22·15) ⇒ ⇒ ⇒ ⇒ ⇒ (15 + w) × 1·3454 15 × 1·3454 + 1·3454w 20·1810 + 1·3454w 1·4472w – 1·3454w 0·1018w ⇒ w + 1·4472w = 18·6504 15 + w = 18·6504 + 1·4472w = 18·6504 + 1·4472 w = 18·6504 + 1·4472w = 20·1810 – 18·6504 = 1·5306 = 1·5306 1·1018 = X W log X W log X ——————————————— ——————————————— ———————————————— ————————————————————————— 8 25 19 28 0·9031 1·3979 1·2788 1·4472 3 5 7 w 2·7093 6·9895 8·9516 1·4472w ———————————————— ——————————————— ——————————————— —————————————————————————— Total 15 + w 18·6504 + 1·4472w 15 approx. 5·10. HARMONIC MEAN If X 1 , X 2 ,…, Xn is a given set of n observations, then their harmonic mean, abbreviated as H.M. or simply H is given by : 1 1 H = = = n1 …(5·28) 1 1 1 1 1 1 ∑ + +…+ ∑ X n X1 X2 Xn n X ] [ () ( ) In other words, Harmonic Mean is the reciprocal of the arithmetic mean of the reciprocals of the given observations. In case of frequency distribution, we have f1 f2 fn f 1 1 1 N = + +…+ = ∑ ⇒ H= …(5·28a) H N X1 X2 Xn N X ∑ (f/X) [ ] ( ) where N = ∑ f, is the total frequency, X is the value of the variable or the mid-value of the class (in case of grouped or continuous frequency distribution) and f is the corresponding frequency of X. Remark. Effect of Change of Scale on Harmonic Mean. Let x1 , x2, …, xn be the given set of n observations. If the variable x is transformed to the new variable u by change of scale : u = kx, k ≠ 0 …(*) then, by definition [ ] [ ] 1 1 1 1 1 1 1 1 1 = + +…+ = + +…+ H.M. (u) n u1 u2 un n kx1 kx2 kxn [ ] [Using (*)] 1 1 1 1 1 1 1 = · + +…+ = · k n x1 x2 xn k H.M. (x) ⇒ H.M.(u) = k × H.M. (x) Hence, we have the following result : If u = kx, k ≠ 0, then H.M. (u) = k × H.M. (x) 5·10·1. Merits and Demerits of Harmonic Mean. Merits : (i) Harmonic mean is rigidly defined. (ii) It is based on all the observations. (iii) It is suitable for further mathematical treatment. …[5·28b] 5·58 BUSINESS STATISTICS If H 1 and H2 are the harmonic means of two groups of sizes N1 and N 2 respectively, then the harmonic mean H of the combined group of size N 1 + N2 is given by : [ N1 N2 1 1 = + H N 1 + N 2 H1 H2 ] …(5·28c) (iv) Since the reciprocals of the values of the variable are involved, it gives greater weightage to smaller observations and as such is not very much affected by one or two big observations. (v) It is not affected very much by fluctuations of sampling. (vi) It is particularly useful in averaging special types of rates and ratios where time factor is variable and the act being performed remains constant. Demerits. (i) It is not easy to understand and calculate. (ii) Its value cannot be obtained if any one of the observations is zero. (iii) It is not a representative figure of the distribution unless the phenomenon requires greater weightage to be given to smaller items. As such, it is hardly used in business problems. Uses. As has been pointed out in merit (vi) , harmonic mean is specially useful in averaging rates and ratios where time factor is variable and the act being performed e.g., distance is constant. The following examples will clarify the point. Example 5·46. The following table gives the weights of 31 persons in a sample enquiry. Calculate the mean weight using (i) Geometric mean and (ii) Harmonic mean. Weight (lbs.) : No. of persons : 130 3 135 4 140 6 145 6 146 3 148 5 149 2 150 1 157 1 Solution. COMPUTATION OF G.M. AND H.M. Weight (lbs.) (X) No. of persons (f) log X f log X 1 X f X 130 135 140 145 146 148 149 150 157 3 4 6 6 3 5 2 1 1 2·1139 2·1303 2·1461 2·1614 2·1644 2·1703 2·1732 2·1761 2·1959 6·3417 8·5212 12·8766 12·9684 6·4932 10·8515 4·3464 2·1761 2·1959 0·00769 0·00741 0·00714 0·00690 0·00685 0·00676 0·00671 0·00667 0·00637 0·02307 0·02964 0·04284 0·04140 0·02055 0·03380 0·01342 0·00667 0·00637 ∑f = N = 31 G.M. = Antilog H.M. = ( ∑f log X = 66·7710 1 N ∑f log X ) = Antilog ( 66·7710 31 ∑(f / X) = 0·21776 ) = Antilog (2·1539) = 142·5 N 31 = = 142·36 ∑(f /X) 0·21776 Hence, the mean weight of 31 persons using (i) geometric mean is 142·5 lbs. and (ii) harmonic mean is 142·36 lbs. Example 5.47. If 2u = 5x, is the relation beween two variables u and x, and harmonic mean of x is 0·4, find the harmonic mean of u. [I.C.W.A. (Foundation), June 2005] Solution. 2u = 5x ⇒ 5 u= x 2 …(*) 5·59 AVERAGES OR MEASURES OF CENTRAL TENDENCY Also H.M. (x) = 0.4 (Given) …(**) If u 1 , u 2 , …, un are n observations corresponding to x1, x 2 , …, xn respectively, obtained by the transformation (*) so that ui = 5 x , (i = 1, 2, …, n), 2 i …(***) then, by definition H.M. (u) = 1 [ 1 1 1 1 + +…+ n u1 u2 un = 1 = ] [ 5 ⎡ 1 2 ⎢1 1 1 1 ⎢ + +…+ n x x x 1 2 n ⎢⎣ [ 1 21 2 1 … 21 + · + + n 5 x1 5 x2 5 xn ] [From (***)] ] ⎤ = 5 H.M. (x) = 5 × 0·4 = 1 ⎥ 2 2 ⎥ ⎥⎦ [From (**)] Remark. We may state and use the result (5·28b) directly, to get from (*) H.M. (u) = 5 5 H.M. (x) = × 0.4 = 1 2 2 Example 5.48. Find the harmonic mean of 1, 2, 3, … , n respectively. Solution. The harmonic mean (H ) is given by : 1 ∑ (f/x) = H ∑f 2 + 3 + 4 + … + n + (n + 1) = 1+2+3+…+n (1 + 2 + 3 + … + n) + n = (1 + 2 + 3 + … + n) n =1+ 1+2+3+…+n =1+ ∴ 1 2 3 n , , , … , occurring with frequencies 2 3 4 n+1 [I.C.W.A. (Foundation), June 2006] x f (1) (2) 1/2 2/3 3/4 M (n – 1)/n n/(n + 1) 1 2 3 M n–1 n 1 x (3) f x (4) = (2) × (3) 2 3/2 4/3 2 3 4 n/(n – 1) (n + 1)/n n n+1 n 2 n+3 =1+ = n (n + 1)/2 n+1 n+1 H = (n + 1)/(n + 3) Example 5·49. A cyclist pedals from his house to his college at a speed of 10 km. p.h. and back from the college to his house at 15 km. p.h. Find the average speed. Solution. Let the distance from the house to the college be x kms. In going from house to college, the distance (x kms) is covered in x/10 hours, while in coming from college to house, the distance is covered in x / 15 hours. Thus a total distance of 2x kms is covered in ( x x 10 + 15 ) hours. 2x 2 = = 12 km. p.h. x x 1 1 + + 10 15 10 15 Remarks. 1. In this case the average speed is given by the harmonic mean of 10 and 15 and not by the arithmetic mean. Hence, average speed = Total distance travelled = Total time taken ( ) ( ) 5·60 BUSINESS STATISTICS 2. If equal distances are covered (travelled) per unit of time with speeds equal to V1 , V2 ,…,Vn , say, then the average speed is given by the harmonic mean of V1 , V2 ,…, Vn i.e., n Average speed = = n 1 1 1 + +…+ ∑ 1V V1 V2 Vn Example 5·50. A vehicle when climbing up a gradient, consumes petrol at the rate of 1 litre per 8 kms. while coming down it gives 12 kms. per litre. Find its average consumption for to and fro travel between two places situated at the two ends of a 25 km. long gradient. Verify your answer. [Delhi Univ. B.A. (Econ. Hons.), 1994] Solution. Since the consumption of petrol is different for upward and downward journeys (at a constant distance of 25 km.), the appropriate average consumption for to and fro journey is given by the harmonic mean of 8 km. and 12 km. ∴ Average consumption for to and fro journey 1 2 48 = = = = 9·6 km. per litre …(*) 1 1 1 3+2 5 + 2 8 12 24 Verification. Consumption of petrol for upward journey of 25 km. @ 1 litre per 8 kms. = 25 8 litres. ( ) () ( ) ( ) 25 Consumption of petrol for downward journey of 25 km @ 1 litre per 12 kms. = 12 litres. ∴ Total petrol consumed for total journey of 25 + 25 = 50 km. is 25 25 1 1 25 125 + litres = 25 + litres = (3 + 2) litres = litres 8 12 8 12 24 24 ∴ Average consumption for to and fro journey Total distance 50 × 24 48 = = = = 9·6 km. per litre, which is same as in (*). Total petrol used 125 5 Example 5·51. In a certain office, a letter is typed by A in 4 minutes. The same letter is typed by B, C and D in 5, 6, 10 minutes respectively. What is the average time taken in completing one letter ? How many letters do you expect to be typed in one day comprising of 8 working hours ? [Delhi Univ. B.A. (Econ. Hons.), 1996; 1995] Solution. The average time (in minutes) taken by each of A, B, C and D in completing one letter is the harmonic mean of 4, 5, 6 and 10 given by : 4 4 240 = = = 5·5814 minutes per letter. 1 1 1 1 43 15 + 12 + 10 + 6 + + + 4 5 6 10 60 43 Hence, expected number of letters typed by each of A, B, C and D is letters per minute. 240 Hence, in a day comprising of 8 hours = 8 × 60 minutes, the total number of letters typed by all of them (A, B, C and D) is : 43 × 4 × 8 × 60 = 344. 240 Example 5·52. An investor buys Rs. 1,200 worth of shares in a company each month. During the first 5 months he bought the shares at a price of Rs. 10, Rs. 12, Rs. 15, Rs. 20, and Rs. 24 per share. After 5 months what is the average price paid for the shares by him ? Solution. Since the share value is changing after a fixed unit time (1 month), the required average price per share is the harmonic mean of 10, 12, 15, 20 and 24 and is given by : 5 5 5 × 120 = = = Rs. 14·63. 41 1 1 1 1 1 12 + 10 + 8 + 6 + 5 + + + + 10 12 15 20 24 120 ( ) ( ) ( ( ) ) ( Note. For an alternate solution, see Example 5·8. ) 5·61 AVERAGES OR MEASURES OF CENTRAL TENDENCY 5·10·2. Weighted Harmonic Mean. Instead of fixed (constant) distance being travelled with varying speeds (c.f. Remark 2, Example 5·49), let us now suppose that different distances are travelled with corresponding different speeds. In that case what is going to be the average speed ? Let us suppose that distances s1 , s 2 ,…, sn , are travelled with speed v 1 , v 2 ,…, vn per unit of time. If t1 ,t2,…, tn are the respective times taken to cover these distances then we have : s1 s2 sn t1 = , t2 = , …, tn = …(*) v1 v2 vn Total distance travelled s1 + s2 + … + sn ∴ Average speed = =t +t +…+t Total time taken 1 2 n s1 + s2 + … + sn = s s [From (*)] sn 1 2 + +…+ v1 v2 vn [ = ] ∑s ∑ 1 = …(5·30) ( ) ( )∑( ) s v 1 ∑s s v which is the weighted harmonic mean of the speeds, the corresponding weights being the distances covered. Hence, if different distances are travelled with corresponding different speeds, then the average speed is given by the weighted harmonic mean of the speeds, the corresponding weights beings the distances covered. Example 5·53. You make a trip which entails travelling 900 kms. by train at an average speed of 60 km. p.h.; 3000 kms. by boat at an average of 25 km. p.h.; 400 kms. by plane at 350 km. p.h., and finally, 15 kms by taxi at 25 km. p.h. What is your average speed for the entire distance ? Solution. Since different distances are covered with varying speeds, the required average speed is given by the weighted harmonic mean of the speeds (in km. p.h) 60, 25, 350 and 25; the corresponding weights being the distances covered (in kms.) viz., 900, 3000, 400 and 15 respectively. ∴ ∑W 4315 Average Speed = ∑(W/X) = 137·03 = 31·49 km. p.h. COMPUTATION OF WEIGHTED H.M. X W W/X —————————————— ——————————————————— ——————————————————————————— 60 25 350 25 900 3000 400 15 15 120 1·43 0·60 ————————————————— ———————————————————— ————————————————————————— ∑W = 4315 ∑(W/X) = 137·03 5·11. RELATION BETWEEN ARITHMETIC MEAN, GEOMETRIC MEAN AND HARMONIC MEAN The arithmetic mean (A.M.), the geometric mean (G.M.) and the harmonic mean (H.M.) of a series of n observations are connected by the relation : A.M. ≥ G.M. ≥ H.M. …(5·31) the sign of equality holding if and only if all the n observations are equal. Remark. For two numbers we also have G2 = A × H …(5·32) where A, G and H represent arithmetic mean, geometric mean and harmonic mean respectively. Proof : Let a and b be two real positive numbers i.e., a > 0, b > 0. a+b 1 2ab Then A.M. = ; G.M. = √ ⎯⎯ab ; H.M. = = …(*) 2 a+b 1 1 1 + 2 a b ( ∴ A×H= a + b 2ab · = ab = G2 2 a +b ) 5·62 BUSINESS STATISTICS Example 5·54. H.M., A.M. and G.M. of a set of 5 observations are 10·2, 16 and 14 respectively.” Comment. Solution. We are given : n = 5; A.M. = 16 ; G.M. = 14 and H.M. = 10·2. Since A.M. > G.M. > H.M., the above statement is correct. Example 5·55. The arithmetic mean of two observations is 127·5 and their geometric mean is 60. Find (i) their harmonic mean and (ii) the two observations. [Delhi Univ. B.Com. (Hons.), (External), 2007] Solution. (i) Let the two observations be a and b. Then we are given : a+b Arithmetic Mean = = 127·5 ⇒ a + b = 255 …(*) 2 G.M. = ⎯ √⎯⎯ a × b = 60 ⇒ ab = 3600 Harmonic mean of two numbers a and b is given by : 2ab 2 × 3600 480 H.M. = = = = 28·24 a+b 255 17 Aliter. For two numbers, we have : 2 (ii) We have (a – b) 2 = (a + b)2 – 4ab = (255)2 – 4 × 3600 = 65025 – 14400 = 50625 a – b = ± ⎯√⎯⎯⎯ 50625 = ± 225 a + b = 255 and a – b = 225 Adding, we get [From (*) and (**)] 2 60 480 H=G A = 127·5 = 1 7 = 28·24 ⇒ G2 = AH …(**) [From (*) and (**)] ∴ 2a = 480 ∴ ⇒ b = 255 – a = 255 – 240 = 15 a + b = 225 and a – b = – 225 Adding, we get a = 480 2 = 240 [From (*)] 2a = 30 ∴ ⇒ b = 255 – a = 255 – 15 = 240 a = 30 2 = 15 [From (*)] Hence, the two observations are 240 and 15 5·12. SELECTION OF AN AVERAGE From the discussion of the merits and demerits of the various measures of central tendency in the preceding sections, it is obvious that no single average is suitable for all practical problems. Each of the averages has its own merits and demerits and consequently its own field of importance and utility. For example, arithmetic mean is not to be recommended while dealing with frequency distribution with extreme observations or open end classes. Median and mode are the averages to be used while dealing with open end classes. In case of qualitative data which cannot be measured quantitatively (e.g., for finding average intelligence, honesty, beauty, etc.), median is the only average to be used. Mode is particularly used in business and geometric mean is to be used while dealing with rates and ratios. Harmonic mean is to be used in computing special types of average rates or ratios where time factor is variable and the act being performed e.g., distance, is constant. Hence, the averages cannot be used indiscriminately. For sound statistical analysis, a judicious selection of the average depends upon : (i) the nature and availability of the data, (ii) the nature of the variable involved, (iii) the purpose of the enquiry. AVERAGES OR MEASURES OF CENTRAL TENDENCY 5·63 (iv) the system of classification adopted, and (v) the use of the average for further statistical computations required for the enquiry in mind. However, since arithmetic mean : (i) satisfies almost all the properties of an ideal average as laid down by Prof. Yule, (ii) is quite familiar to a layman, and (iii) has very wide applications in statistical theory at large, it may be regarded as the best of all the averages. 5·13. LIMITATIONS OF AVERAGES In spite of its very wide applications in statistical analysis, the averages have the following limitations : 1. Since average is a single numerical figure representing the characteristics of a given distribution, proper care should be taken in interpreting its value otherwise it might lead to very misleading conclusions. In this context, it might be appropriate to quote a classical joke regarding average about a village school teacher who had to cross a river along with his family. On enquiry he was given to understand that the average depth of the river was 3 feet. He measured the heights of the members of the family (himself, his wife, 2 daughters and 3 sons) and found that their average (mean) height was 321 feet. Since the average height of the family came out to be higher than the average depth of the river, he ordered his family to cross the river. But when he reached the other side of the river, three of his children were missing. He again checked his arithmetical calculations which still gave the same result and was wondering as to what and where was the mistake. He wrote a couplet in Urdu, reading : ‘Arba jyon ka tyon Kunba dooba kyon’ (Arba means arithmetic or calculations and Kunba means family). In fact, the teacher had the misconception about the average depth of the river which he mistook for uniform depth but in fact the river was very shallow in the beginning but became deeper and deeper and in the middle it was as deep as 4 feet or so. Accordingly, the members of the family with height bellow 4 feet were drowned. 2. A proper and judicious choice of an average for a particular problem is very important. A wrong choice of the average might give wrong and fallacious conclusions. 3. An average fails to give the complete picture of a distribution. We might come across a number of distributions having the same average but differing widely in their structure and constitution. To form a complete idea about the distribution, the measures of central tendency are to be supplemented by some more measures such as dispersion, skewness and kurtosis. 4. In certain types of distributions like U-shaped or J-shaped distributions, an average (which is only a single point of concentration) fails to represent the entire series [c.f. Chapter 4]. 5. Sometimes an average might give very absurd results. For instance, the average of a family might come out in fractions which is obviously absurd. In this context we might quote the following : “The figure of 2·2 children per adult female is felt in some respects to be absurd and the Royal Commission suggested that the middle classes be paid money to increase the average to a rounder and more convenient number”. EXERCISE 5·4 1. Define Geometric Mean and discuss its merits and demerits. Give two practical situations where you will recommend its use. 2. (a) “It is said that the choice of an average depends on the particular problem in hand”. Examine the above statement and give at least one instance each for the use of Mode and Geometric Mean. (b) Discuss the strong and weak points of various measures of central tendency. 3. (a) What are the advantages and disadvantages of the chief averages used in Statistics ? Indicate their special uses if any. (b) What are the desideratta for a satisfactory average ? Examine the geometric mean in the light of these desideratta and bring out the special properties of this average which lend to its use in intercensal population counts and in the construction of index numbers. 5·64 BUSINESS STATISTICS 4. (a) “Each average has its own special features and it is difficult to say which one is the best”. Explain and illustrate. (b) Why is arithmetic mean generally preferred over median as the measure of central tendency ? What is the relation between arithmetic mean and geometric mean ? When is the latter preferred over the former ? 5. Explain the relative merits of geometric mean over other measures of central tendency. 6. Give a specific example of an instance in which : (a) The median would be used in preference to arithmetic mean, (b) The arithmetic mean would not be as satisfactory as the geometric mean, and (c) Mode would be used in preference to the median. 1 1 7. (a) Find the G.M. of 1, 2, 3, , · What will be the geometric mean if ‘0’ is added to this set of values ? 2 3 [I.C.W.A. (Foundation), June 2003] 1/5 1 1 Ans. G = 1 × 2 × 3 × × = 11/5 = 1 ; G = 0 2 3 ( ) (b) Find the geometric mean of : 1, 7, 18, 65, 91 and 103. Ans. 20·62. (c) Calculate geometric mean of the data : 1, 7, 29, 92, 115 and 375. Ans. 30·50 8. Calculate arithmetic mean and geometric mean of the following distribution x : 2 3 4 5 6 7 8 f : 2 4 6 2 3 2 1 Ans. A.M. = 4·5 ; G.M. 4·192. 9. If the population has doubled itself in twenty years, is it correct to say that the rate of growth has been 5% per annum ? Ans. No. r = 3·5%. 10. The population of a city was 1,00,000 in 1975 and 1,44,000 a decade later. Estimate the population at the middle of the decade. [Delhi Univ. B.A. (Econ. Hons.) 1996] Hint and Ans. r : Percentage rate of growth per annum. Then 1,44,000 = 1,00,000 (1 + r)10 ⇒ (1 + r)10 = 144 = 100 ( ) 12 10 2 ⇒ (1 + r)5 = 12 = 1.2. 10 Estimated population at the middle of the decade = 1,00,000 (1 + r)5 = 1,00,000 × 1·2 = 1,20,000. 11. The population of India in 1951 and 1961 were 361 and 439 million respectively. (i) What was the average percentage increase per year during the period ? (ii) If the average rate of increase from 1961 to 1971 remains the same, what would be the population in 1971 ? Ans. (i) 2%, (ii) 533·85 million. 12. The population of a country increased by 20 per cent in the first decade and by 30 per cent in the second decade and by 45 per cent in the third decade. Determine the average decennial growth rate of population. Ans. 31·3% 13. A machine depreciates by 40% in the first year, by 25% in the second year and by 10% per annum for the next three years, each percentage being calculated on the diminishing value. What is the average percentage of depreciation for the entire period ? [Delhi Univ. B.Com. (Hons.), 1994] Ans. 20%. 14. (a) An income-tax assessee depreciated the machinery of his factory by 20 per cent in each of the first two years and 40 per cent in the third year. How much average depreciation relief should he claim from the taxation department ? [Delhi Univ. B.A. (Econ. Hons.), 1999; Bangalore Univ. B.Com., 1998] Ans. 27·32%. (b) A businessman depreciated the machinery of his factory by 20% in the first two years and 40% in the third year. What is the average depreciation for the three years ? [Delhi Univ. B.A. (Econ. Hons.), 2004] Ans. G.M. = 27.32%. 15. (a) An economy grows at the rate of 2% in the first year, 2·5% in the second year, 3% in the third, 4% in the fourth,…and 10% in the tenth year. What is the average rate of growth of the economy ? Ans. 5·6% p.a. (Nagarjuna Univ. B.Com., 1996) AVERAGES OR MEASURES OF CENTRAL TENDENCY 5·65 (b) The annual rates of growth achieved by a nation for 5 years are 5%, 7·5%, 2·5%, 5% and 10% respectively. What in the compound rate of growth for the 5 year period ? [Delhi Univ. B.A. (Eco. Hons.), 1993] Ans. 5·9%. 16. The number of divorces per 1,000 marriages in a big city in India increased from 96 in 1980 to 120 in 1990. Find the annual rate of increase of the divorce rate for the period 1980 to 1990. [Delhi Univ. B.Com. (Hons.), 1994] Hint. 120 = 96 ( r 1 + 100 ) 10 r 1 + 100 = ⇒ ( ) 120 96 1/10 Ans. r = 2·26%. 17. If arithmetic mean and geometric mean of two values are 10 and 8 respectively, find the values. Ans. 16, 4. 18. A man gets three successive annual rises in salary of 20%, 30% and 25% respectively, each percentage being reckoned in his salary at the end of the previous year. How much better or worse would he have been if he had been given three annual rises of 25% each, reckoned in the same way. [Delhi Univ. B.A. (Econ. Hons.), 2006] Ans. The man would be better in the second case by 0·31% of his starting salary in the 1st year. 19. The geometric mean of 4 items is 100 and of another 8 items is 3·162. Find the geometric mean of the 12 items. Ans. 10. 20. (a) Geometric mean of n observations is found to be G. How will you find the correct value of the Geometric Mean if some of the values used in its calculation are found to be wrong and should be replaced by correct values ? (b) Geometric mean of 2 numbers is 15. If by mistake one figure is taken as 5, instead of 3, find correct geometric mean. [Delhi Univ. B.Com. (Hons.) 1993] ⎯√⎯ab = 15 ⇒ ab = 225 225 225 = = 45 Wrong observation, (say), a = 5 ∴ b = 5 a Correct value of a = 3 (Given) Hint. Let the two numbers be a and b. ⎯√⎯⎯ ∴ Correct G.M. = 3 × 45 = 11·62 Or Use Formula (5·26b) (c) The geometric mean of four values was calculated as 16. It was later discovered that one of the value was recorded wrongly as 32 when, in fact, it was 162. Calculate the correct geometric mean. [Delhi Univ. B.Com. (Hons.), 2004] Ans. Correct G.M. = (16) × ( ) 162 32 1/4 = 16 × 3 = 24 2 [Using (5·26b)] 21. (a) Define simple and weighted geometric mean of a given distribution. The weighted geometric mean of three numbers 229, 275 and 125 is 203. The weights for 1st and 2nd numbers are 2 and 4 respectively. Find the weight of the third. Ans. 3. (b) The weighted geometric mean of the four numbers 9, 25, 17 and 30 is 15·3. If the weights of the first three numbers are 5, 3 and 4 respectively, find the weight of the fourth number. Ans. 2 (approx.). 22. (a) Define Harmonic Mean and discuss its merits and demerits. Under what situations would you recommend its use. (b) Find the geometric and harmonic mean from the following data. Items : 1 2 3 4 5 6 7 8 9 10 Value : 15 250 15·7 157 1·57 105·7 105 1·06 25·7 0·257 Ans. GM = 16·04; HM = 1·7637. 1 1 1 1 23. (a) Find the harmonic mean of the numbers , , , , 1. [I.C.W.A. (Foundation), June 2004, Dec. 2002] 5 4 3 2 1 1 1 1 1 Hint. = ∑ = ( 5 + 4 + 3 + 2 + 1) ⇒ H= H 5 5 3 x (b) If each of 3, 48 and 96 occurs once and 6 occurs thrice, verify that geometric mean is greater than harmonic mean. [I.C.W.A. (Foundation), Dec., June 2004] () 5·66 BUSINESS STATISTICS Ans. G.M. = 12 ; H.M. = 6.94 ; G.M. > H.M. 24. What do you mean by Weighted Harmonic Mean ? When will you use it instead of Simple Harmonic Mean ? Explain by a practical situation. 25. It is said that “Choice of an average depends on the particular problem in hand.” Examine the statement and give at least one instance each for the use of mode, geometric mean and harmonic mean. [Delhi B.Com. (Hons.), 2007] 26. From the following statements select any two which are correct and any three which are incorrect. In respect of each of such statements selected by you, give your comments explaining briefly why you consider the statement correct or incorrect : (i) The median may be considered more typical than the mean because the median is not affected by the size of the extremes ; (ii) In a frequency distribution the true value of the mode cannot be calculated exactly ; (iii) The Geometric Mean cannot be used in the averaging of index numbers because it gives undue importance to small numbers ; (iv) The Harmonic Mean of a series of fractions is the same as the reciprocal of the arithmetic mean of the series. Ans. (i) T, (ii) T, (iii) F, (iv) F. 27. Show that the weighted harmonic mean of the first n natural numbers, where the weights are equal to the corresponding numbers, is given by (n + 1)/2. [Delhi Univ. B.A. (Econ. Hons.), 2003] 28. (a) An aeroplane flies around a square the sides of which measure 100 km. each. The aeroplane covers at a speed of 100 km. per hour first side, at 200 km. per hour the second side, at 300 km. per hour the third side and at 400 km. per hour the fourth side. Use the correct mean to find the average speed around the square. Ans. 192 km. p.h. (b) Four factories emit a kilogram of pollutant each in 4, 5, 8 and 12 days respectively. What is the average rate of pollutant discharge ? Use your answer to calculate the total pollutant discharged by the four factories in one week. [Delhi Univ. B.A. (Econ. Hons.), 2005] Hint. Find H.M. of 4, 5, 8, 12. 480 Ans. 1 kg. pollutant in days per factory. 79 553 79 ×4×7= = 4.608 kg. Total pollutant discharged by four factories per week = 120 480 29. (a) A railway train runs for 30 minutes at a speed of 40 miles an hour and then, because of repairs of the track runs for 10 minutes at a speed of 8 miles an hour, after which it resumes its previous speed and runs for 20 minutes except for a period of 2 minutes when it had to run over a bridge with a speed of 30 miles per hour. What is the average speed ? [Delhi Univ. B.Com. (Hons.), 2009] Hint. Average Speed = (Total Distance covered) ÷ (Total time taken) = [( ) ] 8 40 30 40 × 30 + × 10 + × 18 + × 2 ÷ (30 + 10 + 20) m.p.h. 60 60 60 60 Ans. 34·33 m.p.h. (b) A train runs 25 miles at a speed of 30 m.p.h.; another 50 miles at a speed of 40 m.p.h.; then due to repairs of the track runs for 6 minutes at a speed of 10 m.p.h. and finally covers the remaining distance of 24 miles at a speed of 24 m.p.h. What is the average speed in miles per hour ? [Punjab Univ. B.Com., 1994] Ans. 31·41 m.p.h. 30. (a) A cyclist covers his first three kilometres at an average speed of 8 kms. per hour, another 2 kms. at 9 kms. per hour and the last 2 kms. at 4 kms. per hour. Find the average speed for the entire journey. Ans. 6·38 kms. per hour. (b) If X travels 8 km. at 4 km. per hour; 6 km. at 3 km. per hour and 4 km. at 2 km. per hour, what would be the average rate per hour at which he travelled ? [Delhi Univ. B.Com. (Pass), 1998] Ans. Weighted H.M. = 3 km. p.h. 31. (a) A man travelled by car for 3 days. He covered 480 km. each day. On the first day he drove for 10 hours at 48 km. an hour on the second day he drove for 12 hours at 40 km. an hour and on last day he drove for 15 hours at 32 km. an hour. What was his average speed ? [Bombay Univ. B.Com., 1996] Ans. 38·919 km. p.h. AVERAGES OR MEASURES OF CENTRAL TENDENCY 5·67 (b) Kishore travels 900 kms. by train at an average speed of 60 kms. per hour; 3,000 kms. by steamship at an average of 25 kms. per hour; 400 kms. by aeroplane at 350 kms. per hour; and finally 15 kms. by bus at 25 kms. per hour. Calculate his average speed for the entire journey. [C.S. (Foundation), Dec. 2001] Ans. 31·556 km. p.h. 32. A man travels from Agra to Dehradun covering 204 miles at a mileage rate of 10 miles per gallon of petrol and via Ghaziabad with an additional journey of 40 miles at the rate of 15 miles per gallon. Find the average mileage per gallon. Ans. 10·58 miles per gallon. 33. The consumption of petrol by a motor was a gallon for 20 miles while going up from plains to hill station and a gallon for 24 miles while coming down. What particular average would you consider appropriate for finding the average consumption in miles per gallon for up and down journey, and why ? Ans. Harmonic Mean = 21·82 m.p. gallon. 34. A man having to drive 90 kilometres wishes to achieve an average speed of 30 kilometres per hour. For the first half of the journey he averages only 20 km. p.h. What must be his average for the second half of the journey if his overall average is to be 30 km. p.h. Ans. 60 km. p.h. 35. An aeroplane travels distances of d1, d2 and d3 kms. at speeds V1, V2 and V3 km. per hour respectively. Show that the average speed (V) is given by : d1 + d2 + d3 d1 d2 d3 = V + V + V V 1 2 3 [Delhi Univ. B.A. (Econ. Hons.), 1993] 36. (a) A person purchases one kilogram of cabbage from each of the four places at the rate of 20 kg., 16 kg., 12 kg. and 10 kg. per 100 rupee respectively. On the average how many kg. of cabbage has he purchased per 100 rupees ? [Delhi Univ. B.Com. (Pass), 2001] (b) If you spend Rs. 100 per week on apples and the price of apples for three weeks is Rs. 25, Rs. 20 and Rs. 10 per kilogram, what is the average price of apples for you ? [Delhi Univ. B.A. (Econ. Hons.), 2002] Ans. Rs. 15·79 per kg. 37. In a certain office a letter is typed by A in 4 minutes. The same letter is typed by B, C and D in 5, 6, 10 minutes respectively. What is the average time taken in completing one letter ? How many letters do you expect to be typed in one day comprising of 8 working hours. 480 ~ Ans. H.M. = 5·58 minutes per letter ; Letters typed in 8 hours (480 minutes) = – 86. 5·58 38. A scooterist purchased petrol at the rate of Rs. 24, Rs. 29.50 and Rs. 36.85 per litre during three successive years. Calculate the average price of petrol, (i) If he purchased 150, 180 and 195 litres of petrol in the respective years and (ii) If he spent Rs. 3,850, Rs. 4,675 and Rs. 5,825 in three years. Give support to your answer. [Delhi Univ. B.Com. (Hons.), 2005] Total money spent on petrol Hint. Average price of petrol/litre = Total petrol consumed in litres (i) Weighted A.M. of prices the weights being the quantities of petrol purchased (ii) Weighted H.M. of prices, the weights being the money spent on petrol. Ans. (i) Rs. 30·65/litre, (ii) Rs. 30/litre (approx.) 39. (a) Define Arithmetic Mean, Harmonic Mean and Geometric Mean for a set of n observations and state the relationship between them. Ans. A ≥ G ≥ H ; the sign of equality holds if and only if all the observations are equal. (b) Show the relationship between arithmetic mean and harmonic mean for the variable X, which can take the values a and b such that a, b are non-negative integers. [Delhi Univ. B.A. (Econ. Hons.), 2007] Ans. A × H = ( )( ) a+b 2ab · = ab = G2 2 a+b (c) If for two numbers, the arithmetic mean is 25 and the harmonic mean is 9, what is the geometric mean of the series ? [C.A. (Foundation), May 2001] Ans. G.M. = 15. 5·68 BUSINESS STATISTICS (d) If A.M. of two numbers is 17 and G.M. is 15, find the H.M. of these numbers. [Delhi Univ. B.A. (Econ. Hons.), 2000] Ans. 13·24 40. (a) Comment on the following : “The G.M. and A.M. of a distribution are 27 and 30. Then H.M. is 26.” [Delhi Univ. B.Com. (Hons.), 2009] Ans. Since A.M. ≥ G.M. ≥ H.M., the statement is correct. (b) State giving reasons which average will be more appropriate in the following cases : (i) The distribution has open-end classes. (ii) The distribution has wide range of variations. (iii) When depreciation is charged by diminishing balance method and an average rate of depreciation is to be calculated. (iv) The distance covered is fixed but speeds are varying and an average speed is to be calculated. [Delhi. Univ. B.Com. (Pass), 1999] Ans. (i) Md or Mo, (ii) Md, (iii) G.M., (iv) H.M. 6 Measures of Dispersion 6·1. INTRODUCTION AND MEANING Averages or the measures of central tendency give us an idea of the concentration of the observations about the central part of the distribution. In spite of their great utility in statistical analysis, they have their own limitations. If we are given only the average of a series of observations, we cannot form complete idea about the distribution since there may exist a number of distributions whose averages are same but which may differ widely from each other in a number of ways. The following example will illustrate this viewpoint. Let us consider the following three series A, B and C of 9 items each. Series Total Mean A 15, 15, 15, 15, 15, 15, 15, 15, 15 135 15 B 11, 12, 13, 14, 15, 16, 17, 18, 19, 135 15 C 3, 6, 9, 12, 15, 18, 21, 24, 27 135 15 All the three series A, B and C, have the same size (n = 9) and same mean viz., 15. Thus, if we are given that the mean of a series of 9 observations is 15, we cannot determine if we are talking of the series A, B or C. In fact, any series of 9 items with total 135 will give mean 15. Thus, we may have a large number of series with entirely different structures and compositions but having the same mean. From the above illustration it is obvious that the measures of central tendency are inadequate to describe the distribution completely. In the words of George Simpson and Fritz Kafka : “An average does not tell the full story. It is hardly fully representative of a mass unless we know the manner in which the individual items scatter around it. A further description of the series is necessary if we are to gauge how representative the average is.” Thus the measures of central tendency must be supported and supplemented by some other measures. One such measure is Dispersion. Literal meaning of dispersion is “Scatteredness.” We study dispersion to have an idea of the homogeneity (compactness) or heterogeneity (scatter) of the distribution. In the above illustration, we say that the series A is stationary, i.e., it is constant and shows no variability. Series B is slightly dispersed and series C is relatively more dispersed. We say that series B is more homogeneous (or uniform) as compared with series C or the series C is more heterogeneous than series B. We give below some definitions of dispersion as given by different statisticians from time to time. WHAT THEY SAY ABOUT DISPERSION — SOME DEFINITIONS “Dispersion is the measure of the variation of the items.”—A.L. Bowley “Dispersion is a measure of the extent to which the individual items vary.”—L.R. Connor “Dispersion or spread is the degree of the scatter or variation of the variables about a central value.” —B.C. Brooks and W.F.L . Dick “The degree to which numerical data tend to spread about an average value is called the variation or dispersion of the data.”—Spiegel “The term dispersion is used to indicate the facts that within a given group, the items differ from one another in size or in other words, there is lack of uniformity in their sizes.”—W.I. King 6·2 BUSINESS STATISTICS 6·1·1. Objectives or Significance of the Measures of Dispersion. The main objectives of studying dispersion may be summarised as follows : 1. To find out the reliability of an average. The measures of variation enable us to find out if the average is representative of the data. As stated earlier, dispersion gives us an idea about the spread of the observations about an average value. If the dispersion is small, it means that the given data values are closer to the central value (average) and hence the average may be regarded as reliable in the sense that it provides a fairly good estimate of the corresponding population average. If the dispersion is large, then the data values are more deviated from the central value, thereby implying that the average is not representative of the data and hence not quite reliable. 2. To control the variation of the data from the central value. The measures of variation help us to determine the causes and the nature of variation, so as to control the variation itself. It helps to measure the extent of variation from the standard quality of various works carried in industries. For example, we use 3-sigma (3-σ) control limits to determine if a manufacturing process is in control or not. This helps us to identify the causes of variation in the manufactured product and accordingly take corrective and remedial measures. [For detailed discussion, see Chapter 21, Statistical Quality Control.] The Government can also take suitable policy decisions to remove the inequalities in the distribution of income and wealth, after careful study of the dispersion of the income and wealth. 3. To compare two or more sets of data regarding their variability. The relative measures of dispersion may be used to compare two or more distributions, even if they are measured in different units, as regards their variability or uniformity. For detailed discussion, see § 6·12 Coefficient of Variation. 4. To obtain other statistical measures for further analysis of data. The measures of variation are used for computing other statistical measures which are used extensively in Correlation Analysis (Chapter 8), Regression Analysis (Chapter 9), Theory of Estimation and Testing of Hypothesis (Chapter 16), Statistical Quality Control (Chapter 21) and so on. 6·2. CHARACTERISTICS FOR AN IDEAL MEASURE OF DISPERSION The desideratta for an ideal measure of dispersion are the same as those for an ideal measure of central tendency, viz. : (i) It should be rigidly defined. (ii) It should be easy to calculate and easy to understand. (iii) It should be based on all the observations. (iv) It should be amenable to further mathematical treatment. (v) It should be affected as little as possible by fluctuations of sampling. (vi) It should not be affected much by extreme observations. All these properties have been explained in Chapter 5 on Measures of Central Tendency. 6·3. ABSOLUTE AND RELATIVE MEASURES OF DISPERSION The measures of dispersion which are expressed in terms of the original units of a series are termed as Absolute Measures. Such measures are not suitable for comparing the variability of the two distributions which are expressed in different units of measurement. On the other hand, Relative Measures of dispersion are obtained as ratios or percentages and are thus pure numbers independent of the units of measurement. For comparing the variability of the two distributions (even if they are measured in the same units), we compute the relative measures of dispersion instead of the absolute measures of dispersion. 6·4. MEASURES OF DISPERSION The various measures of dispersion are : (i) Range. (ii) Quartile deviation or Semi-Interquartile range. (iii) Mean deviation. MEASURES OF DISPERSION 6·3 (iv) Standard deviation. (v) Lorenz curve. The first two measures viz., Range and Quartile deviation are termed as positional measures since they depend upon the values of the variable at particular position of the distribution. The last measure viz., Lorenz curve is a graphic method of studying variability. In the following sections we shall discuss these measures in detail one by one. 6·5. RANGE The range is the simplest of all the measures of dispersion. It is defined as the difference between the two extreme observations of the distribution. In other words, range is the difference between the greatest (maximum) and the smallest (minimum) observation of the distribution. Thus Range = Xmax – Xmin (6·1) where Xmax = L, is the greatest observation and Xmin = S, is the smallest observation of the variable values. In case of the grouped frequency distribution (for discrete values) or the continuous frequency distribution, range is defined as the difference between the upper limit of the highest class and the lower limit of the smallest class. Remarks 1. In case of a frequency distribution, the frequencies of the various values of the variable (or classes) are immaterial since range depends only on the two extreme observations. 2. Absolute and Relative Measures of Range. Range as defined in (6·1) is an absolute measure of dispersion and depends upon the units of measurement. Thus, if we want to compare the variability of two or more distributions with the same units of measurement, we may use (6·1). However, to compare the variability of the distributions given in different units of measurement, we cannot use (6·1) but we need a relative measure which is independent of the units of measurement. This relative measure, called the coefficient of range, is defined as follows : X – Xmin Coefficient of Range = max …(6·2) Xmax + Xmin where the symbols have already been explained. In other words, coefficient of range is the ratio of the difference between two extreme observations (the biggest and the smallest) of the distribution to their sum. It is a common practice to use coefficient of range even for the comparison of variability of distributions given in the same units of measurement. 3. Effect of Change of Origin and Scale on Range. Let the given variable X be transformed to the new variable U by changing the origin and scale in X as follows : U = AX + B …(6.3) ⇒ Umax =A . Xmax + B and Umin = A . Xmin + B ∴ Range (U) = Umax – Umin = A (Xmax – Xmin) ⇒ Range (U) = A . Range (X) …(6.3a) Hence, range is independent of change of origin but not of scale. 4. When is dispersion (variation) zero ? A crude measure of dispersion is : Range = Largest sample observation – Smallest sample observation. Range = 0, if Largest sample observation = Smallest sample observation This is possible only if the variable takes a constant value i.e., if all the observations in the sample have the same value. For example, if a variable takes 5 values 8, 8, 8, 8, 8, then : Largest value (L) = 8 and Smallest value (S) = 8 ∴ Range = L – S = 8 – 8 = 0. 6·4 BUSINESS STATISTICS 6·5·1. Merits and Demerits of Range. Range is the simplest though crude measure of dispersion. It is rigidly defined, readily comprehensible and is perhaps the easiest to compute, requiring very little calculations. However, it does not satisfy the properties (iii) to (vi) for an ideal measure of dispersion. We give below its limitations and drawbacks. (i) Range is not based on the entire set of data. It is based only on two extreme observations, which themselves are subject to change fluctuations. As such, range cannot be regarded as a reliable measure of variability. (ii) Range is very much affected by fluctuations of sampling. Its value varies very widely from sample to sample. (iii) If the smallest and the largest observations of a distribution are unaltered and all other values are replaced by a set of observations within these values i.e., X max and X min, the range of the distribution remains same. Moreover if any item is added or deleted on either side of the extreme value, the value of the range is changed considerably, though its effect is not so pronounced if we use the coefficient of range. Thus range does not take into account the composition of the series or the distribution of the observations within the extreme values. Consequently, it is fairly unreliable as a measure of dispersion of the values within the distribution. (iv) Range cannot be used if we are dealing with open end classes. (v) Range is not suitable for mathematical treatment. (vi) Another shortcoming of the range, though less important is that it is very sensitive to the size of the sample. As the sample size increases, the range tends to increase though not proportionately. (vii) In the words of W.I. King “Range is too indefinite to be used as a practical measure of dispersion.” 6·5·2. Uses. (1) In spite of the above limitations and shortcomings range, as a measure of dispersion, has its applications in a number of fields where the data have small variations like the stock market fluctuations, the variations in money rates and rate of exchange. (2) Range is used in industry for the statistical quality control of the manufactured product by the construction of R-chart i.e., the control chart for range. (3) Range is by far the most widely used measure of variability in our day-to-day life. For example, the answer to problems like, ‘daily sales in a departmental store’; ‘monthly wages of workers in a factory’ or ‘the expected return of fruits from an orchard’, is usually provided by the probable limits - in the form of a range. (4) Range is also used as a very convenient measure by meteorological department for weather forecasts since the public is primarily interested to know the limits within which the temperature is likely to vary on a particular day. Example 6·1. Calculate the range and the coefficient of range of A’s monthly earnings for a year. Month Monthly earnings Month Monthly earnings Month Monthly earnings (In ’00 Rs.) (In ’00 Rs.) (In ’00 Rs.) 1 139 5 157 9 162 2 150 6 158 10 162 3 151 7 160 11 173 4 151 8 161 12 175 Solution. Largest earning (L) = Rs. 17,500 ; Smallest earnings (S) = Rs. 13,900 ∴ Range = L – S = 17,500 – 13,900 = Rs. 3,600. L – S 17‚500 – 13‚900 36 = = = 0·115 Coefficient of range = L + S 17‚500 + 13‚900 314 Example 6·2(a). The following table gives the age distribution of a group of 50 individuals. Age (in years) : 16 – 20 21 – 25 26 – 30 31 – 36 No. of persons : 10 15 17 8 Calculate range and the coefficient of range. 6·5 MEASURES OF DISPERSION (b). If the variables x and y are related by 3x – 2y + 5 = 0 and the range of x is 8, find the range of y . [I.C.W.A. (Foundation), Dec. 2005] Solution. (a) Since age is a continuous variable we should first convert the given classes into continuous classes. The first class will then become 15·5—20·5 and the last class will become 30·5—35·5. Largest value = 35·5 ; Smallest value = 15·5 ∴ Range = 35·5 – 15·5 = 20 years 35·5 – 15·5 20 Coefficient of range = 35·5 + 15·5 = 51 = 0·39. (b) We are given : and Range (x) = 8 3x – 2y + 5 = 0 …(1) 1 2 ⇒ 3 2 y = (3x + 5) = x + 5 2 Hence, using (6.3a), we get 3 3 Range (y) = 2 Range (x) = 2 × 8 = 12 [From (1)] 6·6. QUARTILE DEVIATION OR SEMI INTER-QUARTILE RANGE It is a measure of dispersion based on the upper quartile Q3 and the lower quartile Q 1 . Inter-quartile Range = Q 3 – Q1 …(6·4) Quartile deviation is obtained from inter-quartile range on dividing by 2 and hence is also known as semi inter-quartile range. Thus Q – Q1 Quartile Deviation (Q.D.) = 3 …(6·5) 2 Q.D. as defined in (6·5) is only an absolute measure of dispersion. For comparative studies of variability of two distributions we need a relative measure which is known as Coefficient of Quartile Deviation and is given by : (Q – Q1)/2 Q3 – Q1 Coefficient of Q.D. = 3 = …(6·6) (Q 3 + Q1)/2 Q3 + Q1 Remarks 1. The quartile deviation gives the average amount by which the two quartiles differ from median. For a symmetrical distribution we have (c.f. Chapter 7). Q + Q3 Q3 – Md = Md – Q 1 ⇒ Md = 1 …(*) 2 i.e., median lies half way on the scale from Q1 to Q3. Thus for a symmetrical distribution we have : Q – Q1 Q + Q1 Q.D. + Q1 = 3 + Q1 = 3 = Md [From (*)] 2 2 Q – Q1 Q + Q1 and Q3 – Q.D. = Q3 – 3 = 3 = Md [From (*)] 2 2 In other words, for a symmetrical distribution we have : Q1 = Md – Q.D. and Q 3 = Md + Q.D. …(**) Since in a distribution, 25% of the observations lie below Q1 and 25% observations lie above Q3, 50% of the observations lie between Q 1 and Q 3 . Therefore, using (**) we conclude that for a symmetrical distribution Md ± Q.D. covers exactly 50% of the observations. 2. Rigorously speaking quartile deviation is only a positional average and does not exhibit any scatter around an average. As such some statisticians prefer to call it a measure of partition rather than a measure of dispersion. 6·6·1. Merits and Demerits of Quartile Deviation. Merits. Quartile deviation is quite easy to understand and calculate. It has a number of obvious advantages over range as a measure of dispersion. For example : (a) As against range which was based on two observations only, Q.D. makes use of 50% of the data and as such is obviously a better measure than range. 6·6 BUSINESS STATISTICS (b) Since Q.D. ignores 25% of the data from the beginning of the distribution and another 25% of the data from the top end, it is not affected at all by extreme observations. (c) Q.D. can be computed from the frequency distribution with open end classes. In fact, Q.D. is the only measure of dispersion which can be obtained while dealing with a distribution having open end classes. Demerits. (i) Q.D. is not based on all the observations since it ignores 25% of the data at the lower end and 25% of the data at the upper end of the distribution. Hence, it cannot be regarded as a reliable measure of variability. (ii) Q.D. is affected considerably by fluctuations of sampling. (iii) Q.D. is not suitable for further mathematical treatment. Thus quartile deviation is not a reliable measure of variability, particularly for distributions in which the variation is considerable. 6·7. PERCENTILE RANGE This is a measure of dispersion based on the difference between certain percentiles. If P i is the ith percentile and Pj is the jth percentile then the so-called i-j percentile range is given by : i-j Percentile Range = Pj – Pi, (i < j) …(6·7) Thus i -j Semi-percentile Range is given by : (Pj – Pi ) / 2, (i < j) …(6·7a) th th The commonly used percentile range is the one which corresponds to the 10 and 90 percentiles. Thus taking i = 10 and j = 90 in (6·7), we get 10-90 Percentile Range = P90 – P10 …(6·8) and 10-90 Semi-percentile Range = (P 90 – P10)/2 …(6·8a) The above measures are absolute measures only. The relative measure of variability based on percentiles is given by : (P90 – P10)/2 P90 – P10 Coefficient of 10-90 percentile = = …(6·9) (P90 + P10)/2 P90 + P10 Theoretically, 10-90 percentile range should serve as a better measure of dispersion than Q.D. since it is based on 80% of the data. However, in practice it is not commonly used. Example 6·3. Find : (i) Inter-quartile Range; (ii) Quartile Deviation; (iii) Coefficient of Quartile Deviation, for the following distribution : Class Interval : 0—15 15—30 30—45 45—60 60—75 75—90 90—105 f : 8 26 30 45 20 17 4 Solution. COMPUTATION OF QUARTILES Here N/4 = 37·5. The c.f. just greater than 37·5 is 64. Hence, Q1 lies in the corresponding class 30—45. h N 15 ∴ Q1 = l + – C = 30 + 30 37·5—34 f 4 ( ) ( ) = 30 + 3·5 2 = 30 + 1·75 = 31·75 3N/4 = 112·5. The c.f. just greater than 112·5 is 129. Hence, Q3 lies in the corresponding class 60—75. h 3N 15 ∴ Q3 = l + – C = 60 + 20 112·5 – 109 f 4 ( = 60 + 3 × 3·5 4 = ) ( 60 + 2·625 = 62·625 ) Class Interval f (Less than) c.f. —————————————————————————— ———————————————— ————————————————————————— 0—15 15—30 30—45 45—60 60—75 75—90 90—105 8 26 30 45 20 17 4 8 34 64 109 129 146 150 —————————————————————————————————————————————————————————————————— Total N = 150 6·7 MEASURES OF DISPERSION (i) Inter-quartile Range = Q3 – Q1 = 62·625 – 31·750 = 30·875 Q – Q1 30·875 (ii) Quartile Deviation = 3 = 2 = 15·44 [From Pat (i)] 2 (iii) Using the results in Part (i), we get : Q – Q1 62·63 – 31·75 30·88 Coefficient of Q.D. = 3 = = = 0·33. Q3 + Q1 62·63 + 31·75 94·38 Example 6·4(a). Evaluate an appropriate measure of dispersion for the following data : Income (in Rs.) : Less than 50 50—70 70—90 90—110 110—130 130—150 Above 150 No. of persons : 54 100 140 300 230 125 51 (b) Comment on the following : If the coefficient of quartile deviation (Q.D) is 0.6 and Q.D. = 15, then Q 1 = 10 and Q3 = 40. [Delhi Univ. B.Com (Hons.), 2009] Solution. (a) Since we are given the classes with open end intervals, the only measure of dispersion that we can compute is the quartile deviation. COMPUTATION OF QUARTILE DEVIATION N 1000 4 = 4 3N 4 = Income (in Rs.) No. of persons (f) Here N = 1000, = 250, 750 54 Since c.f just greater than 250 is 294, Q1 (first Less than 50 50 —70 100 quartile) lies in the corresponding class 70—90. 70 —90 140 Similarly, since c.f. just greater than 750 is 824, the corresponding class 110—130 contains Q3. Hence, 90 —110 300 110—130 230 20 96 Q1 = 70 + 140 (250 – 154) = 70 + 130—150 125 7 Above 150 51 = 70 + 13·714 = 83·714 20 Q3 = 110 + 230 (750 – 594) = 110 + ∴ Quartile deviation (Q.D.) = 2 × 156 23 = Q.D. = 54 154 294 594 824 949 1000 110 + 13·565 = 123·565 Q3 – Q1 123·565 – 83·714 39·851 = = 2 = 19·925. 2 2 (Q 3 – Q1) = 15 (Given) 2 Q – Q1 Coefficient of Q.D. = 3 = 0.6 Q3 + Q1 Adding (1) and (2), we get 2Q3 = 80 Subtracting (1) from (2), we get 2Q1 = 20 Hence, the given statement is true. (b) Less than c.f. ———————————————————————— ————————————————————————— —————————————————————— ⇒ Q3 – Q1 = 30 ⇒ Q3 + Q1 = ⇒ ⇒ Q3 = 40 Q1 = 10 30 30 × 10 = = 50 0.6 6 …(1) …(2) EXERCISE 6·1 1. (a) “Frequency distributions may either differ in the numerical size of their averages though not necessarily in their formations or they may have the same values of their averages yet differ in their respective formations.” Explain and illustrate how the measures of dispersion afford a supplement to the information about the frequency distributions given by the averages. (b) Discuss the validity of the statement : “An average, when published, should be accompanied by a measure of dispersion, for significant interpretation.” 2(a) Find the range and the coefficient of range for the following observations. 65, 70, 82, 59, 81, 76, 57, 60, 55 and 50. [C.A. PEE-1, Nov. 2003] Ans. 32 ; 0.2424 (b) From the monthly income of 10 families given below, calculate (a) the median, (b) the geometric mean, (c) the coefficient of range. 6·8 BUSINESS STATISTICS S. No. : 1 2 3 4 5 6 7 8 9 10 Income in Rs. : 145 367 268 73 185 619 280 115 870 315 Ans. (a) Md = Rs. 274 (b) G = Rs. 252·4, (c) Coefficient of Range = 0·84. 3. The index numbers of prices of cotton shares (I1) and coal shares (I2) in a given year are as under— Month : Jan. Feb. March April May June July August Sept. Oct. Nov. Dec. 173 164 172 183 184 185 211 217 232 240 I1 : 188 178 130 129 129 120 127 127 130 137 140 142 I2 : 131 130 Calculate range for each share. Hence, discuss which share do you consider more variable in price. Ans. Range (I1) = 76, Coefficient of Range (I1) = 0·19 ; Range (I2) = 22, Coefficient of Range (I2) = 0·084. Cotton shares are more variable in prices. 4. Find the value of third quartile if the values of first quartile and quartile deviation are 104 and 18 respectively. [Delhi Univ. B.Com. (Pass), 2002] Ans. Q3 = 140. 5. Age distribution of 200 employees of a firm is given below : Construct a ‘less than ogive curve, and hence or Q 3 – Q1 otherwise calculate semi-interquartile range of the distribution : 2 Age in years (less than) : 25 30 35 40 45 50 55 No. of employees : 10 25 75 130 170 189 200 Q 3 – Q1 Ans. Q1 = 33·5 years, Q3 = 43 years, = 4·75 years 2 6. Find the mode, median, lower quartile (Q1) and upper quartile (Q3) and Coeff. of Q.D. from the following data : Wages : 0—10 10—20 20—30 30—40 40—50 No. of workers : 22 38 46 35 20 [Maharishi Dayanand Univ. B.Com., 1997] Ans. Mode = 24·21 : Median = 24·46, Q1 = 14·803, Q3 = 24·21 ; Coeff. of Q.D. = 0·396. 7. Compute the Coefficient of Quartile Deviation of the following data : Size Frequency Size Frequency 12 24—28 6 4—8 10 28—32 10 8—12 6 32—36 18 12—16 2 36—40 30 16—20 15 20—24 Ans. Q1 = 14·5, Q3 = 24·92, Coefficint of Q.D. = 0·2643. 8. Find (i) Inter-quartile range, (ii) Semi-inter-quartile range, and (iii) Coefficient of quartile deviation, from the following frequency distribution : Marks : 10—20 20—30 30—40 40—50 50—60 60—70 70—80 80—90 No. of students : 60 45 120 25 90 80 120 60 [C.A. (Foundation), Dec. 1993] Ans. (i) 38·75, (ii) 19·375, (iii) 0·3647. 9. From the following data, (i) Calculate the ‘percentage’ of workers getting wages : (a) more than Rs. 44 ; (b) between Rs. 22 and Rs. 58. (ii) Find the quartile deviation. Wages (Rs.) : 0—10 10—20 20—30 30—40 40—50 50—60 60—70 70—80 No. of workers : 20 45 85 160 70 55 35 30 Ans. (i) (a) 32·4%, (b) 68·4%; (ii) Q1 = 27·06, Q3 = 49·29, Q.D. = 11·115. 10. Calculate the appropriate measure of dispersion from the following data : Wages in Rs. per week : Less than 35 35—37 38—40 41—43 Over 43 No. of wage earners : 14 62 99 18 7 Ans. Coefficient of Q.D. = 0·046. 11. Find out middle 50%, middle 80% and coefficient of Q.D. from the following table : Size of items : 2 4 6 8 10 12 Frequency : 3 5 10 12 6 4 Ans. Quartile range = 4 ; Percentile range = 8, Coefficient of Q.D. = 0·25. 6·9 MEASURES OF DISPERSION 6·8. MEAN DEVIATION OR AVERAGE DEVIATION As already pointed out, the two measures of dispersion discussed so far viz., range and quartile deviation are not based on all the observations and also they do not exhibit any scatter of the observations from an average and thus completely ignore the composition of the series. Mean Deviation or the Average Deviation overcomes both these drawbacks. As the name suggests, this measure of dispersion is obtained on taking the average (arithmetic mean) of the deviations of the given values from a measure of central tendency. According to Clark and Schkade : “Average deviation is the average amount of scatter of the items in a distribution from either the mean or the median, ignoring the signs of the deviations. The average that is taken of the scatter is an arithmetic mean, which accounts for the fact that this measure is often called the mean deviation.” 6·8·1. Computation of Mean Deviation. If X 1 , X 2 ,…,Xn are n given observations then the mean deviation (M.D.) about an average A, say, is given by : M.D. (about an average A) = 1n ∑ | X – A | = n1 ∑ | d | …(6·10) where | d | = | X – A | read as mod (X – A), is the modulus value or absolute value of the deviation (after ignoring the negative sign) d = X – A and ∑ | d | is the sum of these absolute deviations and A is any one of the averages Mean (M), Median (Md) and Mode (Mo). Steps for Computation of Mean Deviation 1. Calculate the average A of the distribution by the usual methods. 2. Take the deviation d = X – A of each observation from the average A. 3. Ignore the negative signs of the deviations, taking all the deviations to be positive to obtain the absolute deviations, | d | = | X – A |. 4. Obtain the sum of the absolute deviations obtained in step 3. 5. Divide the total obtained in step 4 by n, the number of observations. The result gives the value of the mean deviation about the average A. In the case of frequency distribution or grouped or continuous frequency distribution, mean deviation about an average A is given by : 1 1 …(6·11) M.D. (about the average A) = ∑ f | X – A | = ∑ f | d | N N where X is the value of the variable or it is the mid-value of the class interval (in the case of grouped or continuous frequency distribution), f is the corresponding frequency, N = ∑f , is the total frequency and | X – A | is the absolute value of the deviation d = (X – A) of the given values of X from the average A (Mean, Median or Mode). Steps for Computation of Mean Deviation for Frequency Distribution Steps 1, 2 and 3 are same as given above. 4. Multiply the absolute deviations | d | = | X – A | by the corresponding frequency f to get f | d |. 5. Take the total of products in step 4 to obtain ∑ f | d |. 6. Divide the total in step 5 by N, the total frequency. The resulting value is the mean deviation about the average A. Remarks 1. Usually, we obtain the mean deviation (M.D.) about any one of the three averages mean (M), median (Md) or mode (Mo). Thus M.D. (about mean) = 1N ∑ f | X – M | 1 M.D. (about median) = N ∑ f | X – Md | M.D. (about mode) 1 =N ∑ f | X – Mo | } …(6·11a) 2. The sum of the absolute deviations (after ignoring the signs) of a given set of observations is minimum when taken about median. Hence mean deviation is minimum when it is calculated from median. 6·10 BUSINESS STATISTICS In other words, mean deviation calculated about median will be less than mean deviation about mean or mode. 3. As already pointed out in Remark 1, usually, we compute the mean deviation about any one of the three averages mean, median or mode. But since mode is generally ill-defined, in practice M.D. is computed about mean or median. Further, as a choice between mean and median, theoretically, median should be preferred since M.D. is minimum when calculated about median (c.f. Remark 2). But because of wide applications of mean in Statistics as a measure of central tendency, in practice mean deviation is generally computed from mean. 4. For a symmetrical distribution the range Mean ± M.D. (about mean) or Md ± M.D. (about median), [·. · M = M d for a symmetrical distribution] covers 57·5% of the observations of the distribution. If the distribution is moderately (skewed), the range will cover approximately 57·5% of the observations. 5. Effect of Change of Origin and Scale on Mean Deviation About Mean. Let X1 , X2 , …, X n be n observations on the variable X. Let the variable X be transformed to the new variable Y by changing the origin and scale in X, by the transformation : — — — — Y = AX + B ⇒ Y = AX + B ⇒ Y – Y = A (X – X ) …(6.12) By definition, — — 1 n 1 n M.D. (Y) about mean = ∑ | (Yi – Y ) | = ∑ | A (Xi – X ) | [From (6.12)] n i=1 n i=1 =|A|. — 1 n ∑ | X – X | = | A | . [M.D. (X) about mean] n i=1 i Hence, we have the following result. Y = AX + B ⇒ M.D. (Y) about mean = | A | × [M.D. (X) about mean] …(6.12a) 6·8·2. Short-cut Method of Computing Mean Deviation. For computing the mean deviation, first of all we have to calculate the average about which we want the mean deviation, by the methods discussed in the previous chapter. In case the average is a whole number, the method of computing mean deviation by formulae (6·10), (6·11) or (6·12) is quite convenient. But if the average comes out in fractions, this method becomes fairly tedious and time-consuming and requires lot of algebraic calculations. In such a case we compute the mean deviation by taking the deviations from an arbitrary point ‘a’, near the average value and applying some corrections as given below : [ 1 M.D. (about Mean) = N ∑ f | X – a | + (M – a) ( ∑ fB – ∑ fA ) ] …(6·13) where and M is the mean, a is the arbitrary constant near the mean, ∑ fB is the sum of all the class frequencies before and including the mean value, ∑ fA is the sum of all the class frequencies after the mean value. Similarly, using the short cut method, 1 M.D. (about median) = N [ ∑ f | X – a | + (Md – a) (∑ f ′ – ∑ f ′) ] B A …(6·13a) where now, Md is the median, a is some arbitrary constant near the median, ∑ fB′ is the sum of the class frequencies before and including the median value, and ∑ fA′ is the sum of the class frequencies after the median value. Remarks 1. Obviously, ∑fA + ∑ fB = N …(6·14) ⇒ ∑ fB = N – ∑ fA or ∑ fA = N – ∑ fB Similarly ∑ fB′ = N – ∑ fA′ …(6·14a) 2. The above formulae (6·13) and (6·13a) are true provided all the values of the variable which are above the average (M or Md) are also above ‘a’ and those which are below the average are also below ‘a’. MEASURES OF DISPERSION 6·11 The arbitrary constant ‘a’ should be taken some arbitrary integral value near the average value, i.e., it should be a value in the average class. The short cut method will not yield correct result if ‘a’ is taken outside the average class. 6·8·3. Merits and Demerits of Mean Deviation Merits : (i) Mean deviation is rigidly defined and is easy to understand and calculate. (ii) Mean deviation is based on all the observations and is thus definitely a better measure of dispersion than the range and quartile deviation. (iii) The averaging of the absolute deviations from an average irons out the irregularities in the distribution and thus mean deviation provides an accurate and true measure of dispersion. (iv) As compared with standard deviation (discussed in next article § 6·9), it is less affected by extreme observations. (v) Since mean deviation is based on the deviations about an average, it provides a better measure for comparison about the formation of different distributions. Demerits. (i) The strongest objection against mean deviation is that while computing its value we take the absolute value of the deviations about an average and ignore the signs of the deviations. (ii) The step of ignoring the signs of the deviations is mathematically unsound and illogical. It creates artificiality and renders mean deviation useless for further mathematical treatment. This drawback necessitates the requirement of another measure of variability which, in addition to being based on all the observations is also amenable to further algebraic manipulations. (iii) It is not a satisfactory measure when taken about mode or while dealing with a fairly skewed distribution. As already pointed out, theoretically mean deviation gives the best result when it is calculated about median. But median is not a satisfactory measure when the distribution has great variations. (iv) It is rarely used in sociological studies. (v) It cannot be computed for distributions with open end classes. (vi) Mean deviation tends to increase with the size of the sample though not proportionately and not so rapidly as range. 6·8·4. Uses. In spite of its mathematical drawbacks, mean deviation has found favour with economists and business statisticians because of its simplicity, accuracy and the fact that standard deviations (discussed in § 6·9) gives greater weightage to the deviations of extreme observations. Mean deviation is frequently useful in computing the distribution of personal wealth in a community or a nation since for this, extremely rich as well as extremely poor people should be taken into consideration. Regarding the practical utility of mean deviation as a measure of variability, it may be worthwhile to quote that in the studies relating to forecasting business cycles, the National Bureau of Economic Research has found that the mean deviation is most practical measure of dispersion to use for this purpose. 6·8·5. Relative Measures of Mean Deviation. The measures of mean deviation as defined in (6·10), (6·11) and (6·12) are absolute measures depending on the units of measurement. The relative measure of dispersion, called the coefficient of mean deviation is given by : Mean Deviation …(6·15) Coefficient of M.D. = Average about which it is calculated M.D. ∴ Coefficient of M.D. about mean = …(6·15a) Mean M.D. and Coefficient of M.D. about median = …(6·15b) Median The coefficients of mean deviation defined in (6·15), (6·15a) and (6·15b) are pure numbers independent of the units of measurement and are useful for comparing the variability of different distributions. Example 6·5. Calculate the mean deviation from mean for the following data. Class Interval : 2—4 4—6 6—8 8—10 Frequency : 3 4 2 1 6·12 BUSINESS STATISTICS Solution. Class Mid-Value (X) 2—4 4—6 6—8 8—10 3 5 7 9 COMPUTATION OF MEAN AND M.D. FROM MEAN — — Frequency d=X–5 fd |X– X| f|X–X| (f) = | X – 5·2 | 3 –2 –6 2·2 6·6 4 0 0 0·2 0·8 2 2 4 1·8 3·6 1 4 4 3·8 3·8 — ∑ f = 10 ∑ fd = 2 ∑ f | X – X | = 14·8 — X =A+ 6 0 4 4 ∑ f | d | = 14 ∑ fd 2 = 5 + = 5·2 N 10 — 1 f | d |* M.D. about mean = N ∑ f | X – X | = 14·8 10 = 1·48 * Last column f | d | is not required for this method. It is needed for the short-cut method given below. Aliter. Short-cut Method. We can use the deviations from arbitrary point a = 5, directly to compute — — the M.D. from mean without computing the values | X – X |. This is particularly useful when X is in fractions (decimals) in which case the usual formula is quite laborious. For this we need the last column f | d |. [c.f. (6·13), § 6·8·2]. Using the formula (6·13), we get 1 [ — M.D. about mean = N ∑ f | d | + ( X – a) (∑ fB – ∑ fA) ] where ∑ fB = Sum of all the class frequencies before and including the mean class i.e., the class in which mean lies. Here mean is 5·2 and it lies in the class 4—6. ∴ ∑ fB = 4 + 3 = 7 ∑ fA = Sum of all the class frequencies after the mean class = 2 + 1 = 3 or ∑ fA = N – ∑ fB = 10 – 7 = 3. ∴ 1 [ ] M.D. about mean = 10 14 + (5·2 – 5) × (7 – 3) = 14 + 0·8 10 = 14·8 10 = 1·48. Remark. The value obtained by the short-cut method coincides with the value obtained by the direct method and rightly so because the arbitrary point A = 5 is near the mean value 5·2 (c.f. Remark 2, § 6·8·2). Example 6·6. (a) Find the Mean Deviation from the Mean for the following data : Class Interval : 0—10 10—20 20—30 30—40 40—50 50—60 60—70 Frequency : 8 12 10 8 3 2 7 [C.A. (Foundation), Nov. 1996] (b) Also find the mean deviation about median. (c) Compare the results obtained in (a) and (b). Solution. Class Interval 0—10 10—20 20—30 30—40 40—50 50—60 60—70 Total CALCULATIONS FOR M.D. ABOUT MEAN AND MEDIAN — — Mid-value Frequency Less than fX | X – Md | |X– X| f|X–X| (X) (f) c.f. = | X – 22 | = | X – 29 | 5 15 25 35 45 55 65 8 12 10 8 3 2 7 N = 50 8 20 30 38 41 43 50 40 180 250 280 135 110 455 ∑fX = 1,450 24 14 4 6 16 26 36 192 168 40 48 48 52 252 — ∑ f | X– X | = 800 17 7 3 13 23 33 43 f | X – Md | 136 84 30 104 69 66 301 ∑ f | X – Md | = 790 6·13 MEASURES OF DISPERSION 1 1‚450 ∑ fX = 50 = 29. N — 1 800 Mean Deviation about mean = ∑ f | X – X | = = 16. N 50 (b) (N/2) = (50/2)= 25. The c.f. just greate than 25 is 30. Hence, the corresponding class 20—30 is the median class. Using the median formula, we get h N 10 Md = l + – C = 20 + 25 (25 – 20) = 20 + 2 = 22 f 2 1 790 ∴ Mean Deviation about median = ∑ f | X – Md | = 50 = 15·8 N (c) From (a) and (b), we observe that : M.D. about median < M.D. about mean. In fact, we have the following general result : “Mean deviation is least when taken about median.” Example 6·7. Calculate mean deviation from the median for the following data : Marks less than : 80 70 60 50 40 30 20 10 No. of Students : 100 90 80 60 32 20 13 5 Solution. First of all we shall convert the given cumulative frequency distribution table into ordinary frequency distribution as given in the following table. — (a). Mean ( X ) = ( Marks c.f. 0—10 10—20 20—30 30—40 40—50 50—60 60—70 70—80 5 13 20 32 60 80 90 100 ) COMPUTATION OF M.D. FROM MEDIAN Frequency (f) Mid-value of class (X) | X – Md | 5 13 – 5 = 8 20 – 13 = 7 32 – 20 = 12 60 – 32 = 28 80 – 60 = 20 90 – 80 = 10 100 – 90 = 10 N = ∑ f = 100 5 15 25 35 45 55 65 75 41·43 31·43 21·43 11·43 1·43 8·57 18·57 28·57 f | X – Md | 207·15 251·44 150·00 137·16 40·04 171·14 185·70 285·70 ∑ f | X – Md | = 1428·6 Here N/2 = 50. Since the c.f. just greater than 50 is 60, the corresponding class 40—50 is the median class. h N 10 ∴ Md = l + – C = 40 + 28 (50 – 32) = 40 + 6·43 = 46·43. f 2 1 1428·6 ∴ M.D. about Md = ∑ f | X – Md | = = 14·286 ~– 14·29. N 100 Aliter. Since median value comes out to be in fractions, we can do the above question conveniently by the short-cut method i.e., by taking the deviations from any arbitrary point a = 45, (say), lying in the median class. ( Marks (X) 5 15 25 35 45 55 65 75 ) MEAN DEVIATION BY SHORT-CUT METHOD f d = X – 45 |d| 5 – 40 40 8 –30 30 7 –20 20 12 –10 10 28 0 0 20 10 10 10 20 20 10 30 30 ∑ f = 100 f|d| 200 240 140 120 0 200 200 300 ∑ f | d | = 1400 6·14 BUSINESS STATISTICS Using formula (6·13a), we get 1 M.D. about Md = ∑ f | d | + (Md – a) (∑fB′ – ∑ fA′ N where ∑ fB′ = Sum of the frequencies before and including the median value viz., 46·3. = 5 + 8 + 7 + 12 + 28 = 60 ∑ fA′ = N – ∑ fB′ = 100 – 60 = 40 [ ∴ [ =[ M.D. about Md = ] 1400 + (46·43 – 45) (60 – 40) 100 1400 + 28·6 100 ] 1428·6 = 100 ][ 1400 + 1·43 × 20 100 ] = 14·286 ~– 14·29. Remark. As in the last question, the values of M.D. obtained by the direct method and the short-cut method are same, since the arbitrary value ‘a’ is taken in the median class. Example. 6.8. If 2xi + 3y i = 5; i = 1, 2, …, n and mean deviation of x1 , x2, …, xn about their mean is 12, find the mean deviation of y1, y2, …, yn about their mean. [I.C.W.A. (Foundation), Dec. 2006] Solution. M.D. (X) about mean = 12 (Given). (*) 2 5 Also 2xi + 3yi = 5 ⇒ yi = – xi – ; i = 1, 2, …, n …(**) 3 3 We know that if Y = AX + B, then M.D. of (Y) about mean = | A | × [M.D. (X) about mean], [From 6.12a] where | A | is the modulus value of A. ∴ M.D. (Y) about mean = ⎪– 2 3⎪ × [M.D. (X) about mean ] = 32 × 12 = 8 [From (*)] Example 6·9. Mean deviation may be calculated from the arithmetic mean or the median or the mode. Which of these three measures is the minimum ? Five towns A, B, C, D and E lie in that order along a road. The distances in kilometres of the towns as measured from A are A —0, B—5, C—9, D—16, E—20, A new college is to be established at one of these 5 places, and the number of students who would join the college are A—39, B—911, C—46, D—193, and E—716. The criterion for choosing the location is that the total distance travelled by the students, as measured in student-kilometres should be a minimum. (Thus if the college is located at D, 46 students from C will travel in all 46 × 7 = 322 student-kilometres). Where should the college be located ? Justify your result. Solution. Mean deviation calculated from the median is the minimum. Let X denote the distance (in kms.) as Town Distance from No. of Less than measured from the town A. Then the frequency town A students c.f. distribution of the distances covered by the (X) (f) students in going to the college is as given in the A 0 39 39 adjoining table. B 5 911 950 We are given that the criterion for choosing the C 9 46 996 location of the college is that the total distance D 16 193 1189 travelled by the students, as measured in studentE 20 716 N = 1905 kilometres should be minimum. Since mean ———————————————————————————————————————————————————————————— —————————————————— deviation is minimum when calculated from median, the total distance travelled by the students, as measured in student-kilometres will be minimum at the point X = Median, of the frequency distribution of X. Here (N/2) = (1905/2) = 952·5. The c.f. just greater than 952·5 is 996. Hence, the corresponding value of X = 9, is the median. Since X = 9, corresponds to the town C, the college should be located at the town C. 6·15 MEASURES OF DISPERSION EXERCISE 6·2 1. What do you mean by ‘mean deviation’. Discuss its relative merits over range and quartile deviation as a measure of dispersion. Also point out its limitations. 2. Calculate mean deviation about A.M. from the following : Value (x) : 10 11 12 13 Frequency (f) : 3 12 18 12 Ans. A.M. = 11·87 ; M.D. = 0·71. 3. Calculate the mean deviation about meadian of the series : x 2·5 3·5 4·5 5·5 6·5 7·5 8·5 9·5 10·5 f 2 3 5 6 6 4 6 4 14 Ans. M.D. (about median) = 2·22. 4. Compute the quartile deviation and mean deviation from median for the following data. Height in inches 58 59 60 61 62 No. of students 15 20 32 35 33 Height in inches 63 64 65 66 — No. of students 22 22 10 8 — Ans. Q.D. = 1·5 ; M.D. (about median) = 1·73. 5. With median as the base, calculate the mean deviation and compare the variability of the two series A and B. Series A : 3484 4572 4124 3682 5624 4388 3680 4308 Series B : 487 508 620 382 408 266 186 218 Ans. Series A : Md = 4216 ; Series B : Md = 395 ; M.D. = 490·25 ; M.D. = 121·38 ; Coeff. of M.D. = 0·116; Coeff. of M.D. = 0·307. Series B is more variable. 6. Compare the dispersion of the following series by using the co-efficient of mean deviation. Age (years) : 16 17 18 19 20 21 22 23 No. of boys : 4 5 7 12 20 13 5 0 No. of girls : 2 0 4 8 15 10 6 3 24 4 2 Total 70 50 Ans. Coeff. of M.D. about median (boys) = 0·0685 ; Coeff. of M.D. about median (girls) = 0·0630. 7. Calculate the mean deviation from the mean for the following data : Marks : 0—10 10—20 20—30 30—40 No. of Students : 6 5 8 15 40—50 7 Ans. Mean = 33·4 ; M.D. about mean = 13·184. 50—60 6 60—70 3 [C.A. (Foundation), May 1999] 8. (a) Mean deviation may be calculated from the arithmetic mean or the median or the mode ? Which of these three measures is the minimum ? (b) Find out mean deviation and its coefficient from median from the following series : Size of items : 4 6 8 10 12 14 16 Frequency : 2 1 3 6 4 3 1 Ans. 2·4 ; 0·24 9. Calculate the mean deviation about the mean for the following data : x : 5 15 25 35 f : 8 12 10 8 45 3 55 65 2 7 [C.A. (Foundation), May 2001] Also find the M.D. about median and comment on the results obtained in (a) and (b). Ans. Mean = 29; M.D. about mean = 16. ; Median = 22 ; M.D. about median = 15·8. 6·16 BUSINESS STATISTICS 10. Calculate mean deviation from median from the following data : Class interval (f) Class interval 50—55 6 20—25 55—60 12 25—30 60—70 17 30—40 70—80 30 40—45 10 45—50 Also calculate coefficient of mean deviation. Ans. 8·75 ; 0·206. (f) 10 8 5 2 11. The following distribution gives the difference in age between husband and wife in a particular community : Difference in years : 0—5 5—10 10—15 15—20 20—25 25—30 30—35 35—40 Frequency : 449 705 507 281 109 52 16 4 Calculate mean deviation about median from these data. What light does it throw on the social conditions of a community ? Ans. M.D. about median = 5·24. 12. Find the median and mean deviation of the following data : Size : 0—10 10—20 20—30 30—40 40—50 50—60 60—70 Frequency : 7 12 18 25 16 14 8 Ans. Median = 35·2 ; M.D. = 13·148. [Mysore Univ. B.Com., 1998] 13. Calculate the value of coefficient of mean deviation (from median) of the following data : Marks No. of Students Marks No. of Students 25 50—60 2 10—20 20 60—70 6 20—30 10 70—80 12 30—40 7 80—90 18 40—50 Ans. Median = 54·8 ; M.D. about median = 12·95; Coefficient of M.D. = 0·2363. 14. Compute the mean deviation from the median and from mean for the following distribution of the scores of 50 college students. Score : 140—150 150—160 160—170 170—180 180—190 190—200 Frequency : 4 6 10 10 9 3 Ans. 10·24 ; 10·56. 15. Calculate Mean Deviation from Median from the following data : Wages in Rs. (Mid-value) : 125 175 225 275 325 No. of persons : 3 8 21 6 2 Ans. Median = 221·43 ; M.D. (Median) = 31·607. 6·9. STANDARD DEVIATION Standard deviation, usually denoted by the letter σ (small sigma) of the Greek alphabet was first suggested by Karl Pearson as a measure of dispersion in 1893. It is defined as the positive square root of the arithmetic mean of the squares of the deviations of the given observations from their arithmetic mean. Thus if X1 , X2 ,…, Xn is a set of n observations then its standard deviation is given by : σ = — 1 ∑ (X – X )2 n …(6·16) 1 X = ∑ X, is the arithmetic mean of the given values. n Steps for Computation of Standard Deviation where — — 1. Compute the arithmetic mean X by the formula (6·16a). — 2. Compute the deviation (X – X ) of each observation from arithmetic mean i.e., obtain — — — X1 – X , X2 – X , …, Xn – X . …(6·16a) 6·17 MEASURES OF DISPERSION — — — 3. Square each of the deviations obtained in step 2 i.e., compute (X1 – X )2, (X2 – X )2,…, (Xn – X )2. 4. Find the sum of the squared deviations in step 3 given by : — — — — ∑ (X – X )2 = (X 1 – X ) 2 + (X2 – X )2 + … + (Xn – X )2. — 1 5. Divide this sum in step 4 by n to obtain ∑ (X – X )2. n 6. Take the positive square root of the value obtained in step 5. 7. The resulting value gives the standard deviation of the distribution. In case of frequency distribution, the standard deviation is given by : σ = — 1 ∑ f(X – X )2 N …(6·17) where X is the value of the variable or the mid-value of the class (in case of grouped or continuous frequency distributions) ; f is the corresponding frequency of the value X ; N = ∑ f, is the total frequency and — 1 X = ∑fX … (6·17a) N is the arithmetic mean of the distribution. Steps for Computation of Standard Deviation in case of Frequency Distribution — 1. Compute X by the formula (6·17a) or the usual step deviation formula discussed in Chapter 5. — 2. Compute deviations (X – X ) from the mean for each value of the variable X. — 3. Obtain the squares of the deviations obtained in step 2 i.e., compute (X – X )2. 4. Multiply each of the squared deviations obtained in step 3 by the corresponding frequency to get — f (X – X )2. — 5. Find the sum of the values obtained in step 4 to get ∑ f (X – X )2. 6. Divide the sum obtained in step 5 by N = ∑ f, the total frequency. 7. The positive square root of the value obtained in step 6 gives the standard deviation of the distribution. Remarks 1. It may be pointed out that although mean deviation could be calculated about any one of the averages (M, Md or Mo), standard deviation is always computed about arithmetic mean. 2. To be more precise, the standard deviation of the variable X will be denoted by σx. This notation will be useful when we have to deal with the standard deviation of two or more variables. 3. Standard deviation abbreviated as S.D. or s.d. is always taken as the positive square root in (6·16) or (6·17). — — — 4. The value of s.d. depends on the numerical value of the deviations (X1 –X ), (X2 – X ), …, (Xn – X ). Thus the value of σ will be greater if the values of X are scattered widely away from the mean. Thus a small value of σ will imply that the distribution is homogeneous and a large value of σ will imply that it is heterogeneous. In particular s.d. is zero if each of the deviations is zero i.e., σ = 0 if and only if, — — — — X1 – X = 0, X2 – X = 0, …, Xn – X = 0 ⇒ X1 = X2 = X3 = … = Xn = X , which is the case if the variable assumes the constant value. Thus, σ = 0 if and only if, X1 = X2 = X3 = … = Xn = k (constant) …(6·17b) In other words, σ = 0 if and only if all the observations are equal. As an illustration, let us suppose that the variable X takes the same constant values, say, 6, 6, 6, 6, 6. Then X 6 6 6 6 6 — 1 — X = 5 ∑X = 6 + 6 + 56 + 6 + 6 = 30 5 =6 X– X =X–6 0 0 0 0 0 —————————————————————————————————————————————————————————————————- ∴ S.D. (σ) = 1 5 — ∑ (X – X )2 = 1 5 (0) =0 6·18 BUSINESS STATISTICS 6·9·1. Mathematical Properties of Standard Deviation. Standard deviation possesses a number of interesting and important mathematical properties which are given below. 1. Standard deviation is independent of change of origin but not of scale. If d = X – A, then σx = σ d X – A, But if d = h > 0, then σx = h . σd. h 2. Standard deviation is the minimum value of the root mean square deviation (§ 6·9·3) 3. S.D. ≤ Range i.e., σ ≤ Xmax – Xmin [For Proof, see Remark to § 6·9·2.] 4. Standard deviation is suitable for further mathematical treatment. If we know the sizes, means and standard deviations of two or more groups, then we can obtain the standard deviation of the group obtained on combining all the groups. [For details see § 6·10] 5. The standard deviation of the first n natural numbers viz., 1, 2, 3, …, n is ⎯ √⎯⎯⎯⎯⎯⎯⎯ (n2 – 1)/12 [For Proof, see Example 6·23(a).] 6. The Empirical Rule. For a symmetrical bell shaped distribution, we have approximately the following area properties. (i) 68% of the observations lie in the range : Mean ± 1·σ. (ii) 95% of the observations lie in the range : Mean ± 2.σ. (iii) 99% of the observations lie in the range : Mean ± 3.σ.. 7. The approximate relationship between quartile deviation (Q.D.), mean deviation (M.D.) and standard deviation (σ) is : 2 4 Q.D. –~ 3 σ and M.D. ~– 5 σ ⇒ Q.D. : M.D. : S.D. : : 10 : 12 : 15 8. For any discrete distribution, standard deviation is not less than mean deviation about mean i.e., S.D. (σ) ≥ Mean Deviation about mean. 6·9·2. Merits and Demerits of Standard Deviation Merits. Standard deviation is by far the most important and widely used measure of dispersion. It is — rigidly defined and based on all the observations. The squaring of the deviations (X —X ) removes the drawback of ignoring the signs of deviations in computing the mean diviation. This step renders it suitable for further mathematical treatment. The variance of the pooled (combined) series is given by formula (6·30) in § 6·10. Moreover, of all the measures of dispersion, standard deviation is affected least by fluctuations of sampling. Thus, we see that standard deviation satisfies almost all the properties laid down for an ideal measure of dispersion except for the general nature of extracting the square root which is not readily comprehensible for a non-mathematical person. It may also be pointed out that standard deviation gives greater weight to extreme values and as such has not found favour with economists or businessmen who are more interested in the results of the modal class. Taking into consideration the pros and cons and also the wide applications of standard deviation in statistical theory, such as in skewness, kurtosis, correlation and regression analysis, sampling theory and tests of significance, we may regard standard deviation as the best and the most powerful measure of dispersion. — Remark. Since X – X ≤ R (Range), for all values of X viz., X1, X2 , …, Xn , we get – 1 σ2 = ∑ f (X – X)2 N – – – 1 1 = f (X – X)2 + f2(X2 – X)2 + … + fn (Xn – X)2 ≤ f R2 + f2R2 + …+ fn R2 N 1 1 N 1 [ ] – [ – ] – (·. ·X1 – X ≤ R , X2 – X ≤ R,…, Xn – X ≤ R) 6·19 MEASURES OF DISPERSION 1 1 · R2 (f1 + f2 + … + fn ) = R2 . N = R2 N N ⇒ σ2 ≤ R2 ⇒ σ≤R ⇒ s.d. ≤ Range. 6·9·3. Variance and Mean Square Deviation. Variance is the square of the standard deviation and is denoted by σ2 . For a frequency distribution, variance is given by : – 1 σ2 = ∑ f(X – X )2 …(6·18) N where the symbols have already been explained in (6·17). The mean square deviation, usually denoted by s2 is defined as 1 s2 = ∑ f(X – A)2 …(6·19) N where A is any arbitrary number. The square root of the mean square deviation is called root mean square deviation and given by : ∴ σ2 ≤ s = 1 ∑ f (X – A)2 N …(6·20) Relation between σ 2 and s2 . We have s2 ≥ σ 2 ⇒ s ≥ σ …(6·21) In other words, mean square deviation is not less than the variance or the root mean square deviation is not less than the standard deviation. — The sign of equality will hold in (6·21) i.e., s 2 = σ2 if and only if X = A …(6·22) — s2 Thus, will be least when X = A. Hence, mean square deviation or equivalently root mean square deviation is least when deviations are taken from the arithmetic mean and variance (standard deviation) is the minimum value of mean square deviation (root mean square deviation). Important Remark The variance σ2 is in squared units. For example, for the distribution of heights (in inches) of a group of individuals, σ 2 is expressed in (inches)2 , a concept which is difficult to visualise and interpret. To overcome this problem, we try to measure the variation in the sample data in the same units as those of the original measurements by calculating the standard deviation (S.D.), which is defined as the positive square root of variance. Hence, in practice, we use standard deviation, rather than variance as the basic unit of variability. 6·9·4. Different Formulae for Calculating Variance. By definition, the variance of the random variable X denoted by σ2 or more precisely by σx2, is given by — 1 σx2 = ∑ f (X – X ) 2 …(·6·23) N where ∑ f = N, is the total frequency. If X is not a whole number but comes out to be in fractions, the computation of σ x2 by the above formula becomes very cumbersome and time-consuming. In order to overcome this difficulty we shall develop different versions of the formula (6·23) which reduce the arithmetical calculations to a great extent and are very useful for numerical computation of standard deviation. Formula 1. σx2 = — 1 1 ∑ f X2 – X 2 = ∑ f X2 – N N ( 1 ∑fX N ) 2 …(6·24) Formula 2. If d = X – A , where A is arbitrary constant, then σx2 = σd 2 = 1 ∑ f d2 – N ( N1 ∑ f d ) 2 …(6·25) 6·20 BUSINESS STATISTICS (6·24) is a much convenient form to use than the formula (6·23). But if the values of X and f are large, the computation of f X, f X2 is quite time- consuming. In that case we use the step-deviation method in which we take the deviations of the given values of X from any arbitrary point A. Generally, ‘A’ is taken to be a value lying in the middle part of the distribution, although the formula (6.25) holds for any value of A. (6·25) leads to the following important conclusion : “The variance and consequently the standard deviation of a distribution is independent of the change of origin”. Thus, if we add (subtract) a constant to (from) each observation of the series, its variance remains same. Mathematically this means that : Var (X + a) = Var (X – b) = Var X, where a and b are constants. …(6.25a) Formula 3. If we change the origin and scale in X i.e., if we take : X – A; d = h > 0, then h σx2 = h2 σd 2 = h2 [ N1 ∑ fd – (N1 (∑ fd) ) ] 2 2 …(6·26) In case of grouped or continuous frequency distribution, it is convenient to change the scale also. Thus, if h is the magnitude of the class interval (or if h is the common factor in the values of the variable X), then we may take X – A, d = h>0 and use (6·26) h Remarks 1. Formula (6.26) also leads to the following result. Var (aX) = a2 Var (X) ⇒ S.D. (aX) = a S.D. (X) …(6·26a) where Var (X) denotes the variance of X and a is a constant. This shows that variance (or s.d.) is not independent of change of scale. Combining the results obtained in (6·25) and (6·26) we conclude that : “Variance or standard deviation is independent of the change of origin but not of the scale”. 2. For numerical problems, a somewhat more convenient form of (6·26) may be used. Rewriting (6·26), we get 1/2 h2 h σx2 = 2 N ∑ f d2 – (∑ f d)2 ⇒ σx = N ∑ f d 2 – (∑ f d)2 ⇒ σ x = h . σd …(6·27) N N [ ] [ ] 3. Different Formulae for Variance for Raw Data If x 1 , x2,…, xn are the n observations, then — 1 σx2 = ∑ (X – X )2 n ⇒ — 1 σx2 = ∑ X2 – X n — 2 = 1 ∑X n 2 – ( 1n ∑ X ) …(6·28) 2 [Taking f = 1 and N = n in (6·24)] …(6·28a) 4. If we are given X and σx2, then we can obtain the values of ∑ X and ∑ X 2 as discussed below. From (6·28a), we have — — 1 X = ∑X ⇒ ∑ X = nX …(6·28b) n – – 1 and σx2 = ∑ X2 – X 2 ⇒ ∑ X2 = n (σx2 + X 2 ) …(6·28c) n Formulae (6·28b) and (6·28c) are very useful when we are given the values of the mean and standard deviation (or variance) and later on it is found that one or more of the observations are wrong and it is 6·21 MEASURES OF DISPERSION required to compute the mean and variance after replacing the wrong values by correct values or after deleting the wrong values. For illustrations, see Examples 6·24 to 6·26. Example 6·10. Calculate the standard deviation of the following observations on a certain variable : 240·12, 240·13, 240·15, 240·12, 240·17, 240·15, 240·17, 240·16, 240·22, 240·21 Solution. COMPUTATION OF STANDARD DEVIATION ∑X 2401·60 X = = 10 = 240·16 10 — 1 σ2 = ∑(X – X )2 = 0·0106 10 = 0·00106 n — X 240·12 240·13 240·15 240·12 240·17 240·15 240·17 240·16 240·22 240·21 1 ∴ s.d. (σ) = (0·00106)2 [ = Antilog [ = Antilog [ = Antilog [ = Antilog 1 2 log (0·00106) – 1 2 (3·0253) ] 1 2 (–3 + 0·0253) 1 2 (–2·9747) ] — — (X – X )2 X –X ————————————————————— ————————————————————— ——————————————————————————— ] ] – 0·04 – 0·03 – 0·01 – 0·04 0·01 – 0·01 0·01 0·00 0·06 0·05 0·0016 0·0009 0·0001 0·0016 0·0001 0·0001 0·0001 0 0·0036 0·0025 ———————————————————— ——————————————————————————————————————————————— – = Antilog (–1·4173) = Antilog (2·5127) = 0·03256 ∑X = 2401·60 — — ∑(X – X ) = 0 ∑ (X – X )2 = 0·0106 Example 6·11. Complete a table showing the frequencies with which words of different numbers of letters occur in the extract reproduced below (omitting punctuation marks) treating as the variable the number of letters in each word, and obtain the mean and standard deviation of the distribution : “Her eyes were blue : blue as autumn distance — blue as the blue we see, between the retreating mouldings of hills and woody slopes on a sunny September morning : a misty and shady blue, that had no beginning or surface, and was looked into rather than at.” Solution. Here we take the variable (X) as the number of letters in each word in the extract given above. We find that in the extract given above there are words with number of letters ranging from 1 to 10. Hence, the variable X takes the values from 1 to 10. The frequency distribution is easily obtained by using ‘tally marks’ as given in the following Table. Mean = A + ∑ fd =6+ N ( ) –76 46 = 6 – 1·65 = 4·35 σ2 = = ⇒ 1 ∑ fd 2 – N 354 46 – ( N1 ∑ fd ) 2 (–1·65)2 = 7·6956 – 2·7225 = 4·9731 σ = ⎯√⎯⎯⎯⎯ 4·9731 = 2·23. No. of letters in Frequency (f) a word (X) d=X–A =X–6 fd fd2 ————————————————— ————————————— ———————————— ———————————— ————————————— 1 2 3 4 5 6 7 8 9 10 Total 2 8 9 10 5 4 3 1 3 1 N = ∑ f = 46 –5 –4 –3 –2 –1 0 1 2 3 4 –10 –32 –27 –20 –5 0 3 2 9 4 50 128 81 40 5 0 3 4 27 16 ∑ fd = –76 ∑ fd2 = 354 Example 6·12. Calculate the mean and standard deviation from the following data : Value : 90—99 80—89 70—79 60—69 50—59 40—49 Frequency : 2 12 22 20 14 4 30—39 1 6·22 BUSINESS STATISTICS Solution. CALCULATIONS FOR MEAN AND S.D. x – 64·5 10 fd fd2 3 6 18 12 2 24 48 22 1 22 22 Class Mid-value (x) Frequency (f) 90—99 94·5 2 80—89 84·5 70—79 74·5 60—69 64·5 20 0 0 0 50—59 54·5 14 –1 –14 14 40—49 44·5 4 –2 –8 16 30—39 34·5 1 –3 –3 9 ∑ fd = 27 ∑fd = 127 Total d= N = 75 Mean = A + 2 h∑ fd 10 × 27 = 64·5 + = 64·5 + 3·6 = 68·1 N 75 ∑ fd2 – N S.D. = h . ( ∑Nfd ) 2 = 10 × 127 75 – ( ) 27 2 75 = 10 × √⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 1·6933 – 0·1296 = 10 × √⎯⎯⎯⎯⎯ 1·5637 = 10 × 1·2505 = 12·505. Example 6·13. The arithmetic mean and the standard deviation of a set of 9 items are 43 and 5 respectively. If an item of value 63 is added to the set, find the mean and standard deviation of 10 items. [Delhi Univ. B.A. (Econ. Hons.), 2006] Solution. We are given : n = 9, –x = 43 and σ = 5. –x = ∑x ⇒ ∑ x = nx– = 9 × 43 = 387 n ∑x2 Also σ2 = – ( –x )2 ⇒ ∑ x2 = n(σ2 + –x 2 ) n ∴ ∑x 2 = 9(25 + 432 ) = 9(25 + 1849) = 9 × 1874 = 16866 If a new item 63 is added, then number of items becomes 10. New (∑x) = ∑x + 63 = 387 + 63 = 450 ∴ New mean = New (∑x2) 450 10 = =∑x2 New s.d. = …(i) …(ii) [From (i)] 45 + 63 2 = 16866 + 3969 = 20835 New (∑x2) – (New mean)2 = 10 20835 2 10 – (45) = ⎯√⎯⎯⎯⎯⎯⎯⎯⎯ 2083·5 – 2025 = ⎯ √⎯⎯⎯ 58·5 = 7·65. Example 6·14. Twenty passengers were found ticketless on a bus. The sum of squares and the S.D. of the amount found in their pockets were Rs. 2,000·00 and Rs. 6·00 respectively. If the total fine imposed on these passengers is equal to the total amount recovered from them and fine imposed is uniform, what is the amount each one of them has to pay as fine ? What difficulties do you visualize if such a system of penalty were imposed ? [Delhi Univ. B.A. (Econ.)Hons., 1993] Solution. Let xi, i = 1, 2,…,20 be the amount (in Rs.) found in the pocket of the ith passenger. Then we are given : 20 n = 20, ∑ xi2 = Rs. 2,000 i=1 and s.d.(σ) = Rs. 6 …(i) 6·23 MEASURES OF DISPERSION The total fine imposed on the ticketless passengers is given to be equal to the total amount recovered from them. 20 ∴ Total fine imposed on the 20 passengers = ∑ xi. i=1 Further, since the fine imposed is uniform among all the 20 passengers, n 1 Fine to be paid by each passenger = 20 ∴ 1 We have : σ2 = ∑ xi2 – –x 2 n –x 2 = Rs.2 ∴ ( 2000 20 –x 2 ⇒ = ∑ i=1 1 20 xi = –x … (ii) ∑ x i 2 – σ2 ) – 62 = Rs.2 (100 – 36) = Rs.2 64, ⇒ –x = Rs. 8 [From (i)] Hence, using (ii), the fine paid by each of the passengers is Rs. 8. If among these ticketless passengers, there are a few rich persons with large sums of money in their pockets, then an obvious shortcoming of this system of imposing penalty is that, it will give undue heavy penalty to the poor passengers (with smaller amounts of money in their pockets). 1 Example 6.15. The variance of a series of numbers 2, 3, 11 and x is 12 . Find the value of x. 4 [I.C.W.A. (Foundation), June 2006] Solution. We are given n = 4. X 2 3 11 x ∑X = 16 + x ∑X2 ∑X 2 x2 + 134 16 + x 2 2 σX2 = – = – X 4 9 121 x2 ∑X2 = x2 + 134 n n 4 4 ( ) ( ) 1 1 [4 (x2 + 134) – (16 + x)2 ] = 16 [4x2 + 536 – (256 + x2 + 32x)] 16 1 = [3x2 – 32x + 280] = 12 41 = 49 (Given). 16 4 3x2 – 32x + 280 = 49 × 4 = 196 ⇒ 3x 2 – 32x + 84 = 0 = ⇒ ⇒ σX2 ∴ x = ⇒ x = 32 ± √ ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ (– 32)2 – 4 × 3 × 84 32 ± √ ⎯⎯⎯⎯⎯⎯⎯⎯ 1024 – 1008 32 ± 4 = = 2×3 6 6 ( 32 + 4 6 or ) ( 32 – 4 = 6 6 or ) 14 . 3 Example 6·16. The mean of 5 observations is 4·4 and the variance is 8·24. If three of the five observations are 1, 2 and 6, find the values of the other two. [Delhi Univ. B.A. (Econ. Hons.), 2004] – Solution. We are given n = 5, x = 4·4 and σ2 = 8·24 We have : ∑x = nx– = 5 × 4·4 = 22 ∑x 2 = n (σ2 + –x 2 ) = 5(8·24 + 19·36) = 5 × 27·60 = 138 and Three observations are 1, 2 and 6. Let the two unknown observations be x 1 and x2 . Then ∑x = 1 + 2 + 6 + x1 + x2 = 22 ⇒ ∑x 2 = 1 2 + 2 2 + 6 2 + x12 + x22 = 138 ⇒ Substituting the value of x2 from (*) in (**) we get x12 + (13 – x1)2 = 97 2 2 2 ⇒ x1 + [13 + x1 – 2 × 13 × x1 ] = 97 x1 + x2 = 22 – 9 = 13 x12 + x22 = 138 – 41 = 97 [·. · (a – b)2 = a 2 + b2 – 2ab] …(*) …(**) 6·24 BUSINESS STATISTICS ⇒ x12 + (169 + x 1 2 – 26x1) = 97 ⇒ 2x12 – 26x1 + 72 = 0 Solving as a quadratic equation in x1 , we get 26 ± √ ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ (26) 2 – 4 × 2 × 72 26 ± √ ⎯⎯⎯⎯⎯⎯⎯ 676 – 576 26 ± 10 = = 2×2 4 4 26 + 10 26 – 10 = or = 9 or 4 4 4 Substituting in (*), we get x1 = 9 ⇒ x2 = 13 – 9 = 4 or x 1 = 4 Hence, the other two numbers are 4 and 9. Example 6.17. (a) For a group of 10 items, x1 = 10 10 i=1 i=1 ∑ (Xi – 2) = 40 and x2 = 13 – 4 = 9 ∑ Xi2 = 495. Then find the variance of this group. (b) For 10 values X1 , X2 , …, X10 of the variable X, 10 10 i=1 i=1 ∑ Xi = 110 and ⇒ [I.C.W.A. (Foundation), June 2005] ∑ (Xi – 5) 2 = 1,000. Find variance of X. [I.C.W.A. (Foundation), June 2006] Solution. (a) In the usual notations, we are given : n = 10 ; ∑X2 = 495 and ∑ (X – 2) = 40 ⇒ ∑X – 2n = 40 ⇒ ∑X = 40 + 2 × 10 = 60 2 2 2 ∑X ∑X 495 60 ∴ σX2 = – = – = 49.5 – 36 = 13.5 n n 10 10 (b) We are given : n = 10, ∑X = 110 and ∑ (X – 5)2 = 1,000. Let d = X – A = X – 5, (A = 5). Then, we have ∑d = ∑ (X – 5) = ∑X – 5 × n = 110 – 5 × 10 = 60 and ∑ d 2 = ∑ (X – 5)2 = 1,000. Since variance is independent of change of origin, we get : ( ) ( ) ( ) ( ) ∑d2 ∑d 2 1‚000 60 2 – = – = 100 – 36 = 64. n n 10 10 Example 6.18. For the numbers 5, 6, 7, 8, 10, 12, if S1 and S 2 are the respective Root Mean Square Deviations about the mean and about an arbitrary number 9, show that 17S2 2 = 20 S12 . [I.C.W.A. (Foundation), June 2003] σX2 = σd 2 = Solution. We have CALCULATIONS FOR MEAN SQUARE DEVIATIONS — — — ∑X 48 X (X – 9) (X – 9)2 X–X (X – X )2 n = 6, X= = =8 n 6 =X–8 16 –4 9 –3 5 S1 2 = Mean Square Deviation about mean 9 –3 4 –2 6 — 2 34 17 1 = ∑ (X – X ) = = 4 –2 1 –1 7 n 6 3 1 – 1 0 0 8 S2 2 = Mean Square Deviation about the 1 1 4 2 10 point ‘9’ 9 3 16 4 12 — 1 40 20 ∑(X – 9)2 ∑X = 48 2 ∑(X – X ) = ∑ (X – 9)2 = = n 6 3 = 40 = 34 ∴ 2 S1 17 3 17 = × = S2 2 3 20 20 ⇒ 17 S2 2 = 20 S1 2 6·25 MEASURES OF DISPERSION Example 6·19. A charitable organisation decided to give old age pensions to people over sixty years of age. The scale of pensions were fixed as follow : Age group 60—65 Rs. 200 per month ” 65—70 Rs. 250 per month ” 70—75 Rs. 300 per month ” 75—80 Rs. 350 per month ” 80—85 Rs. 400 per month The ages of 25 persons who secured the pensions right are as given below : 74, 62, 84, 72, 61, 83, 72, 81, 64, 71, 63, 61, 60, 67, 74, 64, 79, 73, 75, 76, 69, 68, 78, 66, 67 Calculate monthly average pensions payable per person and the standard deviation. [Delhi Univ. B.Com. (Hons.) (External), 2005] Solution. First of all we shall prepare the frequency distribution of the 25 persons with respect to age in the age-groups 60—65, 65—70,…, 80—85 (as suggested by the above data) by using the method of tally marks. Then we shall compute the arithmetic mean of the pension payable per person and also its standard deviation, as explained in the following table. COMPUTATION OF MEAN AND STANDARD DEVIATION Age Group 60—65 65—70 70—75 75—80 80—85 Tally makrs |||| |||| |||| Frequency (f) Monthly pension (in Rs.) 200 250 300 350 400 7 5 6 4 3 || | |||| ||| d= X – 300 50 –2 –1 0 1 2 fd –14 –5 0 4 6 ∑ fd = –9 N = 25 fd2 28 5 0 4 12 ∑ fd2 = 49 Average monthly pension is given by : — h∑fd X =A+ = 300 + 50 ×25(–9) = (300 – 18) = Rs. 282 N Standard deviation of monthly pension is : σ =h. ( ) ∑ fd2 ∑ fd – N N 2 = h N 50 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ N ∑fd2 – (∑ fd)2 = 25 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ 25 × 49 – (–9) 2 = 2⎯ √⎯⎯⎯⎯⎯⎯ 1225 – 81 = 2⎯ √⎯⎯⎯ 1144 = 2 Antilog = 2 Antilog ( 1 2 × 3·0585 ( 1 2 log 1144 ) ) = 2 Antilog (1·5292) = 2 × 33·83 = Rs. 67·66. Example 6·20. You are the incharge of the rationing department of a state affected by food shortage. The following information is received from your local investigators : Area X Y Mean Calories 2,500 2,200 Standard Deviation of Calories 500 300 The estimated requirement of an adult is taken at 3,000 calories daily and absolute minimum at 1,250. Comment on the reported figures and determine which area needs more urgent action. [Delhi Univ. B.Com. (Hons.), 2002] – Solution. We shall compute the 3-sigma limits x ± 3σ for each area, which will include approximately 99·73% of the population observation [assuming that the distribution is approximately normal]. 6·26 BUSINESS STATISTICS — 3-σ Limits = X ± 3σ 2500 ± 3 × 500 = 2500 ± 1500 = (1000, 4000) 2200 ± 3 × 300 = 2200 ± 900 = (1300, 3100) Area X Area Y The absolute daily minimum calories requirement for a person is 1250. From the above figures we observe that almost all the persons in the area Y are getting more than the minimum calories requirement as the lower limit in this area is 1300. However, since in the area X, the lower 3-σ limit is 1000 which is less than 1250, quite a number of people in area X are not getting the minimum requirement of 1250 calories. Hence, as the incharge of the rationing department, it becomes my duty to take urgent action for the people of area X. Example 6·21. Find the proportion of items lying within : (i) mean ± σ and (ii) mean ± 2σ, of the following distribution. Class Frequency 11—12 13—14 15—16 17—18 19—20 Class 5 426 720 741 665 Frequency 21—22 23—24 25—26 27—28 29—30 395 38 8 5 7 Solution. COMPUTATION OF MEAN AND S.D. Class interval Mid-value (X) 11—12 13—14 15—16 17—18 19—20 21—22 23—24 25—26 27—28 29—30 11·5 13·5 15·5 17·5 19·5 21·5 23·5 25·5 27·5 29·5 d= X – 19·5 2 –4 –3 –2 –1 0 1 2 3 4 5 f fd fd 2 5 426 720 741 665 395 38 8 5 7 –20 –1278 –1440 –741 0 395 76 24 20 35 80 3834 2880 741 0 395 152 72 80 175 ∑ f = 3010 Mean = A + σ =h ∑ fd = –2929 ∑ fd 2 = 8409 h ∑fd (–2929) = 19·5 + 2 × 3010 = 19·5 – 1·946 = 17·55 N ∑ fd2 – N ( ∑Nfd ) 2 =2× 8409 3010 – ( ) –2929 2 3010 = 2 × √⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 2·7937 – (0·9731)2 = 2 × ⎯ √⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 2·7937 – 0·9469 = 2 × √⎯⎯⎯⎯⎯ 1·8464 = 2 × 1·35897 = 2·7179 ≈ 2·72 ∴ Mean ± σ = 17·55 ± 2·72 = 20·27 and 14·83 and Mean ± 2σ = 17·55 ± 2 × 2·72 = 17·55 ± 5·44 = 22·99 and 12·11. The number of items lying within Mean ± σ i.e., within 14·83 and 20·27 is 720 + 741 + 665 = 2126, and the proportion of items lying within Mean ± σ is : 2‚126 3‚010 = 0·7063 i.e., 70·63% 6·27 MEASURES OF DISPERSION The number of items lying within Mean ± 2σ i.e., within 12·11 and 22·99 is : 426 + 720 + 741 + 665 + 395 = 2,947. Hence, the proportion of items lying within Mean ± 2σ is : 2‚947 3‚010 = 0·9791 i.e., 97·91% Example 6·22. The mean and standard deviation of the frequency distribution of a continuous random variable X are 40·604 lbs. and 7·92 lbs. respectively. The distribution after change of origin and scale is as follows : d : –3 –2 –1 0 1 2 3 4 Total f : 3 15 45 57 50 36 25 9 240 where d = (X – A)/h and f is the frequency of X. Determine the actual class intervals. Solution. COMPUTATION OF MEAN AND S.D. We are given : d = (X – A)/h d X = 40·604 — X =A+ ⇒ ⇒ and σx = 7·92 h ∑ fd N 40·604 = A + h ( ) 149 240 40·604 = A + 0·621h σx = h ⇒ f fd fd2 ——————————————— —————————————————————— —————————————————— —————————————————— — 7·92 = h …(*) ( ) ∑ fd2 ∑ fd – N N 695 240 – ( ) 149 2 240 2 –3 3 –9 27 –2 15 –30 60 –1 45 – 45 45 0 57 0 0 1 50 50 50 2 36 72 144 3 25 75 225 4 9 36 144 N = ∑ f = 240 ∑ fd = 149 ————————————————————————————————————— ——————————————————— ——————————————————— ∑ fd2 = 695 = h⎯ √⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 2·8958 – (0·6208) 2 =h√ ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 2·8958 – 0·3854 = h ⎯ √⎯⎯⎯⎯ 2·5104 = 1·5844 h ⇒ 7·92 h = 1·5844 = 4·9987 ~– 5 Substituting in (*), we get A = 40·604 – 5 × 0·621 = 40·604 – 3·105 = 37·499 ~– 37·5 The value d=0 ⇒ X–A=0 ⇒ X = A = 37·5 Since the magnitude of the class interval is h = 5, the boundaries of the corresponding class are (37·5 – 2·5, 37·5 + 2·5) i.e., (35, 40). Thus the actual frequency distribution is as given in the adjoining table. d X = Mid-value of Class Class Interval 22·5 27·5 32·5 37·5 42·5 47·5 52·5 57·5 20—25 25—30 30—35 35—40 40—45 45—50 50—55 55—60 Frequency ———————————— ——————————————————————————————————————————— ———————————————— –3 –2 –1 0 1 2 3 4 3 15 45 57 50 36 25 9 Example 6·23. (a) Find the mean and standard deviation of the first n natural numbers. (b) Hence deduce the mean and s.d. of the numbers 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Solution. (a) If the variable X denotes the natural number, then the first n natural numbers are 1, 2, 3, …, n. X 1 2 3 … n 2 2 2 2 X 1 2 3 … n2 6·28 BUSINESS STATISTICS ∑X 1 + 2 + 3 + … + n n(n + 1) . 1 n +1 = = = n n 2 n 2 ∑X2 —2 12 + 2 2 + 3 3 + … + n2 n+1 2 Variance = –X = – n n 2 2 n(n + 1) (2n + 1) (n + 1) (n + 1) = – = 2(2n + 1) – 3(n + 1) 6n 4 12 (n + 1) (n + 1) (n – 1) = 4n + 2 – 3n – 3 = 12 12 Mean = ( ) [ [ ∴ σ2 = …(*) ] ] n2 – 1 12 ⇒ s.d. (σ) = n2 – 1 12 … (**) Note. This is a standard result and should be committed to memory. (b) Deduction. We have to find mean and s.d. of the first 10 natural numbers. Taking n = 10 in (*) and (**), we get respectively : Mean = 10 + 1 11 = = 5·5 ; 2 2 s.d. (σ) = 102 – 1 12 = 100 – 1 12 =√ ⎯⎯⎯ 8·25 = 2·87. Example 6·24. The arithmetic mean and standard deviation of series of 20 items were calculated by a student as 20 cm. and 5 cm. respectively. But while calculating them an item 13 was misread as 30. Find the correct arithmetic mean and standard deviation. — Solution. We are given n = 20, X = 20 cms, σ = 5 cms ; Wrong value used = 30 ; Correct value = 13 — — We have : ∑X = nX = 20 × 20 = 400 and ∑X2 = n(σ2 + X 2 ) = 20(25 + 400) = 8500 If the wrong observation 30 is replaced by the correct value 13, then the number of observations remains same viz., 20 and Corrected ∑X = 400 – 30 + 13 = 383 ; Corrected ∑X2 = 8500 – (30)2 + (13)2 = 7769 Corrected (∑X) 383 ∴ Corrected mean = = 20 = 19·15 n Corrected ( ∑X )2 – Corrected σ2 = (Corrected mean)2 n 7769 = – (19·15)2 = 388·45 – 366·72 = 21·73 20 ∴ Corrected s.d. (σ) = ⎯√⎯⎯⎯ 21·73 = 4·6615. Example 6·25. The mean and the variance of ten observations are known to be 17 and 33 respectively. Later it is found that one observation (i.e., 26) is inaccurate and is removed. What is the mean and standard deviation of the remaining ? [Delhi Univ. B.A. (Econ. Hons.), 2009; C.A. (Foundation), May 2002] — Solution. In the usual notations, we are given : n = 10, X = 17 and σ2 = 33 — — ∑X = nX = 10 × 17 = 170 and ∑X2 = n(σ2 + X 2 ) = 10 (33 + 289) = 3220 If the inaccurate observation (26) is removed, then for the remaining n – 1 = 10 – 1 = 9 observations, the values of ∑X and∑X2 are given by : Corrected ∑X = (∑X) – 26 = 170 – 26 = 144 Corrected ∑X2 = (∑X2 ) – 262 = 3220 – 676 = 2544 — The mean ( X 1 ) and variance (σ12 ) of the remaining 9 observations are given by : — Corrected ∑X 144 New mean ( X 1 ) = = 9 = 16 9 Corrected ∑X2 — 2 2544 New variance (σ1 2 ) = – (X 1 ) = 9 – 256 = 282·67 – 256 = 26·67. 9 6·29 MEASURES OF DISPERSION Example 6·26. For a frequency distribution of marks in Sociology of 200 candidates (grouped in intervals 0—5, 5—10,…,etc.), the mean and the standard deviation was found to be 45 and 15. Later it was discovered that the score 53 was misread as 63 in obtaining the frequency distribution. Find the correct mean and standard deviation corresponding to the correct frequency distribution. — Solution. We are given N = 200, X = 45 and σ = 15. These values have been obtained on using the wrong value 63 while the correct value is 53. In case of grouped or continuous frequency distribution, the value of X used for computing the mean and the standard deviation, is the mid-value of the class interval. Since the wrong value 63 lies in the interval 60—65 with the mid-value 62·5 and the correct value 53 lies in — the interval 50—55 with mid-value 52·5, the question amounts to finding the correct values of X and σ if the wrong value 62·5 is replaced by the correct value 52·5. — ∴ Corrected ∑ f X = 9000 – 62·5 + 52·5 = 8990 ∴ We have N = 200, X = 45, σ = 15 Corrected ∑ f X2 = 450000 – (62·5)2 + (52·5)2 Wrong value used = 62·5 ; Correct value = 52·5 — = 450000 – [(62·5) 2 – (52·5)2 ] ∑ f X = NX = 200 × 45 = 9000 = 450000 – (62·5 + 52·5) (62·5 – 52·5) — ∑ f X2 = N(σ2 + X2 ) = 200 (152 + 45 2 ) = 450000 – 115 × 10 = 450000 – 1150 = 200(225 + 2025) = 200 × 2250 = 450000 = 448850 Corrected ∑ f X 8990 ∴ Corrected mean = = 200 = 44·95 N Corrected σ = = Corrected ∑ f X2 – (Corrected mean)2 N 448850 200 – (44·95)2 = √⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 2244·25 – 2020·50 = √ ⎯⎯⎯⎯⎯⎯ 223·75 = 14·96. EXERCISE 6·3 1. (a) Explain with suitable example the term variation. What purposes does a measure of variation serve ? Comment on some of the well-known measures of variation along with their respective merits and demerits. [Delhi Univ. MBA, 2000] (b) What is a measure of dispersion ? Discuss four important measures of spread indicating their uses. [Andhra Pradesh Univ. B.Com., 1998] 2. What is meant by dispersion ? In your opinion which is the best method of finding out dispersion and why ? [Delhi Univ. B.Com. (Pass), 1999] 3. What are the chief requisites of a good measure of dispersion ? In the light of those, comment on some of the well-known measures of dispersion. 4. (a) What do you understand by absolute and relative measures of dispersion ? Explain advantages of the relative measures over the absolute measures of dispersion. (b) Define mean deviation and standard deviation. Explain, why economists prefer mean deviation to standard deviation in their analysis. 5. (a) Compare mean deviation and standard deviation as measures of variation. Which of the two is a better measure ? Why ? [Delhi Univ. B.Com. (Hons.), 2001] (b) Explain the mathematical properties of standard deviation. Why is standard deviation used more than mean deviation ? [Delhi Univ. B.Com. (Hons.), 2009] 6. What is standard deviation ? Explain its superiority over other measures of dispersion. 7. Give the various formulae for computing the standard deviation. 8. State giving reasons whether the following statements are true or false : (i) Standard deviation can never be negative. (ii) The sum of squared deviations measured from mean is least. Ans. (i) True, (ii) True. 9. For the numbers (X) : 1, 3, 4, 5 and 12, find : (i) the value (v) for which ∑ (X – v)2 is minimized. (ii) the value (v) for which ∑ | X – v | is minimized. [Delhi Univ. B.A. (Econ. Hons.), 2009] 6·30 BUSINESS STATISTICS — 1 Ans. (i) ∑ (X – v)2 is minimum when v = X = 5 (1 + 3 + 4 + 5 + 12) = 5 (ii) ∑ | X – v | is minimum when v = Median of (1, 3, 4, 5, 12) = 4 10. Calculate standard deviation of the following marks obtained by 5 students in a tutorial group : Marks Obtained : 8, 12, 13, 15, 22 [Delhi Univ. B.Com. (Pass), 1997] Ans. σ2 = ( ) ∑x2 ∑x – n n 2 = ( ) 1086 70 – 5 5 2 = 21·2 ⇒ σ =⎯ √⎯⎯ 21·2 = 4·6. 11. Why is standard deviation considered to be the best measure of dispersion ? Find the variance if ∑d2 = 150 and N = 6. Deviations are taken from actual mean. [Delhi Univ. B.Com. (Pass), 1998] 150 2 Ans. σ = = 25. 6 12. (a) From the following information, find the standard deviation of x and y variables : ∑x = 235, ∑y = 250, ∑x2 = 6750, ∑y2 = 6840, N = 10. [Delhi Univ. B.Com. (Hons.) 1997] Ans. σ x = 11·08; σy = 7·68. (b) You are given the following raw sums in a statistical survey of two variables X and Y : ∑X = 240, ∑Y = 250, ∑X2 = 6400 and ∑Y2 = 7060. Ten items are included in each survey. Compute Standard Deviation of the X and Y variables. [Delhi Univ. B.Com. (Pass), 1996] Ans. σx = 8, σy = 9. 13. (a) State a formula for computing standard deviation of n natural numbers 1, 2, …, n. [Delhi Univ. B.Com. (Pass), 2000] Ans. σ = √ ⎯⎯⎯⎯⎯⎯⎯ (n2 – 1) / 12. ⎯ 2 . [Kerala Univ. B.Com., 1996] (b) Show that the standard deviation of the natural numbers 1, 2, 3, 4 and 5 is √ (c) Mean of 10 items is 50 and S.D. is 14. Find the sum of the squares of all the items. [Mahatma Gandhi Univ. B.Com., April 1998] Ans. ∑x2 = 26960 14. Calculate standard deviation of the following series. Daily Wages of No. of Daily Wages of No. of Daily Wages of No. of Workers (in Rs.) Workers Workers (in Rs.) Workers Workers (in Rs.) Workers 100—105 200 120—125 350 140—145 280 105—110 210 125—130 520 145—150 210 110—115 230 130—135 410 150—155 160 115—120 320 135—140 320 155—160 90 Ans. s.d. = 14.244 15. Find out the mean and standard deviation of the following data. Age under (years) : 10 20 30 40 50 60 70 80 No. of persons dying : 15 30 53 75 100 110 115 125 Ans. Mean = 35.16 years, S.D. = 19.76 years. 16. In the following data, two class frequencies are missing. Class Interval Frequency Class Interval 100—110 4 150—160 110—120 7 160—170 120—130 15 170—180 130—140 — 180—190 140—150 40 190—200 Frequency — 16 10 6 3 However, it was possible to ascertain that the total number of frequencies was 150 and that the median has been correctly found out as 146·25. You are required to find with the help of information given : (i) The two missing frequencies. (ii) Having found the missing frequencies, calculate arithmetic mean and standard deviation. (iii) Without using the direct formula, find the value of mode. 6·31 MEASURES OF DISPERSION Ans. (i) 24, 25; (ii) A.M. = 147·33, s.d. = 19·2; (iii) Mode = 144·09. 17. The following table gives the distribution of income of households based on hypothetical data : Income Percentage of Income Percentage of households (Rs.) (Rs.) households Under 100 7·2 500—599 14·9 100—199 11·7 600—699 10·4 200—299 12·1 700—999 9·0 300—399 14·8 1000 and above 4·0 400—499 15·9 (i) What are the problems involved in computing standard deviation from the above data ? (ii) Compute a suitable measure of dispersion. Ans. (ii) Compute Quartile Deviation. Q.D. = 169·425 ; Coeff. of Q.D. = 0·404. 18. The standard deviation calculated from a set of 32 observations is 5. If the sum of the observations is 80, what is the sum of the squares of these observations ? Ans. ∑X2 = 1000. 19. The mean of 200 items is 48 and their standard deviation is 3. Find the sum and sum of squares of all items. Ans. 9,600 ; 4,62,600. — 20. Given : No. of observations (N) = 100; Arithmetic average (X ) = 2 ; Standard deviation (sx) = 4 find ∑X and ∑X2. Ans. ∑X = 200, ∑X2 = 2000. 21. The mean of 5 observations is 3 and variance is 2. If three of the five observations are 1, 3, 5, find the other two. Ans. 2, 4. 22. An association doing charity work decided to give old age pension to people of 60 years and above in age. The scales of pension were fixed as follows : Age group 60—65 ; Rs. 400 per month Age group 65—70 ; Rs. 500 per month Age group 70—75 ; Rs. 600 per month Age group 75—80 ; Rs. 700 per month Age group 80—85 ; Rs. 800 per month 85 and above ; Rs. 1000 per month. The ages of 30 persons who secured the pension right are given below : 62 65 68 72 75 77 82 85 90 78 75 61 60 68 72 76 78 79 80 82 68 75 94 98 73 77 68 65 71 89 Calculate the monthly average pension payable and the standard deviation . Ans. Average monthly pension = Rs. 676·70; s.d. = Rs. 183·80. 23. Treating the number of letters in each word in the following passage as the variable X, prepare the frequency distribution table and obtain its mean, median, mode and standard deviation. “The reliability of data must always be examined before any attempt is made to base conclusions upon them. This is true of all data, but particularly so of numerical data, which do not carry their quality written large on them. It is a waste of time to apply the refined theoretical methods of Statistics to data which are suspect from the beginning.” Ans. Mean = 4·565, Median = 4, Mode = 3, S.D. = 2·673. 24. A collar manufacturer is considering the production of a new style of collar to attract young men. The following statistics of new circumferences are available based on measurements of a typical group : Mid-value (in inches) No. of students Mid-value (in inches) No. of students 12·0 4 14·5 18 12·5 8 15·0 10 13·0 13 15·5 7 13·5 20 16·0 5 6·32 BUSINESS STATISTICS 14·0 25 — Use the criterion X ± 3σ to obtain the largest and the smallest size of collar he should make in order to meet the needs of practically all his customers having in mind that collars are worn on an average 3/4 inches larger than neck size. Hint. Mean = 13·968″; S.D. = 0·964″. ; Limits for collar size are given by : [Mean ± 3 s.d.] + 3/4 ; Largest collar size = 14·718 + 2·892 = 17·61″ ; Smallest collar size = 14·718 – 2·892 = 11·826″ 25. The following data represent the percentage impurities in a certain chemical substance. Percentage of impurities Frequency Percentage of impurities Frequency Less than 5 0 10—10·9 45 5—5·9 1 11—11·9 30 6—6·9 6 12—12·9 5 7—7·9 29 13—13·9 3 8—8·9 75 14—14·9 1 9—9·9 85 (i) Calculate the mean and standard deviation. (ii) Find the number of frequency lying between (A.M. ± 2 S.D.). Ans. (i) Mean = 9·3857, S.D. = 1·3924; (ii) 267. 26. The following distribution was obtained by a change of origin and scale of variable X : d : –4 –3 –2 –1 0 1 2 f : 4 8 14 18 20 14 10 3 6 4 6 Write down the frequency distribution of X if it is given that mean and variance are 59·5 and 413 respectively. Ans. C.I. f C.I. f C.I. f 15·5—25·5 4 45·5—55·5 18 75·5—85·5 10 25·5—35·5 8 55·5—65·5 20 85·5—95·5 6 35·5—45·5 14 65·5—75·5 14 95·5—105·5 6 ————————— Total 100 27. Mean and standard deviation of the following continuous series are 31 and 15·9 respectively. The distribution after taking step deviation is as follows : d : –3 –2 –1 0 1 2 3 f : 10 15 25 25 10 10 5 Determine the actual class intervals. [G.G.I.P. Univ., B.B.A., May 2004] Ans. 0—10, 10—20, 20—30, 30—40, 40—50, 50—60, 60—70. 28. (a) The mean and standard deviation of a sample of 100 observations were calculated as 40 and 5·1 respectively by a student who took by mistake 50 instead of 40 for one observation. Calculate the correct mean and standard deviation. Ans. Corrected mean = 39·9 and s.d. = 5. (b) For a number of 51 observations, the arithmetic mean and standard deviation are 58·5 and 11 respectively. It was found after the calculations were made that one of the observations recorded as 15 was incorrect. Find the mean and standard deviation of the 50 observations if this incorrect observation is omitted. Ans. Mean = 59·37 and S.D. (σ) = 9·21. 29. The mean and the standard deviation of a sample of size 10 were found to be 9·5 and 2·5 respectively. Later on, an additional observation became available. This was 15·0 and was included in the original sample. Find the mean and the standard deviation of the 11 observations. Ans. Mean = 10, s.d. = 2·86. 30. (a) The mean and standard deviation of 20 items are found to be 10 and 2 respectively. At the time of checking it was found that one item 8 was incorrect. Calculate the mean and standard deviation if (i) the wrong item is omitted, and (ii) it is replaced by 12. (b) A study of the age of 100 film stars grouped in intervals of 10—12, 12—14,…, etc., revealed the mean age and standard deviation to be 32·02 and 13·18 respectively. While checking it was discovered that the age 57 was misread as 27. Calculate the correct mean age and standard deviation. Ans. (a) (i) Mean = 10·1053, s.d. = 1·997; (ii) Mean = 10·2 s.d. = 1·99. ————————— MEASURES OF DISPERSION 6·33 (b) Mean = 32·32; s.d. =13·402. 31. (a) Mean and Standard Deviation of 100 items are found to be 40 and 10. At the time of calculation two items are wrongly taken as 30 and 70 instead of 3 and 27. Find the correct mean and correct standard deviation. Ans. Mean = 39·3 ; S.D. = 10·24 [Delhi Univ. B.Com. (Pass), 2001] (b) Mean and coefficient of standard deviation of 100 items are found by a student as 50 and 0·1. If at the time of calculations two items are wrongly taken as 40 and 50 instead of 60 and 30, find the correct mean and standard deviation. [Delhi Univ. B.Com. (Hons.), 1996] – x – σ Hint. n = 100, x = 50, = 0 ·1 ⇒ σ= =5 ⇒ σ 2 = 25. 10 x Ans. Mean = 50, σ = √ ⎯⎯29 = 5·39. (c) The mean and the standard deviation of a characteristic of 100 items were found to be 60 and 10 respectively. At the time of calculations, two items were wrongly taken as 5 and 45 instead of 30 and 20. Calculate the corrected mean and corrected standard deviation. [Delhi Univ. B.Com. (Hons.), 2009; C.A. (Foundation), June 1993] Ans. Corrected mean = 60, Corrected S.D. = 9·62 32. Fill in the blanks : (i) Algebraic sum of deviations is zero from …… (ii) The sum of absolute deviations is minimum from …… (iii) Standard deviation is always …… than range. (iv) Standard deviation is always …… than mean deviation. (v) The mean and s.d. of 100 observations are 50 and 10 respectively. The new : (a) Mean = ……, s.d. = ……, if 2 is added to each observation. (b) Mean = ……, s.d. = ……, if 3 is subtracted from each observation. (c) Mean = ……, s.d. = ……, if each observation is multiplied by 5 (d) If 2 is subtracted from each observation and then it is divided by 5. (vi) Variance is the …… value of mean square deviation. (vii) If Q1 = 10, Q3 = 40, the coefficient of quartile deviation is …… (viii) If 25% of the items in a distribution are less than 10 and 25% are more than 40, the quartile deviation is …… (ix) The median and s.d. of a distribution are 15 and 5 respectively. If each item is increased by 5, the new median = …… and s.d. = …… (x) A computer showed that the s.d. of 40 observations ranging from 120 to 150 is 35. The answer is correct/wrong. Tick right one. Ans. (i) Arithmetic mean (ii) Median (iii) Less (iv) Greater (v) (a) 52, 10 (b) 47, 10 (c) 250, 50 (d) 9·6, 2 (vi) Minimum (vii) 0·6 (viii) 15 (ix) 20, 5 (x) Wrong, since s.d. can’t exceed range. 6·10. STANDARD DEVIATION OF THE COMBINED SERIES As already pointed out, standard deviation is suitable for algebraic manipulations i.e., if we are given the averages, the sizes and the standard deviations of a number of groups, then we can obtain the standard deviation of the resultant group obtained on combining the different groups. Thus if — — — σ1 , σ2 , …, σk are the standard deviations; X 1 , X 2 , … , X k are the arithmetic means; and n1 , n2 , …, nk are the sizes, of k groups respectively, then the standard deviation σ of the combined group of size N = n1 + n2 + … + nk is given by the formula Nσ2 = n1 (σ12 + d12 ) + n2 (σ22 + d22 ) + … + nk(σk2 + dk2 ) … (6·29) – – – – – – where d1 =X1 – X; d2 = X2 – X, … , dk = Xk – X … (6·29a) – – – – n X + n X + … + nkXk and X = 1 1 2 2 , is the mean of the combined group. … (6·29b) n1 + n2 + … + n k Thus the standard deviation of the combined group is given by : 6·34 BUSINESS STATISTICS σ= [ ] n1 (σ12 + d12 ) + n2 (σ22 + d 2 2 ) + … + nk(σk2 + dk2 ) 1/2 n1 + n2 + … + nk … (6·30) In particular, for two groups we get from (6·33) : (n1 + n2) σ2 = n1 (σ12 + d12 ) + n2 (σ22 + d22 ) – – – – – – – n X + n2X2 n2 (X1 – X2 ) where d1 = X1 – X = X1 – 1 1 = n1 + n2 n1 + n2 – – – – – – – n X + n2X2 n1 (X2 – X1 ) and d2 = X2 – X = X2 – 1 1 = n1 + n2 n1 + n2 Rewriting (6·35) and substituting the values of d1 and d2 , we get (n1 + n2) σ2 = n1 σ1 2 + n2σ2 2 + n1d1 2 + n2d2 2 – – n1 n2 (X1 – X2 )2 2 2 = n1 σ1 + n2σ2 + (n1 + n2 ) (n1 + n2)2 – – n n (X – X2 )2 = n1 σ1 2 +n2 σ2 2 + 1 2 1 n1 +n2 – – n 2 σ 2 + n2σ2 2 n1 n2 (X1 – X2 )2 1/2 ⇒ σ = 1 1 + n1 + n2 (n1 + n2 )2 [ … (6·31) ] … (6·31a) Thus, for two groups, the formula (6·35a) can be used with convenience, since all the values are already given. Example 6·27. The means of two samples of sizes 50 and 100 respectively are 54·1 and 50·3 and the standard deviations are 8 and 7. Obtain the standard deviation of the sample of size 150 obtained by combining the two samples. Solution. In the usual notations we are given : – – n1 = 50, n2 = 100, X1 = 54·1, X2 = 50·3, σ1 = 8, σ2 = 7. – The mean X of the combined sample of size 150 obtained on pooling the two samples is given by : – – – n X +n X + 100 × 50·3 2705 + 5030 7735 X = 1 1 2 2 = 50 × 54·1 = = 150 = 51·57. 50 + 100 150 n1 + n2 – – – – d1 = X1 – X = 54·10 – 51·57 = 2·53; d2 = X2 – X = 50·30 – 51·57 = – 1·27 Hence, the variance σ2 of the combined sample of size 150 is given by : (n1 + n2) σ2 = n1 (σ12 + d12 ) + n2 (σ22 + d22 ) ⇒ 150σ2 = 50[82 + (2·53)2 ] + 100 [72 + (– 1·27)2] = 50(64 + 6·3504) + 100(49 + 1·6129) + 5061·29 8578·81 σ2 = 3517·52150 = 150 = 57·1921 ∴ ⇒ σ=√ ⎯⎯⎯⎯⎯⎯ 57·1921 = 7·5625. Example 6·28. The mean weight of 150 students is 60 kgs. The mean weights of boys and girls are 70 kgs. and 55 kgs respectively , and the standard deviations are 10 kgs. and 15 kgs. respectively. Find the number of boys and the combined standard deviation. Solution. In the usual notations, we are given : n = n + n =150; –x = 60 kg. ; –x = 70 kg; –x = 55 kg ; σ = 10 kg. ; σ =15 kg. … (*) 1 2 1 2 1 where the subscripts 1 and 2 refer to boys and girls respectively. We have : – – 70n1 + 55n2 –x = n1 x 1 +n2 x 2 ⇒ = 60 n1 + n2 150 2 [From (*)] 6·35 MEASURES OF DISPERSION ∴ 14n1 + 11 (150 – n1 ) = 60 × 30 = 1800 ⇒ ⇒ 3n1 = 1800 – 1650 = 150 n1 = 150 = 50 3 ∴ n2 = 150 – n1 = 150 – 50 = 100. Hence, the numbr of boys is n1 = 50 and the number of girls is n2 = 100. The combined variance (for boys and girls together) is given by : 1 σ2 = n (σ 2 + d1 2 ) + n2 (σ2 2 + d22 ) n1 + n2 1 1 where d = –x – –x = (70 – 60) kg. = 10kg. ; d = –x – –x = (55– 60) kg. = – 5 kg. [ 1 ] 1 2 [ 2 ] 1 + 25000 700 σ2 = 50 (102 + 102) + 100 (152 + (– 5)2) = 10000150 = 3 = 233·33 150 ∴ s.d. (σ) = ⎯ √⎯⎯⎯⎯ 233·33 = 15·275. Example 6·29. For a group containing 100 observations, the arithmetic mean and the standard deviation are 8 and √ ⎯⎯⎯ 10·5 respectively. For 50 observations selected from these 100 observations, the mean and standard deviation are 10 and 2 respectively. Calculate values of the mean and standard deviation for the other half. Solution. In the usual notations, we are given : n = 100, –x = 8, σ = ⎯√⎯⎯⎯ 10·5 ⇒ σ2 = 10·5; n = 50, –x = 10, σ = 2; n = 100 – 50 = 50 ⇒ 1 1 1 2 We want –x2 and σ2. – – –x = n1 x 1 + n2x2 n1 + n2 800 = 500 + 50x– – 50 × 10 + 50 × x2 100 ⇒ 8 ⇒ 50x–2 = 800 – 500 = 300 ⇒ –x = 300 = 6 2 50 d1 = –x1 – –x = 10 – 8 = 2 ⇒ d1 2 = 4 ; d2 = –x2 – –x = 6 – 8 = – 2 (n1 + n2) σ2 = n1 (σ12 + d12 ) + n2 (σ22 + d22 ) ⇒ 100 × 10·5 = 50 (4 + 4) + 50 (σ22 + 4) ⇒ d2 2 = 4 ∴ ⇒ 2 1050 = 400 + 50σ2 2 + 200 = ⇒ σ2 2 = 105050– 600 = 450 50 = 9 ⇒ σ2 = 3, since s.d. is always positive. Example 6·30. Find the missing information from the following : Group I Group II Group III Combined Number 50 ? 90 200 Standard Deviation 6 7 ? 7·746 Mean 113 ? 115 116 (Himachal Pradesh Univ. B.Com., 1998) Solution. Group 1 Group 2 Group 3 Combined Group Number n1 = 50 n2 = ? n3 = 90 n1 + n2 + n3 = 200 s.d. σ1 = 6 σ2 = 7 σ3 = ? σ = 7·746 – – – – Mean X1 = 113 X2 = ? X3 = 115 X = 116 – We have three unknown values viz., n2 , σ 3 and X2 . To determine these three values we need three equations which are given below : 6·36 BUSINESS STATISTICS n1 + n2 + n3 = 200 …(i) ; – – – – n1 X1 + n2X2 + n3X3 X= n1 + n2 + n3 … (ii) and (n1 +n2 + n3 ) σ2 = n1 (σ12 + d1 2 ) + n2 (σ22 + d2 2 ) + n3 (σ32 + d3 2 ) … (iii) From (i) we get … (iv) n2 = 200 – (n1 + n3 ) = 200 – (50 + 90) = 60 Using (ii) we get : – 200 × 116 = 50 × 113 + 60 X2 + 90 × 115 ∴ – ⇒ 23200 = 5650 + 60X2 + 10350 – ⇒ 60X2 = 23200 – (5650 + 10350) = 23200 – 16000 = 7200 – ⇒ X2 = 7200 … (v) 60 = 120 – – d1 = X1 – X = 113 – 116 = – 3 – – d2 =X2 – X = 120 – 116 = 4 – – d3 = X3 – X = 115 – 116 = – 1 Substituting these values in (iii), we get 200 × (7·746)2 = 50(36 + 9) + 60(49 + 16) + 90(σ32 + 1) ⇒ 200 × 60·000516 = 50 × 45 + 60 × 65 + 90 + 90σ3 2 ⇒ 12000 = 2250 + 3900 + 90 + 90σ3 2 ⇒ σ3 2 = 1200090– 6240 = 5760 90 = 64 Hence, the unknown constants are : n2 = 60, ⇒ – X2 = 120 σ3 = 8 and σ3 = 8. 6·11. COEFFICIENT OF VARIATION Standard deviation is only an absolute measure of dispersion, depending upon the units of measurement. The relative measure of dispersion based on standard deviation is called the coefficient of standard deviation and is given by : σ Coefficient of Standard Deviation = … (6·32) X This is a pure number independent of the units of measurement and thus, is suitable for comparing the variability, homogeneity or uniformity of two or more distributions. We have already discussed the relative measures of dispersion based on range, quartile deviation and mean deviation. Since standard deviation is by far the best measure of dispersion, for comparing the homogeneity or heterogeneity of two or more distributions, we generally compute the coefficient of standard deviation unless asked otherwise. 100 times the coefficient of dispersion based on standard deviation is called the coefficient of variation, abbreviated as C.V. Thus, σ C.V. =100 × … (6·33) X According to Professor Karl Pearson who suggested this measure, “coefficient of variation is the percentage variation in mean, standard deviation being considered as the total variation in the mean”. For comparing the variability of two distributions we compute the coefficient of variation for each distribution. A distribution with smaller C.V. is said to be more homogeneous or uniform or less variable than the other and the series with greater C.V. is said to be more heterogeneous or more variable than the other. Remark. Some authors define coefficient of variation as the “coefficient of standard deviation expressed as a percentage”. For example, if mean = 15 and s.d. = 3, then MEASURES OF DISPERSION 6·37 s.d. = 3 = 0.20 ⇒ C.V. = 20% … (6.33a) Mean 15 If we are given the coefficient of variation as a percentage, (like 25% or 10%), then we will use the formula (6·33a). C.V. = Example 6·31. Comment on the following : For a set of 10 observations : mean = 5, s.d. = 2 and C.V. = 60%. 6·37 MEASURES OF DISPERSION Solution. We are given : Using (6.37a), we get n = 10, –x = 5 and σ = 2 σ C.V. = – = 52 = 0·40 = 40% x But we are given C.V. = 60%. Hence, the given statement is wrong. – Example 6·32. If N = 10, X = 12, ∑X2 = 1530, find the coefficient of variation. Solution. We have : – 1 2 σ2 = ∑X2 – (X)2 = 1530 ⇒ σ=3 10 – (12) = 153 – 144 = 9 n [Negative sign is rejected since s.d. is always non-negative] ∴ C.V. = 100—× σ = 10012× 3 = 25 [Using (6·37)] X Example 6·33. The arithmetic mean of runs scored by three batsmen—Vijay, Subhash and Kumar in the same series of 10 innings are 50, 48 and 12 respectively. The standard deviation of their runs are respectively, 15, 12 and 2. Who is the most consistent of the three ? If one of the three is to be selected, who will be selected ? – – – Solution. Let X1 , X2 , X3 be the means and σ1 , σ2 , σ3 the standard deviation of the runs scored by Vijay, Subhash and Kumar respectively. Then we are given : – – – X1 = 50, X = 48, X3 = 12; σ1 = 15, σ2 = 12, σ3 = 2 C.V. of runs scored by Vijay = 100σ1 — = 100 × 15 = 30 50 X1 100σ2 C.V. of runs scored by Subhash = C.V. of runs scored by Kumar = — X2 100σ3 — X3 = 10048× 12 = 25 = 10012× 2 = 16.67 The decision regarding the selection of player may be based on two considerations : (i) If we want a consistent player (which is statistically sound decision), then Kumar is to be selected, since C.V. of the runs is smallest for Kumar. (ii) If we want to select a player whose expected score is the highest, then Vijay will be selected. Remark. In fact, the best way will be to select a person who is consistent and also has the highest expected score. Example 6·34. A batsman Mr. A is more consistent in his last 10 innings as compared to another batsman Mr. B. Therefore, Mr. A is also a higher run getter”. Comment. [Delhi Univ. B.Com. (Pass), 1999] Solution. The consistency of a batsman is judged on the basis of coefficient of variation. The batsman A is more consistent than batsman B if –x σ σ σA σB σ A C.V. (A) < C.V. (B) ⇒ 100 – A < 100 – B ⇒ < ⇒ > A … (i) – – – σB x x x x x A B The batsman A will be higher run getter than batsman B if –x –x > –x A ⇒ A B –x > 1 A B B … (ii) B Since, in general, (i) does not necessarily imply (ii), the given statement is not true, in general. In order to conclude that A is also a higher run getter, we must be given the values of –x A and –x B and they should satisfy (ii). 6·38 BUSINESS STATISTICS Example 6·35. Coefficien of variation of two series are 75% and 90% and their standard deviations are 15 and 18 respectively. Find their means. Solution. We are given : σ1 = 15 and σ2 = 18 C.V. of I series 75 = 75% = 100 Using (6·37a), we have : C.V. = ∴ Mean of I series = and Mean of II series = σ1 C.V. (I) σ2 C.V.(II) 90 C.V. of II series = 90% = 100 ; σ Mean = ⇒ 15 × 100 = 75 Mean = σ C.V. 20; = 18 × 100 = 90 20 Example 6·36. “After settlement the average weekly wage in a factory had increased from Rs. 8,000 to Rs. 12,000 and the standard deviation had increased from Rs. 100 to Rs. 150. After settlement the wage has become higher and more uniform.” Do you agree ? Solution. It is given that after settlement the average weekly wages of workers have gone up from Rs. 8,000 to Rs. 12,000. This implies that the total wages received per week by all the workers together have increased. However, we cannot conclude that the wage of each individual has increased. Regarding uniformity of the wages, we have to calculate the coefficient of variation of the wages of workers before the settlement and after the settlement. × 100 C.V. of wages before the settlement = 100 8‚000 = 1.25 × 150 C.V. of wages after the settlement = 100 12‚000 = 1.25 Since the coefficient of variation of wages before the settlement and after the settlement is same, there is no change in the variability of distribution of wages after the settlement. Hence, it is wrong to say that the wages have become more uniform (less variable) after the settlement. Example 6·37. A study of B.A. (Hons.) Economics examination results of 1000 students in 1990 gave the mean grade as 78 and the standard deviation as 8·0. A similar study in 1995 revealed the mean grade of the group as 80 and the standard deviation as 7·6. In which year was there the greater (i) absolute dispersion, (ii) relative dispersion ? What can we say about the average performance of the students over time ? [Delhi Univ. B.A. (Econ. Hons.), 1999] Solution. In the usual notations, we are given : – – Year 1990 : X1 = 78; σ1 = 8·0; Year 1995 : X2 = 80, σ2 = 7·6 (i) We know that the best absolute measure of dispersion is the standard deviation. Since σ1 > σ2 , in 1990 there was greater absolute dispersion. (ii) The relative measure of dispersion is given by the coefficient of variation. 100 σ 100 σ × 7·6 C.V. (1990) = – 1 = 10078× 8.0 = 10.26 ; C.V. (1995) = – 2 = 10080 = 9·50 X1 X2 Since C.V. (1990) > C.V. (1999), the relative dispersion is greater in 1990. – – We observe that X2 > X1 and C.V. (1995) < C.V. (1990). Hence, it can be said that over the time from 1990 to 1995, the average performance of the students has increased (improved) and they have become more consistent (less variable). Example 6.38. Explain the difference between absolute and relative dispersion. If 20 is subtracted from every observation in a data set, then the coefficient of variation of the resulting data set is 20%. If 40 is added to every observation of the same data set, then the coefficient of variation of the resulting set of data is 10%. Find the mean and standard deviation of the original set of data. [Delhi Univ. B.Com. (Hons.), 2004] 6·39 MEASURES OF DISPERSION — Solution. Let X be the mean and σX = σ, be the standard deviation of the data observations Let U = X – 20 ⇒ — — and V = X + 40 ⇒ — — and U = X – 20 and σU = σX = σ } [Q V = X + 40 and σV = σX = σ 100 σ 100 σ C.V. (U) = — U = — = 20 (Given) U X – 20 100 σ 100 σ C.V. (V) = — V = — = 10 (Given) V X + 40 s.d. is independent of change of origin ] …(i) …(ii) — — X + 40 =2 ⇒ X = 800 — X – 20 100σ Substituting in (i), we get = 20 ⇒ σ = 12 60 Example 6·39. An analysis of the monthly wages paid to workers in two firms A and B, belonging to the same industry, gives the following results : Firm A Firm B Number of wages earners 550 650 Average monthly wages (in ’00 Rs.) 50 45 Standard deviation of the distribution of wages (in ’00 Rs.) ⎯⎯90 √ ⎯⎯⎯ √ 120 Dividing (i) by (ii) we get Answer the following questions with proper justifications : (a) Which firm A or B pays larger amount as monthly wages ? (b) In which firm A or B, is there greater variability in individual wages ? (c) What are the measures of (i) average monthly wages and (ii) standard deviation in the distribution of individual wages of all workers in the two firms taken together ? – – Solution. Let n1 , n 2 denote the sizes ;X1 , X2 the means and σ1 , σ2 the standard deviations of the monthly wages (in Rs.) of the workers in the firms A and B respectively. Then we are given : – n1 = 550, X1 = 50, σ1 = ⎯√⎯90 ⇒ σ1 2 = 90 – n2 = 650, X2 = 45, σ2 = ⎯√⎯⎯ 120 ⇒ σ2 2 = 120 (a) We know that Total monthly wages paid by the firm Average monthly wages = No. of workers in the firm ⇒ Total monthly wages paid by the firm = (No. of workers in the firm) × (Average monthly wages) – ∴ Total monthly wages paid by firm A = n1 X1 = Rs. 550 × 50 = Rs. 27,500 hundred – Total monthly wages paid by firm B = n2 X2 = Rs. 650 × 45 = Rs. 29,250 hundred Hence the firm B pays out larger amount as monthly wages, the excess of the monthly wages paid over firm A being Rs. (29,250 – 27,500) hundred = Rs. 1,750 hundred (b) In order to find out which firm has more variation in individual wages, we have to compute the coefficient of variation (C.V.) of the distribution of monthly wages for each of the two firms A and B. C.V. for firm A = C.V. for firm B = 100 σ1 — X1 100 σ2 — X2 = 10050√⎯⎯90 = 2 × 9·487 = 18·974 ⎯⎯⎯ √ 120 = 10045 = 20 × 10·954 219·080 = 9 9 = 24·34 6·40 BUSINESS STATISTICS Since C.V. for firm B is greater than the C.V. for firm A, firm B has greater variability in individual wages. – (c) (i) The average monthly wage, say,X, of all the workers in the two firms A and B taken together is given by : — — – n1 X 1 + n2X 2 X = n +n 500 × 50 + 650 × 45 hundred 550 + 650 27‚500 + 29‚250 Rs. hundred = Rs. 56‚750 1‚200 1‚200 hundred 1 = = Rs. 2 = Rs. 47·29 hundred = Rs. 4,729. (ii) The variance σ2 of the distribution of monthly wages of all the workers in the two firms A and B taken together is given by : (n1 + n2) σ2 = n1 (σ12 + d12 ) + n2 (σ22 + d22 ) … (*) – – d1 =X1 – X = 50 – 47·29 = 2·71 ⇒ d1 2 = 7·344 – – and d2 = X2 – X = 45 – 47·29 = – 2·29 ⇒ d2 2 = 5·244 Substituting in (*), we get 1200σ2 = 550 (90 + 7·344) + 650 (120 + 5·244) = 550 × 97·344 + 650 × 125·244 = 53539·2 + 81408·6 = 134947·8 134947·8 2 ⇒ σ = 1200 = 112·4565 Rs.2 ⇒ σ = Rs. ⎯√⎯⎯⎯⎯⎯ 112·4565 hundred = Rs. 10·60 hundred = Rs. 1060 Example 6·40. From the prices X and Y of shares A and B respectively given below, state which share is more stable in value. Price of Share A (X) : 55 54 52 53 56 58 52 50 51 49 Price of Share B (Y) : 108 107 105 105 106 107 104 103 104 101 Solution. COMPUTATION OF MEAN AND S.D. OF PRICES OF SHARES A AND B SHARE A SHARE B — — — — X X – X = X – 53 (X – X ) 2 Y Y – Y = Y – 105 (Y – Y ) 2 55 2 4 108 3 9 54 1 1 107 2 4 52 –1 1 105 0 0 53 0 0 105 0 0 56 3 9 106 1 1 58 5 25 107 2 4 52 –1 1 104 –1 1 50 –3 9 103 –2 4 51 –2 4 104 –1 1 49 ∑X = 530 –4 16 101 ∑Y = 1050 –4 0 — ∑(X – X ) = 0 – ∑X 530 X = n = 10 = 53 — ∑(X – X ) = 70 2 16 — ∑(Y – Y ) 2 = 40 , – σx2 = 1n ∑(X – X) 2 = 70 10 = 7 ⇒ σx = ⎯ √ 7 = 2·646 – ∑Y 1050 Y = n = 10 = 105 , – 40 σy2 = 1n ∑(Y – Y ) 2 = 10 =4 ⇒ σy = ⎯ √4 = 2 C.V. (X) = 100 × σx — X = 100 × 2.646 = 53 4·99 ; C.V. (Y) = 100 × σy — Y = 100 × 2 105 = Since C.V. (Y) is less than C.V. (X), the share B is more stable in value. 1·90 6·41 MEASURES OF DISPERSION Example 6·41. Goals scored by two teams A and B in a football season were as shown in adjoinging table. By calculating the coefficient of variation in each case, find which team may be considered more consistent. Number of goals scored in a match 0 1 2 3 4 Number of matches A team B team 27 17 9 9 8 6 5 5 4 3 Solution. CALCUALTIONS FOR C.V. FOR TEAMS A AND B TEAM A TEAM B X2 No. of goals scored in a match (X) No. of matches (f1) f 1X f1 No. of matches (f2) f 2X 0 1 2 3 4 27 9 8 5 4 0 9 16 15 16 0 9 32 45 64 17 9 6 5 3 0 9 12 15 12 Total ∑ f 1 = 53 ∑ f 1 X = 56 ∑ f 2 = 40 ∑ f 2X = 48 – ∑f X 56 XA = 1 = 53 = 1·06 ∑f1 ∑f X2 – 2 σA2 = 1 – (XA)2 = 150 53 – (1·06) ∑f1 = 2·83 – 1·1236 = 1·7064 σA = √⎯⎯⎯⎯⎯ 1·7064 = 1·3063 100 × σA ∴ C.V. for team A = – XA ∑ f 1X2 = 150 f 2X2 0 9 24 45 48 ∑ f 2X2 = 126 – ∑f X 48 XB = 2 = 40 = 1·2 ∑f2 ∑f X2 – 2 σB2 = 2 – (XB)2 = 126 40 – (1·2) ∑f2 = 3·15 – 1·44 = 1·71 ⇒ σB = √⎯⎯⎯ 1·71 = 1·308 100 × σB ∴ C.V. for team B = – XB × 1·3063 = 100 1·06 = 123·24 = 100 ×1·21·308 = 109·0 Since C.V. for team B is less then C.V. for team A, team B may be considered to be more consistent. 6·12. RELATIONS BETWEEN VARIOUS MEASURES OF DISPERSION For a Normal distribution (c.f. Chapter 14 on theoretical distributions) we have the following relations between the different measures of dispersion : (i) Mean ± Q.D. covers 50% of the observations of the distribution. (ii) Mean ± M.D. covers 57·5% of the observations. (iii) Mean ± σ includes 68·27% of the observations. (iv) Mean ± 2σ includes 95·45% of the observations. (v) Mean ± 3σ includes 99·73% of the observations. 2 (vi) Q.D. = 0·6745σ ≈ 3 σ (approximately). 2 π (vii) M.D. = (viii) Q.D. = 0·8459 M.D. 5 6 4 . σ = 0·7979 ≈ 5 σ (approximately). … (6·34) … (6·35) [From (vi) and (vii), on dividing and transposing] ⇒ Q.D. = M.D. (approximately) … (6·36) 6·42 BUSINESS STATISTICS Combining the results (vi), (vii), and (viii) we get approximately : 3 Q.D. = 2 S.D. ; 5 M.D. = 4 S.D. ; 6 Q.D. = 5 M.D. ⇒ 4 S.D. = 5 M.D. = 6 Q.D. Thus we see that standard deviation ensures the highest degree of reliability and Q.D. the lowest. (ix) We have : Q.D. : M.D. : S.D. : : 23 σ : 45 σ : σ ⇒ Q.D. : M.D. : S.D. : : 10 : 12 : 15 …(6·37) (x) Range = 6 S.D. = 6σ …(6·38) Remarks 1. Rigorously speaking, the above results for various measures of dispersion hold for Normal distribution discussed in Chapter 14 on theoretical distributions. However, these results are approximately true even for symmetrical distributions or moderately asymmetrical (skewed) distributions. 2. In the above results we have expressed various measures of dispersion in terms of standard deviation. We give below the relations expressing standard deviation in terms of other measures of dispersion. S.D. = 1·2533 M.D. –~ 45 M.D. } S.D. = 1·4826 Q.D. ~– 32 Q.D. S.D. = 61 Range Also we have : M.D. = 1·1830 Q.D. ~– 65 Q.D. …(6·39) …(6·40) EXERCISE 6·4 1. (a) What do you understand by absolute and relative measure of dispersion ? Explain advantages of the relative measures over the absolute measures of dispersion. (b) What do you understand by coefficient of variation ? What purpose does it serve ? 2. Prove that the coefficient of variation of the first n natural numbers is : (n – 1) . 3 (n + 1) [Delhi Univ. B.A. Econ. (Hons.), 2006, 2003] 3. The arithmetic means of runs secured by the three batsmen, X, Y and Z in a series of 10 innings are 50, 48 and 12 respectively. The standard deviations of their runs are 15, 12 and 2 respectively. Who is the most consistent of the three ? Ans. C.V. (X) = 30 ; C.V. (Y) = 25 ; C.V. (Z) = 16·67. Batsman Z is the most consistent. 4. Two samples A and B have the same standard deviations, but the mean of A is greater than that of B. The coefficient of variation of A is (i) greater than that of B. (ii) less than that of B. (iii) equal to that of B. (iv) None of these. Ans. (i) 5. (a) The coefficient of variation of a distribution is 60% and its standard deviation is 12. Find out its mean. (b) Find the coefficient of variation if variance is 16, number of items is 20 and sum of the items is 160. [Bangalore Univ. B.Com., 1998] Ans. (a)Mean = 20 ; (b) C.V. = 50%. 6. (a)Coefficients of variation of two series are 60% and 80%. Their standard deviations are 20 and 16 respectively. What are their arithmetic means ? (b) Coefficients of variation of two series are 60% and 80%. Their standard deviations are 24 and 20 respectively. What are their arithmetic means ? [Delhi Univ. B.Com. (Pass), 1997] Ans. (a) 33·3, 20; (b) 40, 25. 7. Comment on the statement : “After settlement the average weekly wage in a factory had increased from Rs. 800 to Rs. 1200 and the standard deviation had increased from 200 to 250. After settlement, the wages have become higher and more uniform. Ans. Yes. 6·43 MEASURES OF DISPERSION 8. Weekly average wages of workers in a factory increase from Rs. 800 to Rs. 1200 and standard deviation increases from Rs. 100 to Rs. 500. Have the wages become less uniform now ? Ans. C.V.(Initial wages) = 12·5 ; C.V. (Revised wages) = 41·67. Yes, the revised wages are more variable. 9. A study of examination results of a batch of students showed the average marks secured as 50 with a standard deviation of 2 in the first year of their studies. The same batch showed an average of 60 marks with an increased standard deviation of 3, after five years of studies. Can you say that the batch as a whole showed improved performance ? Ans. Improved performance (better average) and more consistent. 10. The means and standard deviations of two brands of light bulbs are given below : Brand 1 800 hours 100 hours Mean S.D. Brand 2 770 hours 60 hours Calculate a measure of relative dispersions for the two brands and interpret the results. [Delhi Univ. B.Com. (Hons.), 2000] Ans. C.V. (I) = 12·5; C.V. (II) = 7·79 ; Brand II is more uniform. 11. The following is the record number of bricks laid each day for 10 days by two bricklayers A and B. Calculate the coefficient of variation in each case and discuss the relative consistency of the two bricklayers. A 700 675 725 625 650 700 650 700 600 650 B 550 600 575 550 650 600 550 525 625 600 If each of the values in respect of worker A is decreased by 10 and each of the values for worker B is increased by 50, how will it affect the results obtained earlier ? [Delhi Univ. B.Com. (Hons), 2007] Ans. — X A = 667.5 C.V. (A) = — σA = 37.165 ; 37.165 × 100 = 5.57 667.5 C.V. (A) < C.V. (B) X B = 582.5, ; ⇒ C.V. (B) = σB = 37.165 37.165 × 100 = 6.37 582.5 Brick layer A is more consistent. Since S.D. is independent of change of origin, we have : — New Mean : X A ′ = 667.5 – 10 = 657.5 σA′ = σA = 37.165 ; ; — X B ′ = 582.50 + 50 = 632.50 σB′ = σB = 37.165 [New C.V. (A) = 5.65] < [ New C.V. (B) = 5.88]. Result is not affected. 12. The number of employees, average wages per employee and variance of the wages per employee for two factories are given below : Factory A Factory B Number of Employees 100 200 Average Wage per Employee (Rs.) 120 200 16 25 Variance of the Wages per Employee (Rs.) In which factory is there greater variation in the distribution of wages per employee ? [C.A. (Foundation), May 2000] Ans. C.V.(A) = 3·3; C.V. (B) = 2·5. There is greater variability in Factory A. 13. The number of employees, wages per employee and the variance of the wages per employee for two factories are given below : Factory A Factory B No. of employees 50 100 Average wages per employee per month (in Rs.) 120 85 Variance of the wages per employee per month (in Rs.) 9 16 6·44 BUSINESS STATISTICS In which factory is there greater variation in the distribution of wages per employee ? Ans. In factory B ; C.V. (A) = 2·5, C.V. (B) = 4·7. 14. Two workers on the same job show the following results over a long period of time : Worker A Mean time of completing the job (minutes) 30 Standard deviation (minutes) 6 Worker B 25 4 (i) Which worker appears to be more consistent in the time he requires to complete the job ? (ii) Which worker appears to be faster in completing the job ? Explain. — — Ans. (i) B, (ii) B (·.· X B < X A ). 15. The mean and standard deviation of 200 items are found to be 60 and 20 respectively. At the time of calculations, two items were wrongly taken as 3 and 67 instead of 13 and 17. Find the correct mean and standard deviation. What is the correct coefficient of variation ? Ans. Corrected mean = 59·8, s.d. = 20·09; C.V. = 33·60. 16. The mean and standard deviation of a series of 100 items were found to be 60 and 10 respectively. While calculating, two items were wrongly taken as 5 and 45 instead of 30 and 20. Calculate corrected variance and corrected coefficient of variation. [Delhi Univ. B.Com. (Hons.), 2009] Ans. Corrected (σ2) = 92.50 ; Corrected C.V. = 16.03 17. For the following distribution of marks obtained, find the arithmetic mean, the standard deviation and the coefficient of variation. Marks obtained : 0—5 5—10 10—15 15—20 20—25 25—30 30—35 35—40 No. of students : 2 5 7 13 21 16 8 3 Ans. A.M. = 21·9; σ = 7·9931; C.V. = 36·5. 18. Data on the annual earnings of professors and physicians in a certain town yield the following results : – – Professors : x 1 = Rs.16,000, σ1 = Rs. 2,000 Physicians : x 2 = Rs.23,000, σ2 = Rs. 4,000 Are the professors’ earnings more or less variable than the physicians’ ? [Delhi Univ. B.A. (Econ. Hons.), 1990] Ans. C.V. (Professors) = 12·50, C.V. (Physicians) =17·39; Professors’ earnings are less variable. 19. (a) Verify the correctness of the following statement : “A batsman scored at an average of 60 runs an inning against Pakistan. The standard deviation of the runs scored by him was 12. A year later against Australia, his average came down to 50 runs an inning and the standard deviation of the runs scored fell down to 9. Therefore, it is correct to say that his performance was worse against Australia and that there was lesser consistency in his batting against Australia”. [Delhi Univ. B.Com. (Hons.), 1986] Ans. C.V. (Australia) = 18, C.V. (Pakistan) = 20. Greater consistency against Australia. — — X (Against Pakistan) > X (Against Australia); better performance against Pakistan than against Australia. (b) The following is the record of goals scored by team A in the football season : No. of goals scored by team A in a match : 0 1 2 3 4 Number of matches : 1 9 7 5 3 For team B the average number of goals scored per match was 2·5 with a standard deviation of 1·25 goals. Find which team may be considered more consistent. Ans. C.V. (A) = 54·77, C.V. (B) = 50; B is more consistent. 20. During the 10 weeks of a session, the marks obtained by two candidates, Ramesh and Suresh, taking the Computer Programme course are given below : Ramesh : 58 59 60 54 65 66 52 75 69 52 Suresh : 87 89 78 71 73 84 65 66 56 46 (i) Who is the better scorer — Ramesh or Suresh ? (ii) Who is more consistent ? [Delhi Univ. B.Com. (Pass), 1998] – – Ans. Ramesh : x1 = 61, σ1 = 7·25, C.V. = 11·89 ; Suresh : x 2 = 71·5, σ2 = 13·08, C.V. = 18·29 – – (i) Suresh is better scorer (·.· x 2 > x 1). ; (ii) Ramesh is more consistent. 6·45 MEASURES OF DISPERSION 21. Complete the table showing the frequencies with which words of different number of letters occur in the extract reproduced below (omitting punctuation marks) treating as the variable the number of letters in each word, and obtain the coefficient of variation of the distribution : “Her eyes were blue; blue as autumn distance—blue as the blue we see between the retreating mouldings of hills and woody slopes on a sunny September morning : a misty and shady blue, that had no beginning or surface, and was looked into rather than at”. — Ans. X = 4·35; σ = 2·23; C.V. = 51·04. 22. Compile a table, showing the frequencies with which words of different number of letters occur in the extract reproduced below (omitting punctuation marks) treating as the variable the number of letters in each word, and obtain the mean, median and the coefficient of variation of the distribution : “Success in the examination confers no absolute right to appointment unless Government is satisfied, after such enquiry as may be considered necessary, that the candidate is suitable in all respects for appointment to the public service”. Ans. Mean = 5·5; Median = 5; s.d. = 3·12; C.V. = 56·7. 23. No of goals scored in a match : 0 1 2 3 4 Team A : 27 9 8 5 1 No. of Matches Team B : 1 5 8 9 27 ] Ans. C.V. (A) = 127·84; C.V. (B) = 36·06; Team B is more consistent. Life 24. Lives of two models of refrigerators in a recent survey are shown in adjoinging table. What is the average life of each model of these refrigerators ? Which model has greater uniformity ? — — Ans. X a = 5·12 years ; C.V. (A) = 54·88; X b = 6·16 years. C.V. (B) = 36.2; Model B has greater uniformity. No. of refrigerators (No. of years) 0—2 2—4 4—6 6—8 8—10 10—12 Model A 5 16 13 7 5 4 Model B 2 7 12 19 9 1 25. A purchasing agent obtained samples of incandescent lamps from two suppliers. He had the samples tested in his own labouratory for the length of life with the following results : Length of life in hours Samples from Company A Company B 700 and under 900 10 3 900 and under 1,100 16 42 1,100 and under 1,300 26 12 1,300 and under 1,500 8 3 Which company’s lamps are more uniform ? Ans. C.V. (A) = 16·7, C.V. (B) = 11·9. Lamps of company B are more uniform. 26. Two brands of tyres are tested with the following results : Life (in ’000 miles) (a) Which brand of tyres have greater average life ? (b) Compare the variability and state which brand of tyres would you use on your fleet of trucks ? 20—25 25—30 30—35 35—40 40—45 No. of tyres of brand X 1 22 64 10 3 Y 0 24 76 0 0 [Bangalore Univ. B.Com., 1997] — Ans. X = 32·1 thousand miles, — σx = 3·137 thousand miles, C.V. (X) = 9·77; Y = 31·3 thousand miles, σy = 0·912 thousand miles, C.V. (Y) = 2·914. (a) Brand X ; (b) Brand X tyres are more variable; Brand Y. 6·46 BUSINESS STATISTICS 27. The mean and standard deviation of the marks obtained by two groups of students, consisting of 50 each, are given below. Calculate the mean and standard deviation of the marks obtained by all the 100 students : Group Mean Standard Deviation 1 60 8 2 55 7 [C.A. (Foundation), Nov. 1999] – Ans. x 12 = 57·5, σ12 = 7·92. 28. For a group of 50 male workers, the mean and standard deviation of their weekly wages are Rs. 63 and Rs. 9 respectively. For a group of 40 female workers these are Rs. 54 and Rs. 6 respectively. Find the standard deviation for the combined group of 90 workers. Ans. σ = 9. 29. The first of the two samples has 100 items with mean 15 and standard deviation 3. If the whole group has 250 √⎯⎯⎯ 13·44, find mean and the standard deviation of the second group. items with mean 15·6 and standard deviation ⎯ — Ans. X 2 = 16; σ 2 = 4. 30. For two groups of observations the following results were available : Group I : ∑(X – 5) = 8; ∑(X – 5)2 = 40; N1 = 20 Group II : ∑(X – 8) = – 10; ∑(X – 8)2 = 70; N2 = 25 Find the mean and the standard deviation of the 45 observations obtained by combining the two groups. [Delhi Univ. B.Com. (Hons.), 2006] Hint. Expand ∑(X – 5), ∑(X – 5)2, ∑(X – 8) and ∑(X – 8)2, and find ∑X and ∑X 2 for each group. For combined group : — ∑X = 108 + 190 = 298, ∑X2 = 620 + 1510 = 2130, N= 45. Ans. X 12 = 6·622, σ12 = 1·865. 31. A company has three establishments E1, E 2, E 3 in three cities. Analysis of the monthly salaries paid to the employees in the three establishments is given bellow : E1 E2 E3 Number of employees 20 25 40 Average monthly salary (Rs.) 305 300 340 Standard deviation of monthly salary (Rs.) 50 40 45 Find the average and the standard deviation of the monthly salaries of all the 85 employees in the company. Ans. Mean = Rs. 320; s.d. = Rs. 48·69. 32. Calculate the missing information from the following data. Variable A Variable B Total Number 175 ? Mean 220 240 Standard Deviation ? 6·3 Variable C 225 ? 5·9 Combined 500 235 5·4 — Ans. nB = 100; X C = 244·4, σA = 18·36. 33. For a group of 30 male workers, the mean and standard deviation of weekly overtime work (No. of hours) are 10 and 4 respectively; for 20 female workers the mean and standard deviation are 5 and 3 respectively. (i) Calculate the mean for the two groups taken together. (ii) Is the overtime work more variable for the male group than for the female group ? Explain. [Delhi Univ. B.A. (Econ.) Hons., 1991] Ans. (i) 8 hours. (ii) C.V. males (= 40) < C.V. females (= 60) ⇒ Overtime work for males is less variable. 34. An analysis of the monthly wages paid to workers in two firms A and B, belonging to the same industry, gives the following results : Firm A Firm B Number of wage-earners 586 648 Average monthly wage Rs. 52·5 Rs. 47·5 6·47 MEASURES OF DISPERSION Variance of the distribution of wage 100 121 (a) Which firm, A or B, pays out the larger amount as monthly wages ? (b) In which firm, A or B, is there greater variability in individual wages ? (c) What are the measures of (i) average monthly wage, and (ii) the variability in individual wages, of all the workers in the two firms, A and B, taken together. — Ans. (a) B; (b) In Firm B; (c) X = Rs. 49·87; σ = Rs. 10·83. 35. For two firms A and B, the following details are available : Number of employees Average salary (Rs.) Standard deviation of salary (Rs.) : : : A 100 1,600 16 B 200 1,800 18 (i) Which firm pays large package of salary ? (ii) Which firm shows greater variability in the distribution of salary. (iii) Compute the combined average salary and combined variance of both the firms. [Delhi Univ. B.Com. (Pass), 2000] Ans. (i) B; (ii) C.V. (A) = 1, C.V. (B) = 1. Both firms show equal variability. – (iii) x 12 = Rs. 1,733·33 and σ212 = Rs2. 9190·22 36. If the mean deviation of a moderately skewed distribution is 7·2 unit, find the standard deviation as well as quartile deviation. 5 5 Ans. S.D. ~ – M.D. = 9; Q.D. –~ 6 M.D. = 6·0. 4 37. For a series, the value of Mean Deviation is 15. Find the most likely value of its quartile deviation. [Delhi Univ. B.Com. (Pass), 2002] 5 Ans. Q.D. = M.D. = 12·5. 6 6·13. LORENZ CURVE Lorenz curve is a graphic method of studying the dispersion in a distribution. It was first used by Max O. Lorenz, an economic statistician for the measurement of economic inequalities such as in the distribution of income and wealth between different countries or between different periods of time. But today, Lorenz curve is also used in business to study the disparities of the distribution of wages, profits, turnover, production, population, etc. A very distinctive feature of the Lorenz curve consists in dealing with the cumulative values of the variable and the cumulative frequencies rather than its absolute values and the given frequencies. The technique of drawing the curve is fairly simple and consists of the following steps : (i) The size of the item (variable value) and the frequencies are both cumulated. Taking grand total for each as 100, express these cumulated totals for the variable and the frequencies as percentages of their corresponding grand totals. (ii) Now take coordinate axes, X-axis representing the percentages of the cumulated frequencies (x) and Y-axis representing the percentages of the cumulated values of the variable (y). Both x and y take the values from 0 to 100 as shown in the Fig. 6·1. (iii) Draw the diagonal line y = x joining the origin O (0, 0) with the point P(100, 100) as shown in the diagram. The line OP will make an angle of 45° with the X-axis and is called the line of equal distribution. (iv) Plot the percentages of the cumulated values of the variable (y) against the percentages of the corresponding cumulated frequencies (x) for the given distribution and join these points with a smooth freehand curve. Obviously, for any given distribution this curve will never cross the line of equal distribution 6·48 BUSINESS STATISTICS C B Income OP. It will always lie below OP unless the distribution is uniform (equal) in which case it will coincide with OP. Thus when the distribution of items is not proportionately equal, the variability (dispersion) is indicated and the curve is farther from the line of equal distribution OP. The greater the variability, the greater is the distance of the curve from OP. Let us consider the Lorenz curve LORENZ CURVE diagram (Fig. 6.1), for the distribution of Y P (100,100) income, (say). 100 • In the diagram, OP is the line of equal distribution of income. If the plotted cumulative percentages lie on this line, n io there is no variability in the distribution ut b i of income of persons. The points lying str di l on the curve OAP indicate a less degree of a qu A variability as compared to the points lying fe o e n on the curve OBP. Variability is still Li greater, when the points lie on the curve OCP. Thus a measure of variability of the distribution is provided by the distance of the curve of the cumulated percentages of X O the given distribution from the line of 100 No. of persons equal distribution. Fig. 6·1. Remarks 1. An obvious disadvantage of the Lorenz curve is that it gives us only a relative idea of the dispersion as compared with the line of equal distribution. It does not provide us any numerical value of the variability for the given distribution. Accordingly it should be used together with some numerical measure of dispersion. However, this should not undermine the utility of Lorenz curve in studying the variability of the distributions particularly relating to income, wealth, wages, profits, lands, and capitals, etc. 2. From the Lorenz curve we can immediately find out as to what percentage of persons (frequencies) correspond to a given percentage of the item (variable value). Example 6·42. From the following table giving data regarding income of workers in a factory, draw a graph (Lorenz curve) to study the inequality of income : Income (in Rs.) No. of workers in the factory Below 500 6,000 500—1,000 4,250 1,000—2,000 3,600 2,000—3,000 1,500 3,000—4,000 650 Solution. CALCULATIONS FOR LORENZ CURVE Income (in Rs.) Mid-value Cumulative income Percentage of cumulative income No. of Workers (f) Cumulative frequency Percentage of cumulative frequency (1) (2) (3) (4) (5) (6) (7) 0—500 500—1000 1000—2000 2000—3000 250 750 1500 2500 250 1,000 2,500 5,000 2·94 11.76 29.41 58.82 6,000 4,250 3,600 1,500 6,000 10,250 13,850 15,350 37·5 64.1 86.6 95.9 6·49 MEASURES OF DISPERSION 3000—4000 3500 Total 8500 8,500 100.00 650 16,000 100.0 16,000 The Lorenz curve (Fig. 6·2) prominently exhibits the inequality of the distribution of income among the factory workers. LORENZ CURVE • (100, 100) O N 90 – U TI 80 – D IS TR IB 70 – • EQ U A L 60 – 50 – O F 40 – N E 30 – • LI 20 – – – – – – – – – O • – • 10 – – PERCENTAGE OF INCOME → 100 – 10 20 30 40 50 60 70 80 90 100 PERCENTAGE OF PERSONS → Fig. 6·2. Remark. From the Lorenz curve we observe that 70% of the persons get only 15% of income and 90% of the persons get only 35% of the income. EXERCISE 6·5 1. What is Lorenz curve ? How do you construct it ? What is its use? 2. From the following table giving data regarding income of employees in two factories, draw a graph (Lorenz Curve) to show which factory has greater inequalities of income : Income (’00 Rs.) : Below 200 200—500 500—1,000 1,000—2,000 2,000—3,000 Factory A : 7,000 1,000 1,200 800 500 Factory B : 800 1,200 1,500 400 200 3. Industrial relations have been deteriorating in the Wessex factory of JK Limited and personnel management has established that a contributory factor is the inequalities of earnings of operators paid on the basis of an incentive scheme. Operators work an eight-hour day and bonus is paid progressively after measured work equivalent to 360 standard minutes has been produced. Table A, on right side, shows the position for the month of June 1975 in respect of operator production. Improvements are made in working conditions in the areas where poor performances were recorded and subsequently in October 1975 the results in Table B, were measured. Standard minutes Table A (June Table B (October produced per 1975) Number of 1975) Number of operator per day operators operators 4 10 300 11 32 320 11 20 340 9 18 360 10 2 380 12 5 400 12 5 420 11 5 440 10 3 460 5 — 480 5 — 500 (a) Present the data in Tables A and B in the form of a Lorenz curve. (b) Comment on the results. Ans. Group A represents greater inequality of distribution than group B. 4. The frequency distribution of marks obtained in Mathematics (M) and English (E) are as follows : Mid-value of marks : 5 15 25 35 45 55 65 75 85 95 6·50 BUSINESS STATISTICS No. of students (M) : 10 12 13 14 22 27 20 12 11 9 No. of students (E) : 1 2 26 50 59 40 10 8 3 1 Analyse the data by drawing the Lorenz curves on the same diagram and describe the main features you observe. 5. Draw Lorenz Curve for the comparison of profits of two groups of companies, A and B, in business. What is your conclusion ? Total Amount of profits earned by companies Number of Companies in Group A Group B 6 11 13 14 15 17 10 14 1 19 26 14 14 13 6 7 600 2,500 6,000 8,400 10,500 15,000 17,000 40,000 Ans. Lorenz curve for the group B is farthest from the line of equal distribution. Hence, group B represents greater inequality of profits than group A. 6. (a)Write an explanatory note on Lorenz curve. (b) The following table gives the population and earnings of residents in towns A and B. Represent the data graphically so as to bring out the inequality of the distribution of the earnings of residents. Town A Town B No. of persons : 50 50 50 50 50 50 50 50 50 50 Earnings (Rs. daily) 35 50 75 115 160 180 225 300 425 925 No. of persons : 100 140 60 50 200 90 60 40 160 100 Earnings (Rs. daily) 160 320 120 280 400 400 280 920 240 960 Ans. Inequality of incomes is more prominent in Town A. EXERCISE 6·6 (Objective Type Questions) I. Match the correct parts to make a valid statement : (a) Algebraic sum of deviations from mean (b) Coefficient of Mean Deviation (c) Variance (d) Quartile Deviation (e) Coefficient of Variation (f) Sum of absolute deviations from median (i) Q3 – Q1 100 σ Mean (iii) Zero — 1 ∑f|X–X| (iv) N — 1 (v) ∑ f (X – X )2 N M.D. about Mean (vi) Mean (vii) (Q3 – Q1)/2 (viii) Minimum (d) — (vii); (h) — (iv). (ii) (g) Interquartile range (h) Mean deviation Ans. (a) — (iii); (b) — (vi); (c) — (v); (e) — (ii); (f) — (viii); (g) — (i); II. In the following questions, tick the correct answer : (i) Algebraic sum of deviations from mean is : (a) Positive, (b) Negative, (c) Zero, (d) Different for each case. (ii) Sum of squares of deviations is minimum when taken from : 6·51 MEASURES OF DISPERSION (a) Mean, (b) Median, (c) Mode, (d) None of these. (iii) Sum of absolute deviations is minimum when measured from : (a) Mean, (b) Median, (c) Mode, (d) None of these. (iv) For a discrete frequency distribution : (a) S.D. ≤ M.D., (b) S.D. ≥ M.D., (c) S.D. > M.D., (d) S.D. < M.D. (e) None of these, where M.D. = Mean Deviation from mean. (v) The range of a given distribution is (a) greater than s.d., (b) Less than s.d., (c) Equal to s.d., (d) None of these. (vi) The measure of dispersion independent of frequencies of the given distribution is (a) Range, (b) s.d., (c) M.D., (d) Q.D. (vii) In case of open end classes, an appropriate measure of dispersion to be used is (a) Range, (b) Q.D., (c) M.D., (d) s.d. (viii) Measure of dispersion which is affected most by extreme observations is. (a) Range, (b) Q.D., (c) M.D., (d) s.d. (ix) Mean deviation from median (Md) is given by (a) ∑ | X – Md | , n (b) ∑ | X – Md | , (c) ∑ | X – Md |2 , n ⎯n √ (x) Quartile deviation is given by : Q 3 – Q2 Q2 – Q1 (a) Q3 – Q1, (b) , (c) , 2 2 (xi) Step deviation formula for variance is : ∑d2 ∑d2 ∑d 2 ∑d 2 ∑d 2 ∑d , (b) – , (c) – 2 , (a) 2 – n n n n n n (xii) If the distribution is approximately normal, then 3 4 2 (b) M.D. = σ, (c) M.D. = σ, (a) M.D. = σ , 5 5 5 (xiii) For a normal distribution, 2 1 (b) Q.D = σ, (c) Q.D. = σ, (a) Q.D. = σ, 3 3 (xiv) For a normal distribution, (a) Q.D. > M.D., (b) Q.D. < M.D., (c) Q.D. = M.D. (xv) For a normal distribution, the range mean ± 1.σ covers (a) 65%, (b) 68·26%, (c) 85%, Ans. (i) — (c); (ii) — (a); (iii) — (b); (iv) — (b); (vi) — (a); (vii) — (b); (viii) — (a); (ix) — (a); (xi) — (b); (xii) — (c); (xiii) — (b); (xiv) — (b); ( ) ( ) ( ) (d) (d) (d) ∑ | X – Md |2 . n Q3 – Q1 · 2 ( ) ( ) ∑d2 n 2 – ∑d n 2 . (d) None of these. (d) None of these. (d) 95% of the items. (v) — (a); (x) — (d); (xv) — (b). III. Fill in the blanks : (i) Algebraic sum of deviations is zero from …… (ii) The sum of absolute deviations is minimum from …… (iii) Standard deviation is always …… than range. (iv) Standard deviation is always …… than mean deviation. (v) All relative measures of dispersion are …… from units of measurement. (vi) Variance is the ……value of mean square deviation. (vii) If Q1 = 10, Q3 = 40, the coefficient of quartile deviation is …… (viii) If 25% of the items in a distribution are less than 10 and 25% are more than 40, quartile deviation is …… (ix) The median and s.d. of distribution are 15 and 5 respectively. If each item is increased by 5, the new median = …… and s.d. = …… 6·52 BUSINESS STATISTICS (x) A computer showed that the s.d. of 40 observations ranging from 120 to 150 is 35. The answer is correct/wrong. Tick right one. Ans. (i) Arithmetic mean (ii) Median (iii) Less (iv) Greater (v) Free (vi) Minimum (vii) 0·6 (viii) 15 (ix) 20, 5 (x) Wrong, since s.d. can’t exceed range. IV. Fill in the blanks : (i) The …… the Lorenz Curve is from the line of equal distribution, the greater is the variability in the series. (ii) Quartile deviation is …… measure of dispersion. (iii) If Q1 = 20 and Q3 = 50, the coefficient of quartile deviation is …… (iv) If in a series, coefficient of variation is 64 and mean 10, the standard deviation shall be …… Ans. (i) Farther (ii) Absolute (iii) 0·375 (iv) 6·4. V. State whether the following statements are true or false. In case of false statements, give the correct statement. (i) Algebraic sum of deviations from mean is minimum. (ii) Mean deviation is least when calculated from median. (iii) Variance is always non-negative. (iv) Mean, standard deviation and coefficient of variation have same units. (v) Relative measures of dispersion are independent of units of measurement. (vi) Mean and standard deviation are independent of change of origin. (vii) Variance is square of standard deviation. (viii) Standard deviation is independent of change of origin and scale. (ix) Variance is the minimum value of mean square deviation. 2 (x) Q.D. = × (s.d.), always. 3 (xi) Mean deviation can never be negative. 2 (xii) M.D. = × σ, for normal distribution. 3 (xiii) If mean and s.d. of a distribution are 20 and 4 respectively, C.V. = 15%. (xiv) If each value in a distribution of 5 observations is 10, then its mean is 10 and variance is 1. (xv) In a discrete distribution s.d. ≥ M.D. (about mean). Ans. (i) False, (ii) True, (iii) True, (iv) False, (v) True, (vi) False, (vii) True, (viii) False, (ix) True, (x) False, (xi) True, (xii) False, (xiii) False, (xiv) False, (xv) True. VI. The mean and s.d. of 100 observations are 50 and 10 respectively. Find the new mean and standard deviation, (i) if 2 is added to each observation. (ii) if 3 is subtracted from each observation. (iii) if each observation is multiplied by 5. (iv) if 2 is subtracted from each observation and then it is divided by 5. Ans. (i) 52,10, (ii) 47, 10, (iii) 250, 50 (iv) 9·6, 2 VII. The sum of squares of deviations of 15 observations from their mean 20 is 240. Find (i) s.d. and (ii) C.V. Ans. σ = 4, C.V. = 20. VIII. State, giving reasons, whether the following statements are true or false. (i) Standard deviation can never be negative (ii) Sum of squares of deviations measured from mean is least. Ans. (i) True, (ii) True. IX. Comment briefly on the following statements : (i) The mean of the combined series lies between the means of the two component series. (ii) The standard deviation of the combined series lies between the standard deviations of the two component series. (iii) Mean can never be equal to standard deviation. (iv) Mean can never be equal to variance. (v) A consistent cricket player has greater variability in test scores. 6·53 MEASURES OF DISPERSION Ans. (i) True (ii) False (iii) False (iv) False (v) False. X. Comment briefly on the following statements : (i) The median is the point about which the sum of squared deviations is minimum. — — (ii) Since ∑(Xi – X ) = 0, ∴ ∑(Xi – X )2 = 0. (iii) A computer obtained the standard deviation of 25 observations whose values ranged from 65 to 85 as 25. (iv) A student obtained the mean and variance of a set of 10 observations as 10, – 5 respectively. (v) The range is the perfect measure of variability as it includes all the measurements. (vi) For the distribution of 5 observations : 8, 8, 8, 8, 8, mean = 8 and variance = 8 (vii) If the mean and s.d. of distribution A are smaller than the mean and s.d. of distribution B respectively, then the distribution A is more uniform (less variable) than the distribution B. Ans. (i) False, (ii) False, (iii) False, (iv) False, (v) False, (vi) False, (vii) False. XI. (a) If s.d. of a group is 15, find the most likely value of (i) Mean deviation and (ii) Quartile deviation of that group. (b) If mean deviation of a distribution is 20, find the most likely value of (i) s.d. and (ii) Q.D. (c) If quartile deviation of a distribution is 6, find the most likely value of (i) s.d. and (ii) M.D. (d) If quartile deviation of a distribution is 20 and its mean is 60, obtain the most likely value of (i) Coefficient of variation, (ii) Mean deviation and (iii) Coefficient of mean deviation. In all the above parts. assume that the distribution is normal. Ans. (a) M.D. = 12, Q.D. = 10, (b) s.d. = 25, Q.D. = 16·67, (c) s.d. = 9, M.D. = 7·2, (d) C.V. = 50, M.D. = 24, Coefficient of M.D. = 0·4. Skewness, Moments and Kurtosis 7 7·1. INTRODUCTION It was pointed out in the last two chapters that we need statistical measures which will reveal clearly the salient features of a frequency distribution. The measures of central tendency tell us about the concentration of the observations about the middle of the distribution and the measures of dispersion give us an idea about the spread or scatter of the observations about some measure of central tendency. We may come across frequency distributions which differ very widely in their nature and composition and yet may have the same central tendency and dispersion. For example, the following two frequency distributions – have the same mean X = 15 and standard deviation σ = 6, yet they give histograms which differ very widely in shape and size. Frequency distribution I Class 0— 5 5—10 10—15 15—20 20—25 25—30 Frequency 10 30 60 60 30 10 Frequency distribution II Class 0— 5 5—10 10—15 15—20 20—25 25—30 Frequency 10 40 30 90 20 10 Thus these two measures viz., central tendency and dispersion are inadequate to characterise a distribution completely and they must be supported and supplemented by two more measures viz., skewness and kurtosis which we shall discuss in the following sections. Skewness helps us to study the shape i.e., symmetry or asymmetry of the distribution while kurtosis refers to the flatness or peakedness of the curve which can be drawn with the help of the given data. These four measures viz., central tendency, dispersion, skewness and kurtosis are sufficient to describe a frequency distribution completely. 7·2. SKEWNESS Literal meaning of skewness is ‘lack of symmetry’. We study skewness to have an idea about the shape of the curve which we can draw with the help of the given frequency distribution. It helps us to determine the nature and extent of the concentration of the observations towards the higher or lower values of the variable. In a symmetrical frequency distribution which is unimodal, if the frequency curve or histogram is folded about the ordinate at the mean, the two halves so obtained will coincide with each other. In other words, in a symmetrical distribution equal distances on either side of the central value will have same frequencies and consequently both the tails, (left and right), of the curve would also be equal in shape and length (Fig. 7.1). A distribution is said to be skewed if : (i) The frequency curve of the distribution is not a symmetric bell-shaped curve but it is stretched more to one side than to the other. In other words, it has a longer tail to one side (left or right) than to the other. A frequency distribution for which the curve has a longer tail towards the right is said to be positively skewed (Fig. 7.2) and if the longer tail lies towards the left, it is said to be negatively skewed (Fig. 7.3). Figures 7·1 to 7·3 are given on page7·2. 7·2 BUSINESS STATISTICS (ii) The values of mean (M), median (Md) and mode (Mo) fall at different points i.e., they do not coincide. (iii) Quartiles Q1 and Q 3 are not equidistant from the median i.e., Q3 – Md ≠ Md – Q1 SYMMETRICAL DISTRIBUTION M = M o = Md Fig. 7.1. POSITIVELY SKEWED DISTRIBUTION Mo MdM Fig. 7.2. NEGATIVELY SKEWED DISTRIBUTION M Md Mo Fig. 7.3. Remark. Since the extreme values give longer tails, a positively skewed distribution will have greater variation towards the higher values of the variable and a negatively skewed distribution will have greater variation towards the lower values of the variable. For example the distribution of mortality (death) rates w.r.t. the age after ignoring the accidental deaths will give a positively skewed distribution. However, most of the phenomena relating to business and economic statistics give rise to negatively skewed distribution. For instance, the distributions of the quantity demanded w.r.t. the price; or the number of depositors w.r.t. savings in a bank or the number of persons w.r.t. their incomes or wages in a city will give negatively skewed curves. 7·2·1. Measures of Skewness. Various measures of skewness (Sk) are : (1) Sk = Mean – Median = M – Md …(7·1) or Sk = Mean – Mode = M – Mo …(7·1 a) (2) Sk = (Q3 – Md) – (Md – Q 1 ) = Q3 + Q1 – 2 Md …(7·2) These are the absolute measures of skewness and are not of much practical utility because of the following reasons : (i) Since the absolute measures of skewness involve the units of measurement, they cannot be used for comparative study of the two distributions measured in different units of measurement. (ii) Even if the distributions are having the same units of measurement, the absolute measures are not recommended because we may come across different distributions which have more or less identical skewness (absolute measures) but which vary widely in the measures of central tendency and dispersion. Thus for comparing two or more distributions for skewness we compute the relative measures of skewness, also commonly known as coefficients of skewness which are pure numbers independent of the units of measurement. Moreover, in a relative measure of skewness, the disturbing factor of variation or dispersion is eliminated by dividing the absolute measure of skewness by a suitable measure of dispersion. The following are the coefficients of skewness which are commonly used. 7·2·2. Karl Pearson’s Coefficient of Skewness. This is given by the formula : Mean – Mode M – Mo Sk = = σ …(7·3) s.d. But quite often, mode is ill-defined and is thus quite difficult to locate. In such a situation, we use the following empirical relationship between the mean, median and mode for a moderately asymmetrical (skewed) distribution : Mo = 3Md – 2M …(7·4) Substituting in (7·3), we get M – (3Md – 2M) 3(M – Md) Sk = = …(7·5) σ σ Remarks 1. Theoretically, Karl Pearson’s Coefficient of Skewness lies between the limits ± 3, but these limits are rarely attained in practice. 7·3 SKEWNESS, MOMENTS AND KURTOSIS 2. From (7·3) and (7·5), skewness is zero if M = M o = Md. In other words, for a symmetrical distribution mean, mode and median coincide i.e., M = Md = Mo. 3. Sk > 0, if M > Md > Mo or if Mo < Md < M …(7·6) Thus, for a positively skewed distribution, the value of the mean is the greatest of the three measures and the value of mode is the least of the three measures. If the distribution is negatively skewed, the inequality in (7·9) is reversed i.e., the inequalities ‘greater than’ (i.e., >) and ‘less than’ (i.e., <) are interchanged. Thus : Sk < 0, if M < Md < Mo or if Mo > Md > M …(7·7) In other words, for a negatively skewed distribution, of the three measures of central tendency viz., mean, median and mode, the mode has the maximum value and the mean has the least value. 4. While ‘dispersion’ studies the degree of variation in the given distribution. skewness attempts at studying the direction of variation. Extreme variations towards higher values of the variable give a positively skewed distribution while in a negatively skewed distribution, the extreme variations are towards the lower values of the variable. 5. In Pearson’s coefficient of skewness, the disturbing factor of variation is eliminated by dividing the absolute measure of skewness M – Mo by the measure of dispersion σ (standard deviation). 6. If the distribution is symmetrical, say, about mean, then ⇒ Also Mean – Xmin = Xmax – Mean Xmax + Xmin = 2 Mean Xmax – Xmin = Range …(7·8) …(7.8a) …(7.9) Adding and subtracting (7.8a) and (7.9), we get respectively : 1 Xmax = Mean + Range …(7.10) 2 1 2 Xmin = 2 Mean – Range ⇒ Xmin = Mean – Range …(7.10a) 2 Example 7·1. A distribution of wages paid to workers would show that, although a few reach very high levels, most workers are at lower level of wages. If you were an employer, resisting worker’s claim for an increase of wages, which average would suit your case? Do you think your argument will be different if you are a trade union leader ? Explain. [Delhi Univ. B.A. (Econ. Hons.), 1997] Solution. The distribution of wages paid to the workers would show that, although a few reach very high levels, most workers are at lower level of wages. This implies that the distribution of wages of workers in the factory is positively skewed so that Mean > Median > Mode ⇒ Mode < Median < Mean. Since mode is the least of the three averages mean, median and mode, the average that will suit the employer most (to resist workers claim for higher wages) will be mode. By similar argument, since mean is the largest of the three averages, it will suit most to the trade union leader. Example 7.2. In a symmetrical distribution the mean, standard deviation and range of marks for a group of 20 students are 50, 10 and 30 respectively. Find the mean marks and standard deviation of marks, if the students with the highest and the lowest marks are excluded. [Delhi Univ. B.Com. (Hons.), 2004] Solution. We are given a symmetrical distribution in which, 2 X max = 2 Mean + Range ⇒ — No. of observations (n) = 20 ; Mean ( X ) = 50 ; s.d. (σ) = 10 Since the distribution is symmetrical (about mean) , we have ⇒ Mean – Xmin = Xmax – Mean Xmax + Xmin = 2 Mean = 2 × 50 = 100 and Range = 30 [From (*)] …(*) …(**) 7·4 BUSINESS STATISTICS Also Xmax – Xmin = Range = 30 (Given) …(***) Adding and subtracting (**) and (***), we get respectively 130 2 Xmax = 130 ⇒ Xmax = = 65 and 2 Xmin = 70 2 Thus the given problem reduces to : ⇒ Xmin = 70 = 35 2 — “Given n = 20, X = 50 and σX = 10, obtain the mean and standard deviation if two observations 65 and 35 are omitted.” — ∑X — X= ⇒ ∑X = nX = 20 × 50 = 1,000 n — σ2 = 10 2 = 100 ⇒ ∑X2 = n (σ2 + X 2 ) = 20 (100 + 2,500) = 52,000 On omitting the two observations 35 and 65, for the remaining, N = n – 2 = 20 – 2 = 18 observations, we have 18 ∑ Xi = ∑X – (35 + 65) = 1,000 – 100 = 900 i=1 18 ∑ Xi2 = ∑X2 – (352 + 65 2 ) = 52,000 – (1,225 + 4,225) = 46,550 i=1 — ∴ New Mean ( X ′) = and New s.d. (σ ′) = — 1 18 900 ∑ X = = 50 = X 18 i = 1 i 18 1 18 2 — 2 ∑ X – ( X ′) = 18 i = 1 i 46‚550 – 50 2 = 18 1‚550 =⎯ √⎯⎯⎯ 86·11 = 9·28 18 Aliter : For a symmetrical distribution (about mean) we have, [From (7·10) and (7·10a)], 1 1 Xmax = Mean + Range and Xmin = Mean – Range 2 2 Example 7·3. Calculate Karl Pearson’s co-efficient of skewness from the following data : Size Frequency : : 1 10 2 18 3 30 4 25 5 12 6 3 7 2 Solution. Mean (M) = S.D. (σ) = = COMPUTATION OF MEAN, MODE AND S.D. Frequency (f) f.x f.x2 Size (x) ∑ fx 328 = 100 = 3·28 N ( ∑Nfx ) – ( ) ∑ – N fx2 1258 100 328 100 ——————————————————————————————————————————————————————— ————————————————————— 2 2 = ⎯√⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 12·58 – 10·7584 = √ ⎯⎯⎯⎯⎯⎯ 1·8216 = 1·3497 1 2 3 4 5 6 7 Total 10 18 30 25 12 3 2 N = 100 10 36 90 100 60 18 14 ∑ fx = 328 10 72 270 400 300 108 98 ∑ fx2 = 1258 Since the maximum frequency is 30, corresponding value of x viz., 3 is the mode. Thus Mode (Mo) = 3. Karl Pearson’s Coefficient of Skewness is given by M – Mo 3·28 – 3·00 0·28 Sk = = 1·3497 = 1·3497 = 0·2075 σ Hence, the distribution is slightly positively skewed. 7·5 SKEWNESS, MOMENTS AND KURTOSIS Example 7·4. Calculate Karl Pearson’s Coefficient of Skewness from the data given below : Hourly wages (Rs.) No. of workers Hourly wages (Rs.) No. of workers 40—50 50—60 60—70 70—80 80—90 5 6 8 10 25 90—100 100—110 110—120 120—130 130—140 30 36 50 60 70 Solution COMPUTATION OF MEAN, MEDIAN AND S.D. Hourly wages (Rs.) 40—50 50—60 60—70 70—80 80—90 90—100 100—110 110—120 120—130 130—140 Mid-value (X) No. of workers (f) – 85 d = X 10 –4 –3 –2 –1 0 1 2 3 4 5 5 6 8 10 25 30 36 50 60 70 45 55 65 75 85 95 105 115 125 135 N = ∑f = 300 fd fd2 –20 –18 –16 –10 0 30 72 150 240 350 80 54 32 10 0 30 144 450 960 1750 ∑fd = 778 ‘Less than’ c.f. 5 11 19 29 54 84 120 170 230 300 ∑fd2 = 3510 Since the maximum frequency viz., 70 occurs towards the end of the frequency distribution, mode is ill-defined in this case. Hence, we obtain Karl Pearson’s coefficient of skewness using median viz., by the formula : Sk = 3 (Mean – Median) σ Mean = A + …(*) h∑fd × 778 = 85 + 10300 = 85 + 25·93 = Rs. 110·93 N ∑fd2 N – s.d. (σ) = h . ( ) ∑fd N 2 = 10 × 3510 300 – ( ) 778 300 2 = 10 × ⎯√⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 11·7 – (2·5932)2 = 10 × ⎯ √⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 11·7 – 6·7252 = 10 × ⎯√⎯⎯⎯⎯ 4·9748 = 10 × 2·23043 = Rs. 22·3043 N 2 = 300 2 = 150. The c.f. just greater than 150 is 170. Hence, the corresponding class 110—120 is the median class. Using the median formula, we get Median = l + h f ( – c ) = 110 + N 2 10 50 (150 – 120) = 110 + 30 5 = 116 Substituting these values in (*), we get Sk = 3 (110·93 – 116) 3 × 5·07 –15·21 = – 22·3043 = 22·3043 = 22·3043 Hence, the given distribution is negatively skewed. – 0·6819 7·6 BUSINESS STATISTICS Example 7·5. Consider the following distributions : Distribution A Mean 100 Median 90 Standard Deviation 10 Distribution B 90 80 10 (i) Distribution A has the same degree of the variation as distribution B. (ii) Both distributions have the same degree of skewness. True/False ? Comment, giving reasons. [Delhi Univ. B.Com. (Hons.), 2008] Solution. (i) C.V. for distribution A = 100 × σA – XA C.V. for distribution B = 100 × σB – XB 10 = 100 × 100 = 10 = 100 × 10 90 = 11·11 Since C.V. (B) > C.V. (A), the distribution B is more variable than the distribution A. Hence, the given statement that the distribution A has the same degree of variation as distribution B is wrong. (ii) Karl Pearson’s coefficient of skewness for the distributions A and B is given by : 3(M – Md) 3(100 – 90) Sk (A) = = = 3 ; Sk (B) = 3(9010– 80) = 3 10 σ Since Sk(A) = Sk(B) = 3, the statement that both the distributions have the same degree of skewness is true. Example 7·6. (a) In a moderately asymmetrical distribution, the mode and mean are 32·1 and 35·4 respectively. Calculate the median. (b) From a moderately skewed distribution of retail prices for men’s shirts, it is found that the mean price is Rs. 200 and the median price is Rs. 170. If the coefficient of variation is 20%, find the Pearsonian coefficient of skewness of the distribution. [Delhi Univ. B.Com. (Hons.), 2009; B.A. (Econ. Hons.), 2008] Solution (a) For a moderately asymmetrical distribution, we have 1 Mo = 3Md – 2M ⇒ 3Md = Mo + 2M ⇒ Md = (Mo + 2M ) 3 1 1 102·9 ∴ Md = (32·1 + 2 × 35·4) = (32·1 + 70·8) = = 34·3 3 3 3 (b) We are given : Mean (M) = Rs. 200, Median (Md) = Rs. 170 100 σ 20M 20 × 200 and C.V. = 20% ⇒ = 20 ⇒ σ= = Rs. = Rs. 40. M 100 100 3 (M– Md) 3 (200 – 170) 9 Sk (Karl Pearson) = = = = 2·25 σ 40 4 Example 7·7. Pearson’s coefficient of skewness for a distribution is 0·4 and its coefficient of variation is 30%. Its mode is 88, find mean and median. [Delhi Univ. B.Com. (Hons.), 1997] Solution. We are given Mode = Mo = 88, Karl Pearson’s coefficient of skewness is : M – Mo M – 88 Sk = = = 0·4 (given) …(*) σ σ σ C.V. = 100 M = 30 (given) σ = 30M 100 = 0·3M ⇒ …(**) From (*) and (**), we get M – 88 = 0·4σ = 0·4 × 0·3 M Substituting in (**), we get ⇒ M – 0·12M = 88 σ = 0·3 × 100 = 30 ⇒ 88 (1 – 0·12) M = 88 ⇒ M = 0·88 = 100 7·7 SKEWNESS, MOMENTS AND KURTOSIS Using the empirical relation between mean, median and mode for a moderately asymmetrical distribution viz., Mo = 3Md – 2M ⇒ 3Md = Mo + 2M, we get : Md = 31 (Mo + 2M) = 31 (88 + 2 × 100) = 288 3 = 96 Hence, Mean = 100 and Median = 96. Example 7·8. Pearson’s measure of skewness of a distribution is 0·5. Its median and mode are respectively 42 and 36. Find the Coefficient of Variation. Solution. We are given : Median = 42, Mode = 36 ⇒ and Pearson’s coefficient of skewness = 0·5 …(*) Sk = Mean – Mode = σ 0·5 …(**) To find s.d. (σ), we shall first find the value of mean, by using the empirical relationship between mean, median and mode for a moderately asymmetrical distribution viz., 3 Median – Mode 3 × 42 – 36 126 – 36 90 Mode= 3 Median – 2 Mean ⇒ Mean = = = 2 = 2 = 45 2 2 Mean – Mode 45 – 36 9 Substituting in (**), we get : (s.d.) σ = = 0·5 = 0·5 = 18 0·5 × σ 100 × 18 Coefficient of Variation (C.V.) = 100 = 40 Mean = 45 Hence, C.V. is 40. Example 7·9. The sum of 50 observations is 500 and the sum of their squares is 6,000 and median is 12. Compute the coefficient of variation and the coefficient of skewness. [Delhi Univ. B.Com. (Pass), 2000] Solution. In the usual notations, we are given : ∴ n = 50 , ∑x = 500 and ∑x2 = 6,000 ; Median = 12 ∑x 500 ∑x2 – 2 6‚000 Mean ( –x ) = = 50 = 10 ; σ2 = – x = 50 – 100 = 20 ⇒ n n σ = √⎯⎯20 = 4·47 σ 100 × 4·47 Coefficient of Variation (C.V.) = 100 = 44·7 – = 10 x Karl Pearson’s coefficient of skewness is given by : 3(Mean – Median) 3(10 – 12) – 6 Sk = = 4·47 = 4·47 = –1·34. σ Example 7·10. The following information was obtained from the records of a factory relating to the wages. Arithmetic Mean: Rs. 56·80; Median : Rs. 59·50; S.D. : Rs. 12·40 Give as much information as you can about the distribution of wages. Solution. We can obtain the following information from the above data : (i) Since Md = Rs. 59·50, we conclude that 50% of the workers in the factory obtain the wages above Rs. 59·50. σ 100 × 12·40 (ii) C.V. = 100 = 21·83 M = 56·80 (iii) Karl Pearson’s coefficient of skewness is given by : 3(M – Md) 3(56·80 – 59·50) 3 × (–2·70) – 8·1 Sk = = = 12·40 = 12·4 = – 0·65 12·40 σ Hence, the distribution of wages is negatively skewed i.e., it has a longer tail towards the left. (iv) Using the empirical relation between M, Md and Mo for a moderately asymmetrical distribution, we get Mo = 3Md – 2M = 3 × 59·50 – 2 × 56·80 = 178·50 – 113·60 = Rs. 64·90. 7·8 BUSINESS STATISTICS Example 7·11. You are given the position in a factory before and after the settlement of an industrial dispute. Comment on the gains or losses from the point of view of workers and that of management. Before After No. of workers 3,000 2,900 Mean wage (in Rs.) 220 230 Median wage (in Rs.) 250 240 Standard deviation (in Rs.) 30 26 [Delhi Univ. B.Com. (Hons.), 2006] Solution. On the basis of the above data we are in a position to make the following comments : (i) The number of workers after the dispute has decreased from 3000 to 2900. Obviously this is a definite loss to the persons thrown out or retrenched. It may also be a loss to the management if their retrenchment affects the efficiency of work adversely. Total wages paid (ii) We know that : Average wage = Total No. of workers ⇒ Total wages paid = (Average wage) × (Total No. of workers) Hence, Total wages paid by the management before the dispute = Rs. 3000 × 220 = Rs. 6,60,000 Total wages paid by the management after the dispute = Rs. 2900 × 230 = Rs. 6,67,000 Thus we see that the total wages paid by the management have gone up after the dispute (the additional wage bill being Rs. 7000), although the number of workers has been reduced from 3000 to 2900. This is due to the fact that the average wage per worker has increased after the dispute - which is a distinct advantage to the workers. It may be pointed out that the increased wages paid by the management (Rs. 7000) should not be viewed as a disadvantage to the management unless we have definite reasons to believe that the efficiency and productivity have not gone up after the dispute. However, the loss to the managements due to higher wage bill, will be more than compensated if after the dispute, there is a definite increase in the efficiency of the workers or/and increase in productivity. (iii) Although the number of workers has decreased from 3000 to 2900 after the dispute, the average wage per worker has gone up from Rs. 220 to 230. This might probably be a consequence of the retrenchment of casual labour or temporary labour working on daily wages or so with relatively lower wages. (iv) The median wage after the dispute has come down from Rs. 250 to Rs. 240. This implies that before the dispute upper 50% of the workers were getting wages above Rs. 250 whereas after the dispute they get wages only above Rs. 240. (v) Using the empirical relation between mean, median and mode (for a moderately asymmetrical distribution) viz., Mode = 3 Median – 2 Mean We get Mode (before dispute) = 3 × 250 – 2 × 220 = Rs. 310 Mode (after dispute) = 3 × 240 – 2 × 230 = Rs. 260 Thus, we find that the modal wage has come down from Rs. 310 (before dispute) to Rs. 260 (after dispute). Thus after the dispute there is concentration of wages around a much smaller value. × (s.d.) C.V. = 100Mean (vi) ∴ C.V. (before dispute) = 100 × 30 220 = 13·64 and × 26 C.V. (after dispute) = 100230 = 11·30 Since C.V. has decreased from 13·64 to 11·30, the distribution of wages has become less variable i.e., more consistent or uniform after the settlement of the dispute. Thus, after the settlement there are less disparities in wages and from management point it will result in greater satisfaction to the workers. 7·9 SKEWNESS, MOMENTS AND KURTOSIS (vii) Since we are given mean and median, we can calculate Karl Pearson’s coefficient of skewness for studying the symmetry of the distribution of wages before the dispute and after the dispute viz., Sk = 3(M – Md))/σ Sk (before dispute) = 3(22030– 250) = –3 and Sk (after dispute) = 3(23026– 240) = –1·15 Thus the highly negatively skewed distribution (before the dispute) has become a moderately negatively skewed distribution (after the dispute). This implies that the curve of distribution of wages after the dispute has a less longer tail towards the left. In other words, the number of workers getting lower wages has increased. EXERCISE 7·1 1. (a) Explain the concept of skewness. Draw the sketch of a skewed frequency distribution and show the position of the mean, median and mode when the distribution is asymmetric. (Delhi Univ. B.Com., 1997) (b) Explain the concept of positive and negative skewness. (c) Show graphically the positions of mean, median and mode in a positively and negatively skewed series. [Delhi Univ. B.Com. (Pass), 1998] 2. Comment on the following : (a) In a symmetrical distribution, we have mean = median ≠ mode. (b) “A series representing U-shaped curve is symmetrical.” Comment. [Delhi Univ. B.Com. (Hons.) 2002] [Delhi Univ. B.Com. (Pass), 2001] 3. (a) The values of mean and median are 30 and 40 respectively in a frequency distribution. Is the distribution skewed? If yes, state the direction of skewness. [Delhi Univ. B.Com. (Pass), 2002] Ans. Mean < Median, the distribution is negatively skewed. (b) The mean for a symmetrical distribution is 50·6. Find the values of median and mode. [C.S. (Foundation), June 2001] Ans. Median = Mode = 50·6. 4. (a) Define Pearson’s measure of skewness. What is the difference between relative measure and the absolute measure of skewness. (b) From the following data find out Karl Pearson’s co-efficient of skewness : Measurement : 10 11 12 13 14 15 Frequency : 2 4 10 8 5 1 Ans. 0·3478. 5. Calculate the Pearson’s coefficient of skewness from the following : Wages (Rs.) : 0—10 10—20 20—30 No. of Workers : 15 20 30 30—40 25 40—50 10 Ans. 0·1845. 6. Calculate Karl Pearson’s coefficient of skewness from the following data and explain its significance : Wages : 70—80 80—90 90—100 100—110 110—120 120—130 130—140 140—150 No. of Persons : 12 18 35 42 50 45 20 8 [Delhi Univ. B.Com. (Hons.), 2000] Ans. M = Rs. 110·43, Mo = Rs. 116·15, σ = Rs. 17·26, Sk = – 0·3316. 7. The mean, standard deviation and range of a symmetrical distribution of weights of a group of 20 boys are 40 kgs., 5 kgs. and 6 kgs. respectively. Find the mean and standard deviation of the group if the lightest and heaviest boys are excluded. [Delhi Univ. B.A. (Econ. Hons.), 2004] Ans. Mean = 40 , s.d. (σ) = 5.17 Hint. Proceed as in Example 7.2. 8. Calculate the Pearson’s measure of skewness on the basis of Mean, Mode and Standard Deviation. Mid-value (X) : 14.5 15.5 16.5 17.5 18.5 19.5 20.5 f : 35 40 48 100 125 87 43 Hint. The corresponding classes are : 14—15, 15—16, …, 21—22. Ans. Mean = 18.07 , Mode = 18.4, σ = 1.775, Skewness (Pearson) = – 0.186 21.5 22 7·10 BUSINESS STATISTICS 9. From the following data of age of employees, calculate coefficient of skewness and comment on the result : Age below (yrs.) : 25 30 35 40 45 50 55 No. of employees : 8 20 40 65 80 92 100 [Delhi Univ. MBA, 1997] – Ans. x = 37·25 yrs.; Mo = 36·67 yrs.; σ = 16·99 yrs.; Sk (Karl Pearson) = 0·07. 10. Calculate Karl Pearson’s coefficient of skewness from the following series : Wt. in kgs. : Below 40 40—50 50—60 No. of persons : 10 16 18 Wt. in kgs. : 80—90 90—100 100 and above No. of persons : 4 4 3 60—70 25 70—80 20 Hint. Take the first class as 30—40 and the last class as 100—110. Ans. Mean = 62·200 kg, Mode = 65·833 kg, σ = 16·857 kg., Sk = – 0·2155. 11. Calculate Karl Pearson’s coefficient of skewness from the following data : Marks (above) : 0 10 20 30 40 50 60 70 80 No. of students : 150 140 100 80 80 70 30 14 0 Hint. Locate mode by the method of grouping ; two modal classes 10—20 and 50—60; mode ill-defined. Find median. Ans. Sk = 3(M – Md)/σ = – 0·6622. 12. The daily expenditure of 100 families is given below : Daily Expenditure : 0—20 20—40 40—60 60—80 No. of Families : 13 ? 27 ? If the mode of the distribution is 44, calculate the Karl Pearson coefficient of skewness. Ans. Frequency for the class 20—40 is 25 and for the class 60—80 is 19. Sk (Karl Pearson) = 50 – 44 25·3 = 80—100 16 0·24. 13. The following facts are gathered before and after an industrial dispute : Before dispute No. of workers employed 515 Mean wages Rs. 49·50 Median wages Rs. 52·80 Variance of wages (Rs.)2 121·00 Compare the position before and after the dispute in respect of (a) total wages, (b) modal wages, (c) standard deviation, Ans. Before dispute (i) Total wages Rs. 25,492·50 (ii) Modal wages Rs. 59·40 (iii) C.V. 22·22 (iv) Skewness – 0·90 After dispute 509 Rs. 52·70 Rs. 50·00 (Rs.)2 144·00 and (d) skewness. After dispute Rs. 26,849·75 Rs. 44·50 22·74 0·69 14. You are given the position in a factory before and after the settlement of an industrial dispute. Before Dispute After Dispute No. of workers 3,000 2,900 Mean wages (Rs.) 220 230 Median wages (Rs.) 250 240 Standard deviation (Rs.) 30 26 Comment on the gains and losses from the point of view of workers and that of management. [Delhi Univ. B.Com. (Hons.), 2006] Hint. Proceed as in Example 7.13. 15. You are given below the following details relating to the wages is respect of two factories from which it is concluded that the skewness and variability are same in both the factories. SKEWNESS, MOMENTS AND KURTOSIS 7·11 Factory A Factory B Arithmetic Mean : 50 45 Mode : 45 50 Variance : 100 100 Point out the mistake or the wrong inference in the above statement. Ans. C.V. (A) = 20, C.V. (B) = 22·2 ; Sk(A) = +0·5, Sk(B) = – 0·5. 16. The sum of 20 observations is 300 and its sum of squares is 5,000 and median is 15. Find the coefficient of skewness and coefficient of variation. [Delhi Univ. B.Com. (Hons.), 2007] 5 Ans. Sk = 0, C.V. = 15 × 100 = 33·3. 17. For a group of 10 items ∑X = 452, ∑X 2 = 24,270 and Mode = 43·7. Find the Pearsonian coefficient of skewness. Ans. Sk = 0·08. 18. If the mode and mean of a moderately asymmetrical series are respectively 16 inches and 15·6 inches, would be its most probable median ? Ans. Median = 15·73 inches. 19. In a slightly skew distribution the arithmetic mean is Rs. 45 and the median is Rs. 48. Find the approximate value of mode. Ans. Mode = Rs. 54. 20. In a frequency distribution, Karl Pearson’s coefficient of skewness revealed that the distribution was skewed to the left to an extent of 0·6. Its mean value was less than its modal value by 4·8. What was the standard deviation ? Ans. σ = 8. 21. In a distribution mean = 65, median = 70 and the coefficient of skewness is – 0·6. Find mode and coefficient of variation. (Assume that the distribution is moderately asymmetrical.) Ans. Mode = 80, C.V. = 38·46 22. In a certain distribution the following results were obtained : – Arithmetic Mean (X) = 45; Median = 48; Coefficient of skewness = – 0·4 The person who gave you this data, failed to give you S.D. (Standard Deviation). You are required to estimate it with the help of the above data. [Delhi Univ. B.Com. (Hons.), 1997] Ans. σ = 22·5. 23. Karl Pearson’s measure of skewness of a distribution is 0·5. The median and mode of the distribution are respectively, 42 and 32. Find : (i) Mean, (ii) the S.D., (iii) the coefficient of variation. [Delhi Univ. B.A. (Econ. Hons.), 2000] Ans. (i) 47, (ii) 30, (iii) 63·83. 24. Karl Pearson’s coefficient of skewness of a distribution is 0·32. Its standard deviation is 6·5 and mean is 29·6. Find the mode and median of the distribution. If the mode of the above is 24·8, what will be the standard deviation ? [Delhi Univ. B.Com. (Hons.), 1998] Ans. Mode = 27·52, Median = 28·91, σ = 15. 25. Karl Pearson’s coefficient of skewness of a distribution is +0·40. Its standard deviation is 8 and mean is 30. Find the mode and median of the distribution. Ans. Mode = 26·8, Median = 28·93. 26. The median, mode and coefficient of skewness for a certain distribution are respectively 17·4, 15·3 and 0·35. Calculate the coefficient of variation. Ans. Mean = 18·45; C.V. = 48·78. 27. Karl Pearson’s coefficient of skewness of a distribution is 0.5. Its mean is 30 and coefficient of variation is 20%. Find the mode and median of the distribution. [Delhi Univ. B.A. (Econ. Hons.), 2007] Ans. σ = 6, Median = 29, Mode = 27. 28. (a) A frequency distribution is positively skewed. The mean of the distribution is : Greater than the mode, Less than the mode, Equal to the mode, None of these. Tick the correct answer. 7·12 BUSINESS STATISTICS (b) In a moderatly skewed distribution the values of mean and median are 5 and 6 respectively. The value of mode in such a situation is approximately equal to… . (i) 8 (ii) 11 (iii) 16 (iv) None of these. Ans. (a) M > Mo, (b) : (i) 8. 29. State the empirical relationship among mean, median and mode in a symmetrical and moderately asymmetrical frequency distribution. How does it help in estimating mode and measuring skewness ? Given that mean of a distribution in 50 and mode in 58; (i) Calculate the median; (ii) What can you say about the shape of the distribution ? Explain, Why ? [Delhi Univ. B.A.. (Econ. Hons.), 2009] Ans. (i) Md = 52.67; (ii) M < Md < M0 ⇒ Distribution is negatively skewed. 30. In a moderately asymmetrical distribution : (i) The mode and median are 300 and 240 respectively. Find the value of mean. (ii) Mean = 200 and Mode = 150. Find the value of median. [C.S. (Foundation), Dec. 2001] Ans. (i) Mean = 210, (ii) Median = 183·33. 31. Which group is more skewed ? (i) Mean = 22 ; Median = 24 ; s.d. = 10. (ii) Mean = 22 ; Median = 25 ; s.d. = 12. Ans. Sk (i) = – 0·60 ; Sk (ii) = – 0·75. Group (ii) is more skewed to the left. 32. What is the relationship between mean, mode and median ? What is the condition under which this relationship holds ? Locate graphically the position of the three measures in the case of both negatively as well as positively skewed distribution. Ans. Mode = 3 Median – 2 Mean, for a moderately asymmetrical distribution. 7·2·3. Bowley’s Coefficient of Skewness. Prof. A.L. Bowley’s coefficient of skewness is based on the quartiles and is given by : (Q – Md) – (Md – Q 1 ) Q + Q1 – 2Md Sk = 3 ⇒ Sk = 3 …(7·11) (Q 3 – Md) + (Md – Q1) Q3 – Q1 Remarks 1. Bowley’s coefficient of skewness is also known as Quartile coefficient of skewness and is especially useful in situations where quartiles and median are used viz., : (i) When the mode is ill-defined and extreme observations are present in the data. (ii) When the distribution has open end classes or unequal class intervals. In these situations, Pearson’s coefficient of skewness cannot be used. 2. From (7·11), we observe that : Sk = 0, if Q3 – Md = Md – Q 1 …(7·12) This implies that for a symmetrical distribution (Sk = 0), median is equidistant from the upper and lower quartiles. Moreover, skewness is positive if : Q3 – Md > Md – Q 1 ⇒ Q3 + Q1 > 2Md …(7·13) and skewness is negative if : Q3 – Md < Md – Q 1 ⇒ Q3 + Q1 < 2Md …(7·14) 3. Limits for Bowley’s Coefficient of Skewness. Sk (Bowley) ⎟ ≤1 ⇒ –1 ≤ Sk (Bowley) ≤ 1. …(7·15) Thus, Bowley’s coefficient of skewness ranges from –1 to 1. Further, we note from (7·11) that : Sk = +1, if Md – Q1 = 0, i.e., if Md = Q1 …(7·15a) and Sk = –1, if Q3 – Md = 0, i.e., if Q3 = Md …(7·15b) 4. It should be clearly understood that the values of the coefficient of skewness obtained by Bowley’s formula and Pearson’s formula are not comparable, although in each case, Sk = 0 implies the absence of skewness i.e., the distribution is symmetrical. It may even happen that one of them gives positive skewness while the other gives negative skewness. (See Example 7·17) 7·13 SKEWNESS, MOMENTS AND KURTOSIS 5. The only and perhaps quite serious limitations of this coefficient is that it is based only on the central 50% of the data and ignores the remaining 50% of the data towards the extremes. 7·2·4. Kelly’s Measure of Skewness. The drawback of Bowley’s coefficient of skewness, (viz., that it ignores the 50% of the data towards the extremes), can be partially removed by taking two deciles or percentiles equidistant from the median value. The refinement was suggested by Kelly. Kelly’s percentile (or decile) measure of skewness is given by : Sk = (P 90 – P50) – (P50 – P10) = P90 + P10 – 2P50 …(7·16) But P 90 = D9 and P10 = D1. Hence, (7·16) can be re-written as : Sk = (D 9 – D5) – (D5 – D1) = D9 + D1 – 2D5 …(7·16a) Pi and D i are the ith percentile and decile respectively of the given distribution. (7·16) or (7·16a) gives an absolute measure of skewness. However, for practical purposes, we generally compute the coefficient of skewness, which is given by : Sk (Kelly) = = (P90 – P50) – (P50 – P10) (P90 – P50) + (P50 – P10) = P90 + P10 – 2P50 P90 – P10 …(7·17) D9 + D1 – 2D5 D9 – D1 …(7·17a) Remarks 1. We have : D5 = P50 = Median. Hence, from (7·17) and (7·17a), we get Sk (Kelly) = P90 + P10 – 2Md P90 – P10 = D9 + D1 – 2Md D9 – D1 …(7·18) 2. This method is primarily of theoretical importance only and is seldom used in practice. 7·2·5. Coefficient of Skewness based on Moments. This coefficient is based on the 2nd and 3rd moments about mean and is discussed in detail in § 7·5. Example 7·12. Comment on the following : (i) The mode of a distribution cannot be less than arithmetic mean. (ii) If Q1 , Q2, Q3 be respectively the lower quartile, the median and the upper quartile of a distribution, then Q2 – Q1 = Q3 – Q2. Solution. (i) The given statement is not true in general. Mode of a distribution cannot be lower than arithmetic mean or in other words, arithmetic mean is greater than mode only for a positively skewed distribution. However, Mean = Mode for a symmetrical distribution, while for a negatively skewed distribution Mean < Mode. (ii) The statement : Q3 – Q2 = Q2 – Q1, is true only for a symmetrical distribution. However, if the distribution is skewed, then Q3 – Q2 ≠ Q2 – Q1 If Q3 – Q2 > Q2 – Q1 ⇒ Q1 + Q3 > 2Q2, the distribution is positively skewed. and if Q3 – Q2 < Q2 – Q1 ⇒ Q1 + Q3 < 2Q2, the distribution is negatively skewed. Example 7·13. Calculate Bowley’s coefficient of skewness of the following data : Weight (lbs) Under 100 100—109 110—119 120—129 No. of persons 1 14 66 122 Weight (lbs) No. of persons Weight (lbs) No. of persons 130—139 140—149 150—159 160—169 145 121 65 31 170—179 180—189 190—199 200 and over 12 5 2 2 7·14 BUSINESS STATISTICS COMPUTATION OF QUARTILES Solution. Here we are given the frequency Weight No. of persons ‘Less than’ Class boundaries (f) c.f. (lbs) distribution with inclusive type classes. Since the formulae for median and quartiles are based on 0—99 1 1 0 — 99·5 continuous frequency distribution with exclusive 100—109 14 15 99·5—109·5 type classes without any gaps, we obtain the 66 81 109·5—119·5 class boundaries which are given in the last 110—119 120—129 122 203 119·5—129·5 column of the adjoining table. ———————————————— —————————————————————— ———————————————— —————————————————————————— 3N Here N = 586, 4N = 146·5, N 2 = 293, 4 = 439·5 The c.f. just greater than N/2 i.e., 293 is 348. Hence, the corresponding class 129·5—139·5 is the median class. Using the Median formula, we get 10 Md (Q2) = 129·5 + 145 (293 – 203) 130—139 140—149 150—159 160—169 170—179 180—189 190—199 200—209 145 121 65 31 12 5 2 2 348 469 534 565 577 582 584 N = 586 129·5—139·5 139·5—149·5 149·5—159·5 159·5—169·5 169·5—179·5 179·5—189·5 189·5—199·5 199·5—209·5 × 90 = 129·5 + 10145 = 129·50 + 6·21 = 135·71 The c.f. just greater than N/4 i.e., 146·5 is 203. Hence, the corresponding class 119·5—129·5 contains Q1 ∴ 10 × 65·5 Q1 = 119·5 + 122 (146·5 – 81) = 119·5 + 10 122 = 119·50 + 5·37 = 124·87 The c.f. just greater than 3N/4 i.e., 439·5 is 469. Hence, the corresponding class 139·5—149·5 contains Q3. ∴ 10 × 91·5 Q3 = 139·5 + 121 (439·5 – 348) = 139·5 + 10 121 = 139·50 + 7·56 = 147·06 Bowley’s coefficient of skewness is given by : Sk (Bowley) = Q3 + Q1 – 2Md Q3 – Q1 + 124·87 – 2 × 135·71 271·93 – 271·42 0·51 = 147·06147·06 = = 22·19 = 0·0298. – 124·87 22·19 Example 7·14. Calculate the coefficient of skewness from the following data by using quartiles. Marks No. of students Marks No. of students Above 0 180 Above 60 65 Above 15 160 Above 75 20 Above 30 130 Above 90 5 Above 45 100 COMPUTATION OF QUARTILES Number of students (f) Marks Solution. We are given ‘more than’ cumulative frequency distribution. To compute quartiles, we first express it as a continuous frequency distribution without any gaps as given in the adjoining table. N 180 2= 2 = 90, N 180 4= 4 = 45 and 3N 4 = ∴ Md = l + h f ( N 2 –C 0—15 15—30 30—45 45—60 60—75 75—90 above 90 135 The c.f. just greater than N/2 i.e., 90 is 115. Hence the corresponding class 45—60 is the median class. ‘Less than’ c.f. —————————————————————————————————————————————————————————————————————— 180 – 160 = 20 160 – 130 = 30 130 – 100 = 30 100 – 65 = 35 65 – 20 = 45 20 – 5 = 15 5 20 50 80 115 160 175 180 —————————————————— ——————————————————————————————————————————————————————— ) = 45 + ( 90 – 80 ) = 45 + 15 35 Total 15 × 10 35 = N = ∑f = 180 45 + 4·29 = 49·29 The c.f. just greater than N/4 i.e., 45 is 50. Hence, the corresponding class 15—30 contains Q1. ∴ Q1 = l + hf ( N 4 –C ) = 15 + ( 45 – 20 ) = 15 + 15 30 15 × 25 30 = 15 + 12·5 = 27·5 The c.f. just greater than 3N/4 i.e., 135 is 160. Hence, the corresponding class 60—75 contains Q3. 7·15 SKEWNESS, MOMENTS AND KURTOSIS Q3 = l + hf ∴ ( 3N 4 –C ) = 60 + ( 135 – 115 ) = 60 + 15 45 20 = 3 60 + 6·67 = 66·67 Hence, Bowley’s coefficient of skewness based on quartiles is given by : Q + Q1 – 2Md 66·67 + 27·50 – 2 × 49·29 94·17 – 98·58 4·41 Sk (Bowley) = 3 = = = – 39·17 = – 0·1126 66·67 – 27·50 39·17 Q3 – Q1 Example 7·15. If the first quartile is 142 and the semi-interquartile range is 18, find the median (assuming the distribution to be symmetrical. [Delhi Univ. B.Com. (Hons.), 1997] Q 3 – Q1 Solution. We are given : Q1 = 142; and = 18 ⇒ Q3 = Q1 + 2 × 18 = 142 + 36 = 178 2 If the distribution is symmetrical, then (Q – Md) – (Md – Q 1 ) Sk (Bowley) = 3 = 0 ⇒ Q3 – Md = Md – Q 1 (Q 3 – Md) + (Md – Q1) Md = 12 (Q 1 + Q3) = 142 +2 178 = 160 ∴ Example 7·16(a). In a frequency distribution the coefficient of skewness based on quartiles is 0·6. If the sum of the upper and lower quartiles is 100 and median is 38, find the value of upper and lower quartiles. (b) The mean, mode an Q.D. of a distribution are 42, 36 and 15 respectively. If its Bowley’s coefficient of skewness is 1/3, find the values of the two quartiles. Also state the empirical relationship between mean, mode and median. [Delhi Univ. B.Com. (Hons.), 2008] Solution. (a) We are given : Q + Q1 – 2Md Sk = 3 = 0·6 …(*) Also Q 3 + Q1 = 100 and Median = 38 …(**) Q3 – Q1 Substituting from (**) in (*), we get 100 – 2 × 38 Q3 – Q1 = 0·6 24 Q3 – Q1 = 1000·6– 76 = 0·6 = 40 ⇒ Thus, we have : Q3 + Q1 = 100 and Adding and subtracting, we get respectively 2Q3 = 140 ⇒ Q3 = 140 2 = 70; Q3 – Q1 = 40 and 2Q1 = 60 ⇒ Q1 = 60 2 = 30 Hence, the values of the upper and lower quartiles are 70 and 30 respectively. (b) We are given : Mean = 42, Mode = 36 Q – Q1 and Q.D. = 3 = 15 ⇒ Q3 – Q1 = 30 2 The empirical relation between mean, mode and median is : Mean – Mode = 3 (Mean – Median) ⇒ Mode = 3 Median – 2 Mean 1 From (1) and (3), we get 36 = 3Md – 2 × 42 ⇒ Md = (36 + 84) = 40 3 Bowley’s coefficient of skewness is given by : Q + Q1 – 2 Md 1 Sk (Bowley) = 3 = (Given) Q3 – Q1 3 ⇒ ∴ 1 30 Q3 + Q1 = 2 Md + (Q3 – Q1) = 2 × 40 + = 90 3 3 Q3 – Q1 = 30 and Q3 + Q1 = 90. Adding and subtracting we get respectively. 2 Q3 = 120 ⇒ Q3 = 60 and [From (2) and (4)] 2Q1 = 60 ⇒ Q1 = 30. …(1) …(2) …(3) …(4) 7·16 BUSINESS STATISTICS Example 7·17. From the information given below, calculate Karl Pearson’s coefficient of skewness and also quartile coefficient of skewness : Measure Mean Median Standard deviation Third quartile First quartile Place A 150 142 30 195 62 Place B 140 155 55 260 80 Solution. Place A : 3(M – Md) 3(150 – 142) 8 = = 10 = 0·8 30 σ Q + Q1 – 2Md 195 + 62 – 2 × 142 257 – 284 27 Sk (Bowley) = 3 = = = – 133 = – 0·203 195 – 62 133 Q3 – Q1 Sk (Karl Pearson) = Place B : 3(M – Md) 3(140 – 155) 3 × (–15) 9 = = = – 11 = – 0·82 55 55 σ Q + Q1 – 2Md 260 + 80 – 2 × 155 340 – 310 1 Sk (Bowley) = 3 = = = 6 = 0·167 260 – 80 180 Q3 – Q1 Note: See Remark 4 to § 7.2.3 EXERCISE 7·2 Sk (Karl Pearson) = 1. What is ‘Skewness’ ? What are the tests of skewness ? Draw different sketches to indicate different types of skewness and locate roughly the relative positions of mean, median and mode in cash case. 2. Give any three measures of skewness of a frequency distribution. Explain briefly (not exceeding ten lines) with suitable diagrams the term ‘Skewness’ as mentioned above. 3. Distinguish between Karl Pearson’s and Bowley’s measures of skewness. Which one of these would you prefer and why ? [Delhi Univ. MBA, 2000] 4. Explain the meaning of skewness using sketches of frequency curves. State the different measures of skewness that are commonly used. How does skewness differ from dispersion ? 5. What are quartiles ? How are they used to measure skewness ? 6. Describe Bowley’s measure of skewness. Show that it lies between ±1. Under what conditions these limits are attained ? 7. The weekly wages earned by one hundred workers of a factory are as follows. Find the absolute measures of dispersion and skewness based on quartiles and interpret the results. Weekly wages (’00 Rs.) No. of workers Weekly wages (’00 Rs.) No. of workers 12·5—17·5 12 37·5—42·5 10 17·5—22·5 16 42·5—47·5 6 22·5—27·5 25 47·5—52·5 3 27·5—32·5 14 52·5—57·5 1 32·5—37·5 13 Ans. Q.D. = 7·01; Sk = (Q3 – Md) – (Md – Q1) = 3·34. 8. Find out the coefficient of skewness using Bowley’s formula from the following figures : Income (in Rs.) : 100—199 200—299 300—399 400—499 500—599 600—699 700—799 800—899 No. of Persons : 39 25 49 62 38 37 32 18 Ans. Sk = 0·1146. 9. The figures in the adjoining table relate to the size of capital Capital in Lakhs of Rs. of companies : 1—5 Find out (i) the median size of the capital; 6—10 (ii) the coefficient of skewness with the help of Bowley’s 11—15 measure of skewness. 16—20 What conclusion do you draw from the skewness measured by 21—25 you ? 26—30 31—35 Ans. Median = 23·47 ; Sk (Bowley) = – 0·119. No. of Companies ——————————————————————————————— —————————————————————————————— 20 27 29 38 48 53 70 7·17 SKEWNESS, MOMENTS AND KURTOSIS 10. For the frequency distribution given below, calculate the coefficient of skewness based on the quartiles : Class Limits : 10—19 20—29 30—39 40—49 50—59 60—69 70—79 80—89 Frequency : 5 9 14 20 25 15 8 4 Ans. – 0·103. 11. Calculate the coefficient of skewness based n quartiles and median from the following data : Variable : 0—10 10—20 20—30 30—40 40—50 50—60 60—70 70—80 Frequency : 12 16 26 38 22 15 7 4 [Osmania Univ. B.Com., 1998] Ans. Q1 = 22·69, Q3 = 45·91, Median (Q2) = 34·21, Sk (Bowley) = 0·008. 12. By using the quartiles, find a measure of skewness for the following distribution. Annual Sales (Rs. ’000) Less than ” ” ” ” ” ” ” ” No. of firms 20 30 40 50 60 Annual Sales (Rs. ’000) No. of firms Less than 70 ” ” 80 ” ” 90 ” ” 100 30 225 465 580 634 644 650 665 680 [Bangalore Univ. B.Com., 1998] Ans. Q1 = 27·18 ; Q3 = 43·90 ; Median = 34·79 ; Sk (Bowley) = 0·0903. 13. Calculate Bowley’s coefficient of skewness for the following data : Income equal to or more than No. of Persons Income equal to or more than 200 150 100 50 0 600 700 800 900 1000 1000 950 700 600 500 100 200 300 400 500 No. of Persons Assume that income is a continuous variable. Ans. Sk = – 0·4506. 14. (a) Find out Quartile coefficient of skewness in a series where Q1 = 18, Q3 = 25, Mode = 21 and Mean = 18. [Delhi Univ. B.Com. (Pass) 1999] Hint. Mode = 3Md – 2 Mean ⇒ Md = 19 Ans. 0·714. (b) Find the coefficient of skewness from the following information : Difference of two quartiles = 8 ; Mode = 11, Sum of two quartiles = 22 ; Mean = 8. [Delhi Univ. B.Com. (Hons.), 1997] Hint. Q3 – Q1 = 8 , Q3 + Q1 = 22 ⇒ Q1 = 7, Q3 = 15 ; Mo = 3Md – 2M ⇒ Md = 11 + 16 =9 3 Ans. Sk (Bowley) = 0·5. 15. (a) If the quartile coefficient of skewness is 0·5, quartile deviation is 8 and the first quartile is 16, find the median of the distribution. [Delhi Univ. B.A. (Econ. Hons.), 2002] Ans. Median = 20. (b) The measure of skewness for a certain distribution is – 0·8. If the lower and the upper quartiles are 44·1 and 56·6 respectively, find the median. Ans. Md = 55·35. 16. (a) The coefficient of skewness for a certain distribution based on the quartiles is – 0·8. If the sum of the upper and lower quartiles is 100·7 and median is 55·35, find the values of the upper and lower quartiles. Ans. Q3 = 56·6; Q1 = 44·1. 7·18 BUSINESS STATISTICS (b) In a frequency distribution, coefficient of skewness based on quartiles is 0·6. If the sum of upper and lower quartiles is 100 and median is 38, find the values of lower and upper quartiles. [Delhi Univ. B.Com. (Hons.), 1993] Ans. Q1 = 30, Q3 = 70. 17. In a distribution, the difference of the two quartiles is 15 and their sum is 35 and the median is 20. Find the coefficient of skewness. Ans. Sk (Bowley) = – 0·33. 18. If the First Quartile is 48 and Quartile Deviation is 6, find the Median (assuming the distribution to be symmetrical). [Delhi Univ. B.Com. (Pass), 1996] Ans. Md = 12 (Q3 + Q1) = 48 + 60 = 54. 2 19. Fill in the blank : “If Q1 = 6, Q3 = 10 and Bowley’s coefficient of skewness is 0·5, then the value of median will be equal to….” Ans. Md = 7. 20. Particulars relating to the wage distribution of two manufacturing firms are as follows : Mean wage Median wage Modal wage Quartiles S.D. Firm ‘A’ Rs. 175 172 167 162 ; 178 13 Firm ‘B’ Rs. 180 170 162 165 ; 185 19 Compare the two distributions. Ans. C.V. (Firm A) = 7·43, C.V. (Firm B) = 10·56; Bowley’s coeff. of skewness (Firm A) = – 0·25, skewness (Firm B) = 0·5; Karl Pearson coeff. of skewness (Firm A) = 0·615, skewness (Firm B) = 0·947. 21. Find out Bowley’s coefficient of skewness from the following data and show which section is more skewed : Income in ’00 Rs. : 55—58 58—61 61—64 64—67 67—70 Section A : 12 17 23 18 11 Section B : 20 22 25 13 4 Ans. Bowley’s coefficient of skewness (A) = – 0·0061. ; Bowley’s coefficient of skewness (B) = – 0·06. For the comparison of these negative skewnesses, we compare their absolute values. Section B is more skewed. 7·3. MOMENTS Moment is a term generally used in physics or mechanics and provides us a measure of the turning or the rotating effect of a force about some point. The moment of a force, say, F about some point P is given by the product of the magnitude of the force (F) and the perpendicular distance (p) between the point of reference and direction of the force [Fig. 7·4] i.e., Moment = p × F However, the term moment as used in physics has nothing to do with the moment used in Statistics, the only analogy being that in Statistics we talk of moment of random variable about some point and these moments are used to describe the various characteristics of a frequency distribution viz., central tendency, dispersion, skewness and kurtosis. Let the random variable X have a frequency distribution X | x 1 x2 x3 … … xn | ——————————————————————————————————————————————— Let f | f1 f3 f3 … … fn | ∑f = N –x = ∑ fx = ∑ fx , be its arithmetic mean. ∑f N p p F Fig. 7·4. 7·19 SKEWNESS, MOMENTS AND KURTOSIS 7·3·1. Moments about Mean. The rth moment of X about the mean –x , usually denoted by μr [where μ is the letter Mu of the Greek alphabet] is defined as μr = 1N ∑ f (x – –x )r ; r = 0, 1, 2, 3,… …(7·19) [ 1 =N f1 (x1 – –x ) r + f2 (x2 – –x ) r + … + fn (xn – –x ) r In particular putting r = 0 in (7·19), we get 1 μ0 = N ∑ f (x – –x ) 0 = N1 ∑ f = 1N · N ] …(7·19a) [·. · (x – –x ) 0 = 1 and ∑f = N] μ0 = 1. …(7·20) Putting r = 1 in (7·19), we get μ1 = N1 ∑ f (x – –x ) = 0, …(7·21) because the algebraic sum of deviations of a given set of observations from their mean is zero. Thus, the first moment about mean is always zero. Again taking r = 2, we get μ2 = N1 ∑ f (x – –x)2 = σx2 …(7·22) Hence, the second moment about mean gives the variance of the distribution. Also μ = 1 ∑ f (x – –x)3 and μ = 1 ∑ f (x – –x)4 3 4 N N …(7·23) 7·3·2. Moments about Arbitrary Point A. The rth moment of X about any arbitrary point A, usually denoted by μr′ is defined as : 1 μr′ = N ∑ f (x – A)r 1 ; r = 0, 1, 2, 3,… [ …(7·24) ] = N f1 (x1 – A)r + f2(x2 – A)r + … + fn (xn – A)r …(7·24a) In particular taking r = 0 and r = 1 in (7·24), we get respectively 1 1 μ0 ′ = N ∑ f (x – A)0 = N ∑ f = 1 1 …(7·25) 1 μ1 ′ = N ∑ f(x – A) = N [∑ fx – A∑ f] 1 [ ] (·. · A is constant) 1 = N ∑fx – AN = N ∑ fx – A = –x – A …(7·26) –x = A + μ′ ⇒ 1 where μ 1 ′ is the first moment about the point ‘A’. Taking r = 2, 3, 4 in (7·24), we get respectively 1 μ2 ′ = Second moment about A = N ∑ f (x – A)2 μ3 ′ = μ4 ′ = 1 Third moment about A = N ∑ f (x – A)3 1 Fourth moment about A = N ∑ f (x – A)4 …(7·27) …(7·28) …(7·28 a) …(7·28 b) Remarks 1. μr, the rth moment about the mean ; r = 1, 2, 3, … are also called the central moments and μ′r, the rth moment about any arbitrary point A are also known as raw moments. 2. In particular, if we take A = 0 in (7·27), we get –x = 0 + μ ′ (about origin) …(7·28 c) 1 Hence, the first moment about origin gives mean. 7·3·3. Relation between Moments about Mean and Moments about Arbitrary Point ‘A’. We have μr = μr′ – rC 1 μ′r – 1μ1 ′ + rC 2 μ′r – 2 μ′1 2 – rC 3 μ′r – 3 μ′1 3 + … + (–1)r μ1′r …(7·29) 7·20 BUSINESS STATISTICS Remarks 1. We summarise below the important results on moments : μ0 = 1 and μ1 = 0 Mean (x–) = A + μ ′ } 1 ′2 Variance = σx = μ2 = μ2′ – μ1 μ3 = μ3 ′ – 3μ2 ′ μ1 ′ + 2μ1 ′3 μ4 = μ4 ′– 4μ3 ′ μ1 ′ + 6μ2 ′ μ1 ′2 – 3μ1 ′4) 2 …(7·30) } …(7·31) The results in (7·30) and (7·31) are of fundamental importance in Statistics and should be committed to memory. Thus, if we know the first four moments about any arbitrary point A, we can obtain the measures of central tendency [–x = μ1 ′ (about origin)], dispersion (μ2 = σ 2 ), skewness (μ3 or β1) and kurtosis (β 2 ). [The last two measures are discussed in the Sections § 7·5 and § 7·6]. Since, these four measures enable us to have a fairly good idea about the nature and the form of the given frequency distribution, in practice we generally compute only the first four moments and not the higher moments. 2. In case of a symmetrical distribution, if the deviations of the given observations from their arithmetic mean are raised to any odd power, the sum of positive deviations equals the sum of the negative deviations and accordingly the overall sum is zero, i.e., if the distribution is symmetrical then : ∑ f (x – –x ) = ∑ f (x – –x ) 3 = ∑ f (x – –x ) 5 = ∑ f (x – –x ) 7 = … = 0 Dividing by N = ∑ f, we get for a symmetrical distribution : μ 1 = μ 3 = μ5 = μ7 = … = 0 ⇒ μ2 r + 1 = 0 ; r = 0, 1, 2, 3,… …(7·32) Hence, for a symmetrical distribution, all the odd order moments about mean vanish. Accordingly, odd order moments, specially 3rd moment is used as a measure of skewness. 3. We have μ2 = μ2 ′ – μ1 ′2 Since μ1′2, being the square of a real quantity is always non-negative, we get μ2 = μ2 ′ – (some non-negative quantity) ⇒ μ 2 ≤ μ2 ′ ∴ Variance ≤ Mean square deviation or S.D. ≤ Root mean square deviation …(7·33) 4. For obtaining the moments of a grouped (continuous) frequency distribution, if we change the scale also in X by taking x–A d = ⇒ x – A = hd …(7·33a) h then μr′ = rth moment of X about point X = A 1 1 = N ∑ f (x – A)r = N ∑ f . hr dr ⇒ μr′ = 1 hr . N ∑ [From (7·33a)] f dr …(7·33b) In particular, we have 1 μ1 ′ = h . N ∑ f d; 1 μ 2 ′ = h2 . N ∑ f d 2 ; 1 μ 3 ′ = h3 .N ∑ f d 3 ; 1 μ 4 ′ = h4 .N ∑ f d 4 …(7·33c) Finally, on using the relation (7·31), we obtain the moments about mean. For numerical computations, if the mean of the distribution comes out to be integral (i.e., a whole number), then it is convenient to obtain the moments about mean directly by the formula 1 μr = N ∑ f (x – –x ) r …(7·33d) However, for grouped (continuous) frequency distribution, the calculations are simplified by changing the scale also in X. If we take x – –x z = ⇒ x – –x = hz …(7·34) h 7·21 SKEWNESS, MOMENTS AND KURTOSIS then, we have 1 μr = N ∑ f . (hz)r ⇒ μ r = hr . [ 1 N∑ f zr ] …(7·35) a formula which is more convenient to use. 4. Converse. We can obtain μr′ in terms of μ r as given below : μ′r = μr + rC 1 μr – 1 μ1′ + rC 2 μr – 2 μ1′2 + … + μ1 ′r, …(7·36) In particular, taking r = 2, 3 and 4 in (7·36) and simplifying we shall get respectively : μ 2 ′ = μ 2 + μ 1 ′2 μ3 ′ = μ3 + 3μ2 μ1′ + μ1 ′3 …(7·37) μ4 ′ = μ4 + 4μ3 μ1′ + 6μ2 μ1′2 + μ1′4 These formulae enable us to compute moments about any arbitrary point, if we are given mean and moments about the mean. } 7·3·4. Effect of Change of Origin and Scale on Moments about Mean. Let us change the origin and scale in the variable x to obtain a new variable u as defined below : x–a u = ⇒ x = a + hu …(7·38) h Then, μr (x) = hrμr (u) …(7·39) Hence, rth moment of the variable x about its mean is equal to hr times the rth moment of the variable u about its mean. The result does not depend on ‘a’. Hence, we conclude that the moments about mean (i.e., central moments) are invariant under change of origin but not of scale. 7·3·5. Sheppard’s Correction for Moments. In case of grouped or continuous frequency distribution, for the calculation of moments, the value of the variable X is taken as the mid-point of the corresponding class. This is based on the assumption that the frequencies are concentrated at the mid-points of the corresonding classes. This assumption is approximately true for distributions which are symmetrical or moderately skewed and for which the class intervals are not greater than one-twentieth (1/20th) of the range of the distribution. However, in practice, this assumption is not true in general and consequently some error, known as ‘grouping error’ is introduced in the calculation of moments. W.F. Sheppard proved that if (i) the frequency distribution is continuous, and (ii) the frequency tapers off to zero in both directions, the effect due to grouping at the mid-point of the intervals can be corrected by the following formulae, known as Sheppard’s corrections : h2 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪⎭ μ2 (corrected) = μ2 – 12 μ3 (corrected) = μ3 μ4 (corrected) = μ4 – 12 h2 μ2 7 4 + 240 h …(7·40) where h is the width of the class interval. Remark. This correction is valid only for symmetrical or slightly asymmetrical continuous distributions and cannot be applied in the case of extremely asymmetrical (skewed) distributions like J-shaped or inverted J-shaped or U-shaped distributions. As a safeguard against sampling errors, this correction should be applied only if the total frequency is fairly large, say, greater than 1000. 7·3·6. Charlier Checks. We have, on using binomial expansion for positive integral index, x+1 =x+1 (x + 1)2 = x 2 + 2x + 1 (x + 1)3 = x 3 + 3x2 + 3x + 1 (x + 1)4 = x 4 + 4x3 + 6x2 + 4x + 1 } …(*) 7·22 BUSINESS STATISTICS Multiplying both sides of (*) by f and adding over different values of the variable X, we get the following identities, ∑ f (x + 1) = ∑ f x + N ∑ f (x + 1)2 = ∑ f x2 + 2∑ f x + N …(7·41) ∑ f (x + 1)3 = ∑ f x3 + 3∑ f x2 + 3∑ f x + N ∑ f (x + 1)4 = ∑ f x4 + 4∑ f x3 + 6∑ f x2 + 4 ∑ f x + N } These identities are known as Charlier checks and are used in checking the calculations in the computation of the first four moments. β ) AND GAMMA (γγ ) COEFFICIENTS BASED ON 7·4. KARL PEARSON’S BETA (β MOMENTS Prof. Karl Pearson defined the following four coefficients based on the 1st four central moments. β1 = μ3 2 μ2 3 …(7·42) β2 = μ4 μ4 = μ 2 2 σ4 …(7·43) μ3 μ = 3 μ2 3/2 σ3 γ1 = ⎯ √⎯β1 = γ2 = β 2 – 3 = (·.· μ2 = σ2 ) …(7·44) μ4 –3 μ2 2 …(7·45) It may be stated here that these coefficients are pure numbers independent of units of measurement and as such can be conveniently used for comparative studies. In practice they are used as measures of skewness and kurtosis as discussed in the following sections. Remark. Sometimes, another coefficient based on moments viz., Alpha ( α) coefficient is used. Alpha coefficients are defined as μ μ μ μ α1 = 1 = 0, α2 = 22 = 1, α3 = 33 = √ β1 = γ1, α4 = 44 = β2 …(7·46) ⎯ ⎯ σ σ σ σ 7·5. COEFFICIENT OF SKEWNESS BASED ON MOMENTS Based on the first four moments, Karl Pearson’s coefficient of skewness becomes Sk = √⎯β1 (β2 + 3) ⎯ …(7·47) 2(5β2 – 6β1 – 9) where β 1 and β2 are Pearson’s coefficients defined in (7·42) and (7·43) in terms of the first four central moments. Formula (7·47) will give positive skewness if M > Mo and negative skewness if M < Mo. Sk = 0, if β1 = 0 or β2 + 3 = 0 ⇒ β 2 = –3. 1 1 But μ = ∑ f (x – –x )4 > 0 and μ = ∑ f (x – –x ) 2 > 0 4 ∴ β2 = N 2 N μ4 > 0 μ2 2 Since β2 cannot be negative, Sk = 0 if β 1 = 0 or if μ3 = 0. Hence, for a symmetrical distribution, β1 = 0. Accordingly, β1 may be taken as a relative measure of skewness based on moments. Remark. The coefficient β1 as a measure of skewness has a serious limitation. μ3 being the sum of the cubes of the deviations from the mean may be positive or negative but μ3 2 is always positive. Also μ2 being the variance is always positive. Hence, β1 = μ3 2 /μ 2 3 , is always positive. Thus β1 , as a measure of skewness 7·23 SKEWNESS, MOMENTS AND KURTOSIS is not able to tell us about the direction (positive or negative) of skewness. This drawback is removed in Karl Pearson’s coefficient [Gamma One, (γ1)] which is defined as the positive square root of β1 , i.e. ; μ3 μ γ1 = + ⎯ = 33 …(7·48) √⎯β1 = μ 3/2 σ 2 Thus, the sign of skewness depends upon μ3 . If μ3 is positive we get positive skewness and if μ3 is negative, we get negative skewness. 7·6. KURTOSIS So far we have studied three measures viz., central tendency, dispersion and skewness to describe the characteristics of a frequency distribution. However, even if we know all these three measures we are not in a position to characterise a distribution completely. The following diagram (Fig. 7.5) will clarify the point. A B A – Lepto-kurtic B – Meso-kurtic C – Platy-kurtic C Mean Fig. 7·5. All the three curves are symmetrical about the mean and have same variation (range). In order to identify a distribution completely we need one more measure which Prof. Karl Pearson called ‘convexity of the curve’ or its ‘Kurtosis’. While skewness helps us in identifying the right or left tails of the frequency curve, kurtosis enables us to have an idea about the shape and nature of the hump (middle part) of a frequency distribution. In other words, kurtosis is concerned with the flatness or peakedness of the frequency curve. Curve of type B which is neither flat nor peaked is known as Normal curve and shape of its hump is accepted as a standard one. Curves with humps of the form of normal curve are said to have normal kurtosis and are termed as meso-kurtic. The curves of the type A., which are more peaked than the normal curve are known as lepto-kurtic and are said to lack kurtosis or to have negative kurtosis. On the other hand, curves of the type C, which are flatter than the normal curve are called platy-kurtic and they are said to possess kurtosis in excess or have positive kurtosis. As a measure of kurtosis, Karl Pearson gave the coefficient Beta two (β2 ) or its derivative Gamma two (γ2) defined as follows : β2 = μ4 μ = 44 2 μ2 σ …(7·49) μ4 μ – 3σ4 –3 = 4 4 …(7·50) 4 σ σ For a normal or meso-kurtic curve (Type B), β2 = 3 or γ2 = 0. For a lepto-kurtic curve (Type A), β2 > 3 or γ2 > 0 and for a platy-kurtic curve (Type C), β2 < 3 or γ2 < 0. γ2 = β2 – 3 = It is interesting to quote here the words of a British statistician W.S. Gosset (who wrote under the pen name of Student), who very humorously explains the use of the terms platy-kurtic and lepto-kurtic in the following sentence : “Platykurtic curves like the platypus, are squat with short tails ; lepto-kurtic curves are high with long tails like the kangaroos noted for leaping.” 7·24 BUSINESS STATISTICS Gosset’s little but humorous sketch is given below : Platypus Kangaroos Fig. 7.6 (a). Fig. 7.6 (b). Remarks 1. The Pearsonian coefficients β1 and β2 are independent of the change of origin and scale i.e., if we take : X – A, U= h>0 then β1(X) = β 1 (U) and β2(X) = β 2 (U) h 2. For a discrete distribution β2 ≥ 1. In fact, a much stronger result holds. We have : β 2 ≥ β1 + 1. Example 7·18. “For a symmetrical distribution, all central moments of odd order are zero.” Comment. Solution. The statement is always true i.e., for a symmetrical distribution, μ 1 = μ 3 = μ5 = … = 0 i.e., μ2n + 1 = 0 ; (n = 0, 1, 2,…) [For detailed discussion see Remark 2, § 7·3·3] Example 7·19. Calculate the first four moments about the mean for the following data and comment on the nature of the distribution : x: f: 1 1 2 6 3 13 4 25 5 30 6 22 7 9 8 5 9 2 [Delhi Univ. B.Com. (Hons.), 1999] Solution. CALCULATIONS FOR MOMENTS x f 1 2 3 4 5 6 7 8 9 1 6 13 25 30 22 9 5 2 ∑ f = 113 d=x–5 –4 –3 –2 –1 0 1 2 3 4 ∑d=0 fd –4 –18 –26 –25 0 22 18 15 8 ∑ fd = –10 Moments About the Point x = 5 d = x – A = x – 5, (A = 5) ; N = ∑ f = 113 ∑ fd –10 μ1 ′ = = 113 = – 0·0885 N ∑ fd2 282 μ2 ′ = = 113 = 2·4956 N ∴ Mean = A + μ1′ = 5 + (– 0·0885) = 4·9115 fd2 fd3 fd4 16 54 52 25 0 22 36 45 32 – 64 –162 –104 –25 0 22 72 135 128 256 486 208 25 0 22 144 405 512 ∑ fd2 = 282 ∑ fd3 = 2 ∑ fd4 = 2058 ∑ fd 3 2 = 113 = 0·0177 N ∑ fd 4 2058 μ4 ′ = = 113 = 18·2124 N μ3 ′ = 7·25 SKEWNESS, MOMENTS AND KURTOSIS Moments About Mean μ1 = 0 μ4 = μ4 ′ – 4 μ3 ′ μ1 ′ + 6 μ2 ′ μ1 ′2 – 3 μ1′4 2 2 μ2 = μ2 ′ – μ1 ′ = 2·4956 – (– 0·0885) = 18·2124 – 4 × (0·0177) × (–0·0885) = 2·4956 – 0·0078 = 2·4878 + 6 × 2·4956 × (– 0·0885)2 – 3 (– 0·0885)4 3 μ3 = μ3 ′ – 3μ2 ′ μ1 ′ + 2μ1 ′ = 18·2124 + 0·00626 + 0·11728 – 0·000184 = 0·0177 – 3 × 2·4956 × (– 0·0885) = 18·3357 + 2 × (– 0·0885)3 = 0·0177 + 0·66258 – 0·001386 = 0·6789 Moment Coefficients of Skewness μ 2 (0·6789)2 0·4609 β1 = 3 3 = (2·4878) 3 = 15·3974 = 0·0299 μ2 μ3 μ3 0·6789 γ1 = 3/2 = = 3·9239 = 0·173 or γ1 = ⎯ 0·0299 = + 0·173 (·.· μ3 is positive) √⎯β1 = + ⎯√⎯⎯⎯⎯ μ2 μ2⎯ μ √⎯ 2 Kurtosis β2 = μ4 18·3357 = = 18·3357 6·1891 = 2·9626 μ2 2 (2·4878)2 ⇒ γ2 = β2 – 3 = – 0·0373. Interpretation Since γ1 = 0·173, the given distribution is very slightly positively skewed. Since β2 = 2·9626 ~– 3, the given distribution is more or less normal or meso-kurtic. However, since β2 < 3 (γ2 = – 0·04), the given frequency curve is slightly Platy-kurtic and is slightly flatter than the normal curve. Example 7·20. From the following data, calculate moments about assumed mean 25 and convert them into central moments : X : 0—10 10—20 20—30 30—40 f : 1 3 4 2 [Delhi Univ. B.Com. (Hons.), 2000] Solution. CALCULATIONS FOR MOMENTS ABOUT THE POINT X = 25 Class Mid-value (X) f d = 10 fd fd2 fd3 fd4 0—10 10—20 20—30 30—40 5 15 25 35 1 3 4 2 –2 –1 0 1 –2 –3 0 2 4 3 0 2 –8 –3 0 2 16 3 0 2 Total N = ∑ f = 10 Moments about the Point X = 25 ∑ fd 10 × (–3) μ1 ′ = h . = 10 = –3 N ∑ fd2 102 × 9 μ2 ′ = h2 . = 10 = 90 N x – 25 ∑ fd = –3 ∑ fd2 = 9 ∑ fd3 = – 9 ∑ fd4 = 21 h3 × ∑ fd3 103 × (–9) = 10 = –900 N h4 . ∑ fd4 104 × 21 μ4 ′ = = 10 = 21,000 N μ3 ′ = Central Moments (Moments About Mean) μ1 = 0 μ2 = μ2 ′ – μ1 ′2 = 90 – 9 = 81 μ3 = μ3 ′ – 3μ2 ′μ1 ′ + 2μ1 ′3 μ4 = μ4 ′ – 4μ3 ′μ1 ′ + 6μ2 ′ μ1 ′2 – 3μ1 ′4 3 = –900 – 3 × 90 × (–3) + 2 (–3) = 21,000 – 4 × (–900) × (–3) + 6 × 90 × (–3)2 – 3 × (–3)4 = –900 + 810 – 54 = –144 = 21,000 – 10,800 + 4,860 – 243 = 14,817 7·26 BUSINESS STATISTICS Example 7·21. The first three moments of a distribution about the value 67 of the variable are 0·45, 8·73 and 8·91. Calculate the second and third central moments, and the moment coefficient of skewness. Indicate the nature of the distribution. [Delhi Univ. B.A. (Econ. Hons. I), 2000] Solution. In the usual notations we are given : A = 67, μ 1 ′ = 0·45, μ2 ′ = 8·73 and μ3 ′ = 8·91 The second and third central moments are given by : μ2 = μ2 ′ – μ1 ′2 = 8·73 – (0·45)2 = 8·73 – 0·2025 = 8·5275 μ3 = μ3 ′ – 3 μ2 ′ μ1 ′ + 2μ1 ′3 = 8·91 – 3 × 8·73 × 0·45 + 2 × (0·45)3 = 8·91 – 11·7855 + 0·18225 = –2·6933 Hence, the variance of the distribution is σ2 = μ2 = 8·5275 ⇒ σ(s.d.) = ⎯√⎯⎯⎯⎯ 8·5275 = 2·9202 Since μ3 is negative, the given distribution is negatively skewed. In other words, the frequency curve has a longer tail towards the left. Karl Pearson’s moment coefficient of skewness is given by : μ3 μ3 μ2 –2·6933 γ1 = 3/2 = = 8·5275 = –2·6933 = –0·1082 ⇒ β1 = 3 3 = (– 0·1082)2 = 0·0117 × 2·9202 24·9020 μ2 μ2 μ2⎯ √⎯μ2 Since γ1 and β1 are approximately zero, the given distribution is approximately symmetrical. Example 7·22. The first four moments of a distribution about the origin are 1, 4, 10 and 46 respectively. Obtain the various characteristics of the distribution on the basis of the information given. Comment upon the nature of the distribution. Solution. We are given the first four moments about origin. In the usual notations we have : A = 0, μ 1 ′ = 1, μ2 ′ = 4, μ3 ′ = 10 and μ4 ′ = 46 The measure of central tendency is given by : Mean ( –x ) = First moment about origin = μ1′ = 1 The measure of dispersion is given by : Variance (σ2 ) = μ2 = μ2′ – μ1 ′2 = 4 – 1 = 3 ⇒ s.d. (σ) = ⎯ √ 3 = 1·732 ′3 μ3 = μ3 ′ – 3μ2 ′ μ1 ′ + 2μ1 = 10 – 3 × 4 × 1 + 2 × 1 = 10 – 12 + 2 = 0 Karl Pearson’s moment coefficient of skewness is given by : γ1 = μ3 =0 μ2 3/2 ⇒ μ 2 β1 = μ 33 = 0 2 Since γ1 = 0, the given distribution is symmetrical, i.e., Mean = Median = Mode, for the given distribution. Moreover, the quartiles are equidistant from the median i.e., Q3 – Median = Median – Q1 μ4 = μ4 ′ – 4μ3 ′ μ1 ′ + 6μ2 ′ μ1 ′2 – 3μ1 ′4 = 46 – 4 × 10 × 1 + 6 × 4 × 1 – 3 × 1 = 46 – 40 + 24 – 3 = 27 Hence, Karl Pearson’s measure of Kurtosis is given by : β2 = μ4 27 = =3 μ 2 2 32 ⇒ Since β2 = 3, the given distribution is Normal (Meso-kurtic). γ2 = β2 – 3 = 0 7·27 SKEWNESS, MOMENTS AND KURTOSIS Since β1 = 0 and β2 = 3, the given distribution is a normal distribution with Mean ( –x ) = 1 and s.d. (σ) = √ ⎯ 3 = 1·732. Example. 7.23. Given that the mean of a distribution is 5, variance is 9 and the moment coefficient of skewness is – 1, find the first three moments about origin. [Delhi Univ. B.A. (Econ. Hons.), 2009] — Solution. We are given Mean (X ) = 5, Variance (σ2 ) = μ2 = 9 γ1 = and ⇒ σ = 3; μ3 = – 1 ⇒ μ3 = – 27. σ3 We want moments about the point A = 0. — Mean ( X ) = A + μ1′ μ2 = μ2′ – μ1 ′2 μ3 = μ3′ – 3μ2 ′μ1′ + 2μ1 ′3 ⇒ 5 = 0 + μ1 ′ ⇒ ⇒ 9 = μ2 ′ – ⇒ ⇒ 52 – 27 = μ3 ′ – 3 × 34 × 5 + 2 × μ1 ′ = 5 μ2 ′ = 34 53 ⇒ μ3 ′ = – 27 + 510 – 250 = 233 Example 7·24. If √ ⎯⎯β1 = 1, β 2 = 4 and variance = 9, find the values of third and fourth central moments and comment upon the nature of the distribution. [Delhi Univ. B.A. (Econ. Hons.), 2007] Solution. We are given ⎯ √⎯β1 = +1, β2 = 4 and Variance = μ 2 = 9 Also √⎯β1 = 1 ⎯ ⇒ μ3 2 μ2 3 β2 = 4 ⇒ μ4 =4 μ2 2 =1 …(*) ⇒ μ3 = μ23/2 . 1 = 93/2 . 1 = 33 = 27 ⇒ μ4 = 4 × 9 × 9 = 324 [ or β1 = μ3 σ3 ] ∴ μ3 = 27 and μ4 = 324. Nature of the Distribution. Since γ1 = √ ⎯⎯β1 ≠ 0 and γ1 = 1, the distribution is moderately positively skewed. (Q μ3 > 0). Also β2 = 4 > 3. Hence, the given distribution is lepto-kurtic i.e., more peaked than the normal curve. Example 7·25. For a meso-kurtic distribution, the first moment about 7 is 23 and the second moment about origin is 1000. Find the coefficient of variation and the fourth moment about the mean. [Delhi Univ. B.Com. (Hons.), 2008; Delhi Univ. B.A. (Econ. Hons.), 2002] Solution. Since the distribution is given to be meso-kurtic, we have : β2 = 3 μ4 =3 μ2 2 ⇒ First moment about ‘7’ is 23 ∴ ⇒ μ4 = 3μ22 …(i) μ1′ (about 7) = 23 (Given) i.e., Mean = 7 + μ1 ′ = 7 + 23 = 30 …(ii) But mean is the first moment about origin. ∴ μ1 ′ (about origin) = 30 Moments About Origin μ1 ′ = Mean = 30; μ 2 ′ = 1,000 (Given) ∴ μ2 = μ2′ – μ1 ′2 = 1000 – 30 2 = 100 ⇒ Variance (σ2 ) = 100 × s.d. 100 × 10 Coefficient of Variation (C.V.) = 100 = 33·33 Mean = 30 ⇒ s.d. (σ) = 10 7·28 BUSINESS STATISTICS Substituting the value of μ2 = 100, in (i), the fourth moment about mean is given by : μ4 = 3 × 1002 = 30,000. Example 7·26. (a) You are given the following information : Class f Class f Class f 40—41 1 45—46 20 50—51 19 41— 2 46— 38 51— 14 42— 4 47— 52 52— 6 43— 7 48— 40 53— 4 49—50 29 54—55 2 44—45 12 Calculate the first four moments about 47·5. Convert these into moments about the mean and calculate β1 and β2. (b) Also apply Sheppard’s corrections to moments. Solution. CALCULATIONS FOR MOMENTS Class Mid-value (X) 40—41 41—42 42—43 43—44 44—45 45—46 46—47 47—48 48—49 49—50 50—51 51—52 52—53 53—54 54—55 40·5 41·5 42·5 43·5 44·5 45·5 46·5 47·5 48·5 49·5 50·5 51·5 52·5 53·5 54·5 Total f 1 2 4 7 12 20 38 52 40 29 19 14 6 4 2 d = X – 47·5 fd –7 –6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6 7 –7 –12 –20 –28 –36 – 40 –38 0 40 58 57 56 30 24 14 N = 250 fd2 49 72 100 112 108 80 38 0 40 116 171 224 150 144 98 ∑ fd = 98 fd3 fd4 –343 – 432 –500 – 448 –324 –160 –38 0 40 232 513 896 750 864 986 2391 2592 2500 1792 972 320 38 0 40 464 1539 3584 3750 5184 4802 ∑ fd2 = 1502 ∑ fd3 = 1736 First four Moments about A = 47·5. Since h = 1, we have : 98 μ1 ′ = 1N ∑ fd = 250 = 0·392 μ2 ′ = 1 N∑ fd2 = 1502 250 = 6·008 μ3 ′ = μ4′ = 1 1736 3 N ∑ fd = 250 = 6·944 1 29968 4 N ∑fd = 250 = 119·872 First four Moments about Mean : Mean = 47·5 + μ1 ′ = 47·5 + 0·392 = 47·892 μ1 = 0 μ2 = μ2 ′ – μ1 ′2 = 6·008 – (0·392)2 = 6·008 – 0·1537 = 5·854 μ3 = μ3 ′ – 3μ2 ′ μ1 ′+ 2μ1 ′3 = 6·944 – 3 × 6·008 × 0·392 + 2 × (0·392)3 ⇒ = 6·944 – 18·024 × 0·392 + 2 × 0·0602 μ3 = 6·944 – 7·0654 + 0·1204 = – 0·001 ~– 0 ∑ fd4 = 29968 7·29 SKEWNESS, MOMENTS AND KURTOSIS μ4 = μ4 ′ – 4μ3 ′ μ1 ′ + 6μ2 ′ μ1 ′2 – 3μ1 ′4 = 119·872 – 4 × 6·944 × 0·392 + 6 × 6·008 × (0·392)2 – 3 × (0·392)4 = 119·872 – 10·888 + 5·539 – 0·071 = 114·452 μ 2 μ 114·452 114·452 β1 = 3 3 ~– 0 ; (·. · μ3 = 0) and β2 = 42 = (5·854) 2 = 34·269 = 3·34 μ2 μ2 (b) Sheppard’s Corrections. Using (7·40), we have : μ2 (Corrected) = μ2 h2 – 12 = 1 5·854 – 12 (·. · h = Magnitude of class interval = 1) = 5·854 – 3·083 = 5·771 μ3 (Corrected) = μ3 ~– 0 7 4 μ4 (Corrected) = μ4 – 12 h2 μ2 + 240 h = 114·452 – 12 × 5·854 + 7 240 (·. · h = 1) = 114·452 – 2·927 + 0·029 = 111·554. Example 7.27. For a distribution, it has been found that the first four moments about 27 are 0, 256, – 2871 and 188462 respectively. Obtain the first four moments about zero. Also calculate the value of β1 and β2, and comment. [Delhi Univ. B.Com. (Hons.) 2007, 2006] Solution. We are given : A = 27; μ1′ = 0, μ2 ′ = 256, μ3′ = – 2871; μ4 ′ = 188462 Mean = A + μ1 ′ = 27 + 0 = 27 Hence, the moments about ‘A = 27’ are same as moments about mean. ⇒ μ2 = μ2′ = 256, μ3 = μ3′ = – 2871; μ4 = μ4′ = 188462 Moments about Origin μ1 ′ = Mean = 27 μ2 ′ = μ2 + μ1′2 = 256 + 272 = 256 + 729 = 985 μ3 ′ = μ3 + 3μ2 μ1′ + μ1 ′3 = – 2871 + 3 × 256 × 27 + 273 = – 2871 + 20736 + 19683 = 37548 μ4 ′ = μ4 + 4μ3 μ1′ + 6μ2 μ1′2 + μ1′4 β1 = 188462 + 4 × (– 2871) × 27 + 6 × 256 × 272 + 27 4 = 188462 – 310068 + 1119744 + 531441 = 1529579 μ 2 (– 2871)2 8242641 μ3 – 2871 – 2871 = 33 = = = 0·4913 ; γ1 = 3/2 = = = – 0·701 μ2 (256) 3 16777216 μ2 ⎯⎯⎯⎯⎯⎯⎯ √ 16777216 4096 β2 = μ4 188462 188462 = = = 2·876 μ2 2 (256) 2 65536 ⇒ γ2 = β2 – 3 = – 0·124 Since β1 ≠ 0, the distribution is not symmetrical. Since μ3 < 0, γ1 = – 0·701 < 0, the distribution is moderately negatively skewed. β2 < 3 ⇒ Distribution is moderately platy-kurtic. Example 7·28. The first four moments of a distribution about the value 4 of the variable are –1·5, 17, –30 and 108. Find the moments about mean, β1 and β2. Find also the moments about (i) the origin, and (ii) the point x = 2. Solution. In the usual notations, we are given A = 4 and μ1′ = –1·5, μ2 ′ = 17, μ3 ′ = –30 and μ4 ′ = 108. 7·30 BUSINESS STATISTICS Moments about Mean : μ2 = μ2 ′ – μ1 ′2 = 17 – (–1·5)2 = 17 – 2·25 μ4 = μ4 ′ – 4μ3 ′ μ1 ′ + 6μ2 ′ μ1 ′2 – 3μ1 ′4 = 14·75 = 108 – 4(–30) (–1·5) + 6(17) (–1·5)2 – 3(–1·5)4 μ3 = μ3 ′ – 3μ2 ′ μ1 ′ + 2μ1 ′3 = 108 – 180 + 229·5 – 15·1875 3 = 142·3125 = –30 – 3 × (17) × (–1·5) + 2(–1·5) = –30 + 76·5 – 6·75 = 39·75 μ 2 (39·75)2 μ 142·3125 Hence, β1 = 3 3 = (14·75)3 = 1580·06 ; β2 = 42 = 142·3125 3209·05 = 0·4924 (14·75)2 = 217·5625 = 0·6541 μ2 μ2 Also Mean ( –x ) = A + μ ′ = 4 + (–1·5) = 2·5 1 Note. Since for a discrete distribution, β2 ≥ 1, [See Remark 2 to § 7.6], there seems to be some error in the problem. Moments about Origin. We have –x = 2·5, μ = 14·75, μ = 39·75 and μ = 142·31 (approx.) 2 3 4 We know –x = A + μ1 ′, where μ 1 ′ is the first moment about the point x = A. Taking A = 0, we get the first moment about origin as μ1 ′ = mean = 2·5. Using (7·35), we get μ2 ′ = μ2 + μ1′2 = 14·75 + (2·5)2 = 14·75 + 6·25 = 21 μ3 ′ = μ3 + 3μ2 μ1 ′ + μ1 ′3 = 39·75 + 3(14·75)(2·5) + (2·5)3 = 39·75 + 110·625 + 15·625 = 166 μ4 ′ = μ4 + 4μ3μ1 ′ + 6μ2 μ1 ′2 + μ1′4 = 142·3125 + 4(39·75)(2·5) + 6(14·75)(2·5) 2 + (2·5)4 = 142·3125 + 397·5 + 553·125 + 39·0625 = 1132 Moments about the Point x = 2. We have –x = A + μ ′. Taking A = 2, the first moment about the point 1 x = 2 is μ1 ′ = –x – 2 = 2·5 – 2 = 0·5 Using (7·35), we get μ2 ′ = μ2 + μ1′2 = 14·75 + 0·25 = 15 μ3 ′ = μ3 + 3μ2μ1 ′ + μ1 3 = 39·75 + 3(14·75)(0·5) + (0·5)3 = 39·75 + 22·125 + 0·125 = 62 μ4 ′ = μ4 + 4μ3μ1 ′ + 6μ2 μ1 ′2 + μ1′4 = 142·3125 + 4(39·75)(0·5) + 6(14·75)(0·5) 2 + (0·5)4 = 142·3125 + 79·5 + 22·125 + 0·0625 = 244 Example 7·29. Examine whether the following results of a piece of computation for obtaining the second central moment are consistent or not ; N = 120, ∑ f X = –125, ∑ f X2 = 128. Solution. We have 2 ∑ f X2 ∑ f X 2 128 μ2 = – = 120 – –125 = 1·0670 – 1·0816 = – 0·0146, 120 N N which is impossible, since variance cannot be negative. Hence, the given data are inconsistent. ( ) ( ) EXERCISE 7·3. 1. (a) Distinguish between “Skewness” and “Kurtosis” and bring out their importance in describing frequency distributions. (b) “Averages, dispension, skewness and kurtosis are complementary to one another in understanding a frequency distribution.” Elucidate. SKEWNESS, MOMENTS AND KURTOSIS 7·31 2. (a) Explain the terms Skewness and Kurtosis used in connection with the frequency distribution of a continuous variable. Give the different measures of skewness (any three of the measures to be given) and kurtosis. (b) Define skewness and describe briefly the various tests of skewness. [Himachal Pradesh Univ. B.Com., 1998] (c) Explain briefly how the measures of skewness and kurtosis can be used in describing a frequency distribution. (d) What do you mean by ‘skewness’ ? How is skewness different from kurtosis ? 3. (a) Explain the relation between the central moments and raw moments of a frequency distribution and hence express the first four central moments in terms of the raw moments. (b) What are the raw and central moments of a distribution ? Show that the central moments are invariant under change of origin but not of scale. 4(a) Define moments. “A frequency distribution can be described almost completely by the first four moments and two measures based on moments.” Examine the statement. [Delhi Univ. B.Com. (Hons.), 1996] (b) Define moments. How are moments helpful in the study of the different aspects of the formation of a frequency distribution ? [Delhi Univ. B.Com. (Hons.) (External), 2006] 5. Indicate whether two distributions with the same means, standard deviations and coefficients of skewness must have same peakedness. [Delhi Univ. B.Com. (Hons.), 1999] 6. Prove that for any frequency distribution : (i) Kurtosis is greater than unity. (ii) Quartile coefficient of skewness is less than 1 numerically. 7. Prove any two of the following : (i) The sum of squares of deviations is the least when deviations are taken from the mean. (ii) The standard deviation is affected by the change of scale but not by the change of origin. (iii) The moment coefficient of kurtosis is not affected by the change of scale. [Delhi Univ. B.A. (Econ. Hons.), 1991] 8. Prove that the moment coefficient of kurtosis is not affected by the change of scale. [Delhi Univ. B.A. (Econ. Hons.),1996] 9. Define Pearsonian coefficients β1 and β2, and discuss their utility in Statistics. Define moment coefficient of skewness. Discuss its utility and limitations, if any. 10. (a) Find the first, second, third and fourth central moments of the set of numbers 2, 4, 6, 8. Ans. μ 1 = 0, μ2 = 5, μ3 = 0, μ4 = 41. (b) Given the following data, compute the moment coefficient of skewness and comment on the result. X: 2 3 7 8 10 [Delhi Univ. B.A. (Econ. Hons.), 1991] — – – Hint. X = 6; μ 2 = 15 ∑ (x – x )2 = 46 = 9·2 ; μ3 = 15 ∑ (x – x )3 = –18 = –3·6 5 5 β1 = 0·0166 ⇒ γ1 = ⎯√⎯⎯⎯⎯ 0·0166 = – 0·12 (Negative sign is taken because μ3 is negative). Hence, the distribution is very slightly negatively skewed, or it is approximately symmetrical ~ 0). (·.· β1 = 11. Calculate β1 and β2 (measures of skewness and kurtosis) for the following frequency distribution and hence comment on the type of the frequency distribution : x: 2 3 4 5 6 f: 1 3 7 2 1 Ans. β1 = 0·0204 ; β2 = 3·1080. Distribution is approximately normal. 12. Find the first four moments about the mean for the following distribution. Height (in inches) : 60—62 63—65 66—68 69—71 72—74 Frequency : 5 18 42 27 8 Ans. μ 1 = 0, μ2 = 8·5275, μ3 = –2·6933, μ4 = 199·3759. 13. Find the variance, skewness and kurtosis of the following distribution by the method of moments : Class interval : 0—10 10—20 20–30 30–40 Frequency : 1 4 3 2 Ans. σ2 = 84, γ1 = 0·0935, β2 = 2·102. 7·32 BUSINESS STATISTICS 14. Explain Sheppard’s corrections for Grouping Errors. What are the conditions to be satisfied for the application of Sheppard’s corrections ? [Delhi Univ. B.Com. (Hons.), (External), 2006] 15. For the following distribution, calculate the first four central moments and two beta co-efficients : Class Interval : 20—30 30—40 40—50 50—60 60—70 70—80 80—90 Frequency : 5 14 20 25 17 11 8 [Delhi Univ. B.Com. (Hons.), 2001] Ans. μ 1 = 0, μ2 = 254, μ3 = 540, μ4 = 1,49,000; β1 = 0·0178, β2 = 2·3095. 16. The first two moments of a distribution about the value 5 of the variable are 2 and 20. Find the mean and the variance. Ans. Mean = 7, Variance = 16. 17. The first three moments of a distribution about the value 2 are 1, 22 and 10. Find its mean, standard deviation and the moment measure of skewness. Ans. Mean = 3, σ = 4·58, γ1 = – 0·5614. 18. If the first three moments about the origin for a distribution are 10,225 and 0 respectively, calculate the fist three moments about the value 5 for the distribution. [Delhi Univ. B.A. (Econ. Hons.), 2005] Ans. 5, 150, – 2750. 19. For a distribution, the mean is 10, variance is 16, γ 1 is +1 and β2 is 4. Obtain the first four moments about the origin, i.e., zero. Comment upon the nature of distribution. [Delhi Univ. B.Com. (Hons.), 2005] Ans. Moments about origin : μ1′ = 10, μ2′ = 116, μ3′ = 1544, μ4′ = 23184. The distribution is positively skewed and leptokurtic. 20. The arithmetic mean of a certain distribution is 5. The second and the third moments about the mean are 20 and 140 respectively. Find the third moment of the distribution about 10. Ans. μ 3′ (about 10) = –285. 21. In a certain distribution the first four moments about the point 4 are 1·5, 17, –30 and 108 respectively. Find the kurtosis of the frequency curve and comment on its shape. Ans. β2 = 2·3088. Distt. is Platy-kurtic. 22. The first four moments of a distribution about the value 4 are – 1·5, 17, – 30 and 108 respectively. Calculate : (i) Moments about mean; (ii) Skewness; (iii) Show whether the distribution is Leptokurtic or Platykurtic. [Delhi Univ. B.Com. (Hons.), 2009] Ans. (i) μ1 = 0, μ2 = σ2 = 14·75, μ3 = 39·75, μ 4 = 142·3125. (ii) Skewness (γ 1) = (μ3/σ3) = 0·7018 (iii) β2 = (μ4/μ22) = 0·654 < 3 ⇒ Distribution is platykurtic. Note . See Example 7·28. 23. The first four central moments are 0, 4, 8 and 144. Examine the skewness and kurtosis. Ans. γ1 = 1, β2 = 9 ⇒ γ2 = 6. 24. The central moments of a distribution are given by : μ2 = 140, μ3 = 148, μ4 = 6030. Calculate the moment measures of skewness and kurtosis and comment on the shape of the distribution. Ans. γ1 = 0·0893 ; β2 = 0·3076 ; Distribution is approximately symmetrical and platy-kurtic. 25. The first four moments of a distribution about value 2 are 1, 2·5, 5·5 and 16 respectively. Calculate the four moments about mean and comment on the nature of the distribution. [Delhi Univ. B.Com. (Hons.), 2002] 26. The first four moments of a distribution about the value 3 are 1, 2·5, 5·5 and 16 respectively. Do you think that the distribution is leptokurtic ? [Delhi Univ. B.A. (Econ. Hons.), 2009] Ans. μ2 = 1·5, μ 3 = 0, μ4 = 6 ; β2 = 2·67 < 3. Distribution is platy-kurtic and not leptokurtic. 27. The first four moments of a distribution about the value 5 are equal to 2, 20, 40 and 50. Obtain the mean, ⎯ √ [Delhi Univ. B.A. (Econ. Hons.), 1993] variance, β1 and β2 for the distribution and comment. Ans. Mean = 7, Variance = μ2 = 16; μ3 = – 64, μ4 = 162. μ3 μ4 γ 1 = β1 = 3/2 = – 64 = –1. (Negatively skewed) ; β 2 = 2 = 162 = 0·63 (Platy-kurtic) μ2 64 μ2 256 ⎯√ 7·33 SKEWNESS, MOMENTS AND KURTOSIS 28. The following data are given to an economist for the purpose of economic analysis. The data refer to the length of a certain type of batteries. N = 100, ∑ fd = 50, ∑ fd2 = 1970, ∑ fd3 = 2948 and ∑ fd4 = 86,752, in which d = (X – 48). Do you think that the distribution is platykurtic ? [Delhi Univ. B.Com. (Hons.), 1998] Ans. β2 = 2·214; β2 < 3, distribution is platy-kurtic. 29. The standard deviation of a symmetrical distribution is 5. What must be the value of the fourth moment about the mean in order that the distribution be (a) lepto-kurtic, (b) meso-kurtic, and (c) platy-kurtic ? Ans. Distribution is (a) lepto-kurtic if μ4 > 1875; (b) meso-kurtic if μ4 = 1875; (c) platy-kurtic if μ 4 < 1875. 30. From the data given below, first calculate the first four moments about an arbitrary value and then after Sheppard’s corrections, calculate the first four moments about the mean. Also calculate β2 and comment on its value. Average number of hours worked per week by workers in 100 industries in 1998. Hours worked : 30—33 33—36 36—39 39–42 No. of Industries : 2 4 26 47 42—45 15 45—48 6 Ans. Moments after applying Sheppard’s correction are : μ1 = 0, μ2 = 8·01, μ3 = –20·69, μ4 = 249·393. 31. (a) The first four moments of a distribution about the value 4 are –1·5, 17, –130, 108. Find whether the data are consistent. Ans. μ 4 = 108 – 780 + 229·5 – 15·1875 = – 457·6875. Since μ4 is negative, data are inconsistent. (b) The first four moments of a distribution about 3 are 1, 3, 6 and 8. Is the data consistent ? Explain [Delhi Univ. B.A. (Econ. Hons.), 2009] Ans. Inconsistent, because μ4 = – 1 (< 0), which is not possible. (c) For a distribution, the first four moments about zero are 2, 5, 16 and 30. Is the data consistent ? Explain. [Delhi Univ. B.A. (Econ. Hons.), 2007] Ans. Data are inconsistent because μ4 = – 26 (Negative), which is not possible. 32. (a) The standard deviation of a symmetrical distribution is given to be 5. What must be the value of the fourth moment about the mean in order that the distribution is meso-kurtic ? [Delhi Univ. B.A. (Econ. Hons.), 1999] (b) The standard deviation of symmetrical distribution is 3. What must be the value of the 4th moment about the mean in order that the distribution be mesokurtic ? [Maharishi Dayanand Univ. M.Com., 1999] Ans. (a) = 1875, (b) 243. (c) Comment on the following : The standard deviation of a symmetrical distribution is 3 and the value of fourth moment about mean is 243 in order that the distribution is mesokurtic. [Delhi Univ. B.Com. (Hons.), 2009] Ans. True; because for Mesokurtic distribution : β2 = 3. 33. Fill in the blanks : (a) (i) If β2 = 3, the curve is called …… (b) (i) (ii) If β2 > 3, the curve is called …… (ii) (iii) If β2 < 3, the curve is called …… (iii) (iv) If β1 = 0, the curve is called …… (iv) Comment on the result when equality sign holds. Ans. (a) (i) Mesokurtic, (ii) Leptokurtic, (iii) Platykurtic (b) (i) β1 ≥ 0, (ii) μ4 ≥ 0, (iii) μ2 ≥ 0, 34. Fill in the blanks : (i) Literal meaning of skewness is “.......” (ii) Kurtosis is a measure of …… of the frequency curve. (iii) For a symmetrical distribution, mean, median and mode …… (iv) If Mean < Mode, the distribution is …… skewed. (v) If Mean > Median, the distribution is …… skewed. (vi) Bowley’s coefficient of skewness lies between …… and …… (vii) β1 = 0 implies that distribution is …… (viii) If γ1 > 0, the distribution is …… skewed. β1 is always … (≥ , = , ≤). μ4 is always … (≥, = , ≤). μ2 is always … (≥, = , ≤). β2 is always … (≥, = , ≤) (iv) Symmetrical (iv) β2 ≥ 1. 7·34 BUSINESS STATISTICS If γ1 < 0, the distribution is …… skewed. For a normal curve β2 …… If β2 > 3, the curve is called …… If β2 < 3, the curve is called …… An absolute measure of skewness based on moments is …… An absolute measure of skewness based on quartiles is …… Relative measure of skewness based on mean, s.d. and mode is …… Relative measure of kurtosis is …… Relative measure of skewness in terms of moments is …… For a moderately asymmetrical distribution : Mean – Median = ? (Mean – Mode) If μ3 = –1·48, the curve of the given distribution is stretched more to the ……than to the …… For a symmetrical distribution, …… are equidistant from …… Mean = First moment about …… Variance = …… moment about mean. If μ1′ is the first moment about the point ‘A’, then Mean = …… μ2 … (xxiv) β1 = 3 ; β2 = 2 γ 1 = …… ; γ 2 = …… …… μ2 (xxv) In a moderately asymmetrical distribution the distance between … and … is … the distance between … and … Ans. (i) Lack of symmetry, (ii) Convexity (flatness or (iii) Coincide, (iv) Negatively, peakedness), (v) Positively, (vi) –1 and 1, (vii) Symmetrical, (viii) Positively, (ix) Negatively, (x) 3, (xi) Lepto-kurtic, (xii) Platy-kurtic, (ix) (x) (xi) (xii) (xiii) (xiv) (xv) (xvi) (xvii) (xviii) (xix) (xx) (xxi) (xxii) (xxiii) (xii) μ3, (xvii) β1 or γ 1, (xxi) Origin (i.e., A = 0), (xiv) (Q3 – Md) – (Md – Q1), (xviii) 1/3, (xxii) Second, (xv) (M – Mo)/σ, (xix) Left, Right, (xxiii) A + μ1′, (xvi) β2 or γ 2, (xx) Quartiles, median, μ3 (xxiv) μ23, μ4, β1 = 3/2 , μ2 γ 2 = β2 – 3, ⎯√ (xxv) Mean, Mode, Thrice, Mean, Median. 35. State whether the following statements are true (T) or false (F). (i) Skewness studies the flatness or peakedness of the distribution. (ii) Kurtosis means ‘lack of symmetry’. (iii) For a symmetrical distribution β1 = 0. (iv) Skewness and kurtosis help us in studying the shape of the frequency curve. (v) Bowley’s coefficient of skewness lies between –3 and 3. (vi) Two distributions having the same values of mean, s.d. and skewness must have the same kurtosis. (vii) A positively skewed distribution curve is stretched more to the right than to the left. (viii) If β2 > 3, the curve is called platy-kurtic. (ix) If β2 = 3, the curve is called normal. (x) β1 is always non-negative. β2 can be negative. Variance = μ2 (2nd moment about mean). For a symmetrical distribution : μ1 = μ3 = μ5 = …… = 0 For a symmetrical distribution : Mean > Median > Mode. μ4 (xv) β1 = 2 μ3 Ans. (i) F, (ii) F, (iii) T, (vi) F, (vii) T, (viii) F, (xi) F, (xii) T, (xiii) T, (xi) (xii) (xiii) (xiv) (iv) T, (ix) T, (xiv) F, (v) F (x) T, (xv) F. 8 Correlation Analysis 8·1. INTRODUCTION So far we have confined our discussion to univariate distributions only i.e., the distributions involving only one variable and also saw how the various measures of central tendency, dispersion, skewness and kurtosis can be used for the purposes of comparison and analysis. We may, however, come across certain series where each item of the series may assume the values of two or more variables. If we measure the heights and weights of n individuals, we obtain a series in which each unit (individual) of the series assumes two values—one relating to heights and the other relating to weights. Such distribution, in which each unit of the series assumes two values is called a bivariate distribution. Further, if we measure more than two variables on each unit of a distribution, it is called a multivariate distribution. In a series, the units on which different measurements are taken may be of almost any nature such as different individuals, times, places, etc. For example we may have : (i) The series of marks of individuals in two subjects in an examination. (ii) The series of sales revenue and advertising expenditure of different companies in a particular year. (iii) The series of exports of raw cotton in crores of rupees and imports of manufactured goods during number of years from 1989 to 1994, say. (iv) The series of ages of husbands and wives in a sample of selected married couples and so on. Thus in a bivariate distribution we are given a set of pairs of observations, one value of each pair being the values of each of the two variables. In a bivariate distribution, we may be interested to find if there is any relationship between the two variables under study. The correlation is a statistical tool which studies the relationship between two variables and correlation analysis involves various methods and techniques used for studying and measuring the extent of the relationship between the two variables. WHAT THEY SAY ABOUT CORRELATION — SOME DEFINITIONS AND USES “When the relationship is of a quantitative nature, the appropriate statistical tool for discovering and measuring the relationship and expressing it in a brief formula is known as correlation.”—Croxton and Cowden “Correlation is an analysis of the covariation between two or more variables.” —A.M. Tuttle “Correlation analysis contributes to the understanding of economic behaviour, aids in locating the critically important variables on which others depend, may reveal to the economist the connections by which disturbances spread and suggest to him the paths through which stabilising forces may become effective.”—W.A. Neiswanger “The effect of correlation is to reduce the range of uncertainty of our prediction.” —Tippett Two variables are said to be correlated if the change in one variable results in a corresponding change in the other variable. 8·1·1. Types of Correlation (a) POSITIVE AND NEGATIVE CORRELATION If the values of the two variables deviate in the same direction i.e., if the increase in the values of one variable results, on an average, in a corresponding increase in the values of the other variable or if a 8·2 BUSINESS STATISTICS decrease in the values of one variable results, on an average, in a corresponding decrease in the values of the other variable, correlation is said to be positive or direct. Some examples of series of positive correlation are : (i) Heights and weights. (ii) The family income and expenditure on luxury items. (iii) Amount of rainfall and yield of crop (up to a point). (iv) Price and supply of a commodity and so on. On the other hand, correlation is said to be negative or inverse if the variables deviate in the opposite direction i.e., if the increase (decrease) in the values of one variable results, on the average, in a corresponding decrease (increase) in the values of the other variable. Some examples of negative correlation are the series relating to : (i) Price and demand of a commodity. (ii) Volume and pressure of a perfect gas. (iii) Sale of woollen garments and the day temperature, and so on. (b) LINEAR AND NON-LINEAR CORRELATION The correlation between two variables is said to be linear if corresponding to a unit change in one variable, there is a constant change in the other variable over the entire range of the values. For example, let us consider the following data : x 1 2 3 4 5 y 5 7 9 11 13 Thus for a unit change in the value of x, there is a constant change viz., 2 in the corresponding values of y. Mathematically, above data can be expressed by the relation y = 2x + 3 In general, two variables x and y are said to be linearly related, if there exists a relationship of the form y = a + bx … (*) between them. But we know that (*) is the equation of a straight line with slope ‘b’ and which makes an intercept ‘a’ on the y-axis [c.f. y = m x + c form of equation of the line]. Hence, if the values of the two variables are plotted as points in the xy-plane, we shall get a straight line. This can be easily checked for the example given above. Such phenomena occur frequently in physical sciences but in economics and social sciences, we very rarely come across the data which give a straight line graph. The relationship between two variables is said to be non-linear or curvilinear if corresponding to a unit change in one variable, the other variable does not change at a constant rate but at fluctuating rate. In such cases if the data are plotted on the xy-plane, we do not get a straight line curve. Mathematically speaking, the correlation is said to be non-linear if the slope of the plotted curve is not constant. Such phenomena are common in the data relating to economics and social sciences. Since the techniques for the analysis and measurement of non-linear relation are quite complicated and tedious as compared to the methods of studying and measuring linear relationship, we generally assume that the relationship between the two variables under study is linear. In this chapter, we shall confine ourselves to the measurement of linear relationship only. The measurement of non-linear relationship is, however, beyond the scope of this book. Remark. The study of correlation is easy in physical sciences since on the basis of experimental results, it is easy to establish mathematical relationship between two or more variables under study. But in social and economic sciences, it is very difficult to establish mathematical relationship between the variables under study since in such phenomena, the values of the variables under study are affected simultaneously by multiplicity of factors and it is extremely difficult, sometimes impossible, to study the effects of each factor separately. Hence, in the data relating to social and economic phenomena, the study of correlation cannot be as accurate and precise. 8·1·2. Correlation and Causation. Correlation analysis enables us to have an idea about the degree and direction of the relationship between the two variables under study. However, it fails to reflect upon the CORRELATION ANALYSIS 8·3 cause and effect relationship between the variables. In a bivariate distribution, if the variables have the cause and effect relationship, they are bound to vary in sympathy with each other and, therefore, there is bound to be a high degree of correlation between them. In other words, causation always implies correlation. However, the converse is not true i.e., even a fairly high degree of correlation between the two variables need not imply a cause and effect relationship between them. The high degree of correlation between the variables may be due to the following reasons : 1. Mutual dependence. The phenomena under study may inter-influence each other. Such situations are usually observed in data relating to economic and business situations. For instance, it is well-known principle in economics that prices of a commodity are influenced by the forces of supply and demand. For instance, if the price of a commodity increases, its demand generally decreases (other factors remaining constant). Here increased price is the cause and reduction in demand is the effect. However, a decrease in the demand of a commodity due to emigration of the people or due to fashion or some other factors like changes in the tastes and habits of people may result in decrease in its price. Here, the cause is the reduced demand and the effect is the reduced price. Accordingly, the two variables may show a good degree of correlation due to interaction of each on the other, yet it becomes very difficult to isolate the exact cause from the effect. 2. Both the variables being influenced by the same external factors. A high degree of correlation between the two variables may be due to the effect or interaction of a third variable or a number of variables on each of these two variables. For example, a fairly high degree of correlation may be observed between the yield per hectare of two crops, say, rice and potato, due to the effect of a number of factors like favourable weather conditions, fertilizers used, irrigation facilities, etc., on each of them. But none of the two is the cause of the other. 3. Pure chance. It may happen that a small randomly selected sample from a bivariate distribution may show a fairly high degree of correlation though, actually, the variables may not be correlated in the population. Such correlation may be attributed to chance fluctuations. Moreover, the conscious or unconscious bias on the part of the investigator, in the selection of the sample may also result in high degree of correlation in the sample. In this connection, it may be worthwhile to make a mention of the two phenomena where a fairly high degree of correlation may be observed, though it is not possible to conceive them as being causally related. For example, we may observe a high degree of correlation between the size of shoe and the intelligence of a group of persons. Such correlation is called spurious or non-sense correlation. [For details see § 8·4·2 (iii).] 8·2. METHODS OF STUDYING CORRELATION We shall confine our discussion to the methods of ascertaining only linear relationship between two variables (series). The commonly used methods for studying the correlation between two variables are : (i) Scatter diagram method. (ii) Karl Pearson’s coefficient of correlation (Covariance method). (iii) Two-way frequency table (Bivariate correlation method). (iv) Rank method. (v) Concurrent deviations method. 8·3. SCATTER DIAGRAM METHOD Scatter diagram is one of the simplest ways of diagrammatic representation of a bivariate distribution and provides us one of simplest tools of ascertaining the correlation between two variables. Suppose we are given n pairs of values (x1, y 1 ), (x2, y 2 ), …, (xn, y n ) of two variables X and Y. For example, if the variables X and Y denote the height and weight respectively, then the pairs (x1 , y 1 ), (x 2 , y 2 ), … , (xn, y n ) may represent the heights and weights (in pairs) of n individuals. These n points may be plotted as dots (.) on the x-axis and y-axis in the xy-plane. (It is customary to take the dependent variable along the y-axis and independent variable along the x-axis.) The diagram of dots so obtained is known as scatter diagram. From scatter diagram we can form a fairly good, though rough, idea about the relationship between the two variables. The following points may be borne in mind in interpreting the scatter diagram regarding the correlation between the two variables : 8·4 BUSINESS STATISTICS (i) If the points are very dense i.e., very close to each other, a fairly good amount of correlation may be expected between the two variables. On the other hand, if the points are widely scattered, a poor correlation may be expected between them. (ii) If the points on the scatter diagram reveal any trend (either upward or downward), the variables are said to be correlated and if no trend is revealed, the variables are uncorrelated. (iii) If there is an upward trend rising from lower left hand corner and going upward to the upper right hand corner, the correlation is positive since this reveals that the values of the two variables move in the same direction. If, on the other hand, the points depict a downward trend from the upper left hand corner to the lower right hand corner, the correlation is negative since in this case the values of the two variables move in the opposite directions. (iv) In particular, if all the points lie on a straight line starting from the left bottom and going up towards the right top, the correlation is perfect and positive, and if all the points lie on a straight line starting from left top and coming down to right bottom, the correlation is perfect and negative. The following diagrams of the scattered data depict different forms of correlation. PERFECT POSITIVE CORRELATION PERFECT NEGATIVE CORRELATION LOW DEGREE OF POSITIVE CORRELATION Y Y • • • • • • Y • • • • • O O X • • • • • • • • • • • Fig. 8.3. HIGH DEGREE OF POSITIVE CORRELATION Y • X Fig. 8.2. LOW DEGREE OF NEGATIVE CORRELATION •• • •• • ••••• • •• • •• • •• ••• • • HIGH DEGREE OF NEGATIVE CORRELATION Y •• •• • • •• • • • X O Fig. 8.4. NO CORRELATION Y X Fig. 8.5. O X Fig. 8.7. •• •• • • •• O X Fig. 8.6. NO CORRELATION Y • • • •• •• • • • • • • • • • • • • •• • • • • • • O X Fig. 8.1. O • • • • • • • • • • • • Y • • • • • • • • • • • • • • •• •• • • • • • • • • • • • O X Fig. 8.8. 8·5 CORRELATION ANALYSIS Remarks 1. The method of scatter diagram is readily comprehensible and enables us to form a rough idea of the nature of the relationship between the two variables merely by inspection of the graph. Moreover, this method is not affected by extreme observations whereas all mathematical formulae of ascertaining correlation between two variables are affected by extreme observations. However, this method is not suitable if the number of observations is fairly large. 2. The method of scatter diagram only tells us about the nature of the relationship whether it is positive or negative and whether it is high or low. It does not provide us an exact measure of the extent of the relationship between the two variables. 3. The scatter diagram enables us to obtain an approximate estimating line or line of best fit by free hand method. The method generally consists in stretching a piece of thread through the plotted points to locate the best possible line. However, a rigorous method of obtaining the line of best fit is discussed in next chapter (Regression Analysis). Example 8·1. Following are the heights and weights of 10 students of a B.Com. class. Height (in inches ) X : 62 72 68 58 65 70 66 63 60 72 Weight (in kgs.) Y : 50 65 63 50 54 60 61 55 54 65 Draw a scatter diagram and indicate whether the correlation is positive or negative. Solution. The scatter diagram of the above data is shown below. SCATTER DIAGRAM Y 66 62 60 58 56 54 52 • • • • • • • • • — – — – — – — – — – — – — – — 50 — – — – — – — – — – — – — – — – — 64 0 58 60 62 64 66 68 70 72 X Fig. 8.9. Since the points are dense i.e., close to each other, we may expect a high degree of correlation between the series of heights and weights. Further, since the points reveal an upward trend starting from left bottom and going up towards the right top, the correlation is positive. Hence, we may expect a fairly high degree of positive correlation between the series of heights and weights in the class of B.Com. students. EXERCISE 8·1 1. Explain the concept of correlation. Clearly explain with suitable illustrations its role in dealing with business problems. 2. (a) Define correlation. Explain various types of correlation with suitable examples. [Delhi Univ. B.Com. (Pass), 2000] (b) State the nature of the following correlations (positive, negative or no correlation) : (i) Sale of woollen garments and the day temperature; (ii) The colour of the saree and the intelligence of the lady who wears it ; and (iii) Amount of rainfall and yield of crop. 8·6 BUSINESS STATISTICS 3. Define correlation. Discuss its significance. Does correlation always signify causal relationship between two variables ? Explain with illustration. 4. (a) Does the high degree of correlation between the two variables signify the existence of cause and effect relationship between the two variables ? (b) Does correlation imply causation between two variables ? [Delhi Univ. B.Com. (Hons.), 2008] 5. (a) What is ‘spurious correlation’ and ‘non-sense or chance correlation’ ? Explain with the help of an example. [Delhi Univ. B.Com. (Pass), 1997] (b) Comment on the following statement : “A high degree of positive correlation between the ‘size of the shoe’ and the ‘intelligence’ of a group of individuals implies that people with bigger shoe size are more intelligent than the people with lower shoe size”. 6. How far do you agree with the conclusion drawn in the following case ? Why ? Two series — quantity of money in circulation and general price index — are found to possess positive correlation of a fairly high order. From this, it is concluded that one is the cause and the other the effect in a direct causal relationship. 7. (a) Distinguish clearly between : (i) Positive and Negative correlation ; (ii) Linear and non-linear correlation. (b) “If the two or more quantities vary in sympathy so that movements in one tend to be accompanied by corresponding movements in other(s), then they are said to be correlated.” Discuss. 8. What is correlation ? What is a scatter diagram ? How does it help in studying correlation between two variables, in respect of both its nature and extent ? 9. (a) What is correlation ? Explain the implications of positive and negative correlation. Show by means of scatter diagram, the presence of perfect positive and perfect negative correlation. [C.A. (Foundation), May 1996] (b) Illustrate a perfect negative correlation on a scatter diagram. [Delhi Univ. B.Com. (Pass), 1998] 10. (a) What is a scatter diagram ? How is it useful in the study of correlation between two variables ? Explain with suitable examples. [Delhi Univ. B.Com. (Hons.), (External), 2006] (b) Write a note on scatter diagram. Draw sketches of scatter diagram to show the following correlation between two variables x and y : (i) linear ; (ii) linear and perfect; (iii) non-linear; (iv) x and y uncorrelated. (c) While drawing a scatter diagram, if all points appear to form a straight line going downward from left to right, then it is inferred that there is : (i) Perfect positive correlation ; (ii) Simple positive correlation; (iii) Perfect negative correlation ; (iv) No correlation. Ans. (iii) 11. Given the following pairs of values : Capital employed (Crores of Rs.) : 2 3 5 6 8 9 Profits (Lakhs of Rs.) : 6 5 7 8 12 11 (a) Make a scatter diagram. (b) Do you think that there is any correlation between profits and capital employed ? Is it positive or negative ? Is it high or low ? (c) By graphic inspection, draw an estimating line. 12. “Even a high degree of correlation does not mean that a relationship of cause and effect exists between the two correlated variables”. Discuss. 13. Draw a scatter diagram from the following data : Height (inches) : 62 72 70 60 67 70 64 65 60 70 Weight (lbs.) : 50 65 63 52 56 60 59 58 54 65 Also indicate whether correlation is positive or negative. Ans. Positive Correlation. 14. Draw a scatter diagram for the data given below and interpret it. X: 10 20 30 40 50 60 70 80 Y: 32 20 24 36 40 28 38 44 8·7 CORRELATION ANALYSIS 8·4. KARL PEARSON’S COEFFICIENT OF CORRELATION (COVARIANCE METHOD) A mathematical method for measuring the intensity or the magnitude of linear relationship between two variable series was suggested by Karl Pearson (1867-1936), a great British Bio-metrician and Statistician and is by far the most widely used method in practice. Karl Pearson’s measure, known as Pearsonian correlation coefficient between two variables (series) X and Y, usually denoted by r (X, Y) or rxy or simply r, is a numerical measure of linear relationship between them and is defined as the ratio of the covariance between X and Y, written as Cov (x, y), to the product of the standard deviations of X and Y. Symbolically, Cov (x, y) r = … (8·1) σx σy If (x 1 , y1), (x 2 , y2), … , (xn, yn) are n pairs of observations of the variables X and Y in a bivariate distribution, then Cov (x, y) = 1 ∑(x – x– ) (y – –y) ; σx = n 1 ∑(x – x– )2 ; σ y = n 1 ∑(y – –y )2 n summation being taken over n pairs of observations. Substituting in (8·1), we get 1 ∑ (x – –x ) (y – –y ) ∑ (x – –x ) (y – –y ) n r = = 1 1 ∑ (x – –x )2 ∑ (y – –y )2 ∑ (x – –x )2. ∑(y – –y )2 n n ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ The formula (8·3) can also be written as : r= ∑ dx dy ∑ dx2 . ∑ dy2 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ … (8·2) … (8·3) … (8·3a) where dx and dy denote the deviations of x and y values from their arithmetic means –x and –y respectively i.e., dx = x – –x , dy = y – –y … (8·3b) Simplifying (8·2), we get 1 Cov (x, y) = ∑ xy – –x –y … (8·4) n ∑x ∑y 1 1 ⇒ Cov (x, y) = ∑ xy – = 2 n∑ xy – ∑ x ∑ y … (8·4a) n n n n ( )( ) ( )( )] [ 1 ∑(x – –x ) 2 = n1 ∑ x2 – –x 2 n ∑x 2 1 1 = ∑x2 – = 2 n∑ x2 – n n n σx2 = ( ) [c.f. Chapter 6] ( ∑x ) ] [ n∑ y – ( ∑ y ) ] [ 2 … (8·4b) 2 1 2 2 n Substituting from (8·4a), (8·4b) and (8·4c) in (8·1), we get 1 n∑ xy – ∑ x ∑ y n2 r = 2 1 1 n∑ x2 – ∑ x . n∑ y2 – n2 n2 σy2 = Similarly, we have, ( ) ( )] [ ( )} (∑ y ) } { n ∑ xy – (∑ x ) ( ∑ y ) [ n ∑ x – ( ∑ x ) ] [ n∑ y – ( ∑ y ) ] { = … (8·4c) 2 2 2 2 2 … (8·5) 8·8 BUSINESS STATISTICS Remarks. Formula (8·3) or (8·3a) is quite convenient to apply if the means –x and y– come out to be integers (i.e., whole numbers). If –x or /and –y is (are) fractional then the formula (8·3) or (8·3a) is quite cumbersome to apply, since the computations of ∑(x – x– )2, ∑ (y – –y )2 and ∑(x – x– ) (y – –y ) are quite time consuming and tedious. In such a case formula (8·5) may be used provided the values of x or/and y are small. But if x and y assume large values, the calculation of ∑ x 2 , ∑ y 2 and ∑ xy is again quite time consuming. Thus if (i) –x and –y are fractional and (ii) x and y assume large values, the formulae (8·3) and (8·5) are not generally used for numerical problems. In such cases, the step deviation method where we take the deviations of the variables X and Y from any arbitrary points, is used. We shall discuss this method after the properties of correlation coefficient (c.f. § 8·4·1). 2. Karl Pearson’s correlation coefficient is also known as the product moment correlation coefficient. Example 8·2. Calculate Karl Pearson’s coefficient of correlation between expenditure on advertising and sales from the data given below. Advertising expenses (’000 Rs. ) : 39 65 62 90 82 75 25 98 36 78 Sales (lakh Rs.) : 47 53 58 86 62 68 60 91 51 84 Solution. Let the advertising expenses (in ’000 Rs.) be denoted by the variable x and the sales (in lakh Rs.) be denoted by the variable y. x y 39 65 62 90 82 75 25 98 36 78 47 53 58 86 62 68 60 91 51 84 ∑x = 650 ∑y = 660 CALCULATIONS FOR CORRELATION COEFFICIENT – – dy2 dx = x – x = x – 65 dy = y – y = y – 66 dx2 – 26 0 – 3 25 17 10 – 40 33 – 29 13 ∑ dx = 0 – 19 – 13 – 8 20 – 4 2 – 6 25 – 15 18 ∑ dy = 0 dxdy 676 0 9 625 289 100 1600 1089 841 169 361 169 64 400 16 4 36 625 225 324 494 0 24 500 – 68 20 240 825 435 234 ∑ dx2 = 5398 ∑ dy2 = 2224 ∑ dxdy = 2704 –x = ∑x = 650 = 65 ; –y = ∑y = 660 = 66 ; ∴ dx = x – –x = x – 65 ; dy = y – –y = y – 66 10 10 n n ∑dxdy 2704 2704 2704 r = = = = 3464·8451 = 0·7804 2 2 ⎯ √ ⎯⎯⎯⎯⎯⎯⎯ 5398 × 2224 ⎯ √ ⎯⎯⎯⎯⎯⎯⎯ 12005152 ⎯⎯⎯⎯⎯⎯⎯⎯ √ ∑dx . ∑dy Aliter. ⇒ log r = log 2704 – 21 [log 5398 + log 2224] – = 3·4320 – 12 (3·7325 + 3·3472) = 3·4320 – 3·53985 = – 0·10785 = 1·89215 – ⇒ r = Antilog ( 1·89215) = 0·7802 Hence, there is a fairly high degree of positive correlation between expenditure on advertising and sales. We may, therefore, conclude that in general, sales have increased with an increase in the advertising expenses. Example 8·3. From the following table calculate the coefficient of correlation by Karl Pearson’s method. X 6 2 10 4 8 Y 9 11 ? 8 7 Arithmetic means of X and Y series are 6 and 8 respectively. 8·9 CORRELATION ANALYSIS Solution. First of all we shall find the missing value of Y. Let the missing value in Y-series be a. Then the mean –y is given by : –y = ∑Y = 9 + 11 + a + 8 + 7 = 35 + a = 8 (given ) 5 5 n ⇒ 35 + a = 5 × 8 = 40 ⇒ a = 40 – 35 = 5 CALCULATION OF CORRELATION COEFFICIENT — — 9 11 5 8 7 0 –4 4 –2 2 1 3 –3 0 –1 ∑Y = 40 0 0 X Y 6 2 10 4 8 ∑X = 30 We have X – X = X – 6 (Y – Y ) = Y – 8 30 X = ∑X 5 = 5 = 6, — — (X – X )2 — (Y – Y )2 0 16 16 4 4 1 9 9 0 1 — — — — (X – X ) (Y – Y ) 0 – 12 – 12 0 –2 — — ∑(X – X )2 = 40 ∑(Y – Y )2 = 20 ∑(X – X ) (Y – Y ) = – 26 40 Y = ∑Y 5 = 5 =8 — Karl Pearson’s correlation coefficient is given by : — — ∑(X – X ) (Y – Y ) Cov(X‚ Y) – 26 r= = = – 26 = – 26 = 28·2843 = – 0·9192 ~ – – 0·92 σx σ y — 2 — 2 ⎯⎯⎯⎯⎯⎯ √ 40 × 20 √ ⎯⎯⎯ 800 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ ∑(X – X ) ∑(Y – Y ) Example 8·4. Calculate the coefficient of correlation between X and Y series from the following data : Series X Y No. of pairs of observations 15 15 Arithmetic mean 25 18 Standard deviation 3·01 3·03 Sum of squares of deviations from mean 136 138 Summation of product deviations of X and Y series from their respective arithmetic means = 122. Solution. In the usual notations, we are given : n = 15, –x = 25, –y = 18, σ = 3·01, σ = 3·03 x y ∑(x – –x )2 = 136, ∑(y – –y )2 = 138, and ∑ (x – –x ) (y – –y ) = 122. Karl Pearson’s correlation coefficient between X-series and Y-series is given by ∑ (x – –x ) (y ––y ) 122 122 r = = 15 × 3·01 × 3·03 = 136·8049 = 0·8918 n σx σ y Remark. Here some of the given data are superfluous viz., –x , –y , ∑(x – –x ) 2 , ∑(y – –y ) 2 . Aliter. We may also compute the correlation coefficient using the formula ∑(x – –x ) (y ––y ) 122 r = = 122 = 122 = 136·9964 = 0·8905 – – ⎯ √ ⎯⎯⎯⎯⎯⎯⎯ 136 × 138 ⎯ √ ⎯⎯⎯ 18768 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ ∑(x – x )2 ∑(y – y ) 2 If we use this formula, then the data relating to n, –x, –y , σ and σ are superfluous. x y Example 8·5. The coefficient of correlation between two variables X and Y is 0·48. The covariance is 36. The variance of X is 16. Find the standard deviation of Y. Solution. We are given : rxy = 0·48, Cov (X, Y) = 36, σx2 = 16 ⇒ σx = 4 Cov (X‚ Y) Cov (X‚Y) 9 We have : rxy = ⇒ σy = = 4 ×360·48 = 0·48 = 18·75. σx σ y σx rxy 8·10 BUSINESS STATISTICS Example 8·6. Given the following information : rxy = 0·8, ∑xy = 60, σ y = 2·5 and ∑x 2 = 90, where x and y are the deviations from the respective means, find the number of items (n). [Delhi Univ. B.Com. (Hons.), 2002] — 2 1 — 1 90 Solution. We have : σx2 = ∑(X – X ) = ∑x2 = (·. · X – X = x) n n n — — ∑(X – X ) (Y – Y ) ∑ xy r = = n σX σY nσx σy ⇒ 60 0·8 = n 90 n = . (2·5) 60 × 2 ⎯n √ √ ⎯⎯90 × 5 — — [·.· X – X = x ; Y – Y = y ; σX = σx, σY = σy] = 24 ⎯n √ √ ⎯⎯90 × 24 n = 90 ×240·8 × 0·8 = 10. ⇒ Example 8.7. The covariance of two perfectly correlated variables X and Y is 96. Determine σX and σY if it is known that variance of X and that of Y is in the ratio of 4 : 9 . [Delhi Univ. B.A. (Econ. Hons.), 2008] σX2 4 σX 2 2 Solution. We are given : Cov (X, Y) = 96 …(1) ; = ⇒ = ⇒ σ X = σY …(2) σY 2 9 σY 3 3 The correlation coefficient r = r (X, Y) is given by r= Cov (X‚ Y) 96 144 =2 = 2 σXσY σY 3 σY . σY [From (1) and (2)] …(3) For perfectly correlated variables, we have r = r (X, Y) = ± 1. But since Cov (X, Y) = 96 > 0, r is also positive. Hence r = + 1. 144 Substituting in (3) , we get ; 1 = 2 ⇒ σY2 = 144 ⇒ σY = 12 σY 2 2 Substituting in (2), we get : σX = σY = × 12 = 8 3 3 ∴ σX = 8, σY = 12. Example 8.8. Given below is the information relating to marks in Statistics (X) and marks in Accountancy (Y) obtained by the students of a class. Covariance between X and Y = 144 ; Second moment of X about 20 = 244; First moment of X about 20 = 10 ; Arithmetic mean of Y = 45 Correlation coefficient between X and Y = 0·75. Calculate coefficient of variation of marks of Statistics and that of marks of Accountancy. In which subject the performance of students is more consistent ? [Delhi Univ. B.Com. (Hons.), (External), 2005] Solution. We are given : Cov (X, Y) = 144; μ1 ′ = First moment of X about ‘20’ = 10 μ2 ′ = Second moment of X about ‘20’ = 244 rXY = 0.75 ; ⇒ ⇒ — — Y = 45 X = A + μ1 ′ = 20 + 10 = 30 μ2 = μ2 ′ – μ1 ′2 = 244 – 100 = 144 …(i) …(ii) …(iii) σX2 = μ2 = 144 ⇒ σX = √ ⎯⎯⎯ 144 = 12 …(iv) Cov (X, Y) 144 12 rXY = ⇒ 0·75 = ⇒ σY = = 16 …(v) σX σ Y 12 σY (3/4) The coefficient of variation (C.V.) of marks of Statistics and Accountancy are given by : σ 12 C.V. (Statistics) = C.V. (X) = —X × 100 = × 100 = 40 [From (iii) and (iv)] 30 X σ 16 C.V. (Accountancy) = C.V. (Y) = —Y × 100 = × 100 = 35·55 [From (i) and (v)] 45 Y Since C.V. (Y) < C.V. (X), the performance of the students is more consistent in accountancy. ∴ 8·11 CORRELATION ANALYSIS Example 8·9. A computer while calculating correlation coefficient between two variables X and Y from 25 pairs of observations obtained the following results : n = 25, ∑X = 125, ∑X2 = 650, ∑Y = 100, ∑Y2 = 460, ∑XY = 508 It was, however, discovered at the time of checking that two pairs of observations were not correctly copied. They were taken as (6, 14) and (8, 6) while the correct values were (8, 12) and (6, 8). Prove that the correct value of the correlation coefficient should be 2 / 3. [Delhi Univ. B.Com. (Hons.) 2008] Solution. Corrected ∑X = 125 – 6 – 8 + 8 + 6 = 125 ; Corrected ∑Y = 100 – 14 – 6 + 12 + 8 = 100 Corrected ∑X2 = 650 – 62 – 82 + 8 2 + 62 = 650 ; Corrected ∑Y2 = 460 – 14 2 – 62 + 122 + 82 = 436 Corrected ∑XY = 508 – 6 × 14 – 8 × 6 + 8 × 12 + 6 × 8 = 520 Corrected value of r is given by : n∑XY –(∑X ) (∑Y) 25 × 520 – 125 × 100 r = = 2 2 2 2 [n∑X – (∑X) ] × [n∑Y – (∑Y) ] √ ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ [25 × 650 – (125)2] × [25 × 436 – (100)2] = 13000 ± 12500 (16255 ± 15625) × (10900 ± 10000) = 500 625 × 900 = 500 2 25 × 30 = 3 8·4·1. Properties of Correlation Coefficient Property I. Limits for Correlation Coefficient Pearsonian correlation coefficient can not exceed 1 numerically. In other words it lies between –1 and +1. Symbolically, –1≤r≤1 … (8·6) Remarks 1. This result provides us a check on our calculations. If in any problem, the obtained value of r lies outside the limits ± 1, this implies that there is some mistake in our calculations. 2. r = + 1 implies perfect positive correlation between the variables and r = – 1 implies perfect negative correlation between the variables. Property II. Correlation coefficient is independent of the change of origin and scale. Mathematically, if X and Y are the given variables and they are transformed to the new variables U and V by the change of origin and scale viz., x–A y–B u = and v = ; h > 0, k > 0 … (8·7) h k where A, B, h and k are constants, h > 0, k > 0; then the correlation coefficient between x and y is same as the correlation coefficient between u and v i.e., r(x, y) = r (u, v) ⇒ rxy = ruv … (8·7a) Remark. This is one of the very important properties of the correlation coefficient and is extremely helpful in numerical computation of r. We had already stated [c.f. Remark to § 8·4] that formulae (8·3) and (8·5) become quite tedious to use in numerical problems if –x and /or –y are in fractions or if x and y are large. In such cases we can conveniently change the origin and scale (if possible) in X or/and Y to get new variables U and V and compute the correlation between U and V by the formula ∑(u – –u ) (v – –v ) n∑uv – (∑u) (∑v) ruv = = … (8·7b) – – 2 – (∑u)2 ] [n∑v2 – (∑v )2 ] 2 2 [ n ∑u ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ ∑(u – u ) . ∑(v – v ) ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ Now using the property II, we finally get : rxy = ruv . Property III. Two independent variables are uncorrelated but the converse is not true. Proof. We have proved in (8·4) that Cov (x, y) = 1 ∑xy – –x . –y = E(xy) – E(x)E(y), … (*) n because expected value of a variable is nothing but its arithmetic mean. [See Chapter 13 on Expectation]. 8·12 BUSINESS STATISTICS If x and y are independent variables then [c.f. Chapter 13 on Expectation], E(x . y) = E(x)E(y) Substituting in (*), we get Cov (x, y) = E(x)E(y) – E(x)E(y) = 0 Hence, if x and y are independent then Cov (x‚ y) 0 rxy = = =0 σxσy σxσy ⇒ Independent variables are uncorrelated. Converse. However, the converse of the theorem is not true i.e., uncorrelated variables need not necessarily be independent. As an illustration consider the following bivariate distribution. ∴ x –4 –3 –2 –1 1 2 3 4 ∑x = 0 y 16 9 4 1 1 4 9 16 ∑y = 60 xy – 64 – 27 –8 –1 1 8 27 64 ∑ xy = 0 rxy = n∑xy – (∑x )(∑y) (∑x) √ (∑y ) ⎯⎯⎯⎯⎯⎯⎯⎯⎯ ⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ n∑x 2 – 2 n∑y 2 – 2 = 8 × 0 – 0 × 60 = 0, – (∑x )2 √ n∑y 2 – (∑y)2 ⎯⎯⎯⎯⎯⎯⎯⎯ √ ⎯⎯⎯⎯⎯⎯⎯⎯⎯ n∑x 2 because zero divided by any finite quantity is zero. Hence, in the above example the variables x and y are uncorrelated. But if we examine the data carefully, we find that x and y are not independent but are connected by the relation y = x 2 . The above example illustrates that uncorrelated variables need not be independent. Remarks 1. One should not be confused with the words of uncorrelation and independence. rxy = 0 i.e., uncorrelation between the variables x and y simply implies the absence of any linear (straight line) relationship between them. They may, however, be related in some other form (other than straight line) e.g., quadratic (as we have seen in the above example), logarithmic or trigonometric form. 2. Some more properties of the correlation coefficient will be discussed in the next chapter on Regression Analysis. a×c Property IV. r(aX + b, cY + d) = . r(X, Y) …(8.8) |a×c| where | a × c | is the modulus value of a × c. Remarks 1. Since correlation coefficient is independent of change of origin, we get : a×c r (aX + b, cY + d) = r(aX, cY) = · r (X, Y). |a×c| 2. Some Results on Covariance. If U = aX + b and V = CY + d, then Cov (U, V) = ac . Cov (X, Y) ⇒ Cov [aX + b , CY + d ] = ac. Cov (X, Y) …(i) In particular, taking b = 0 = d, in (i) we get; Cov (aX, cY) = ac. Cov (X, Y) … (ii) Taking a = c = 1 in (i), we get : Cov (X + b, Y + d) = Cov (X, Y) …(iii) The above results can be restated as follows : Cov (X + A, Y + B) = Cov (X, Y) ⎫ ⎪ Cov (AX, BY) = AB . Cov (X, Y) …(8.9) ⎬ ⎪ ⎭ Cov (AX + C, BY + D) = AB Cov (X, Y) where A, B, C and D are constants. Property V. If the variables x and y are connected by the linear equation ax + by + c = 0, then the correlation coefficient between x and y is (+1) if the signs of a and b are different and (–1) if the signs of a and b are alike. 8·13 CORRELATION ANALYSIS Symbolically, if ax + by + c = 0, then + 1‚ if a and b are of opposite signs r = r (x, y) = – 1‚ if a and b are of same sign { Example 8·10. The total of the multiplication of deviation of X and Y = 3044. No. of pairs of the observations is 10. Total of deviations of X = – 170. Total of deviations of Y = – 20. Total of the squares of deviations of X = 8288. Total of the squares of deviations of Y = 2264. Find out the coefficient of correlation when the arbitrary means of X and Y are 82 and 68 respectively. [Delhi Univ. B.Com. (Pass), 2001] Solution. Let U = X – 82 and V = Y – 68. Then we are given : n = 10, ∑UV = 3044, ∑U = – 170, ∑V = – 20, ∑U 2 = 8288, ∑V 2 = 2264. Since the correlation coefficient (r) is independent of change of origin, we have n ∑UV – (∑U) (∑V) r(X, Y) = r (U, V) = ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ [n ∑U2 – (∑U)2] [n ∑V2 – (∑V)2] = = 10 × 3044 ± (±170)(±20) 2 2 [10 × 8288 ± (±170) ][10 × 2264 ± (±20) ] 27040 53980 22240 = 27040 232·34 × 149·13 = 30440 ± 3400 82880 ± 28900 22640 ± 400 27040 = 34648·86 = + 0·78 Example 8·11. Calculate correlation coefficient r(x, y) from the following data : n = 10, ∑x = 140, ∑y = 150, ∑(x – 10) 2 = 180, ∑(y – 15) 2 = 215, ∑(x – 10) (y – 15) = 60 [Delhi Univ. B.Com. (Hons.), 2007] Solution. Let us take u = x – 10 and v = y – 15. Then, we have ∑u = ∑(x – 10) = ∑x – 10 × n = 140 – 100 = 40 ∑v = ∑(y – 15) = ∑y – 15 × n = 150 – 150 = 0 ∑u2 = ∑(x – 10)2 = 180 ; ∑v2 = ∑(y – 15)2 = 215 ; ∑uv = ∑(x – 10) (y – 15) = 60 n ∑uv – (∑u) (∑v) ∴ rxy = ruv = [n ∑u2 – (∑u)2 ] [n ∑v 2 – (∑v)2] ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ = 10 × 60 ± 40 × 0 [10 × 180 ± ( 40)2 ][10 × 215 ± 0] = 600 200 × 2150 = 6 ⎯⎯43 √ 6 = 6·557 = 0·915. Example 8·12. Calculate the coefficient of correlation for the ages of husbands and wives : Age of Husband (years) 23, 27, 28 29, 30, 31, 33, 35, 36, Age of Wife (years) 18, 22, 23, 24, 25, 26, 28, 29, 30, 39 32 Solution. CALCULATIONS FOR CORRELATION COEFFICIENT x 23 27 28 29 30 31 33 35 36 39 y 18 22 23 24 25 26 28 29 30 32 u = x – 31 –8 –4 –3 –2 –1 0 2 4 5 8 v = y – 25 –7 –3 –2 –1 0 1 3 4 5 7 u2 64 16 9 4 1 0 4 16 25 64 v2 49 9 4 1 0 1 9 16 25 49 uv 56 12 6 2 0 0 6 16 25 56 ∑x = 311 ∑y = 257 ∑u = 1 ∑v = 7 ∑u2 = 203 ∑v2 = 163 ∑uv = 179 8·14 BUSINESS STATISTICS Karl Pearson’s correlation coefficient between U and V is given by n∑uv – (∑u) (∑v) 10 × 179 ± 1 × 7 ruv = = [10 × 203 ± (1)2 ][10 × 163 ± (7)2 ] ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ [n∑u2 – (∑u)2 ] [n ∑v2 – (∑v)2] 1790 ± 7 = (2030 ± 1)(1630 ± 49) = 1783 2029 × 1581 1783 = 45·041783 × 39·76 = 1790·79 = 0·9956. Since Karl Pearson’s correlation coefficient (r) is independent of change of origin, we get rxy = ruv = 0·9956. Note. It may be noted that the values of x and y, except for the last three pairs, are connected by the linear relation : y = x – 5. Further, as the value of x decreases (increases), the value of y also decreases (increases). Hence, we may expect a very high degree of positive correlation (almost perfect) with the value of r approaching + 1. Example 8·13. Find Karl Pearson’s coefficient of correlation between sales and expenses of the following ten firms : Firm 1 2 3 4 5 6 7 8 9 10 Sales (’000 units) 50 50 55 60 65 65 65 60 60 50 Expenses (’000 rupees) 11 13 14 16 16 15 15 14 13 13 Solution. Let sales (in thousand units) of a firm be denoted by X and expenses (in ’000 rupees) be denoted by Y. It may be noted that we can take out factor 5 common in X series. Hence, it will be convenient to change the scale also in X. Taking 65 and 13 as working means for X and Y respectively, let us take : u = (x – 65)/5 ; v = y – 13. CALCULATIONS FOR CORRELATION COEFFICIENT v = y – 13 u2 v2 uv 11 13 14 16 16 15 15 14 13 13 u = x – 65 5 –3 –3 –2 –1 0 0 0 –1 –1 –3 –2 0 1 3 3 2 2 1 0 0 9 9 4 1 0 0 0 1 1 9 4 0 1 9 9 4 4 1 0 0 6 0 –2 –3 0 0 0 –1 0 0 ∑y =140 ∑u = – 14 ∑v = 10 ∑u2 = 34 ∑ v2 = 32 ∑ uv = 0 Firms x y 1 2 3 4 5 6 7 8 9 10 50 50 55 60 65 65 65 60 60 50 ∑x = 580 Karl Pearson’s correlation coefficient between u and v is given by n∑uv – (∑u)(∑v) 10 × 0 ± (±14) × (10) ruv = = [10 × 34 ± (±14)2 ] × [10 × 32 ± (10)2 ] ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ [n∑u2 – (∑u)2 ] [n∑v2 – (∑v)2] = 140 (340 ± 196) × (320 ± 100) = 140 144 × 220 = 140 31680 140 = 177·99 = 0·7866. Since correlation coefficient is independent of change of origin and scale, we finally have : rxy = ruv = 0·7866. ∑x –x = – ∑y 140 Aliter. We have : = 580 10 = 58 ; y = n = 10 = 14. n Since x–and –y are integers, it will be convenient to compute r by taking the deviations from means directly, i.e., by taking : dx = x – –x = x – 58 ; dy = y – –y = y – 14, and use the formula (8·3a). [Try it.] 8·15 CORRELATION ANALYSIS Example 8·14. Find Karl Pearson’s coefficient of correlation between the age and the playing habit of the people from the following information. Also mention what does your calculated ‘r’ indicate. Age group (years) 15 and less than 20 20 and less than 25 25 and less than 30 30 and less than 35 35 and less than 40 40 and less than 45 No. of people 200 270 340 360 400 300 No. of players 150 162 170 180 180 120 [Delhi Univ. B.Com. (Hons.), (External), 2006; Delhi Univ. B.Com. (Pass), 2002] No. of Percentage of players (y) Solution. We want to find Karl Pearson’s Age group No. of (yrs.) people players (3) correlation coefficient between the age and the × 100 (4) = (1) (2) (3) (2) playing habit of the people. To do this, we first 150 express the number of players in each age group 150 200 × 100 = 75 200 on a common base i.e., we find the number of 15—20 162 players out of a fixed number of persons (a 20—25 162 270 × 100 = 60 270 common base) which may be taken as 100 or 170 1000 or some other convenient figure. Here we 25—30 170 340 × 100 = 50 340 express the number of players as a percentage 180 of the total people in each age group. 180 360 30—35 × 100 = 50 360 Now we compute Karl Pearson’s 180 180 400 × 100 = 45 correlation coefficient between age (x) and the 35—40 400 percentage of players in each age group (y). 120 Age-group Mid-value (x) 15—20 20—25 25—30 30—35 35—40 40—45 17·5 22·5 27·5 32·5 37·5 42·5 120 300 40—45 300 CALCULATIONS FOR CORRELATION COEFFICIENT y – 50 x – 27·5 v= u= y u2 5 5 75 60 50 50 45 40 Total × 100 = 40 v2 uv –2 –1 0 1 2 3 5 2 0 0 –1 –2 4 1 0 1 4 9 25 4 0 0 1 4 – 10 –2 0 0 –2 –6 ∑u = 3 ∑v = 4 ∑u2 = 19 ∑v2 = 34 ∑uv = – 20 Since correlation coefficient is independent of change of origin and scale we have : n∑uv – (∑u)(∑v) 6 × (±20) ± (3) × ( 4) rxy = ruv = = 2 2 2 2 [6 × 19 ± (3)2 ].[6 × 34 ± ( 4)2 ] ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ [n∑u – (∑u) ] . [n∑v – (∑v) ] = ±120 ± 12 (114 ± 9) × (204 ± 16) = – 132 ⎯⎯⎯⎯⎯⎯⎯ √ 105 × 118 = – 132 ⎯⎯⎯⎯⎯ √ 19740 – 132 = 140·4991 = – 0·9395. Thus, we conclude that there is a very high degree of negative correlation, (almost perfect negative correlation) between age (x) and playing habit (y). This implies that with advancement in age, the people’s interest in playing goes on decreasing and the scatter diagram of the (x, y) values gives points clustering almost around a straight line starting from left top and going to right bottom. Example 8·15. (i) Compute the correlation coefficient between the corresponding values of X and Y in the following table : X Y 2 18 4 12 5 10 6 8 8 7 11 5 8·16 BUSINESS STATISTICS (ii) Multiply each X value in the table by 2 and add 6. Multiply each value of Y in the table by 3 and subtract 15. Find the correlation coefficient between the two new sets of values. Explain why you do or do not obtain the same result as in (i). Solution. (i) COMPUTATION OF CORRELATION COEFFICIENT — — X–X=X–6 X Y 2 4 5 6 8 11 18 12 10 8 7 5 ∑X =36 ∑Y = 60 –4 –2 –1 0 2 5 8 2 0 –2 –3 –5 — X = 1 6 ∑X ∑ (Y – Y ) = 0 = 36 6 = 6, — and — (Y – Y )2 16 4 1 0 4 25 — ∑(X – X ) = 0 — We have : — (X – X )2 Y – Y = Y – 10 64 4 0 4 9 25 — — (X – X ) (Y – Y ) – 32 –4 0 0 –6 – 25 — — — ∑(X – X )2 = 50 ∑(Y – Y )2 = 106 ∑(X – X ) (Y – Y ) = – 67 — Y = 61 ∑Y = 60 6 = 10 Hence the correlation coefficient between X and Y is given by : — — ∑(X – X ) (Y – Y ) – 67 – 67 – 67 rxy = = = 72·81 = – 0·92 — 2 — 2 1/2 = ⎯⎯⎯⎯⎯⎯ √ 50 × 106 √ ⎯⎯⎯⎯ 5300 [∑(X – X ) ∑(Y – Y ) ] Hence, the variables X and Y are highly negatively correlated. (ii) Let us define new variables U and V as : U = 2X + 6 and V = 3Y – 15 We are now required to find the correlation coefficient between the new sets of values of variables U and V as given in the following table. COMPUTATION OF CORRELATION COEFFICIENT BETWEEN U AND V X 2 4 5 6 8 11 Y 18 12 10 8 7 5 Totals ∴ ruv = = U = 2X + 6 10 14 16 18 22 28 V = 3Y – 15 39 21 15 9 6 0 U2 100 196 256 324 484 784 V2 1521 441 225 81 36 0 ∑U = 108 ∑V = 90 ∑U 2 = 2144 ∑V 2 = 2304 n∑UV – (∑U) (∑V) ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ n∑U2 – (∑U)2 √ ⎯⎯⎯⎯⎯⎯⎯⎯⎯ n∑V2 – (∑V) 2 7308 ± 9720 (12864 ± 11664)(13824 ± 8100) = = UV 390 294 240 162 132 0 ∑UV = 1218 6 × 1218 ± 108 × 90 6 × 2144 ± (108)2 × 6 × 2304 ± (90)2 ±2412 1200 × 5724 = – 2412 ⎯⎯⎯⎯⎯⎯ √ 6868800 – 2412 = 2620·8395 = – 0·92 Example 8·16. If the relation between two random variables x and y is 2x + 3y = 4, then the correlation coefficient between them is : (i) – 2/3, (ii) 1, (iii) – 1, (iv) none of these. [I.C.W.A. (Intermediate), June 2000] Solution. Since x and y are connected by the linear relation : 2x + 3y = 4 ⇒ y = – 32 x + 43 , … (*) there is perfect correlation between x and y, i.e., r = ± 1. Further from (*), we observe that as x increases, y decreases. Hence, there is perfect negative correlation between x and y. ∴ r=–1 ⇒ (iii) is the correct answer. CORRELATION ANALYSIS 8·17 Aliter. If x and y are connected by the linear relation ax + by + c = 0, then + 1‚ if a and b have opposite signs. r = r (x, y) = – 1‚ if a and b have same sign. { We are given 2x + 3y = 4. Since a = 2 and b =3, have the same sign, r = r(x, y) = – 1. Example 8·17. For the bivariate data [(x, y)] = [(20, 5), (21, 4), (22, 3)], the correlation coefficient between x and y is : (i) 0, (ii) 1, (iii) – 1, (iv) 0·5. [I.C.W.A. (Intermediate), June 2002] Solution. From the given data, we observe that 20 + 5 = 25, 21 + 4 = 25 and 22 + 3 = 25. Thus, x and y are connected by the linear relation : x + y = 25 … (*) ⇒ There is perfect correlation between x and y ⇒ r=±1 … (**) From (*), we get y = 25 – x ∴ As x increases, y decreases (by the same amount). ⇒ x and y are negatively correlated … (***) From (**) and (***), we conclude that r = r (X, Y) = – 1. Aliter. We can compute the value of r (X, Y) from the given data by using the formula. This is left as an exercise to the reader. Example 8·18. (a) The correlation coefficient between two variables X and Y is found to be 0·4. What is the correlation between 2x and (– y) ? [Delhi Univ. B.A. (Econ., Hons.), 1997] (b) “If the correlation coefficient between two variables x and y is positive, then the coefficient of correlation between – x and – y is also positive” Comment. [Delhi Univ. B.A. (Econ. Hons.), 1996] Solution. (a) We are given : r(x, y) = 0·4 … (i) a×c We know that : r (aX, cY) = · r (X, Y) … (*) |a|× |c| Using (*) and (i), we get : 2 × (– 1) – 2 × 0·4 r (2x, – y) = r (2x, – 1·y) = r (x, y) = = – 0·4 |2|×|–1| 2×1 (b) We are given : r(x, y) > 0… (ii) Using (*), we get r (– x , – y) = r (– 1.x , – 1.y) = (– 1) × (– 1) . r(x, y) = r(x, y) |–1|×|–1| Hence, if r(x, y) is positive, then r (– x, – y) is also positive. 8·4·2. Assumptions Underlying Karl Pearson’s Correlation Coefficient. Pearsonian correlation coefficient r is based on the following assumptions : (i) The variables X and Y under study are linearly related. In other words, the scatter diagram of the data will given a straight line curve. (ii) Each of the variables (series) is being affected by a large number of independent contributory causes of such a nature as to produce normal distribution. For example, the variables (series) relating to ages, heights, weights, supply, price, etc., conform to this assumption. In the words of Karl Pearson : “The sizes of the complex organs (something measureable) are determined by a great variety of independent contributing causes, for example, climate, nourishment, physical training and innumerable other causes which cannot be individually observed or their effects measured.” Karl Pearson further observes, “The variations in intensity of the contributory causes are small as compared with their absolute intensity and these variations follow the normal law of distribution.” (iii) The forces so operating on each of the variable series are not independent of each other but are related in a causal fashion. In other words, cause and effect relationship exists between different forces operating on the items of the two variable series. These forces must be common to both the series. If the 8·18 BUSINESS STATISTICS operating forces are entirely independent of each other and not related in any fashion, then there cannot be any correlation between the variables under study. For example the correlation coefficient between : (a) the series of heights and income of individuals over a period of time, (b) the series of marriage rate and the rate of agricultural production in a country over a period of time, (c) the series relating to the size of the shoe and intelligence of a group of individuals, should be zero, since the forces affecting the two variable series in each of the above cases are entirely independent of each other. However, if in any of the above cases the value of r for a given set of data is not zero, then such correlation is termed as chance correlation or spurious or non-sense correlation. [Also see § 8·1·2.] 8·4·3. Interpretation of r. The following general points may be borne in mind while interpreting an observed value of correlation coefficient r : (i) r = + 1 implies that there is perfect positive correlation between the variables. In other words, the scatter diagram will be a straight line starting from left bottom and rising upwards to the right top as shown in Fig. 8·1, § 8·3. (ii) If r = – 1, there is perfect negative correlation between the variables. In this case scatter diagram will again be a straight line as shown in Fig. 8·1, § 8·3. (iii) If r = 0, the variables are uncorrelated. In other words, there is no linear (straight line) relationship between the variables. However, r = 0 does not imply that the variables are independent [c.f. Property III, page 8·11]. (iv) For other values of r lying between + 1 and – 1, there are no set guidelines for its interpretation. The maximum we can conclude is that nearer is the value of r to 1, the closer is the relation between the variables and nearer is the value of r to 0, the less close is the relationship between them. One should be very careful in interpreting the value of r as it is often mis-interpreted. (v) The reliability or the significance of the value of the correlation coefficient depends on a number of factors. One of the ways of testing the significance of r is finding its probable error [c.f. § 8·5], which in addition to the value of r takes into account the size of the sample also. (vi) Another more useful measure for interpreting the value of r is the coefficient of determination [c.f. § 8·9]. It is observed there that the closeness of the relationship between two variables is not proportional to r. 8·5. PROBABLE ERROR After computing the value of the correlation coefficient, the next step is to find the extent to which it is dependable. Probable error of correlation coefficient, usually denoted by P.E.(r) is an old measure of testing the reliability of an observed value of correlation coefficient in so far as it depends upon the conditions of random sampling. If r is the observed correlation coefficient in a sample of n pairs of observations then its standard error, usually denoted by S.E. (r) is given by : 1 – r2 S.E.(r) = … (8·10) ⎯n √ Probable error of the correlation coefficient is given by : (1 – r2 ) P.E. (r) = 0·6745 × S.E. (r) = 0·6745 … (8·11) ⎯n √ The reason for taking the factor 0·6745 is that in a normal distribution 50% of the observations lie in the range μ ± 0·6745 σ, where μ is the mean and σ is the s.d. According to Secrist, “The probable error of the correlation coefficient is an amount which if added to and subtracted from the mean correlation coefficient, produces amounts within which the chances are even that a coefficient of correlation from a series selected at random will fall.” 8·19 CORRELATION ANALYSIS Uses of Probable Error 1. The probable error of correlation coefficient may be used to determine the limits within which the population correlation coefficient may be expected to lie. Limits for population correlation coefficient are r ± P.E. (r) … (8·12) This implies that if we take another random sample of the same size n from the same population from which the first sample was taken, then the observed value of the correlation coefficient, say, r1 in the second sample can be expected to lie within the limits given in (8·12). 2. P.E. (r) may be used to test if an observed value of sample correlation coefficient is significant of any correlation in the population. The following guidelines may be used : (i) If r < P.E. (r) i.e., if the observed value of r is less than its P.E., then correlation is not at all significant. (ii) If r > 6 P.E. (r) i.e., if observed value of r is greater than 6 times its P.E., then r is definitely significant. (iii) In other situations, nothing can be concluded with certainty. Important Remarks 1. Sometimes, P.E. may lead to fallacious conclusions particularly when n, the number of pairs of observations, is small. In order to use P.E. effectively, n should be fairly large. However, a rigorous test for testing the significance of an observed sample correlation coefficient is provided by Student’s t test. 2. P.E. can be used only under the following conditions : (i) The data must have been drawn from a normal population. (ii) The conditions of random sampling should prevail in selecting sampled observations. Example 8·19. (a) Find Karl Pearson’s coefficient of correlation from the following series of marks secured by 10 students in a class test in Mathematics and Statistics : Marks in Mathematics. : 45 70 65 30 90 40 50 75 85 60 Marks in Statistics : 35 90 70 40 95 40 60 80 80 50 Also calculate its probable error. Assume 60 and 65 as working means. (b) Hence discuss if the value of r is significant or not. Also compute the limits within which the population correlation coefficient may be expected to lie. Solution. (a) Let the marks in mathematics be denoted by the variable X and the marks in statistics by the variable Y. It may be noted that we can take out the factor 5 common in each of the X and Y series. Hence, it will be convenient to change the scale also. Taking 60 and 65 as working means for X and Y series respectively, let us take : x – 60 y – 65 and v= u= 5 5 CALCULATIONS FOR CORRELATION COEFFICIENT x y u v u2 v2 uv 18 36 9 –6 –3 35 45 n∑uv – (∑u)(∑v) r = 10 25 4 5 2 90 70 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ [n∑u2 – (∑u)2 ] × [n∑v 2 – (∑v)2] 1 1 1 1 1 70 65 10 × 141 ± 2 × (±2) = 30 25 36 –5 –6 40 30 (10 × 140 ± 4)(10 × 176 ± 4) 36 36 36 6 6 95 90 1414 = 20 25 16 –5 – 4 40 40 1396 × 1756 2 1 4 –1 –2 60 50 ⇒ log r = log 1414 – 21 [log 1396 + log 1756] 9 9 9 3 3 80 75 = 3·1504 – 12 [3·1449 + 3·2445] 15 9 25 3 5 80 85 0 9 0 –3 0 50 60 = 3·1504 – 12 × 6·3894 We have : – = 3·1504 – 3·1947 = – 0·0443 = 1·9557 Total : 2 –2 140 176 141 8·20 ⇒ ∴ BUSINESS STATISTICS – r = Antilog ( 1·9557) = 0·9031 = 0·9 rxy = ruv ~– 0·9. Probable Error of Correlation Coefficient is given by : P.E. (r) = 0·6745 1 – r2 ⎯ √n = 0·6745 × 0·19 ⎯⎯10 √ = 0·128155 3·1623 = 0·0405. (b) Significance of r. We have r = 0·9 and 6 P.E. (r) = 6 × 0·0405 = 0·2430. Since r is much greater than 6 P.E. (r), the value of r is highly significant. Remark. Since the value of r is significant, it implies that ordinarily, higher the marks of a candidate in Mathematics, higher is his score in Statistics also and lower the marks of a candidate in Mathematics, lower is his score in Statistics also. However, it does not mean that all the students who are good in Mathematics are also good in Statistics and all those students who are poor in Mathematics are also poor in Statistics. It should be clearly borne in mind that “the coefficient of correlation expresses the relationship between two series and not between the individual items of the series”. Limits for Population Correlation Coefficient are : r ± P.E.(r) = 0·9031 ± 0·0405 i.e., 0·8626 and 0·9436. This implies that if we take another sample of size 10 from the same population, then its correlation coefficient can be expected to lie between 0·8626 and 0·9436. Example 8·20. Test the significance of correlation for the following values based on the number of observations (i) 10, and (ii) 100, r = + 0·4 and + 0·9. Solution. We know that an observed value of r is definitely significant if r >6 r > 6 P.E.(r) ⇒ P.E.(r) In this case, we have : No. of observations r P.E. (r) 10 0·4 0·6745 100 0·4 0·6745 10 0·9 0·6745 100 0·9 0·6745 1 – (0·4)2 ⎯⎯10 √ 1 – (0·4)2 ⎯⎯⎯ √ 100 1 – (0·9)2 ⎯⎯10 √ 1 – (0·9)2 ⎯⎯⎯ √ 100 r P.E. (r) Significance of r. = 0·18 0·4 = 2·22 < 6 0·18 Not significant = 0·06 0·4 = 6·67 > 6 0·06 Significant = 0·04 0·9 = 22·5 > 6 0·04 Highly significant = 0·128 0·9 =7>6 0·13 Significant EXERCISE 8·2 1. Explain the meaning and significance of the concept of correlation. How will you calculate it from statistical point of view. 2. (a) Define Karl Pearson’s coefficient of correlation. What is it intended to measure ? (b) What are the special characteristics of Karl Pearson’s coefficient of correlation ? What are the underlying assumptions on which this formula is based ? (c) How do you interpret a calculated value of Karl Pearson’s coefficient of correlation ? Discuss in particular the values of r = 0, r = – 1 and r = + 1. 3. (a) Explain what is meant by coefficient of correlation between two variables. What are the different methods of finding correlation ? Distinguish between Positive and Negative correlation. [Calicut Univ. B.Com., 1997] (b) Write down an expression for the Karl Pearson’s coefficient of linear correlation. Why is it termed as the coefficient of linear correlation ? Explain. [Delhi Univ. B.A. (Econ., Hons.), 1997] 4. Define product moment correlation coefficient between two variables x and y. State its limits. Draw the scatter diagram for the extreme cases. 8·21 CORRELATION ANALYSIS 5. (a) If x and y are independent variates then prove that they are uncorrelated. Is the converse true ? Explain your answer with the help of an example. (b) Prove that two independent variables are uncorrelated. By giving an example, show that the converse is not true. Explain the reason ? [Guru Nanak Dev Univ. MBA, 1994] (c) Comment on the following statement : If the coefficient of correlation between two variables is zero, it does not mean that the variables are unrelated. [Delhi Univ. B.Com. (Hons.), 2002] 6. Discuss the statistical validity of the following statements : (a) “High positive coefficient of correlation between increase in the sale of a newspaper and increase in the number of crimes, leads to the conclusion that newspaper reading may be responsible for the increase in the number of crimes.” (b) “A high positive value of r between the increase in cigarette smoking and increase in lung cancer establishes that cigarette smoking is responsible for lung cancer.” (c) If the coefficient of correlation between the annual value of exports during the last ten years and the annual number of children born during the same period is + 0·9, what inference, if any, would you draw ? [Delhi Univ. B.A. (Econ. Hons.), 1996] 7. Comment on the following : (a) “Positive correlation r = 0·9, is found between the number of children born and exports over last decade.”. [Delhi Univ. B.Com. (Hons.), 2001] (b) The correlation coefficient between the railway accidents in a particular year and the babies born in that year was found to be 0·8. 8. (a) Define a scatter diagram. Draw the scatter diagram when (i) r = + 1, (ii) r = – 1 and (iii) r = 0, where r is the correlation coefficient. [I.C.W.A. (Intermediate), Dec. 2001] (b) What is a scatter diagram ? Give the procedure of drawing a scatter diagram. Draw scatter diagrams when the coefficient of correlation r = + 1 and r = – 1. [C.A. (Foundation), May 2000] 9. The production manager of a company maintains that the flow time in days (y), depends on the number of operations (x) to be performed. The following data give the necessary information : x : 2 2 3 4 4 5 6 6 7 7 y : 8 13 14 11 20 10 22 26 22 25 Plot a scatter diagram. Calculate the value of the Karl Pearson’s Product Moment Correlation Coefficient . [I.C.W.A. (Intermediate), Dec. 1995] Ans. r (x, y) = 0·78. 10. Making use of the data given below, calculate the coefficient of correlation r12 Case : A B C D E F G H X1 : 10 6 9 10 12 13 11 9 X2 : 9 4 6 9 11 13 8 4 Ans. r12 = 0·8958. 11. Calculate Karl Pearson’s coefficient of correlation from the following data, using 20 as the working mean for price and 70 as the working mean for demand : Price : 14 16 17 18 19 20 21 22 23 Demand : 84 78 70 75 66 67 62 58 60 [Delhi Univ. B.Com. (Pass), 1999] Ans. r = – 0·954. 12. Calculate the Karl Pearson’s coefficient of correlation from the following data : Percentage of Marks Percentage of Marks No. 1. 2. 3. 4. Subject Hindi English Economics Accounts Ans. r = 0·623. First Term 75 81 70 76 Second Term 62 68 65 60 No. 5. 6. 7. 8. Subject First Term Second Term Commerce 77 69 Mathematics 81 72 Statistics 84 76 Costing 75 72 [Delhi Univ. B.Com. (Pass), 2000] 8·22 BUSINESS STATISTICS 13. Calculate the Karl Pearson’s coefficient of correlation for the following ages of husbands and wives at the time of their marriage : Age of husband (in years) : 23 27 28 28 28 30 30 33 35 38 Age of wife (in years) : 18 20 22 27 21 29 27 29 28 29 Ans. r = 0·8013. 14. Calculate the Pearson’s coefficient of correlation from the following data using 44 and 26 respectively as the origin of X and Y : X: 43 44 46 40 44 42 45 42 38 40 42 57 Y: 29 31 19 18 19 27 27 29 41 30 26 10 [Osmania Univ. B.Com., 1998] Ans. rxy = – 0·7326. 15. The following table gives the distribution of the total population and those who are totally or partially blind among them. Find out if there is any relation between age and blindness. Age (Years) : 0—10 10—20 20—30 30—40 40—50 50—60 60—70 70—80 No. of Persons (’000) : 100 60 40 36 24 11 6 3 Blind : 55 40 40 40 36 22 18 15 Hint. Here we shall find the correlation coefficient between age (X) and the number of blinds per lakh (Y) as given in the following table. X 5 15 25 35 45 55 65 75 Y 55 67 100 111 150 200 300 500 Ans. r = 0·8982. 16. With the following data in 6 cities, calculate the coefficient of correlation by Pearson’s method between the density of population and the death rate. Cities Area in square miles Population (in ’000) No. of deaths A 150 30 300 B 180 90 1440 C 100 40 560 D 60 42 840 E 120 72 1224 F 80 24 312 [C.A. (Intermediate). May 1981] Population No. of deaths Hint. Find r between, Density = ; and Death Rate = × 1000. Area Population Ans. r = 0·9876. 17. Calculate the correlation coefficient from the following data : X: 12 9 8 10 11 13 7 Y: 14 8 6 9 11 12 3 Let now each value of X be multiplied by 2 and then 6 be added to it. Similarly multiply each value of Y by 3 and subtract 2 from it. What will be the correlation coefficient between the new series of X and Y. [C.A. (Foundation), May 1997] Ans. Let U = 2X + 6, V = 3Y – 2. Since correlation coefficient is independent of change of origin and scale, r(U, V) = r(X, Y) = 0·9485. 18. (a) Given : ∑X = 125, ∑Y = 100, ∑X 2 = 650, ∑Y2 = 436, ∑XY = 520 and n = 25, obtain the value of Karl Pearson’s correlation coefficient r(X, Y). Ans. 0·67. 19. You are given the following information relating to a frequency distribution comprising of 10 observations. — — X = 5·5, Y = 4·0, ∑X 2 = 385, ∑Y 2 = 192; Find rxy. Hint. Use ∑(X + Y) 2 = ∑X 2 + ∑Y 2 + 2 ∑XY and find ∑XY = 185. Ans. r(X, Y) = – 0·681. ∑(X + Y) 2 = 947. [Punjab Univ. B.Com., 1994] 8·23 CORRELATION ANALYSIS 20. A computer while calculating the correlation coefficient between the variables X and Y obtained the following results : N = 30, ∑X = 120, ∑X 2 = 600, ∑Y = 90, ∑Y2 = 250, ∑XY = 356 It was, however, later discovered at the time of checking that it had copied down two pairs of observations as : X Y X Y , while the correct values were, 8 10 8 12 12 7 10 8 Obtain the correct value of the correlation coefficient between X and Y. [I.C.W.A. Dec., 2003] Ans. r = 0·0504 21. Coefficient of correlation between X and Y for 20 items is 0·3; mean of X is 15 and that of Y 20, standard deviations are 4 and 5 respectively. At the time of calculations one pair (x = 27, y = 30) was wrongly taken as (x = 17, y = 35). Find the correct coefficient of correlation. [Delhi Univ. B.Com. (Hons.), (External), 2007] Ans. Correct value of correlation coefficient = 0·5153. 22. In order to find the correlation coefficient between two variables X and Y from 12 pairs of observations, the following calculations were made : ∑X = 30, ∑Y = 5, ∑X2 = 670, ∑Y2 = 285, ∑XY = 334 On subsequent verification it was found that the pair (X = 11, Y = 4) was copied wrongly, the correct value being (X = 10, Y = 14). Find the correct value of correlation coefficient. Ans. 0·78. 23. What do you understand by the probable error of correlation coefficient ? Explain how it can be used to : (i) Interpret the significance of an observed value of sample correlation coefficient. (ii) Determine the limits for the population correlation coefficient. 24. Calculate the coefficient of correlation and find its probable error from the following data : X: 7 6 5 4 3 2 1 Y: 18 16 14 12 10 6 8 Ans. rxy = 0·9643; P.E. (r) = 0·0179. 25. Find Karl Pearson’s correlation coefficient between age and playing habit of the following students : Age (years) : 15 16 17 18 19 20 No. of students : 250 200 150 120 100 80 Regular players : 200 150 90 48 30 12 Also calculate the probable error and point out if coefficient of correlation is significant. [Himachal Pradesh Univ. M.B.A. 1998; Delhi Univ. B.Com. (Hons.), 1996] Hint. Find r between age (X) and percentage of regular players (Y). Ans. rxy = – 0·9912 ; P.E.(r) = 0·0048; r is highly significant. 26. Calculate Karl Pearson’s coefficient of correlation for the following series. Price (in Rs.) : 110—111 111—112 112—113 113—114 Demand (in kg.) : 600 640 640 680 Price (in Rs.) : 116—117 117—118 118—119 Demand (in kg.) : 830 900 1,000 114—115 700 115—116 780 Also calculate the probable error of the correlation coefficient. From your result can you assert that the demand is correlated with price ? Hint. Find correlation coefficient between x : Mid-value of price (in Rs.) and y demand (in kg.) Ans. r = 0·9651 ; P.E. (r) = 0·0154. 27. (a) A student calculates the value of r as 0·7 when the number of items (n) in the sample is 25. Find the limits within which r lies for another sample from the same universe. Ans. Required limits for r are 0·767 and 0·633. (b) A student calculates the value of r as 0·7 when the value of N is 5 and concludes that r is highly significant. Is he correct ? [Delhi Univ. B.Com. (Hons.), 1997] Ans. r 0·7 ⎯ √ 5 = 4·55 < 6. Not significant. = P.E.(r) 0·6745 × 0·51 8·24 BUSINESS STATISTICS 28. The correlation coefficient between Physics and Mathematics final marks for a group of 21 students was computed to be 0·80. Find 95% confidence limits for the coefficient. Ans. Required limits = r ± 1·96 P.E.(r) = 0·8 ± 1·96 × 0·05299 = 0·6961 and 0·9039. 29. The deviations from the respective means of X and Y series are given below : x: –4 –3 –2 –1 0 1 2 y: 3 –3 –4 0 4 1 2 3 –2 4 –1 Calculate Karl Pearson’s coefficient of correlation from the above data. [Delhi Univ. B.Com. (Pass), 1995] 1 Hint. Cov (X, Y) = ∑xy = 0. n Ans. r(X,Y) = 0. 30. Calculate the coefficient correlation between X and Y series from the following data : X series Y series No. of observations 15 15 Arithmetic mean 25 18 Standard deviation 5 5 ∑[(X – 25) (Y – 18)] = 125. — [Delhi Univ. B.Com. (Pass), 1995] — ∑(X – X ) (Y – Y) ∑(X – 25) (Y – 18) 125 = = = 0·33 15 × 5 × 5 15 × 25 n σX σY 31. Given n = 10; ∑x = 100; ∑y = 150; ∑(x – 10) 2 = 180; ∑(y – 15) 2 = 25; ∑(x – 10) (y – 15) = 60; Obtain Karl Pearson’s correlation coefficient. [Ans. (2/√ ⎯ 5) = 0·8944] 32. In a correlation study the results were — — ∑xy = 40, N = 100, ∑x2 = 80, ∑y2 = 20, where x = X – X and y = Y – Y The correlation coefficient is : (a) + 1·0 (b) –1·0 (c) zero (d) None of these. Ans. (a). Ans. rXY = 33. Given r = 0·8, ∑xy = 60, σy = 2·5 and ∑x2 = 90 Find the number of items. Here x and y are deviations from respective means. Ans. n = 10. 34. (a) The following results are obtained between two series. Compute the coefficient of correlation. X Series Number of items 7 Arithmetic mean 4 Sum of square of deviations from arithmetic mean 28 Summation of products of deviations of X and Y series from their respective means = 46 Y Series 7 8 76 Ans. r = 0·997. (b) From the following data calculate the coefficient of correlation between two variables X and Y : (i) Number of items in X-series or Y-series = 12. (ii) Sum of the squares of deviation from mean : 360 for X-series and 250 for Y-series. (iii) Sum of the products of deviations of the two series from their respective means = 225. Ans. rxy = 0·75. 35. (a) The coefficient of correlation between two variables X and Y is 0·38. Their covariance is 10·2. The variance of X is 16. Find the standard deviation of Y-series. Ans. σy = 6·71. (b) The covariance between the length and weight of five items is 6 and their standard deviations are 2·45 and 2·61 respectively. Find the coefficient of correlation between length and weight. [Delhi Univ. B.Com. (Hons.), 2000] Ans. r = 0·9383. 36. (a) The coefficient of correlation between two variates X and Y is 0·8 and their covariance is 20. If the variance of X series is 16, find the standard deviation of Y series. [C.A. (Foundation), June 1993] Ans. σy = 6·25. CORRELATION ANALYSIS 8·25 (b) The coefficient of correlation between two variables X and Y is 0·4 and their co-variance is 10. If variance of X series is 9, find the second moment about mean of Y series. [Delhi Univ. B.Com. (Hons.), 1996] Hint. Second moment about mean of Y series is σy 2. Ans. σy2 = (8·33)2 = 69·39. 37. (a) For the bivariate data : {(x, y) = (10, 4), (11, 3), (12, 2) (14, 0), (8, 6)}, the coefficient of correlation between x and y is : (i) – 1 (ii) 0·5 (iii) 1 (iv) 0 (b) For the bivariate data : {(x, y) : (15, 3), (20, 8), (25, 13), (30, 18)} the coefficient of correlation between x and y is : (i) 1 (ii) – 1 (iii) 0 (iv) 0·5 Ans. (a) x + y = 14 ; r = – 1 (b) x – y = 12 ; r = + 1. 38. (a) Given that the correlation between x and y is 0·5, what is the correlation between 2x – 4 and 3 – 2y ? Ans. r = – 0·5. (b) Point out the inconsistency in the statement : “The correlation coefficient of 3x and – 2y is the same as the correlation coefficient of x and y.” [I.C.W.A. (Intermediate), Dec. 1998] Ans. r(3x, – 2y) = – r(x, y). Statement is inconsistent. 39. (a) The corresponding values of the variables are given below : X: 2 3 5 8 9 Y: 4 6 10 16 18 The correlation coefficient between the variables is : – 1, 0, 1 or none of these. Justify your answer. Ans. rxy = + 1. (b) Let r be the correlation between x and y. What is the correlation coefficient between : (i) (3x + 1) and (2y – 3), and (ii) x and – y ? Explain your answers. Ans. (i) r, (ii) – r. (c) If U = 2x + 11 and V = 3y + 7, what will be correlation coefficient between U and V ? Justify your statement. Ans. ruv = rxy. (d) Are the following statements valid ? Justify your answer : (i) Positive value of correlation coefficient between x and y implies that if x decreases, y tends to increase. (ii) Correlation coefficient is independent of the origin of reference but is dependent on the units of measurement. (iii) Correlation coefficient between x and y turned out to be 1·25. Ans. (i) False, (ii) False, (iii) Impossible. 40. Comment on the following, giving reasons for your conclusions : (a) If the correlation coefficient between two variables X and Y is positive, then (i) the correlation coefficient between – X and – Y is positive. (ii) the correlation coefficient between X and – Y or – X and Y is positive. (b) The correlation coefficient between two variables is 1·4. (c) If the variables are independent then they are uncorrelated. (d) Correlation coefficient can be calculated only if the two variables are measured in the same units. (e) If the correlation coefficient between two variables is zero, then the variables are independent. (f) The value of r cannot be negative. (g) r measures every type of relationship between the two variables. (h) “The closeness of relationship between two variables is proportional to r.” (i) A student while studying correlation between smoking and drinking found a value of r as high as 1·62. 8·6. CORRELATION IN BIVARIATE FREQUENCY TABLE If in a bivariate distribution the data are fairly large, they may be summarised in the form of a two-way table. Here for each variable, the values are grouped into various classes (not necessarily the same for both the variables), keeping in view the same considerations as in the case of univariate distribution. For 8·26 BUSINESS STATISTICS example, if there are m classes for the X-variable series and n classes for the Y-variable series then there will be m × n cells in the two-way table. By going through the different pairs of the values (x, y) and using tally marks we can find the frequency for each cell and thus obtain the so-called bivariate frequency table as shown below. BIVARIATE FREQUENCY TABLE X Series Y Series ↓ Classes → x1 x2 ... Mid Points ... x ... Total of frequencies of Y xm Mid Points Classes y1 y2 fy f (x, y) y .. . yn Total of frequencies of X Total ∑ fx = ∑ fy = N fx Here f (x, y) is the frequency of the pair (x, y). The formula for computing the correlation coefficient between X and Y for the bivariate frequency table is N ∑ x y f (x‚ y) – (∑ x fx) (∑ y fy) r = … (8·13) [N ∑ x 2 fx – (∑ x fx)2] × [N ∑ y2 fy – (∑ y fy)2] ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ where N is the total frequency. If there is no confusion we may use the formula : N∑ f x y – (∑ f x) (∑ f y) r = rxy = … (8·13a) ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ [N ∑ f x2 – (∑ f x)2] × [N ∑ f y2 – (∑ f y)2] where the frequency f used for the product xy is nothing but f(x, y) and the frequency f used in the sums ∑fx and ∑fy are respectively the frequencies of x and y, viz., fx & fy as explained in the above table. If we change the origin and scale in X and Y by transforming them to the new variables U and V by x–A y–B u= and v= ; h > 0, k > 0 h k where h and k are the widths of the x-classes and y-classes respectively and A and B are constants, then by Property II of r, we have : N∑ f u v – (∑ f u) (∑ f v) rxy = ruv = … (8·14) ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ [N ∑ f u2 – (∑ f u) 2 ] × [N ∑ f v2 – (∑ f v)2] We shall explain the method by means of examples. Example 8·21. Family income and its percentage spent on food in the case of hundred families gave the following bivariate frequency distribution. Calculate the coefficient of correlation and interpret its value. Food Expenditure (in %) 10—15 15—20 20—25 25—30 200—300 — — 7 3 300—400 — 4 6 10 Family income (Rs.) 400—500 500—600 — 3 9 4 12 5 19 8 600—700 7 3 — — [Delhi Univ. M.B.A., 2000] 8·27 CORRELATION ANALYSIS Solution. Let us denote the income (in Rupees) by the variable X and the food expenditure (%) by the variable Y. Steps 1. Find the mid-points of various classes for X and Y series. 2. Change the origin and scale in X-series and Y-series by transforming them to new variables u and v as defined below : y – B y – 17·5 x – A x – 450 u= = 100 , and v= = 5 , h k where x denotes the mid-points of the X-sereis and y denotes the mid-points of the Y-series, and h and k are the magnitudes of the classes of X and Y series respectively. 3. For each class of X, find the total of cell frequencies of all the classes of Y and similarly for each class of Y, find the total of cell frequencies of all the classes of X. 4. Multiply the frequencies of x by the corresponding values of the variable u and find the sum ∑fu . 5. Multiply the frequencies of y by the corresponding values of the variable v and find the sum ∑fv. 6. Multiply the frequency of each cell by the corresponding values of u and v and write the product f × u × v within a square in the right hand top corner for each cell. For example for u = – 1 and v = 2, the cell frequency f is 10. Therefore, the product of f, u and v is (– 1) × (2) × 10 = – 20 which is written within a square on the right hand top of cell. Similarly, for u = 2 and v = 1, the product fuv = 0 × 2 × 1 = 0, and so on for all the remaining cell frequencies. 7. Add together all the figures in the top corner squares as obtained in step 6 to get the last column fuv for each of the X and Y series. Finally, find the total of the last column to get ∑fuv. 8. Multiply the values of fu and fv by the corresponding values of u and v respectively to get the columns for fu2 and fv2. Add these values to obtain ∑fu2 and ∑fv2. The above calculations are shown in the table given on next page 8.28. N∑fuv – (∑fu) (∑fv) ruv = ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ [N∑fu2 – (∑fu) 2 ] × [N∑fv2 – (∑fv) 2 ] = = 100 × (±48) ± 0 × 100 2 (100 × 120 ± 0) × [100 × 200 ± (100) ] ±4800 12000 × 10000 Hence, = ±48 120 × 100 = = ±48 12000 ± 4800 [From page 8·28] 12000 × (20000 ± 10000) – 48 = 109·54 = – 0·4382 rxy = ruv = – 0·4381 [By Property II of r] Example. 8·22. Calculate Karl Pearson’s coefficient of correlation from the data given below : Marks Age in Years 18 19 20 21 22 20—25 3 2 — — — 15—20 — 5 4 — — 10—15 — — 7 10 — 5—10 — — — 3 2 0—5 — — — 3 1 Solution. If we denote the age in years by the variable X and the mid-point of the class intervals of marks by the variable Y and take u = X – 20; and v = Y – 12·5 , 5 then the bivariate correlation table is as given on page 8.29. CALCULATIONS FOR CORRELATION COEFFICIENT X→ 200—300 Mid pt. x Y↓ Mid pt. y U → V↓ 300—400 400—500 500—600 600—700 250 350 450 550 650 –2 –1 0 1 2 0 10—15 12·5 –1 15—20 17·5 0 20—25 22·5 1 7 25—30 27·5 2 3 0 0 –3 3 0 0 4 – 14 0 9 –6 6 – 12 – 20 10 0 0 12 fv2 fuv 10 – 10 10 –17 20 0 0 0 30 30 30 – 15 40 80 160 – 16 0 3 5 0 5 0 19 fv – 14 7 4 f 16 0 8 f 10 20 40 20 10 N = 100 fu – 20 – 20 0 20 20 ∑fu = 0 fu2 40 20 0 20 40 ∑fu2 = 120 fuv – 26 – 26 0 18 – 14 ∑fuv = – 48 ∑fv = 100 ∑fv2 = 200 ∑fuv = – 48 CALCULATIONS FOR CORRELATION COEFFICIENT Age in years → X Marks Y↓ u 18 19 20 21 22 –2 –1 0 1 2 v –12 22·5 2 3 12·5 7·5 2·5 0 0 0 –5 0 0 0 0 0 0 –3 –4 2 0 17·5 –4 1 5 4 0 0 0 0 0 0 0 0 0 7 19 –1 3 –2 Total f 2 –6 Total f fv fv2 fuv 5 10 20 –16 9 9 9 –5 17 0 0 0 5 –5 5 –7 4 –8 16 – 10 ∑fv = 6 ∑fv2 = 50 ∑fuv = – 38 –4 3 1 3 7 11 16 3 N = 40 fu –6 –7 0 16 6 ∑fu = 9 fu2 12 7 0 16 12 ∑fu2 = 47 fuv – 12 –9 0 –9 –8 ∑fuv = – 38 8·30 BUSINESS STATISTICS ruv = = N∑fuv – (∑fu)(∑fv) ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ – × – [N∑fu2 (∑fu) 2 ] – 1520 – 54 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ (1880 – 81) (2000 – 36) [N∑fv2 = (∑fv) 2 ] – 1574 ⎯ √⎯⎯⎯⎯⎯⎯⎯⎯⎯ 1799 × 1964 = = 40 × (±38) ± 9 × 6 [From page 8·29] [ 40 × 47 ± (9)2 ] × [40 × 50 ± (6)2 ] – 1574 ⎯⎯⎯⎯⎯⎯ √ 3533236 – 1574 = 1879·69 = 0·8373 But since correlation coefficient is independent of change of origin and scale, [c.f. Property II of r], we get rxy = ruv = – 0·8373. EXERCISE 8·3 1. Write a brief note on the correlation table. The following are the marks obtained by 24 students in a class test of Statistics and Mathematics : Roll No. of Students : 1 2 3 4 5 6 7 8 9 10 11 12 Marks in Statistics : 15 0 1 3 16 2 18 5 4 17 6 19 Marks in Mathematics : 13 1 2 7 8 9 12 9 17 16 6 18 Roll No. of Students : 13 14 15 16 17 18 19 20 21 22 23 24 Marks in Statistics : 14 9 8 13 10 13 11 11 12 18 9 7 Marks in Mathematics : 11 3 5 4 10 11 14 7 18 15 15 3 Prepare a correlation table taking the magnitude of each class interval as four marks and the first class interval as “equal to 0 and less than 4”. Calculate Karl Pearson’s coefficient of correlation between the marks in Statistics and marks in Mathematics from the correlation table and comment on it. Ans. r = 0·5717. 2. What is a bivariate table ? Write the formula you use for calculating coefficient of correlation from such a table, explaining the symbols used. What does a negative value of the coefficient of correlation indicate ? 3. Calculate the coefficient of correlation between the ages of husbands and wives and its probable error from the following table : Ages of husbands (years) Ages of wives (years) 20—30 30—40 40—50 50—60 60—70 Total 15—25 25—35 35—45 45—55 55—65 5 — — — — 9 10 1 — — 3 25 12 4 — — 2 2 16 4 — — — 5 2 17 37 15 25 6 Total 5 20 44 24 7 100 Ans. r = 0·7953; P.E. (r) = 0·0248. Y 4. (a) Given the data in the adjoining table, calculate the correlation coefficient, r, between X and Y. 30—50 50—70 70—90 X 0—5 5—10 10—15 10 3 4 6 5 7 2 4 9 [Delhi Univ. B.A. (Econ. Hons.), 1991] Ans. r = 0·375. (b) Given the following table, obtain r xy. x 0—5 5—10 10—15 Total y 30—50 10 3 4 17 50—70 6 — 7 — 70—90 2 4 — 15 Total 18 12 20 50 [Delhi Univ. B.A. (Econ. Hons.), 2004] Hint. First, obtain the missing values from given totals Ans. r = 0·375. 8·31 CORRELATION ANALYSIS x 5 10 15 20 y 5. Calculate the product moment coefficient of correlation for the adjoining bivariate distribution. 11 2 4 5 4 17 5 3 6 2 Ans. r = – 0·0856. 23 3 1 2 3 6. The following table gives the distribution of income and expenditure of 100 families. Find the coefficient of correlation and its probable error. State whether the correlation coefficient is significant or not. Income (in ’00 Rs.) Percentage Expenditure on Food (X) Y 10—20 20—30 30—40 40—50 50—60 350—450 — — — — 5 450—550 — — 1 10 9 550—650 — 4 12 25 3 650—750 4 16 2 2 — 750—850 2 5 — — — Ans. – 0·8248. 7. A sample of 100 firms was taken and these were classified according to the sales executed by them and profits earned consequently. The results are shown in the table given below. Determine the correlation coefficient between sales and profits and also its probable error. Profits Sales (in lakhs of Rs.) (in ’000 Rs.) 7—8 8—9 9—10 10—11 11—12 12—13 Total 50—70 5 3 — — — — 8 70—90 3 8 5 4 — — 20 90—110 1 — 7 11 2 2 23 110—130 — 4 5 15 6 — 30 130—150 — — 2 7 4 6 19 Total 9 15 19 37 12 8 100 Ans. r = 0·6627; P.E.(r) = 0·0378. 8·7. RANK CORRELATION METHOD Sometimes we come across statistical series in which the variables under consideration are not capable of quantitative measurement but can be arranged in serial order. This happens when we are dealing with qualitative characteristics (attributes) such as honesty, beauty, character, morality, etc., which cannot be measured quantitatively but can be arranged serially. In such situations Karl Pearson’s coefficient of correlation cannot be used as such. Charles Edward Spearman, a British psychologist, developed a formula in 1904 which consists in obtaining the correlation coefficient between the ranks of n individuals in the two attributes under study. Suppose we want to find if two characteristics A, say, intelligence and B, say, beauty are related or not. Both the characteristics are incapable of quantitative measurements but we can arrange a group of n individuals in order of merit (ranks) w.r.t. proficiency in the two characteristics. Let the random variables X and Y denote the ranks of the individuals in the characteristics A and B respectively. If we assume that there is no tie, i.e., if no two individuals get the same rank in a characteristic then, obviously, X and Y assume numerical values ranging from 1 to n. The Pearsonian correlation coefficient between the ranks X and Y is called the rank correlation coefficient between the characteristics A and B for that group of individuals. Spearman’s rank correlation coefficient, usually denoted by ρ (Rho) is given by the formula 6∑d2 ρ=1– … (8·15) n(n2 – 1) where d is the difference between the pair of ranks of the same individual in the two characteristics and n is the number of pairs. 8·7·1. Limits for ρ. Spearman’s rank correlation coefficient lies between –1 and +1, i.e., –1 ≤ ρ ≤ 1 …(8·16) 8·32 BUSINESS STATISTICS Remark. Since the square of a real quantity is always non-negative, i.e., ≥ 0, ∑d2 being the sum of non-negative quantities is also non-negative. Further since n is also positive we get from (8·15) ρ = 1 – [some non-negative quantity] ⇒ ρ ≤ 1, …(i) 2 2 the sign of equality holds i.e., ρ = 1 if and only if ∑d = 0. Now, ∑d = 0 if and only if each d = 0, i.e., the ranks of an individual are same in both the characteristics. Following table gives one such possibility. x 1 2 3 … n y 1 2 3 … n On the other hand, ρ will be minimum i.e., ρ = –1 if ∑d2 is maximum, i.e., if the deviations d are maximum which is so if the ranks of the individuals in the two characteristics are in the reverse (opposite) order as given in the following table. Individual 1 2 3 … n–1 n x 1 2 3 … n–1 n y n n–1 n–2 … 2 1 Note. From the above table, we observe that : x + y = n + 1. Also x – y = d 8·7·2. Computation of Rank Correlation Coefficient. We shall discuss below the method of computing the Spearman’s rank correlation coefficient ρ under the following situations : (i) When actual ranks are given. (ii) When ranks are not given. CASE (I) — WHEN ACTUAL RANKS ARE GIVEN In this situation the following steps are involved : (i) Compute d, the difference of ranks. (ii) Compute d2 . (iii) Obtain the sum ∑d2 . (iv) Use formula (8·15) to get the value of ρ. Example 8·23. The ranks of the same 15 students in two subjects A and B are given below ; the two numbers within the brackets denoting the ranks of the same student in A and B respectively. (1, 10), (2, 7), (3, 2), (4, 6), (5, 4), (6, 8), (7, 3), (8, 1), (9, 11), (10, 15), (11, 9), (12, 5), (13, 14), (14, 12), (15, 13). Use Spearman’s formula to find the rank correlation coefficient. [Sukhadia Univ. MBA, 1998] Solution. CALCULATION OF SPEARMAN’S Spearman’s rank correlation coefficient ρ is given by : CORRELATION COEFFICIENT Rank in B d=x–y d2 Rank in A (y) (x) —————————————————— —————————————————— ——————————————————— ————————————————— ρ =1– 6∑ d2 n(n2 – 1) 6 × 272 = 1 – 15(225 – 1) 6 × 272 = 1 – 15 × 224 18 = 1 – 17 35 = 35 = 0·51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 10 7 2 6 4 8 3 1 11 15 9 5 14 12 13 –9 –5 1 –2 1 –2 4 7 –2 –5 2 7 –1 2 2 ∑d=0 81 25 1 4 1 4 16 49 4 25 4 49 1 4 4 ∑ d 2 = 272 8·33 CORRELATION ANALYSIS Example 8·24. Ten competitors in a beauty contest are ranked by three judges in the following order : 1st Judge 2nd Judge 3rd Judge : : : 1 3 6 6 5 4 5 8 9 10 4 8 3 7 1 2 10 2 4 2 3 9 1 10 7 6 5 8 9 7 Use the rank correlation coefficient to determine which pair of judges has the nearest approach to common tastes in beauty. Solution. Let R1 , R2 and R 3 denote the ranks given by the first, second and third judges respectively and let ρij be the rank correlation coefficient between the ranks given by ith and jth judges, i ≠ j = 1, 2, 3. Let dij = Ri – Rj, be the difference of ranks of an individual given by the ith and jth judge. CALCULATION OF RANK CORRELATION COEFFICIENT R1 R2 R3 1 6 5 10 3 2 4 9 7 8 3 5 8 4 7 10 2 1 6 9 6 4 9 8 1 2 3 10 5 7 d 12 = R1 – R2 d 13 = R1 – R3 d 23 = R2 – R3 –3 1 –1 –4 6 8 –1 –9 1 2 –5 2 –4 2 2 0 1 –1 2 1 –2 1 –3 6 –4 –8 2 8 1 –1 ∑ d12 = 0 ∑ d13 = 0 ∑ d23 = 0 d 122 d 132 d 232 4 1 9 36 16 64 4 64 1 1 25 4 16 4 4 0 1 1 4 1 9 1 1 16 36 64 1 81 1 4 ∑ d122 = 200 ∑ d132 = 60 ∑ d232 = 214 We have n = 10. Spearman’s rank correlation coefficients are given by : 6∑d122 6 × 200 = – 7 ρ12 = 1 – = 1 – 10 = – 0·2121 × 99 33 n(n2 – 1) 6∑d132 6 × 60 7 ρ13 = 1 – = 1 – 10 = 0·6363 × 99 = 11 n(n2 – 1) 6∑d232 6 × 214 = – 49 = – 0·2970 ρ23 = 1 – = 1 – 10 × 99 165 n(n2 – 1) Since ρ 13 is maximum, the pair of first and third judges has the nearest approach to common tastes in beauty. Remark. Since ρ12 and ρ23 are negative, the pair of judges (1, 2) and (2, 3) have opposite (divergent) tastes for beauty. CASE (II)—WHEN RANKS ARE NOT GIVEN Spearman’s rank correlation formula (8·15) can also be used even if we are dealing with variables which are measured quantitatively, i.e., when the actual data but not the ranks relating to two variables are given. In such a case we shall have to convert the data into ranks. The highest (smallest) observation is given the rank 1. The next highest (next lowest) observation is given rank 2 and so on. It is immaterial in which way (descending or ascending) the ranks are assigned. However, the same approach should be followed for all the variables under consideration. Example 8·25. Calculate Spearman’s rank correlation coefficient between advertisement cost and sales from the following data : Advertisement cost (’000 Rs.) : Sales (lakhs Rs.) : 39 47 65 53 62 58 90 86 82 62 75 68 25 60 98 91 36 51 Solution. Let X denote the advertisement cost (’000 Rs.) and Y denote the sales (lakhs Rs.). 78 84 8·34 BUSINESS STATISTICS CALCULATION OF RANK CORRELATION COEFFICIENT Y Rank of X (x) Rank of Y (y) d=x–y –2 10 8 47 –2 8 6 53 0 7 7 58 0 2 2 86 –2 5 3 62 1 4 5 68 4 6 10 60 0 1 1 91 0 9 9 51 1 3 4 84 ∑d=0 X 39 65 62 90 82 75 25 98 36 78 d2 4 4 0 0 4 1 16 0 0 1 ∑ d 2 = 30 Here n = 10 6∑d 2 6 × 30 2 9 = 1 – 10 × 99 = 1 – 11 = 11 = 0·82. n (n2 – 1) Example 8·26. A test in Statistics was taken by 7 students. The teacher ranked his pupils according to their academic achievement. The order of achievement from high to low, together with family income for each pupil, is given as follows : Rai (Rs. 8,700), Bhatnagar (Rs. 4,200), Tuli (Rs. 5,700), Desai (Rs. 8,200), Gupta (Rs. 20,000), Chaudhri (Rs. 18,000) and Singh (Rs. 17,500). Compute the Spearman’s rank correlation between academic achievement and family income. [Delhi Univ. B.Com. (Pass), 1999] Solution. Let us define the following variables. ∴ ρ =1– X : Academic achievement ; Y : Family income (Rupees) CALCULATIONS FOR RANK CORRELATION COEFFICIENT Student Rai Bhatnagar Tuli Desai Gupta Chaudhri Singh Rank X (x) Income (Rs.) Y 1 2 3 4 5 6 7 8,700 4,200 5,700 8,200 20,000 18,000 17,500 Rank Y (y) 4 7 6 5 1 2 3 d=x–y d2 –3 –5 –3 –1 4 4 4 ∑d=0 9 25 9 1 16 16 16 ∑d 2 = 92 Spearman’s rank correlation coefficient between academic achievement (X) and family income (Y) is given by : 6 ∑d2 92 9 ρxy = 1 – = 1 – 76 ×× 48 = 1 – 23 = – 14 = – 0·6429. 14 n(n2 – 1) REPEATED RANKS In case of attributes if there is a tie i.e., if any two or more individuals are placed together in any classification w.r.t. an attribute or if in case of variable data there is more than one item with the same value in either or both the series, then Spearman’s formula (8·15) for calculating the rank correlation coefficient breaks down, since in this case the variables X [the ranks of individuals in characteristic A (1st series)] and Y [the ranks of individuals in characteristic B (2nd series)] do not take the values from 1 to n and – – consequently –x ≠ y, while in proving (8·15) we had assumed that –x = y . 8·35 CORRELATION ANALYSIS In this case, common ranks are assigned to the repeated items. These common ranks are the arithmetic mean of the ranks which these items would have got if they were different from each other and the next item will get the rank next to the rank used in computing the common rank. For example, suppose an item is repeated at rank 4. Then the common rank to be assigned to each item is (4 + 5)/2, i.e., 4·5 which is the average of 4 and 5, the ranks which these observations would have assumed if they were different. The next item will be assigned the rank 6. If an item is repeated thrice at rank 7, then the common rank to be assigned to each value will be (7 + 8 + 9)/3, i.e., 8 which is the arithmetic mean of 7, 8 and 9, viz., the ranks these observations would have got if they were different from each other. The next rank to be assigned will be 10. If only a small proportion of the ranks are tied, this technique may be applied together with formula (8·15). If a large proportion of ranks are tied, it is advisable to apply an adjustment or a correction factor (C.F. ) to (8·15) as explained below. “In the formula (8·15) add the factor m (m2 – 1)/12 to ∑d2 , where m is the number of times an item is repeated. This correction factor is to be added for each repeated value in both the series.” Example 8·27. A psychologist wanted to compare two methods A and B of teaching. He selected a random sample of 22 students. He grouped them into 11 pairs so that the students in a pair have approximately equal scrores on an intelligence test. In each pair one student was taught by method A and the other by method B and examined after the course. The marks obtained by them are tabulated below : Pair : 1 2 3 4 5 6 7 8 9 10 11 A : 24 29 19 14 30 19 27 30 20 28 11 B : 37 35 16 26 23 27 19 20 16 11 21 Find the rank correlation coefficient. Solution. Let variable X denote the scores of students taught by method A and Y denote the scores of students taught by method B. CALCULATIONS FOR RANK CORRELATION COEFFICIENT Rank of X Rank of Y d = x – y In the X-series, we see that the value X Y d2 (x) (y) 30 occurs twice. The common rank —————————————————————————————————————— ———————————————— ————————————————— ————————————————————— assigned to each of these values is 1·5, the arithmetic mean of 1 and 2, the ranks which these observations would have taken if they were different. The next value 29 gets the next rank, viz., 3. Again, the value 19 occurs twice. The common rank assigned to it is 8·5, the arithmetic mean of 8 and 9 and the next value, viz., 14 gets the rank 10. Similarly, in the y-series the value 16 occurs twice and the common rank assigned to each is 9·5, the arithmetic mean of 9 and 10. The next value viz., 11 gets the rank 11. 24 29 19 14 30 19 27 30 20 28 11 37 35 16 26 23 27 19 20 16 11 21 6 3 8·5 10 1·5 8·5 5 1·5 7 4 11 1 2 9·5 4 5 3 8 7 9·5 11 6 5 1 –1 6 –3·5 5·5 –3 –5·5 –2·5 –7 5 25 1 1 36 12·25 30·25 9·00 30·25 6·25 49·00 25·00 ———————————— ———————————— ———————————————— ———————————————— ———————————————— —————————————————————— ∑d = 0 ∑d2 = 225·00 Hence, we see that in the X-series the items 19 and 30 are repeated, each occurring twice and in the Y-series the item 16 is repeated. Thus in each of the three cases m = 2. Hence on applying the correction m(m2 – 1) factor for each repeated item, we get 12 6 ρ =1– [ ∑d2 + 2(4 – 1) 2(4 – 1) 2(4 – 1) 12 + 12 + 12 11(121 – 1) ] = 1 – 611× ×226·5 120 = 1 – 1·0225 = – 0·0225. (·. · n = 11) 8·36 BUSINESS STATISTICS Example 8·28. From the following data calculate the rank correlation coefficient after making adjustment for tied ranks. x: 48 33 40 9 16 16 65 24 16 57 y: 13 13 24 6 15 4 20 9 6 19 [I.C.W.A. (Intermediate), June 2002] Solution. Rx = Rank of x-value ; Ry = Rank of y-value x Explanation of Ranks y Rx Ry d2 d = Rx – Ry ————————————— ————————————————————————————— —————————————— ————————————————— ————————————————— 48 33 40 9 16 16 65 24 16 57 In the x-series, we see that the value 16 is repeated three times. The common rank assigned to each of these values is 8, the arithmetic mean of 7, 8 and 9, the ranks which these observations would have taken if they were different. The next value 9 gets the next rank viz., 10 13 13 24 6 15 4 20 9 6 19 3 5 4 10 8 8 1 6 8 2 5·5 5·5 1 8·5 4 10 2 7 8·5 3 –2·5 – 0·5 3 1·5 4 –2 –1 –1 – 0·5 –1 6·25 0·25 9 2·25 16 4 1 1 0·25 1 Similarly, in the y-series the observation 13 occurs twice. The common rank assigned to each is 5·5, the arithmetic mean of 5 and ∑d = 0 ∑d 2 = 41 6, and the next value 9 gets the next rank 7. Total Again the value 6 is repeated twice and each is given the common rank (8 + 9)/2 i.e., 8·5 and the next value 4 gets the rank 10. Correction Factor (C.F.) Repeated value No. of times it C.F. 2– 1)/12 occurs (m) m(m Spearman’s rank correlation coefficient for repeated ranks gives : x–series : 16 3 3 (9 – 1)/12 = 2 ———————————— ———————————— ———————————————— ————————————————————————————— —————————————————— ——————————————————————— —————————————————————————————————————————————— ρ =1– 6 – 1) n(n2 [ ∑ d + ∑ m(m12– 1) ] 2 2 [ 6 = 1 – 10(100 – 1) 41 + 3 4 11 = 1 – 15 = 15 = 0·7333. ]=1– 6 × 44 10 × 99 y–series : 13 6 2 2 2(4 – 1)/12 = 0·5 2(4 – 1)/12 = 0·5 ————————————————————————— ————————————————————————————————————————————— —— 3 Total Example 8.29. Find the Rank correlation Coefficient from the following marks awarded by the examiners in Statistics. Roll No. By Examiner A By Examiner B By Examiner C 1 2 3 4 5 6 7 8 9 10 11 24 29 19 14 30 19 27 30 20 28 11 37 35 16 26 23 27 19 20 16 11 21 30 28 20 25 25 30 20 24 22 29 15 [Delhi Univ. B.Com. (Hons.), 2005] 8·37 CORRELATION ANALYSIS Solution· First of all we shall convert the markets awarded by different examiners to the candidates into ranks, as given in the following table. Rok Nos· 1 2 3 4 5 6 7 8 9 10 11 Marks Rank Marks awarded (R A) awarded by by Examiner Examiner B A 24 6 37 29 3 35 19 8·5 16 14 10 26 30 1·5 23 19 8·5 27 27 5 19 30 1·5 20 20 7 16 28 4 11 11 11 21 We are given n = 11. Rank (R B) Marks awarded by Examiner C Rank (R C) 1 2 9·5 4 5 3 8 7 9·5 11 6 30 28 20 25 25 30 20 24 22 29 15 1·5 4 9·5 5·5 5·5 1·5 9·5 7 8 3 11 [ n [ 6 225 + =1– =1– 5 1 –1 6 – 3·5 5·5 –3 – 5·5 – 2·5 –7 5 4·5 –1 –1 4·5 –4 7 – 4·5 – 5·5 –1 1 0 – 0·5 –2 0 – 1·5 – 0·5 1·5 – 1·5 0 1·5 8 –5 d 2AB d 2AC d 2BC 25 1 1 36 12·25 30·25 9 30·25 6·25 49 25 20·25 1 1 20·25 16 49 20·25 30·25 1 1 0 0·25 4 0 2·25 0·25 2·25 2·25 0 2·25 64 25 ∑d 2 AB = 225 ∑d 2 AC = 160 ∑d 2 BC = 102·5 m (m2 – 1) for each repeated rank in A and B series 12 6 ∑dAB 2 + ρAB = 1 – d AB = d AC = d BC = RA – RB RA – RC RB – RC (n2 ] – 1) 2 (22 – 1) 2 (22 – 1) 2 (22 – 1) + + 12 12 12 ] 11 (121 – 1) 6 × 226·5 226·5 26·5 =1– =– = – 0·0295 11 × 120 220 220 Similarly ρAC = 1 – 6 [∑ dAC2 + correction for Repeated Ranks] n (n2 – 1) [ 6 160 + 2 =1– =1– } { 2 (22 – 1) +3 12 2 (22 – 1) 12 11 × (121 – 1) }] 6 (160 + 2·5) 162·5 57·5 =1– = = 0·2614 1 × 220 220 220 [ 6 102·5 + ρBC = 1 – { 2 (22 – 1) +3 12 { 2 (22 – 1) 12 }] 11 (121 – 1) 6 × 104·5 104·5 115·5 =1– = = 0·5250 11 × 120 220 220 Example 8·30. The value of Spearman’s rank correlation coefficient for certain pairs of number of observations, was found to be 2/3. The sum of squares of the differences between corresponding ranks was 55. Find the number of pairs. = 8·38 BUSINESS STATISTICS Solution. We are given ρ = 2/3 and ∑ d2 = 55. If n is the number of pairs of observations, we have : 6∑d2 n(n2 – 1) 2 1 –3=3 ρ=1– 330 = n(n2 – 1) ⇒ 1 6 × 55 n(n2 – 1) ⇒ 2 3 ⇒ n(n2 – 1) = 3 × 330 = 990 =1 – ⇒ n3 – n – 990 = 0 …(*) By hit and trial we find that on putting n = 10 in (*), L.H.S. = 103 – 10 – 990 = 1000 – 1000 = 0 = R.H.S. Hence, by Remainder Theorem, (n – 10) is a factor of n 3 – n – 990. On dividing n3 – n – 990 by (n – 10), we obtain the other factor of n3 – n – 990 as n2 + 10n + 99. ∴ (*) ⇒ (n – 10) (n2 + 10n + 99) = 0 ⇒ n – 10 = 0 or n2 + 10n + 99 = 0 ⇒ n = 10 100 – 396 n = –10 ± ⎯√⎯⎯⎯⎯⎯⎯ (Imaginary values) 2 or Hence, n = 10 is the only permissible value. Hence, the number of pairs is 10. Example 8·31. The coefficient of rank correlation between Micro-Economics and Statistics marks of 10 students was found to be 0·5. It was later discovered that the difference in ranks in two subjects obtained by one of the students was wrongly taken as 3 instead of 7. Find the correct value of coefficient of rank correlation. [Delhi Univ. B.A. (Econ. Hons.), 2006] Solution. We are given n = 10, ρ = 0·5. Using (8·15), we get 2 6∑d2 6∑d2 0·5 = 1 – = 1 – 106∑d ⇒ ⇒ ∑d2 = 6990 × 99 990 = 1 – 0·5 = 0·5 × 2 = 82·5 n(n2 – 1) Since one difference was wrongly taken as 3 instead of 7, the correct value of ∑d2 is given by : Corrected ∑d2 = 82·5 – 3 2 + 7 2 = 82·5 – 9 + 49 = 122·5 ∴ Corrected ρ = 1 – 6 × 122·5 10 × 99 = 1– 49 66 = 1 – 0·7424 = 0·2576 8·7·3. Remarks on Spearman’s Rank Correlation Coefficient 1. We always have ∑d = 0, which provides a check for numerical calculations. 2. Since Spearman’s rank correlation coefficient ρ is nothing but Pearsonian correlation coefficient between the ranks, it can be interpreted in the same way as the Karl Pearson’s correlation coefficient. 3. Karl Pearson’s correlation coefficient assumes that the parent population from which sample observations are drawn is normal. If this assumption is violated then we need a measure which is distribution-free (or non-parametric). A distribution-free measure is one which does not make any assumptions about the parameters of the population. Spearman’s ρ is such a measure (i.e., distributionfree), since no strict assumptions are made about the form of the population from which sample observations are drawn. 4. Spearman’s formula is easy to understand and apply as compared with Karl Pearson’s formula. The values obtained by the two formulae, viz., Pearsonian r and Spearman’s ρ are generally different. The difference arises due to the fact that when ranking is used instead of full set of observations, there is always some loss of information. Unless many ties exist, the coefficient of rank correlation should be only slightly lower than the Pearsonian coefficient. 5. Spearman’s formula is the only formula to be used for finding correlation coefficient if we are dealing with qualitative characteristics which cannot be measured quantitatively but can be arranged serially. It can also be used where actual data are given. In case of extreme observations, Spearman’s formula is preferred to Pearson’s formula. 6. Spearman’s formula has its limitations also. It is not practicable in the case of bivariate frequency distribution (Correlation Table). For n > 30, this formula should not be used unless the ranks are given, since in the contrary case the calculations are quite time consuming. 8·39 CORRELATION ANALYSIS EXERCISE 8·4 1. (a) What is Spearman’s rank correlation coefficient ? Discuss its usefulness. (b) Explain the difference between Karl Pearson’s (product moment) correlation coefficient and rank correlation coefficient. 2. (a ) What are the advantages of Spearman’s rank correlation coefficient over Karl Pearson’s correlation coefficient ? Explain the method of calculating Spearman’s correlation coefficient. (b) Define rank correlation coefficient. When is it preferred to Karl Pearson’s coefficient of correlation ? (c) Distinguish between Karl Pearson’s coefficient of correlation and Spearman’s rank correlation coefficient. Explain with the help of an example when Spearman rank correlation coefficient results to + 1, – 1 and between – 1 to + 1. [Delhi Univ. B.Com. (Hons.), 2009] 3. Define Rank Correlation. Write down Spearman’s formula for rank correlation coefficient ρ. What are the limits of ρ ? Interpret the case when ρ assumes the minimum value. 4. Rankings of 10 trainees at the beginning (x) and at the end (y) of a certain course are given below : Trainees : A B C D E F G H I x : 1 6 3 9 5 2 7 10 8 y : 6 8 3 7 2 1 5 9 4 Calculate Spearman’s rank correlation coefficient. J 4 10 [I.C.W.A. (Intermediate), June 1995] Ans. ρ = 0·394. 5. The ranks of same 16 students in Mathematics and Physics are as follows. Two numbers within brackets denote the ranks of the students in Mathematics and Physics. (1, 1) (2, 10) (3, 3) (4, 4) (5, 5) (6, 7) (7, 2) (8, 6) (9, 8) (10, 11) (11, 15) (12, 9) (13, 14) (14, 12) (15, 16) (16, 13). Calculate the rank correlation coefficient for proficiencies of this group in Mathematics and Physics. Ans. ρ = 0·8. 6. Two judges in a beauty competition rank the 12 entries as follows : X : 1 2 3 4 5 6 7 Y : 12 9 6 10 3 5 4 8 7 9 8 10 2 11 11 12 1 What degree of agreement is there between the two judges. Ans. ρ = – 0·454 7. Ten competitors in a beauty contest are ranked by three judges in the following order : 1st Judge : 1 5 4 8 9 6 10 7 2nd Judge : 4 8 7 6 5 9 10 3 3rd Judge : 6 7 8 1 5 10 9 2 3 2 3 2 1 4 Use the rank correlation coefficient to discuss which pair of judges has the nearest approach to beauty. Ans. ρ12 = 0·5515, ρ13 = 0·0545, ρ23 = 0·7333. The pair of 2nd and 3rd judges has the nearest approach to common tastes in beauty. 8. For the following data, calculate the Coefficient of Rank Correlation. x : 80 91 99 71 61 y : 123 135 154 110 105 Ans. ρ = 0·9524. 81 134 70 121 59 106 [C.A. (Foundation), May 2001] 9. The following are the marks obtained by a group of students in two papers. Calculate the rank coefficient of correlation. Economics : 78 36 98 25 75 82 92 62 65 39 Statistics : 84 51 91 69 68 62 86 58 35 49 Ans. ρ = 0·6121. 10. Calculate Spearman’s coefficient of rank correlation for the following data of scores in psychological tests (x) and arithmetical ability (y) of 10 children. 8·40 BUSINESS STATISTICS Child : x : y : A 105 101 B 104 103 C 102 100 D 101 98 E 100 95 F 99 96 G 98 104 H 96 92 I 93 97 J 92 94 Ans. ρ = 0·6. 11. How do you modify Spearman’s rank correlation formula for tied ranks ? Compute the Coefficient of Rank Correlation between X and Y from the data given below : X : 8 10 7 15 3 20 21 5 10 14 8 16 22 19 6 Y : 3 12 8 13 20 9 14 11 4 16 15 10 18 23 25 Ans. ρ = 1 – 6 × (539 + 0·5 + 0·5) 15 × 224 = 0·0357. [Delhi Univ. B.Com. (Pass), 1998] 12. Given the following aptitude and I.Q. scores for a group of students. Find the coefficient of rank correlation. Aptitude Score : 57 58 59 59 60 61 60 64 I.Q. Score : 97 108 95 106 120 126 113 110 [Delhi Univ. B.A. (Eon. Hons.), 2009] 6 × (24 + 0.5 = 0.5) 150 Ans. ρ = 1 – =1– = 0.7024 8 (64 – 1) 504 13. The following data relate to the marks obtained by 10 students of a class in Statistics and Costing : Marks in Statistics : 30 38 28 27 28 23 30 33 28 35 Marks in Costing : 29 27 22 29 20 29 18 21 27 22 Obtain the rank correlation coefficient. [Delhi Univ. B.Com. (Hons.), 2001] Ans. ρ = – 0·3515. 14. Find the coefficient of rank correlation between the marks obtained in Mathematics (x) and those in Statistics (y) by 10 students of certain class out of a total of 50 marks in each subject. Student No. : 1 2 3 4 5 6 7 8 9 10 x : 12 18 32 18 25 24 25 40 38 22 y : 16 15 28 16 24 22 28 36 34 19 Ans. ρ = 0·95. 15. From the following data, calculate the coefficient of rank correlation between x and y. x : 32 35 49 60 43 37 43 49 y : 40 30 70 20 30 50 72 60 10 45 20 25 Ans. ρ = – 0·0758. 16. When is rank correlation coefficient preferred to Karl Pearson’s method ? In a bivariate sample, the sum of squares of differences between the ranks of observed values of two variables is 231 and the correlation coefficient between them is – 0·4. Find the number of pairs. [Delhi Univ. B.Com. (Hons.) (External), 2006] Ans. n = 10. Hint. ρ = 1 – ⇒ 6 ∑d 2 n (n2 – 1) n 3 – n – 990 = 0 ⇒ n (n 2 – 1) = ⇒ 6 ∑ d2 6 × 231 = = 990 1·4 1–ρ n = 10 [See Example 8.30] 17. If the rank correlation coefficient is 0·6 and the sum of the squares of differences of ranks is 66, then the number of pairs is (i) 8, (ii) 9, (iii) 10, (iv) 11. Ans. (iii). [I.C.W.A. (Intermediate), June 2001] 18. Coefficient of correlation between debenture prices and share prices is found to be 0·143. If the sum of the squares of differences in ranks is given to be 48, find the value of n. Ans. n = 7. 19. The coefficient of rank correlation of the marks obtained by 10 students in biology and chemistry was found to be 0·8. It was later discovered that the difference in ranks in the two subjects obtained by one of the students was wrongly taken as 7 instead of 9. Find the correct coefficient of rank correlation. 8·41 CORRELATION ANALYSIS Ans. Correct value of ρ = 0·6061. 20. The coefficient of rank correlation of the marks obtained by 10 students in Statistics and Accountancy was found to be 0·2. It was later discovered that the difference in ranks in the two subjects obtained by one of the students was wrongly taken as 9 instead of 7. Find the correct value of coefficient of rank correlation. [Delhi Univ. B.Com (Hons.), 1992] Ans. Correct (ρ) = 0·3939. 21. The rank correlation of a physical fitness contest involving 12 participants was calculated as 0·6. However, it was later discovered that the difference in ranks of a participant was read as 8 instead of 3. Find the correct value of rank correlation. [Delhi Univ. B.A. (Econ. Hons.), 2009] Ans. 0·7924. 22. Mention the correct answer. The ranks according to two attributes in a sample are given below : R1 : 1 2 3 R2 : 5 4 3 The rank correlation between them is : 0, +1, –1, Ans. ρ = –1. 4 2 5 1 None of these. 8·8. METHOD OF CONCURRENT DEVIATIONS This is very casual method of determining the correlation between two series when we are not very serious about its precision. This is based on the signs of the deviations (i.e., direction of the change) of the values of the variable from its preceding value and does not take into account the exact magnitude of the values of the variables. Thus we put a plus (+) sign, minus (–) sign or equality (=) sign for the deviation if the value of the variable is greater than, less than or equal to the preceding value respectively. The deviations in the values of two variables are said to be concurrent if they have the same sign, i.e., either both deviations are positive or both are negative or both are equal. The formula used for computing correlation coefficient r by this method is given by r=± ± ( 2cn– n) …(8·17) where c is the number of pairs of concurrent deviations and n is the number of pairs of deviations. In the formula (8·17) plus/minus sign to be taken inside and outside the square root is of fundamental importance. 2c – n Since –1 ≤ r ≤ 1, the quantity inside the square root, viz., ± must be positive, otherwise r will be n imaginary which is not possible. Thus if (2c – n) is positive, we take positive sign in and outside the square root in (8·17) and if (2c – n) is negative, we take negative sign in and outside the square root in (8·17). Remarks 1. It should be clearly noted that here n is not the number of pairs of observations but it is the number of pairs of deviations and as such it is one less than the number of pairs of observations. 2. r computed by formula (8·17) is also known as coefficient of concurrent deviations. 3. Coefficient of concurrent deviations is primarily based on the following principle : “If the short time fluctuations of the time series are positively correlated or in other words, if their deviations are concurrent, their curves would move in the same direction and would indicate positive correlation between them.” Thus r computed from (8·17) ordinarily indicates the relationship between short time fluctuations only. Example 8·32. Calculate the coefficient of concurrent deviations from the data given below : Year : 1993 1994 1995 1996 1997 1998 1999 2000 2001 Supply : 160 164 172 182 166 170 178 192 186 Price : 292 280 260 234 266 254 230 190 200 ( ) 8·42 BUSINESS STATISTICS Solution. Year 1993 1994 1995 1996 1997 1998 1999 2000 2001 CALCULATION OF COEFFICIENT OF CONCURRENT DEVIATIONS Supply Sign of deviation from Price Sign of deviation from Product of preceding value (x) preceding value (y) deviations (xy) 292 160 – – 280 + 164 – – 260 + 172 – – 234 + 182 – + 266 – 166 – – 254 + 170 – – 230 + 178 – – 190 + 192 – + 200 – 186 Here we have : n = Number of pairs of deviations = 9 – 1 = 8 c = 0, since there is no pair of deviations having like signs, i.e., since no product deviations xy is positive. Coefficient of concurrent deviations is given by r =± ± (2cn– n) =± ± ( ) 0–8 8 = ± √⎯⎯⎯⎯ ± (–1) Since 2c – n = – 8, i.e., (negative), we take negative sign inside and outside the square root to get, r = – √⎯⎯⎯⎯ – (–1) = –1 Hence, there is perfect negative correlation between the supply and the price. Example 8·33. Calculate the coefficient of concurrent deviations for the following data : Supply : 65 40 35 75 63 80 35 20 80 60 50 Demand : 60 55 50 56 30 70 40 35 80 75 80 [C.A. (Foundation), Nov. 1997] Solution. CALCULATIONS FOR COEFFICIENT OF CONCURRENT DEVIATIONS Supply (X) Sign of the deviation from preceding value (x) 65 40 35 75 63 80 35 20 80 60 50 Demand (Y) Sign of the deviation from preceding value (y) Product of deviations (xy) – – + – + – – + – + + + + + + + + + + – 60 55 50 56 30 70 40 35 80 75 80 – – + – + – – + – – Here we have : n = Number of pairs of deviations = 11 – 1 = 10 c = Number of pairs of deviations having like signs = 9 The coefficient of concurrent deviations is given by : r =± ± (2c – n) n =± ± 2 × 9 – 10 10 = ± √⎯⎯⎯ ± 0·8 Since 2c – n = 8, is positive, we take positive sign inside and outside the square root so that : r = + ⎯√⎯0·8 = 0·89. 8·43 CORRELATION ANALYSIS EXERCISE 8·5 1. (a) Explain the method of concurrent deviations for computing the correlation between two variable series. (b) Give the points of strength and weakness of finding out the relationship between two variables by the method of concurrent deviations. 2. Obtain the coefficient of correlation between price of rice and rainfall from the data given below by means of concurrent deviations. Year Price of rice (in Rs.) Annual rainfall in Year Price of rice (in Rs.) Annual rainfall per quintal centimetres per quintal in centimetres 353 196 1965 315 175 1959 333 190 1966 340 160 1960 390 191 1967 350 158 1961 340 195 1968 350 200 1962 380 196 1969 330 198 1963 340 204 1970 335 195 1964 Ans. r = – 0·3015. 3. Calculate the coefficient of correlation by the method of concurrent deviations from the following data : Year : 1991 1992 1993 1994 1995 1996 1997 1998 1999 Supply : 80 82 86 91 83 85 89 96 93 Price : 146 140 130 117 133 127 115 95 100 Ans. r = –1. 4. Calculate the coefficient of correlation, using the method of concurrent deviations between supply and demand of an item for a ten year period as given below : Year : 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 Supply : 125 160 164 174 155 170 165 162 172 175 Demand : 115 125 192 190 165 174 124 127 152 169 [C.A. (Foundation), Nov. 1995] Ans. r = 0·75. 5. Calculate the coefficient of correlation by the concurrent deviations method. Supply : 112 125 126 118 118 121 125 125 131 135 Price : 106 102 102 104 98 96 97 97 95 90 Ans. r = –0·7. 6. Calculate correlation coefficient by concurrent deviations method : X : 150 135 90 140 100 Y : 60 50 100 80 90 Ans. r = – 0·7071. 7. Calculate coefficient of concurrent deviations from the following data : x : 100 120 135 135 115 110 110 y : 50 40 60 60 80 55 65 Ans. r = 0. 8. Calculate the coefficient of concurrent deviations from the following data : No. of pairs of observations = 96 No. of pairs of concurrent deviations = 36 Ans. r = – 0·492. 8·9. COEFFICIENT OF DETERMINATION Coefficient of correlation between two variable series is a measure of linear relationship between them and indicates the amount of variation of one variable which is associated with or is accounted for by another variable. A more useful and readily comprehensible measure for this purpose is the coefficient of determination which gives the percentage variation in the dependent variable that is accounted for by the independent variable. In other words, the coefficient of determination gives the ratio of the explained variance to the total variance. The coefficient of determination is given by the square of the correlation coefficient, i.e., r2 . Thus, Explained Variance Coefficient of determination = r2 = …(8·18) Total Variance 8·44 BUSINESS STATISTICS The coefficient of determination is a much useful and better measure for interpreting the value of r. According to Tuttle : “The coefficient of correlation has been grossly over-rated and is used entirely too much. Its square, the coefficient of determination is a much more useful measure of the linear covariation of two variables. The reader should develop the habit of squaring every correlation coefficient he finds, cited or stated before coming to any conclusion about the extent of the linear relationship between the two correlated variables.” For example, if the value of r = 0·8, we cannot conclude that 80% of the variation in the relative series (dependent variable) is due to the variation in the subject series (independent variable). But the coefficient of determination in this case is r2 = 0·64 which implies that only 64% of the variation in the relative series has been explained by the subject series and the remaining 36% of the variation is due to other factors. By the same argument while comparing two correlation coefficients, one of which is 0·4 and the other is 0·8 it is misleading to conclude that the correlation in the second case is twice as high as correlation in the first case. The coefficient of determination clearly explains this viewpoint, since in the case r = 0·4, the coefficient of determination is 0·16 and in the case r = 0·8, the coefficient of determination is 0·64, from which we conclude that correlation in the second case is four times as high as correlation in the first case. Remarks 1. The above discussion implies that : “The closeness of the relationship between two variables as determined by correlation coefficient r is not proportional.” 2. The following table gives the values of the coefficient of determination (r2 ) for different values of r. r r 2 0·1 0·2 0·3 0·4 0·5 0·6 0·7 0·8 0·9 1·0 0·01 0·04 0·09 0·16 0·25 0·36 0·49 0·64 0·81 1·00 It may be seen from the above table that as the value of r decreases, r 2 decreases very rapidly except in two particular cases r = 0 and r = 1 when we get r2 = r. 3. Coefficient of determination is always non-negative and as such it does not tell us about the direction of the relationship (whether it is positive or negative) between the two series. 4. Coefficient of Non-determination. The ratio of the unexplained variation to the total variation is called the coefficient of non-determination. It is usually denoted by K2 and is given by the formula : Un-explained Variance Explained Variance =1– = 1 – r2 …(8·19) Total Variance Total Variance 5. Coefficient of Alienation. The coefficient of alienation is given by the square root of the coefficient of non-determination, i.e., by K as given below : K2 = K =±⎯ √⎯⎯⎯ 1 – r2. …(8·19a) EXERCISE 8·6 1. (a) If the correlation coefficient is 0·7, then what is the coefficient of determination ? Also interpret its value. [C.A. (Foundation), Nov. 1997] (b) What is the coefficient of determination ? How is it useful in interpreting the value of an observed correlation coefficient r ? Explain with the help of an example. (c) “A high value of the coefficient of determination is neither necessary nor sufficient to ensure a causal relationship between X and Y.” Explain. [Delhi Univ. B.A. (Econ. Hons.), 2005] 2. Explain the terms : (i) Coefficient of non-determination, (ii) Coefficient of alienation, and give their physical interpretation. 3. (a) A correlation coefficient of 0·5 does not mean that 50% of the data are explained. Comment. [Delhi Univ. B.Com. (Hons.), 1998] Ans. Statement is true. Only 25% of the variation is explained. (b) Do you agree with the statement : “r = 0·8 implies that 80% of the data are explained.” CORRELATION ANALYSIS 8·45 Ans. No. Only 64% of the data are explained. 4. The coefficient of correlation between consumption expenditure (c) and disposable income (y) in a study was found to be + 0·6. What percentage of variation in c is explained by variation in y ? Ans. 36% of the variation in c is explained by variation in y. 5. (a) A correlation between two variables has a value r = 0·6 and a correlation between other two variables is 0·3. Does it mean that the first correlation is twice as strong as the second ? (b) Correlation between two variables has a value of 0·9 and a correlation between other two variables is 0·3. Can we infer that the first correlation is thrice as strong as the second ? Give reasons. [Osmania Univ. B.Com., 1997] Ans. (a) No. (b) No. 6. Comment on the following : “The closeness of the relationship between two variables as determined by r, the correlation coefficient between them, is proportional.” Ans. Statement is wrong. 7. If correlation coefficient between random variables x and y is +ve, comment on the following statements : (i) Correlation coefficient between –x and –y is +ve. (ii) We cannot infer about the sign of correlation between x and –y. (iii) Interpret the result r 2 = 0·64, where r 2 is co-efficient of determination. [Delhi Univ. B.A. (Econ. Hons.), 2000] Ans. (i) True, (ii) False. r(x, –y) = – r(x, y) < 0 (iii) 64% of variation in Y is explained by the linear regression of Y on X. [c.f. Chapter 9.] 8. If the correlation coefficient between X and Y is 0.4 and between U and V is 0.8, does this imply that the extent of association between U and V is twice as that between X and Y ? [Delhi Univ. B.A. (Econ. Hons.), 1999] 9. Calculate correlation coefficient and coefficient of determination from the following data. n = 10, ∑X = 140, ∑Y = 150 ∑ (X– 10)2 = 180, ∑(Y – 15)2 = 215, ∑(X – 10) (Y – 15) = 60. [Delhi Univ. B.Com. (Hons.), (External), 2007] Ans. r = 0·915; Coefficient of determination = r2 = 0·8372 Hint. U = X – 10, V = Y – 15, n = 10 ∑U = ∑X – 10 × 10 = 140 – 100 = 40 ; ∑V = ∑Y – 15 × 10 = 150 – 150 = 0 2 2 ∑U = 180; ∑V = 215 ; ∑UV = 60· n ∑UV – (∑U) (∑V) r XY = rUV = = 0·915 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ √ [n ∑U2 – (∑U)2] [n ∑V2 – (∑V)2] 8.10. LAG AND LEAD CORRELATION When there exists a cause and effect relationship between two time series, it is usually observed that there is a time lag between the changes in the values of the independent variable (also called the subject series) and the dependent variable (also called the relative series). Such phenomenon is usually observed in most of the economic and business time series. For example, the monthly advertisement expenditure of a firm on its product and the sales of the product have a fairly good degree of positive correlation. However, the effect of the advertisement expenditure will be felt on the increased sales of the product only after a certain period of time, which may be 3 or 4 months or even more. This tendency on the part of the effect (change in the dependent variable or relative) to occur sometimes after the occurrence of the cause (change in independent variable or subject) is known as ‘lag’. If it is known that ‘lag’ exists between two time series, it is imperative to make adjustment for it, before computing the correlation coefficient between the two series otherwise fallacious conclusions will be drawn. In order to make allowance for the ‘lag’, it is necessary to determine the ‘time-lag’ i.e., to estimate the time period which lapses before the change in the dependent variable is affected after a change in the independent variable. The ‘period of lag’ can be estimated by plotting the two series on a graph paper and noting the time distance between the peaks/troughs of two curves. If the peak (trough) in dependent 8·46 BUSINESS STATISTICS variable (sales) comes after k-months of the peak (trough) in the independent variable (advertisement expenditure), then there is k-month time-lag between the two variables. Here we say that ‘advertisement expenditure curve’ will lead by k-months and the ‘sales curve’ would lag by k-months. Illustration. Let us consider the following hypothetical time series values of monthly advertisement expenditure (in ’000 Rs.) and sales (in ’000 Rs.) of the product of a firm. Month Advertisement expenditure (’000 Rs.) (x) Jan. Feb. March April May June Sales (’000 Rs.) (y) x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6 Month July Aug. Sept. Oct. Nov. Dec. x y x7 x8 x9 x10 x11 x12 y7 y8 y9 y10 y11 y12 Suppose that ‘advertisement expenditure’ is known to have its effect on the ‘sales’ after 3 months. After making allowance for this ‘time-lag’ of 3-months, the correct value of correlation coefficient is obtained on computing Karl Pearson’s correlation coefficient between the following two-series values. x y x1 y4 x2 y5 x3 y6 x4 y7 x5 y8 x6 y9 x7 y10 x8 y11 Note : The study of ‘Lag-correlation’ is specially useful in the study of economic and business time series. x9 y12 9 Linear Regression Analysis 9·1. INTRODUCTION The literal or dictionary meaning of the word ‘Regression’ is ‘stepping back or returning to the average value’. The term was first used by British biometrician Sir Francis Galton in the later part of the 19th century in connection with some studies he made on estimating the extent to which the stature of the sons of tall parents reverts or regresses back to the mean stature of the population. He studied the relationship between the heights of about one thousand fathers and sons and published the results in a paper ‘Regression towards Mediocrity in Hereditary Stature’. The interesting features of his study were : (i) The tall fathers have tall sons and short fathers have short sons. (ii) The average height of the sons of group of tall fathers is less than that of the fathers and the average height of the sons of a group of short fathers is more than that of the fathers. In other words, Galton’s studies revealed that the off springs of abnormally tall or short parents tend to revert or step back to the average height of the population, a phenomenon which Galton described as Regression to Mediocrity. He concluded that if the average height of a certain group of fathers is ‘a’ cms. above (below) the general average height then average height of their sons will be (a × r) cms. above (below) the general average height where r is the correlation coefficients between the heights of the given group of fathers and their sons. In this case correlation is positive and since | r | ≤ 1 we have a × r ≤ a. This supports the result in (ii) above. But today the word regression as used in Statistics has a much wider perspective without any reference to biometry. Regression analysis, in the general sense, means the estimation or prediction of the unknown value of one variable from the known value of the other variable. It is one of the very important statistical tools which is extensively used in almost all sciences — natural, social and physical. It is specially used in business and economics to study the relationship between two or more variables that are related causally and for estimation of demand and supply curves, cost functions, production and consumption functions, etc. Prediction or estimation is one of the major problems in almost all spheres of human activity. The estimation or prediction of future production, consumption, prices, investments, sales, profits, income, etc., are of paramount importance to a businessman or economist. Population estimates and population projections are indispensable for efficient planning of an economy. The pharmaceutical concerns are interested in studying or estimating the effect of new drugs on patients. Regression analysis is one of the very scientific techniques for making such predictions. In the words of M.M. Blair “Regression analysis is a mathematical measure of the average relationship between two or more variables in terms of the original units of the data”. We come across a number of inter-related events in our day-to-day life. For instance, the yield of a crop depends on the rainfall, the cost or price of a product depends on the production and advertising expenditure, the demand for a particular product depends on its price, expenditure of a person depends on his income, and so on. The regression analysis confined to the study of only two variables at a time is termed as simple regression. But quite often the values of a particular phenomenon may be affected by multiplicity of factors. The regression analysis for studying more than two variables at a time is known as multiple regression. However, in this chapter we shall confine ourselves to simple regression only. 9·2 BUSINESS STATISTICS In regression analysis there are two types of variables. The variable whose value is influenced or is to be predicted is called dependent variable and the variable which influences the values or is used for prediction, is called independent variable. In regression analysis independent variable is also known as regressor or predictor or explanator while the dependent variable is also known as regressed or explained variable. 9·2. LINEAR AND NON-LINEAR REGRESSION If the given bivariate data are plotted on a graph, the points so obtained on the scatter diagram will more or less concentrate round a curve, called the ‘curve of regression’. Often such a curve is not distinct and is quite confusing and sometimes complicated too. The mathematical equation of the regression curve, usually called the regression equation, enables us to study the average change in the value of the dependent variable for any given value of the independent variable. If the regression curve is a straight line, we say that there is linear regression between the variables under study. The equation of such a curve is the equation of a straight line, i.e., a first degree equation in the variables x and y. In case of linear regression the values of the dependent variable increase by a constant absolute amount for a unit change in the value of the independent variable. However, if the curve of regression is not a straight line, the regression is termed as curved or non-linear regression. The regression equation will be a functional relation between x and y involving terms in x and y of degree higher than one, i.e., involving terms of the type x2 , y2, xy, etc. However, in this chapter we shall confine our discussion to linear regression between two variables only. 9·3. LINES OF REGRESSION Line of regression is the line which gives the best estimate of one variable for any given value of the other variable. In case of two variables x and y, we shall have two lines of regression; one of y on x and the other of x on y. Definition. Line of regression of y on x is the line which gives the best estimate for the value of y for any specified value of x. Similarly, line of regression of x on y is the line which gives the best estimate for the value of x for any specified value of y. The term best fit is interpreted in accordance with the Principle of Least Squares which consists in minimising the sum of the squares of the residuals or the errors of estimates, i.e., the deviations between the given observed values of the variable and their corresponding estimated values as given by the line of best fit. We may minimise the sum of the squares of the errors parallel to y-axis or parallel to x-axis, the former (i.e., minimising the sum of squares of errors parallel to y-axis), gives the equation of the line of regression of y on x and the latter, viz., minimising the sum of squares of the errors parallel to x-axis gives the equation of the line of regression of x on y. We shall explain below the technique of deriving the equation of the line of regression of y on x. 9·3·1. Derivation of Line of Regression of y on x. Let (x1, y 1 ), (x 2, y 2 ), …, (x n , y n ), be n pairs of observations on the two variables x and y under study. Let y = a + bx … (9·1) Y • be the line of regression (best fit) of y on x. • bx a+ For any given point P i (x i, y i) in the scatter y= Pi (xi, yi) diagram, the error of estimate or residual as given by • • • • • the line of best fit (9·1) is Pi Hi. Now, the x-coordinate • • of Hi is same as that of Pi, viz ., x i and since Hi (xi) lies • on the line (9·1), the y-coordinate of Hi, i.e., H i M is Hi (xi, a + bxi) • • given by (a + bxi). Hence, the error of estimate for Pi is given by Pi Hi = Pi M – Hi M O M X = y i – (a + bxi) … (9·2) Fig. 9·1. 9·3 LINEAR REGRESSION ANALYSIS This is the error (parallel to the y-axis) for the ith point. We will have such errors for all the points on scatter diagram. For the points which lie above the line, the error would be positive and for the points which lie below the line, the error would be negative. According to the principle of least squares, we have to determine the constants a and b in (9·1) such that the sum of the squares of the errors of estimates is minimum. In other words, we have to minimise E = n n i=1 i=1 ∑ Pi Hi2 = ∑ (yi – a – bxi)2 …(9·3) subject to variations in a and b. We may also write E as : E = ∑(y – ye)2 = ∑(y – a – bx)2, …(9·3a) where y e is the estimated value of y as given by (9·1) for given value of x and summation (∑) is taken over the n pairs of observations. Using the principle of maxima and minima in differential calculus, E will have an extremum (maximum or minimum) for variations in a and b if its partial derivatives w.r.t. a and b vanish separately. Hence from (9·3a), we get ∂E ∂E =0 and =0 …(9·4) ∂a ∂b ⇒ ∑ y = na + b∑ x …(9·5) and ∑ xy = a∑ x + b∑ x2 …(9·6) These equations are known as the normal equations for estimating a and b. The quantities ∑ x, ∑ x 2, ∑y, ∑xy can be obtained from the given set of n points (x1, y1), (x 2 , y2), …, (x n , yn) and we can solve the equations (9·5) and (9·6) simultaneously for a and b, to get : (∑ x2)(∑ y) – (∑ x)(∑ xy) n∑ xy – (∑ x)(∑ y) a= …(9·7) and b= …(9·8) 2 2 n∑ x – (∑ x) n∑ x2 – (∑ x)2 Substituting these values of a and b from (9·7) and (9·8) in (9·1), we get the required equation of the line of regression of y on x. The equation of the line of regression of y on x can be obtained in a much more systematic and – simplified form in terms of –x, y , σx, σy and r = rxy as explained below. Dividing both sides of (9·5) by n, the total number of pairs, we get 1 1 – ∑y = a + b . ∑x ⇒ y = a + b –x n n …(9·9) – This implies that line of best fit, i.e., regression of y on x passes through the point ( –x , y ). Or in other – words, the point ( –x, y ) lies on the line of regression of y on x. Cov (x‚ y) From (9.8), we get : b= …(9·10) σx2 We find that the equation (9·1) is in the slope-intercept form, viz., y = mx + c. Hence b represents the slope of the line of regression of y on x. Further, we have proved in (9·9) that this line (i.e., line of – regression of y on x) passes through the point ( –x, y ). Hence, using the slope-point form of the equation of a line, the required equation of the line of regression of y on x becomes : y – –y = b (x – –x) …(9·11) or But Cov (x, y) . y – –y = (x – –x ) σx2 Cov (x‚ y) r = rxy = σxσy …(9·12) ⇒ Cov (x, y) = r σxσy 9·4 BUSINESS STATISTICS Substituting in (9·12), we may also write the equation of the line of regression of y on x as : rσxσy r σy y – –y = (x – –x) ⇒ y – –y = (x – –x ) 2 σx σx ∂E ∂E = 0 and = 0, is only a requirement for extremum (maxima or minima) of E. ∂a ∂b The necessary and sufficient conditions for a minima of E for variations in a and b are : ∂E ∂E (i) =0, =0 ∂a ∂b …(9·13) Remarks 1. and (ii) ∂2 E >0 ∂a2 and ∂2 E ∂a2 Δ= ∂2 E ∂b ∂a ∂2 E ∂a ∂b ∂2 E ∂b2 >0 …(*) …(**) Theorem. The solution of the least square equations (9.5) and (9.6), provides a minimum of E defined in (9.3). The proof of this theorem is beyond the scope of the book. 2. From (9·4) we have : ∑(y – a – bx) = 0 ⇒ ∑(y – ye) = 0 …(9·14) where ye is the estimated value of y for a given value of x as given by the line of regression of y on x (9·1). – 3. The line of regression of y on x passes through the point ( –x, y ). 4. Fitting of linear and non-linear regression (trends) is discussed in detail in Chapter 11 on ‘Time Series Analysis’ for determining the trend values. 9·3·2. Line of Regression of x on y. The line of regression of x on y is the line which gives the best estimate of x for any given value of y. It is also obtained by the principle of least squares on minimising the sum of squares of the errors parallel to the x-axis (See Fig. 9·2). By starting with the equation of the form : x = A + By, …(9·15) and minimising the sum of the squares of errors of estimates of x, i.e., deviations between the given values of x and their estimates given by line of regression of x on y, viz., (9·17), i.e., minimising E = ∑(x – A – By)2 , …(9·16) we shall get the normal equations for estimating A and Y • B as : • ∑ x = nA + B∑ y and ∑ xy = A∑ y + B∑ y2…(9·17) • • • Solving (9·17) simultaneously for A and B, we shall get • (∑y2 )(∑x) – (∑ y)(∑ xy) • A= …(9·18) • n∑ y2 – (∑ y)2 • By n∑ x y – (∑ x)(∑ y) + A • and B= …(9·19) n∑ y2 – (∑ y)2 x= Substituting these values of A and B in (9·15), we X shall get the required equation of line of regression of x O Fig. 9·2. on y. Remark. The values of A and B obtained in (9·18) and (9·19) are same as in equations (9·7) and (9·8) with x changed to y and y to x. Proceeding exactly as in the case of line of regression of y on x, we shall get from (9·17) the following results : –x = A + B –y (i) …(9·20) 9·5 LINEAR REGRESSION ANALYSIS – This implies that the line of regression of x on y passes through the point ( –x , y ). Cov (x‚ y) r σx (ii) B = = …(9·21) σy2 σy (iii) The equation of the line of regression of x on y is x – –x = B(y – –y ) …(9·22) Cov (x‚ y) ⇒ x – –x = (y – –y) …(9·23) σy2 rσ ⇒ x – –x = x ( y – –y ) …(9·24) σy The derivation of these results is left as an exercise to the reader. Remarks 1. The regression equation (9·13) implies that the line of regression of y on x passes through – the point ( –x , y ). Similarly (9·24) implies that the line of regression of x on y also passes through the point – – ( –x , y ). Hence both the lines of regression pass through the point (–x, y ). In other words, the mean values – ( –x , y ) can be obtained as the point of intersection of the two regression lines. 2. Why two lines of regression ? There are always two lines of regression, one of y on x and the other of x on y. The line of regression of y on x (9·12) or (9·13) is used to estimate or predict the value of y for any given value of x, i.e., when y is a dependent variable and x is an independent variable. The estimate so obtained will be best in the sense that it will have the minimum possible error as defined by the principle of least squares. We can also obtain an estimate of x for any given value of y by using equation (9·13) but the estimate so obtained will not be best since (9·13) is obtained on minimising the sum of the squares of errors of estimates in y and not in x. Hence to estimate or predict x for any given value of y, we use the regression equation of x on y (9·24) which is derived on minimising the sum of the squares of errors of estimates in x. Here x is a dependent variable and y is an independent variable. The two regression equations are not reversible or interchangeable because of the simple reason that the basis and assumptions for deriving these equations are quite different. The regression equation of y on x is obtained on minimising the sum of the square of the errors parallel to the y-axis while the regression equation of x on y is obtained on minimising the sum of squares of the errors parallel to the x-axis. In a particular case of perfect correlation, positive or negative i.e., r = ± 1, the equation of line of regression of y on x becomes : σy y – –y – x – –x y – y = ± (x – –x ) ⇒ =± …(*) σx σy σx Similarly, the equation of the line of regression of x on y becomes : y – –y σx – x – –x – x – x = ± (y – y) ⇒ =± , …(**) σy σy σx which is same as (*). Hence in case of perfect correlation, (r = ± 1), both the lines of regression coincide. Therefore, in general, we always have two lines of regression except in the particular case of perfect correlation (r = ± 1) when both the lines coincide and we get only one line. 9·3·3. Angle Between the Regression Lines. If θ is the acute angle between the two lines of regression then σxσy 1 – r2 θ = tan–1 …(9·25) σx2 + σy2 | r | { In particular, if r = ± 1 then ( ( ) ( ) )} θ = tan–1 (0) ⇒ θ = 0 or π. i.e., the two lines are either coincident (θ = 0) or they are parallel (θ = π). But since both the lines of regression intersect at the point ( –x , –y ), they cannot be parallel. Hence in case of perfect correlation, positive or negative, the two lines of regression coincide. 9·6 BUSINESS STATISTICS If r = 0, then from (9·28), θ = tan–1 (∞) = π/2, i.e., if the variables are uncorrelated, the two lines of regression become perpendicular to each other. Remarks 1. When r = 0 i.e., when x and y are uncorrelated, then the lines of regression of y on x, and x on y are given respectively by [From (9·13) and (9·24)], y – –y = 0 ⇒ y = –y and x – –x = 0 ⇒ x = –x y = –y , represents a line parallel to X-axis at a distance of –y units from the origin and x = –x , represents a line parallel to Y-axis at a distance of –x units from the origin. Hence, if r = 0, the two lines of regression are perpendicular to each other and are parallel to x-axis and y-axis respectively, as shown in Fig 9·2(a). Y (x, y) y=y • x=x O X Fig. 9·2(a ). 2. We have seen above that if r = 0 (variables uncorrelated), the two lines of regression are perpendicular to each other and if r = ± 1, θ = 0, i.e., the two lines coincide. This leads us to the conclusion that for higher degree of correlation between the variables, the angle between the lines is smaller, i.e., the two lines of regression are nearer to each other. On the other hand, the angle between the lines increases, i.e., the lines of regression move apart as the value of correlation coefficient decreases. In other words, if the lines of regression make a larger angle, they indicate a poor degree of correlation between the variables and ultimately for θ = π/2, i.e., the lines becoming perpendicular if no correlation exists between the variables. Thus by plotting the lines of regression on a graph paper, we can have an approximate idea about the degree of correlation between the two variables under study. Some illustrations are given below in Fig. 9·3(a) to Fig. 9·3(e). TWO LINES COINCIDE (r = –1) TWO LINES COINCIDE (r = 1) Y TWO LINES PERPENDICULAR (r = 0) Y Y TWO LINES APART (LOW DEGREE OF CORRELATION) TWO LINES CLOSER (HIGH DEGREE OF CORRELATION) Y y=y Y (x, y) x=x O X Fig. 9·3(a ). O X O Fig. 9·3(b). X Fig. 9·3(c ). O X Fig. 9·3 (d). O X Fig. 9·3(e ). 9·4. COEFFICIENTS OF REGRESSION Let us consider the line of regression of y on x, viz., y = a + bx The coefficient ‘b’ which is the slope of the line of regression of y on x is called the coefficient of regression of y on x. It represents the increment in the value of the dependent variable y for a unit change in the value of the independent variable x. In other words, it represents the rate of change of y w.r.t. x. For notational convenience, the slope b, i.e., coefficient of regression of y on x is written as byx. Similarly in the regression equation of x on y, viz., x = A + By, the coefficient B represents the change in the value of dependent variable x for a unit change in the value of independent variable y and is called the coefficient of regression of x on y. For notational convenience, it is written as bxy. Notations byx = Coefficient of regression of y on x. bxy = Coefficient of regression of x on y. 9·7 LINEAR REGRESSION ANALYSIS From (9·10), the coefficient of regression of y on x is given by Cov (x‚ y) r σy byx = = [·. · Cov(x, y) = r σx σy.] σ2 σ x …(9·26) x Similarly from (9·21), the coefficient of regression of x on y is given by : Cov (x‚ y) r σx bxy = = σy2 σy Accordingly, the equation of the line of regression of y on x becomes y – –y = b ( x – –x ), yx …(9·27) …(9·28) and the equation of the line of regression of x on y becomes : x – –x = b ( y – –y ) …(9·29) Remarks 1. For numerical computations of the equations of line of regression of y on x, and x on y, the following formulae for the regression coefficients byx and bxy are very convenient to use. Cov (x, y) ∑ (x – –x ) (y ––y ) n∑ xy – (∑ x )(∑ y ) byx = = = …(9·30) σx2 n∑ x2 – (∑ x)2 ∑(x – –x ) 2 Cov (x‚ y) ∑ (x – –x ) (y ––y ) n∑ xy – (∑ x )(∑ y ) and bxy = = = …(9·31) σy2 n∑y 2 – (∑y)2 ∑ (y – –y ) 2 Formulae (9·30) and (9·31) are very useful for computing the values of regression coefficients from given set of n points (x1, y1), (x2, y2), …, (xn, yn). Other convenient formulae to be used for finding the regression coefficients for numerical problems are : r σy r σx byx = and bxy = … (9·32) σx σy 2. Correlation coefficient between two variables x and y is a symmetrical function between x and y, i.e., rxy = ryx. However, the regression coefficients are not symmetric functions of x and y, i.e., byx ≠ bxy. 3. We have : Cov (x‚ y) Cov (x‚ y) Cov (x‚ y) byx = …(*), bxy = …(**), rxy = …(***) σx2 σy2 σ x σy From (*) and (**), we observe that the sign of each regression coefficient byx and bxy depends on the covariance term, since σ x > 0 and σ y > 0. If Cov (x, y) is positive, both the regression coefficients are positive and if Cov (x, y) is negative, both the regression coefficients are negative. 4. Further, since σx > 0 and σy > 0, the sign of each of r, byx and bxy depends on the covariance term. If Cov (x, y) is positive, all the three are positive and if Cov (x, y) is negative, all the three are negative. This result can be stated slightly differently as follows : The sign of correlation coefficient is same as that of the regression coefficients. If regression coefficients are positive, r is positive and if regression coefficients are negative, r is negative. xy 9·4·1. Theorems on Regression Coefficients Theorem 9·1. The correlation coefficient is the geometric mean between the regression coefficients i.e., r2 = byx . bxy Cov (x‚ y) σy Proof. We have, byx = =r· 2 σx σx Multiplying (9·34) and (9·35), we get which establishes the result. …(9·34) r2 = byx . bxy … (9·33) and ⇒ bxy Cov (x‚ y) σx = = r· 2 σy σy r =±⎯ byx . bxy √⎯⎯⎯⎯ …(9·35) …(9·36) 9·8 BUSINESS STATISTICS Remark. The sign to be taken before the square root is same as that of regression coefficients. If the regression coefficients are positive, we take positive sign in (9·36) and if regression coefficients are negative, we take negative sign in (9·36). Theorem 9·2. If one of the regression coefficients is greater than unity (one), the other must be less than unity. Proof. If one of the regression coefficients is greater than 1, then the other must be less than one because otherwise, on using (9·33), we shall get : r2 = byx . bxy > 1, which is impossible, since 0 ≤ r2 ≤ 1. Theorem 9·3. The arithmetic mean of the modulus value of the regression coefficients is greater than the modulus value of the correlation coefficient 1 2 i.e., [ byx + bxy ]> r …(9.37) Theorem 9·4. Regression coefficients are independent of change of origin but not of scale. Symbolically, if we transform from x and y to new variables u and v by change of origin and scale, viz., x – a , v y – b, = where a, b, h (>0) and k(> 0) are constants, …(9·38) h k Then byx = k bvu and bxy = h · buv …(9·39) h k In particular if we take h = k = 1, i.e., we transform the variables x and y to u and v by the relation : u=x–a and v=y–b …(9·40) i.e., by change of origin only, then from (9·39), we get n∑ uv – (∑ u)(∑ v) n∑ uv – (∑ u)(∑ v) bxy = buv = …(9·40a) and byx = bvu = …(9·40b) n∑ v2 – (∑ v)2 n∑ u2 – (∑ u)2 These formulae are very useful for obtaining the equations of the lines of regression if the mean values –x and / or –y come out to be in fractions or if the values of x and y are large. Example 9·1. From the following data, obtain the two regression equations : Sales : 91 97 108 121 67 124 51 73 111 57 Purchases : 71 75 69 97 70 91 39 61 80 47 Solution. Let us denote the sales by the variable X and the purchases by the variable Y. u= x 91 97 108 121 67 124 51 73 111 57 ∑ x = 900 We have CALCULATIONS FOR REGRESSION EQUATIONS – – dy 2 dxdy dx = x – x dy = y – y dx 2 1 1 1 1 1 7 5 49 25 35 18 –1 324 1 –18 31 27 961 729 837 – 23 0 529 0 0 34 21 1156 441 714 –39 –31 1521 961 1209 –17 –9 289 81 153 21 10 441 100 210 –33 –23 1089 529 759 2 2 ∑ dx = 0 ∑ dy = 0 ∑ dx = 6360 ∑ dy = 2868 ∑ dx dy = 3900 y 71 75 69 97 70 91 39 61 80 47 ∑ y = 700 –x = ∑x = 900 = 90 ; –y = ∑y = 700 = 70 and 10 10 n n ∑(x – –x ) (y ––y ) ∑ dx dy byx = = = 3900 2 6360 = 0·6132 – ∑dx 2 ∑(x – x ) 9·9 LINEAR REGRESSION ANALYSIS bxy ∑ (x – –x ) (y ––y ) ∑ dx dy = = = ∑dy 2 ∑(y – –y )2 3900 2868 = 1·361 Regression Equations Equation of line of regression of y on x is Equation of line of regression of x on y is y – –y = b (x – –x ) x – –x = b (y – –y ) yx ⇒ ⇒ ⇒ xy y – 70 = 0·6132 (x – 90) = 0·6132x – 55·188 y = 0·6132x – 55·188 + 70·000 y = 0·6132x + 14·812 ⇒ ⇒ ⇒ x – 90 = 1·361 (y – 70) = 1·361y – 95·27 x = 1·361y – 95·27 + 90·00 x = 1·361y – 5·27 Remark. We have r2 = byx bxy = 0·6132 × 1·361 = 0·8346 ⇒ r = ±⎯ √⎯⎯⎯ 0·8346 = ± 0·9135 But since, both the regression coefficients are positive, r must be positive. Hence, r = 0·9135. Example 9·2. From the data given below find : (a) The two regression coefficients. (b) The two regression equations. (c) The coefficient of correlation between the marks in Economics and Statistics. (d) The most likely marks in Statistics when marks in Economics are 30. Marks in Economics : 25 28 35 32 31 36 29 38 34 32 Marks in Statistics : 43 46 49 41 36 32 31 30 33 39 [Himachal Pradesh Univ. M.A. (Econ.), 2003] Solution. Let us denote the marks in Economics by the variable X and the marks in Statistics by the variable Y. x y 25 28 35 32 31 36 29 38 34 32 43 46 49 41 36 32 31 30 33 39 ∑ x = 320 ∑ y = 380 CALCULATIONS FOR REGRESSION EQUATIONS – – dx 2 dy2 dx = x – x = x – 32 dy = y – y = y – 38 –7 –4 3 0 –1 4 –3 6 2 0 ∑ dx = 0 –x = ∑x = 320 = 32 ; 10 n (a) Regression Coefficients Here, Coefficient of regression of y on x = byx = 5 8 11 3 –2 –6 –7 –8 –5 1 ∑ dy = 0 and 49 16 9 0 1 16 9 36 4 0 ∑ dx2 = 140 25 64 121 9 4 36 49 64 25 1 dxdy –35 –32 33 0 2 –24 21 – 48 –10 0 ∑ dy2 = 398 ∑ dxdy = – 93 –y = ∑y = n ∑(x – –x ) (y – –y ) ∑ dxdy – 93 = = – ∑ dx2 140 ∑(x – x )2 380 10 = 38. = – 0·6643 9·10 BUSINESS STATISTICS Coefficient of regression of x on y = bxy ∑(x – –x ) (y ––y ) ∑ dx dy – 93 = = = 398 = – 0·2337 – ∑ dy2 2 ∑(y – y ) (b) Regression Equations Equation of the line of regression of x on y is : Equation of the line of regression of y on x is : – – x – x = bxy (y – y ) y – –y = byx (x – –x ) ⇒ x – 32 = – 0·2337 (y – 38) ⇒ y – 38 = – 0·6643 (x – 32) = – 0·2337y + 0·2337 × 38 ⇒ y = – 0·6643x + 38 + 0·6643 × 32 = – 0·2337y + 8·8806 = – 0·6643x + 38 + 21·2576 ⇒ x = – 0·2337y + 32 + 8·8806 ⇒ y = – 0·6643x + 59·2576 …(*) ⇒ x = – 0·2337y + 40·8806 (c) Correlation Coefficient. We have r2 = byx . bxy = (– 0·6643) × (– 0·2337) = 0·1552 ⇒ r=±√ ⎯⎯⎯⎯⎯ 0·1552 = ± 0·394 Since both the regression coefficients are negative, r must be negative. Hence, we get r = – 0·394. (d) In order to estimate the most likely marks in Statistics (y) when marks in Economics (x) are 30, we shall use the line of regression of y on x viz., the equation (*). Taking x = 30 in (*), the required estimate is given by y = – 0·6643 × 30 + 59·2576 = –19·929 + 59·2576 = 39·3286 Hence, the most likely marks in Statistics when marks in Economics are 30, are 39·3286 ~– 39. Example 9·3. A panel of judges A and B graded seven debators and independently awarded the following marks : Debator 1 2 3 4 5 6 7 Marks by A : 40 34 28 30 44 38 31 Marks by B : 32 39 26 30 38 34 28 An eighth debator was awarded 36 marks by Judge A while Judge B was not present. If Judge B was also present, how many marks would you expect him to award to eighth debator assuming same degree of relationship exists in judgement ? [Delhi Univ. B.Com (Hons.), 1993; Himachal Pradesh Univ. M.A. (Econ.), June 1999, Allahabad Univ. M.Com. 2002] Solution. Let the marks awarded by Judge ‘A’ be denoted by the variable X and the marks awarded by Judge ‘B’ by the variable Y. CALCULATIONS FOR REGRESSION EQUATIONS Debator 1 2 3 4 5 6 7 Total x 40 34 28 30 44 38 31 y 32 39 26 30 38 34 28 u = x – A = x – 35 5 –1 –7 –5 9 3 –4 v = y – B = y – 30 2 9 –4 0 8 4 –2 u2 25 1 49 25 81 9 16 v2 4 81 16 0 64 16 4 uv 10 –9 28 0 72 12 8 ∑u=0 ∑ v = 17 ∑ u2 = 206 ∑ v2 = 185 ∑ uv = 121 The marks awarded by Judge A to the eighth debator are given to be 36, i.e., we are given x = 36. We want to find the marks which would have been given to the 8th debator by Judge B, if he were present. In other words, we want to find y when x = 36. To do this we need the equation of line of regression of y on x. In the usual notations we have : 9·11 LINEAR REGRESSION ANALYSIS –x = A + ∑u = 35 + 0 = 35, –y = B + ∑v = 30 + 17 = 32·4286 n 7 n 7 n∑ uv – (∑ u) (∑ v) 7 × 121 – 0 × 17 121 byx = bvu = = = = 0·5874 n∑ u2 – (∑ u)2 7 × 206 – 0 206 The equation of line of regression of y on x is given by y – –y = b (x – –x ) yx ⇒ y – 32·4286 = 0·5874 (x – 35) = 0·5874x – 0·5874 × 35 ⇒ y = 0·5874x – 20·5590 + 32·4286 ⇒ y = 0·5874x + 11·8696 When x = 36, y = 0·5874 × 36 + 11·8696 = 21·1464 + 11·8696 = 33·016 Hence, if the Judge B were also present, he would have given 33 marks to the eighth debator. Example 9·4. A departmental store gives in-service training to its salesmen which is followed by a test. It is considering whether it should terminate the service of any salesman who does not do well in the test. The following data give the test scores and sales made by nine salesmen during a certain period : Test scores : 14 19 24 21 26 22 15 20 19 Sales (’000 Rs.) : 31 36 48 37 50 45 33 41 39 Calculate the coefficient of correlation between the test scores and the sales. Does it indicate that the termination of services of low test scores is justified ? If the firm wants a minimum sales volume of Rs. 30,000, what is the minimum test score that will ensure continuation of service ? Also estimate the most probable sales volume of a salesman making a score of 28. [Delhi Univ. B.Com. (Hons.), 2003] Solution. Let x denote the test scores of the salesmen and y denote their corresponding sales (in ’000 Rs.) x 14 19 24 21 26 22 15 20 19 y 31 36 48 37 50 45 33 41 39 180 360 CALCULATIONS FOR REGRESSION LINES – – dx = x – x = x – 20 dy = y – y = y – 40 dx2 –6 –9 36 –1 –4 1 4 8 16 1 –3 1 6 10 36 2 5 4 –5 –7 25 0 1 0 –1 –1 1 ∑ dx = 0 ∑ dy = 0 ∑ dx2 = 120 dy2 81 16 64 9 100 25 49 1 1 dxdy 54 04 32 – 03 60 10 35 0 01 ∑ dy2 = 346 ∑ dxdy = 193 –x = ∑x = 180 = 20 ; –y = ∑y = 360 = 40 9 9 n n byx = Coefficient of regression of y on x bxy = Coefficient of regression of x on y ∑ dxdy 193 ∑ dxdy 193 = = = 1·6083 = = = 0·5578 ∑ dx2 120 ∑ dy2 346 Karl Pearson’s correlation coefficient r between x and y is given by : Then r2 = byx . bxy = 1·6083 × 0·5578 = 0·8971 r=±√ ⎯⎯⎯⎯⎯ 0·8971 = ± 0·9471 ⇒ Since, the regression coefficients are positive, r is also positive. ∴ r = + 0·9471. 9·12 Aliter. BUSINESS STATISTICS rxy = ∑ dxdy ⎯⎯⎯⎯⎯⎯⎯⎯ √ · ∑dx 2 ∑dy2 = 193 ⎯ √⎯⎯⎯⎯⎯⎯ 120 × 346 = 193 ⎯⎯⎯⎯⎯ √ 41520 193 = 203·7646 = 0·9472 Thus, we see that there is a very high degree of positive correlation between the test scores (x) and the sales (’000 Rs.) (y). This justifies the proposal for the termination of service of those with low test scores. Regression Equations To obtain the test sclore (x) for given sales To estimate the sales volume (y) of a salesman (y), we use the equation of the line of regression of with given test score (x ), we use the line of x on y. regression of y on x, which is given by : The equation of line of regression of x on y is : y – –y = byx (x – –x ) – – x – x = bxy (y – y ) ⇒ y – 40 = 1·6083 (x – 20) ⇒ x – 20 = 0·5578 (y – 40) = 0·5578y – 22·312 = 1·6083x – 32·1660 ⇒ x = 0·5578y – 22·312 + 20 ⇒ y = 1·6083x – 32·1660 + 40 ⇒ x = 0·5578y – 2·312 …(*) ⇒ y = 1·6083x + 7·8340 Hence to ensure the continuation of service, Hence the estimated sales volume of a the minimum test score (x) corresponding to a salesman with test score of 28 is (in ’000