Module Title Business Decision Making Module Code BA 5001 Session 2012/ 13 Teaching Period Year long Janet Geary 2012 Page 1 Module Booklet Contents Welcome to Module Title: Details of the Staff teaching team Name of Module Leader Janet Geary and Maurice Pratt Office Location Janet Geary : Stapleton House SH317 Maurice Pratt : Calcutta House CM 147 Email j.geary@londonmet.ac.uk m.pratt@londonmet.ac.uk Telephone Janet 0207 133 3839 Maurice 0207 320 3270 Office Hours Maurice : Monday 11am to 12, Thursday 12 to 1pm Janet : MODULE SPECIFICATION 1 Module title Business Decision Making 2 Module code BA 5001 Level 5 3 Module Level Janet Geary & Maurice Pratt 4 Module Leader Business School 5 Faculty 6 Teaching site(s) for course cross-campus cross-campus cross-campus 7 Teaching period year long (30 weeks) 8 Teaching mode Day 9 Module Type Year long 30 10 Credit rating 11 Prerequisites and corequisites BA 4002 Managing Information and Accounting 12 Module description The Business Decision Making module is directed at students who are following a number of different business degree programmes and requires that students have previously completed the Level 4 module Managing Information and Accounting or its equivalent. The year-long module is designed to enable students to understand the role of quantitative and statistical techniques in managerial decision making and also to familiarise students with management accounting concepts and techniques with an emphasis on decision making. Assessment: Data analysis and reporting on findings 30%, In-class tests 30%, and integrated assignment 40% 13 Module aims Janet Geary 2012 Page 2 The aims of the module are: •To provide an introduction to business decision making through the application of selected quantitative/statistical techniques and associated specialist software. •To develop an understanding of how such analyses fit into the wider business and management context. •To enable students to assess the reliability and usefulness of any information generated by the analysis and hence justify decisions made. •To develop students’ understanding of the major uses of accounting information by management in problem-solving, decision-making and planning and control. •To familiarise students with the decision-making framework for internal users of accounting information in short term decision-making. •To examine alternative techniques in decision-making situations, including capital investment appraisal. •To enable students to design spreadsheet models and interpret the managerial accounting information outputs of such models. •To prepare students, where appropriate, for Level 6 project work in this area. 14 Module learning outcomes On successful completion of this module, students will be able to: • Understand how uncertainty can be built into business decision making. • Apply linear programming techniques to optimise constrained decision choices. • Use multiple regression analysis to model business problems. • Apply a range of approaches to collect and analyse survey data. • Apply selected techniques to manage projects. • Use specialist software (e.g. SPSS, MSProject) to support business decision making. • Account for short-term decisions, allocating costs utilising traditional methods and activity- based costing systems. • Calculate and interpret accounting information for relevant costs and benefits in shortterm decision-making situations. • Demonstrate knowledge and understanding of the budgeting process and how to construct budgets, including the importance of behavioural implications. • Calculate and communicate financial and non-financial measures of attractiveness in investment appraisal decisions and understand the role of cost of capital. 15 Syllabus Janet Geary 2012 Page 3 Survey methods: sampling, collection and analysis, using SPSS; statistical hypothesis tests. Managing projects: scheduling for efficient resourcing; using MSProject. Using SPSS to model relationships between variables - multiple regression, hypothesis tests. Linear programming models in situations of constrained optimisation; using QSB (or equivalent). The value of accounting information and theories of decision-making. Accounting for short-term decisions, the nature and classification of costs, relevant costing. Application of cost-volume-profit (CVP) analysis in decision-making for organisations and risk and uncertainty in decision-making. The budgeting process, responsibility accounting and the effects of budgeting on motivation. Accounting for long-term decisions: objectives of capital budgeting; the notion of time value of money; compounding and discounting. Determination of a company’s cost of capital and capital rationing, methods of evaluating investment projects: Payback, ARR, NPV and IRR. 16 Assessment Strategy The assessment will be in three parts. Formal assessment will comprise: Assessment 1 (30%): Students will be presented with case study material that enables them to model management data in order to facilitate decision making. Students will be required to report on the results of their analysis. Assessment 2 (30%): In-class tests. Students will be set a number of in-class tests focussing on Management Accounting. Assessment 3 (40%): Integrated assignment, combining decision-making and project management. 17 Summary description of assessment items Assessment Type Description of item CWK In-class tests CWK Report on the findings of data analysis Report on a case study involving decisionmaking and project management % Weighting 30% 30% 40% Tariff Week Due 13 19,23,27 30 18 Learning and teaching 19 Learning and Teaching strategy for the module including approach to blended learning, students’ study responsibilities and opportunities for reflective learning/pdp The teaching will consist of 2½ hour blocks per week, some of which will be in I.T. labs and some in classrooms. This will enable students to develop skills in the software appropriate to the area of the syllabus being covered. Bibliography Janet Geary 2012 Page 4 Indicative bibliography and key on-line resources Atrill, P. and McLaney, E. (2010) Management Accounting for Decision Makers, 6th Edition, Financial Times Press. Bryman A. and Cramer D. (2011) Quantitative Data Analysis with IBM SPSS 17, 18 & 19 – A Guide for Social Scientists, Routledge Dewhurst, F. (2006) Quantitative Methods for Business and Management, 2nd Edition, McGraw Hill Drury, C. (2009) Management Accounting for Business, 4th ed, Cengage Learning EMEA. Hongren C., Bhimani A., Datar S. and Foster G. (2012) Management and Cost Accounting, 5th ed., Financial Times Press. Oakshott, L. (2012) Essential Quantitative Methods, 5th Edition, Palgrave Ray Proctor (2009), Managerial Accounting for Business Decisions, 3rd edition, Financial Times Press. Render B. Stair, R. and Hanna, M. (2012) Quantitative Analysis for Management, 11th Edition, Pearson Rowntree, D. (2004) Statistics Without Tears: An introduction for non-mathematicians, Penguin Books Swift, L. and Piff, S. (2010) Quantitative Methods for Business, Management and Finance, 3rd edition, Palgrave Wisniewski, M. (2009) Quantitative Methods for Decision Makers, 5th Edition, Financial Times/Prentice Hall 20 Approved to run from September 2012 21 Module multivalency 22 Subject Standards Board Janet Geary 2012 Page 5 Weekly Programme Lecture Topics – Seminar/Workshop/Practical details week content 1 Introduction to the Module setting up data files in SPSS Normal Distribution Producing graphs and descriptive statistics in SPSS Hypothesis testing: Chi-squared Chi-squared using SPSS Hypothesis testing : Significance tests for means T-test for population mean , two-sample t-test using SPSS Confidence intervals using SPSS for confidence intervals Correlation and Regression Using SPSS for bivariate correlation and regression Student Activity week 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Assessments Multiple Regression using SPSS for multiple regression Interpreting Multiple Regression output Graphical Linear Programming Using Win QSB for LP Linear programming: shadow prices and sensitivity analysis Using Win QSB for LP Linear Programming: extension to more than 2 variables Using Win QSB for LP Project management Using MS Project Project management Using MS Project Project management Using MS Project coursework 1 (30%) 16 Intro. to accounting for management decisions 17 18 Cost behaviour and cost-volume-profit (CVP) analysis Product costing – traditional overhead costing and activity-based costing (ABC) systems 19 Management Accounting Test One Management Accounting Test 1 (10%) 20 21 Student Activity week Budgeting I – Budgeting process and the preparation of sales and operational budgets 22 Budgeting II – Preparation of cash budgets and the master budget 23 Management Accounting Test Two Management Accounting Test 2 (10%) 24 Relevant costs and benefits for decisions Janet Geary 2012 Page 6 25 Relevant information for operating decisions: Short-term decisions with scarce resources 26 Risk and uncertainty in decision-making 27 Management Accounting Test Three Long-term decisions II – Methods of investment appraisal and capital rationing 28 Management Accounting Test 3 (10%) 29 Revision Week 30 Final coursework (40% Essential Books/on line resources including Weblearn/Blackboard For the first 15 weeks, the module booklet provides most of the required material. Students should supplement their reading of the module booklet with the recommended texts. Required/Weekly Reading/Practice/on line resources including any Weblearn/Blackboard Students will need to look at weblearn on a regular basis (twice week ,say)for : 1, powerpoint slides 2. data files 3. answers to selected exercises 4. Additional readings Additional/Weekly Reading/Practice/on line resources including any Weblearn/Blackboard This module is supported by Weblearn – students are advised to access the site on a regular basis, at least once a week Module Assessment Details, including Assessment Criteria for all elements of the assessment, including any examination Well–structured, well-written reports for courseworks Appropriate use of software packages Accurate interpretations of the output from software packages Accurate calculations Assessment criteria for individual assessments will be provided along with the assessment guidelines Assessment completion dates/deadlines Coursework 1: Friday of week 13 Test 1: In class during week 19 Test 2: In class during week 23 Test 3: in class during week 27 Coursework 2: Friday of week 30 Please ensure that coursework is handed in at the Assessments Unit not later than 5pm on the due date. Janet Geary 2012 Page 7 Practical session 1 In order to learn how to use SPSS , we will base our exercises around the results of a survey using the following questionnaire. We will not be analyzing all the results Survey of Seabridge Fitness & Sports Centre The Centre’s management wants to ensure that its members get full value for money so is undertaking this survey. Included are a number of potential developments to the Centre, and we would appreciate your help in determining the nature and priority of these developments, as well as your opinion on other aspects of the Centre. Please complete this questionnaire (which is both confidential & anonymous) and post it in the box at the reception desk. 1. Which sport/activity have you taken part in during this visit to the Centre? Tick one box only (main sport/activity) Swimming 1 Badminton 4 Specialist classes: 2. 2 5 7 Judo Gym Training Alexander Technique 3 6 8 What is your main reason for taking part in this Sports/Activity? Tick one box only To get fit Competition 3. Keep Fit/Aerobics Basketball Pilates Social Skill development 1 3 2 4 What is your opinion on the range of Sports/Activities offered at the Centre? Tick one box only Very good 1 4. Good 2 Average 3 Poor 4 Very poor 5 We are thinking of introducing a number of changes at the Centre and need to identify priorities. Please show your preferences by ticking the two most important developments Offering hot meals in the café area Building a sauna adjacent to the swimming pool Introducing longer opening hours (7am – 11pm) Introducing family membership scheme Introducing weekday only membership Expanding the range of bookings for team sports Expanding the range of specialist classes Other please state …………………………………………………. 5. Are you: 6. How old are you ________________years? 7. Male 1 1 2 3 4 5 6 7 8 Female 2 When do you normally attend the Centre? Tick one box only, for each of a) & b) below a) Time of day: Morning 1 b) Day: Weekday 1 Janet Geary 2012 Afternoon 2 Weekend 2 Evening 3 Page 8 8. How long have you been a member of the Centre? Please enter the length of your membership to the nearest number of years ______________ 9. Please rate your fitness levels when you joined the centre and now: Tick one box only, for each of a) & b) below a) Fitness when I joined the Centre b) Fitness now 10. Very Good Fairly Average/ Fairly Poor Very Good Good Moderate Poor Poor 1 2 3 4 5 6 7 1 2 3 4 5 6 7 What was the main reason for joining the Centre? Location of the Centre Range of facilities available Recommended by a friend or relative Membership rates Other please state …………………………………………………. 11. Please rate each aspect of service quality: Tick one box only, for each of a) & b) below Very Good a) Helpfulness of the reception staff 1 b) Cleanliness of the changing rooms 1 12. Tick one box only 1 2 3 4 5 Good Average 2 2 3 3 Poor 4 4 Very Poor 5 5 Please use the space below to add any comments about the Centre. Thank you completing this questionnaire. Please post it in the box at the reception desk. [Thanks to Richard Charlesworth for letting us use this questionnaire and associated data file ] Janet Geary 2012 Page 9 Questionnaire responses: Resp 1 2 3 4 5 6 7 8 9 10 11 12 13 Q1 Sport 6 1 7 3 5 1 5 2 5 4 6 3 4 Q2 Reason 3 1 1 4 2 1 3 1 4 1 2 4 4 Q3 Variety 2 2 4 2 4 2 1 3 2 3 1 1 4 Q4(1) Changes1 3 1 2 3 3 5 1 2 1 2 1 6 4 Q4(2) Changes2 4 6 7 4 5 6 5 8 4 8 3 7 7 Q5 Gender 2 1 2 2 1 2 1 2 2 2 1 1 1 Q6 Age 29 34 44 42 28 36 47 19 27 999 37 24 52 Q7a Time 2 1 1 1 1 1 3 1 1 2 2 1 2 Q7b Day 1 1 1 1 2 1 1 1 2 1 1 1 2 Q8 Howlong 2 1 2 2 4 2 3 1 2 4 3 1 3 Q9a Fitness1 2 5 7 3 2 4 3 6 5 4 4 4 7 Q9b Fitness2 2 5 4 3 3 3 2 5 4 2 2 2 3 Q10 Join 3 5 1 4 5 5 5 1 4 1 5 5 3 Q11a Helpfulness 2 3 1 5 4 4 5 4 3 2 4 3 4 Q11b Cleanliness 3 2 1 4 4 5 5 3 3 3 4 5 3 14 15 16 2 1 4 2 2 3 2 4 2 3 5 3 4 7 5 2 1 1 35 31 23 3 3 3 1 1 2 2 2 1 3 6 1 4 5 1 888 2 3 1 3 1 2 4 2 17 5 1 4 3 8 2 32 2 1 2 4 2 3 4 4 18 19 20 3 3 6 3 2 3 3 2 3 2 1 2 3 6 8 2 2 1 21 33 23 3 2 1 1 1 2 1 2 2 1 3 4 2 2 3 1 3 5 2 3 2 2 4 2 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 4 7 4 1 5 1 1 5 7 5 2 5 2 6 1 4 1 3 1 3 3 3 4 2 888 1 2 4 4 2 3 4 2 3 3 2 3 3 1 5 2 4 3 2 3 5 4 5 6 4 1 4 4 6 5 2 3 6 1 1 7 6 6 8 5 3 6 7 8 8 2 5 8 7 5 2 2 1 2 2 2 2 2 2 1 1 1 2 1 1 38 51 30 46 31 42 46 25 61 27 38 41 37 37 57 1 3 1 1 1 2 1 3 2 2 3 1 1 1 1 1 2 1 2 1 2 2 1 1 2 2 1 2 2 2 3 3 3 3 2 2 4 2 1 2 4 4 2 1 3 6 5 6 3 3 2 4 2 2 4 5 4 4 2 5 4 5 4 2 3 4 2 2 2 3 4 4 2 2 3 4 1 2 2 2 4 2 1 3 4 3 2 5 2 3 4 3 1 4 1 1 2 3 5 2 1 2 2 3 2 4 4 1 3 2 2 2 2 5 3 1 1 2 1 3 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 7 3 6 2 2 2 4 6 5 5 4 6 7 6 1 2 8 7 1 4 4 1 2 1 1 3 2 2 2 3 3 3 4 3 3 2 3 3 3 5 3 4 4 3 4 3 1 3 2 2 3 2 3 2 2 3 3 2 1 6 2 1 1 4 4 1 2 4 1 5 1 6 3 6 5 7 6 7 7 2 6 4 5 8 8 8 2 6 3 8 2 1 1 1 2 2 1 1 1 1 2 1 1 2 2 2 1 1 39 22 19 43 68 43 45 38 40 17 34 30 65 33 24 48 999 27 1 3 2 2 2 2 3 1 1 1 2 2 3 3 3 2 1 3 1 2 1 1 1 2 2 2 2 2 1 1 2 1 2 1 2 2 1 2 3 1 2 3 2 1 4 2 1 3 4 2 1 2 3 1 7 1 5 3 7 5 4 3 3 7 2 1 1 3 3 2 5 5 5 1 3 3 4 2 2 3 2 5 1 3 1 3 2 1 4 3 2 2 4 5 4 4 2 2 5 5 1 1 4 4 3 5 1 999 1 3 1 2 2 3 2 2 3 2 3 3 1 2 5 2 2 2 2 4 1 3 3 2 2 1 3 3 3 4 1 2 4 3 2 3 54 55 56 57 58 59 60 6 8 7 1 6 7 6 2 2 1 4 1 4 3 4 2 1 2 5 1 3 2 4 1 4 2 1 2 5 8 7 8 3 5 8 1 2 1 2 2 2 2 29 37 32 55 30 39 18 1 3 2 2 2 3 3 1 1 1 2 2 1 1 2 4 4 3 1 2 4 5 3 6 6 3 5 4 4 1 3 5 3 3 3 2 3 2 4 3 2 3 2 4 5 2 3 4 1 1 3 4 1 2 3 2 Janet Geary 2012 Page 10 Tasks required in the first practical session. 1. Load the excel data file seabridge.xls into SPSS 2. Set up the Variable View page 3. Save the file as Seabridge.sav [Thanks to Richard Charlesworth for the notes for this session] Using SPSS v18/19 to analyse survey data These notes give an introduction to using SPSS (Statistical Package for the Social Sciences) for survey analysis. They are based around the questionnaire ‘Survey of Seabridge Fitness & Sports Centre’ and the survey data. 1. Entering data into SPSS Data entry Data can be entered in three ways: (i) directly within SPSS (via the data editor); (ii) copied from a spreadsheet (e.g. Excel) or Word table, and pasted into the SPSS data editor. (iii) imported to SPSS from a previously created spreadsheet file (e.g. Excel). Methods (i) & (ii) are the easiest methods & are described below. Data entry using the SPSS data editor If data is to be entered directly within SPSS, click on ‘Type in data’ (see Figure 1), then the ‘Data View’ tab at the bottom of the screen. This takes you into the SPSS data editor which has the same format as a spreadsheet with each column representing a different variable (a question), and each row the record of a different respondent (or ‘case’). Note however, that from time to time it is necessary to use more than column for a single question. For example, if multiple responses are permitted/required, a separate column is needed to record each response (e.g. see Q4 in the ‘Seabridge Fitness & Sports Centre’ questionnaire, which requires two answers). Figure 1 – entering data directly into SPSS Janet Geary 2012 Page 11 Data entry by copying data from a spreadsheet or Word table If we choose to simply copy data from a previously prepared spreadsheet such as Excel or a Word table to the SPSS data editor, remember to paste the data into the ‘Data View’ page, and not the ‘Variable View’ page. It is important to paste only the numerical data and not any variable names, which might also have been entered on the spreadsheet or table. SPSS will then automatically assign default variable names VAR00001, VAR00002 etc. 2. Defining the variables Before undertaking any analysis it is important to fully define each variable. This is carried out on the ‘Variable View’ page. In Figure 2 the variable names have already been added – respondent, sport, reason etc, (note that ‘respondent’ is simply the questionnaire (or respondent/user number) whereas the remaining variables relate to specific questions). We are in the process of defining the variable ‘sport’: the ‘(Variable) Label’ enables the user to give a brief description so the variable ‘sport’ represents ‘Sport or activity undertaken’; similarly we can define the meaning of each of the possible numerical responses under ‘Value Labels’, so a response ‘1’ represents ‘Swimming’, ‘2’ represents ‘Keep Fit/Aerobics’ etc; under ‘Measure’ the user has already confirmed that ‘Sport’ is a ‘nominal’ (or categorical) variable. Figure 2 - defining the variable ‘Sport’ Note with large questionnaires it is often helpful to insert the question number at the beginning of the ‘(Variable) Label’. So here we have entered ‘Q1 Sport or activity undertaken’ etc. This makes it easier to locate specific questions when conducting analyses. The ‘Missing’ category enables us to record the code used for any missing values. For example, respondent No. 10 has failed to divulge his/her age, recorded here with a code of 999. Figure 3 shows ‘999’ being defined as the code for a missing response to Question 6 (‘Age’). Janet Geary 2012 Page 12 Figure 3 - defining 999 as the code for a missing value for Question 6 (‘age’) Note that SPSS enables us to define up to three distinct missing value codes. The reason for this is that in addition to genuinely missing responses, we may choose to treat some ‘legitimate’ responses as though they were missing. For instance, respondent number 14 has ticked more than one box for Question 9 (main reason for joining the Centre). Two or more reasons may have persuaded this respondent to join the Centre, so although this response may technically be a correct & honest reply, the ticking of only one box was requested (‘the main reason..’), and it would be inappropriate for us to choose one of these replies to code at the expense of the other, or to count them both. Instead we can introduce a missing value code (here 888) to record multiple responses. . When you have finished defining all the variables, save the file. We will be using this file next week. Janet Geary 2012 Page 13 Topic 1 Normal Distribution Probability Distributions Consider the case where we have a number of customers and we have recorded their ages. If we take ages groups of 10 years and use probabilities rather than frequencies we get the histogram: [the probability is the frequency of each group dived by the total frequency. The total of all probabilities adds to 1] age of customers probability 0.4 0.3 0.2 0.1 0 20 to 30 30 to 40 40 to 50 50 to 60 60 to 70 70 to 80 age group If we take age groups of 5 years we get : ages of customers 0.25 probability 0.2 0.15 0.1 0.05 0 20 to 25 25 to 30 30 to 35 35 to 40 40 to 45 45 to 50 50 to 55 55 to 60 60 to 65 65 to 70 70 to 75 75 to 80 age group With age groups of 2 years we get : ages of customers 0.12 probability 0.1 0.08 0.06 0.04 0 20 to 22 22 to 24 24 to 26 26 to 28 28 to 30 30 to 32 32 to 34 34 to 36 36 to 38 38 to 40 40 to 42 42 to 44 44 to 46 46 to 48 48 to 50 50 to 52 52 to 54 54 to 56 56 to 58 58 to 60 60 to 62 62 to 64 64 to 66 66 to 68 69 to 70 70 to 72 72 to 74 74 to 76 76 to 78 78 to 80 0.02 age groups If we continue this process, we can draw a curve rather than a histogram. The curve will join the midpoint of the top of each column. Janet Geary 2012 Page 14 A continuous random variable, such as age measured to the nearest day, hour, minute etc ., has an infinite number of possible values and it is not possible to list every one with its associated probability. Furthermore the probability of each value will be so small that it must be considered approximately equal to zero. As a result we can only consider the probability that a value lies within a particular interval. In the graph above we have used classes such as 22 up to 24 to represent an interval. For example: When measuring heights the probability that an individual is exactly 1.63 metres tall is very small but the probability that someone has a height between 1.625 and 1.635 metres is much easier, and sensible, to determine. The Normal Distribution is one of the most important in Statistics, not only because many data distributions conform to this pattern, but also because the Normal Distribution underpins the subject of statistical inference. The Normal Distribution curve is shown below. The total area under the curve represents the total probabilities, thus the total area equals 1. standard normal distribution -4 -3 -2 -1 0 1 2 3 4 standard deviations from the mean Properties of the Normal Distribution The Normal Curve has the following properties: It is symmetrical about the mean. The curve is bell-shaped. The total area under the curve equals 1 68.26% of the area under the curve lies within 1 standard deviation from the mean. 95.44% of the area under the curve lies within 2 standard deviations from the mean. 99.73% of the area under the curve lies within 3 standard deviations from the mean. The exact shape and position of the Normal curve depends upon the values of the mean and standard deviation. As Normal Distributions have an infinite combination of means and standard deviations, there is a problem in compiling tables for the probabilities. This problem is solved by standardising the variables. The formula below takes x values (actual measurements) and converts them into z values (for use in tables). z=x-μ μ is the mean and σ the standard deviation. σ z represents the number of standard deviations between x and the mean. If actual measurements, x, are converted into z scores by use of the formula, then a Normal Distribution with mean μ and standard deviation σ is transformed into a Normal Distribution with mean 0 and standard deviation 1. Janet Geary 2012 Page 15 Example: The examination marks for a large group of students followed a Normal Distribution. The mean mark was 50 and the standard deviation was 10 marks. What percentage of students gained marks in the range: a) Between 40 and 60 b) Between 30 and 70 c) above 70 . a) mean = 50 s.d. =10 x = 40 𝑧= x = 60 𝑧= 𝑥−𝜇 = 40−50 𝜎 10 i.e. 1 standard deviation below the mean. 𝑥−𝜇 = 60−50 = = 𝜎 10 i.e. 1 standard deviation above the mean. −10 = −1 10 10 10 = 1 Thus a value between 40 and 60 is equivalent to a value between 1 standard deviation above and 1 standard deviation below the mean. Using the ‘Properties of the Normal Distribution’ above, the area is 68.26% Thus we would expect 68.26% of students to gain marks in the range 40 to 60. b) Between 30 and 70 mean = 50 x = 30 s.d. = 10 𝑥−𝜇 𝑧 = = 30−50 = −20 = −2 𝜎 10 10 Thus x = 30 is 2 standard deviations below the mean. x = 70 𝑧 = 𝑥−𝜇 = 70−50 = 20 = 2 𝜎 10 10 Thus x = 70 is 2 standard deviations above the mean. c) Using the ‘Properties of the Normal Distribution’, the area between 2 standard deviations above and below the mean is 95.44% Thus we would expect 95.44% of students to gain marks in the 30 to 70 range. Above 70 From part b) of this question we have find that 95.44% of students will gain marks in the range 30 to 70. This leaves 4.56% that are either above 70 or below 30. As the Normal curve is symmetrical, 2.28% will be above 70 and 2.28% will be below 30. Thus we would expect 2.28% of students to gain marks above 70. In this example, we could use the information from the ‘Properties of the Normal Distribution’ as the x values gave z values of 1 and 2. In most examples the z values will not be integers and we must use tables to find the areas and thus determine the probabilities. Use of tables for the Normal Distribution There are different ways of expressing tables for the Normal Distribution. These notes refer to the tables that follow on the next page. The tables give the area in the right hand side tail. This area represents the probability that a random variable is greater than z. For example . The probability that z > 1.35 = 0.0885 Janet Geary 2012 Page 16 Tables for the Normal Distribution z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 0.00 0.5000 0.4602 0.4207 0.3821 0.3446 0.3085 0.2743 0.2420 0.2119 0.1841 0.1587 0.1357 0.1151 0.0968 0.0808 0.0668 0.0548 0.0446 0.0359 0.0287 0.0228 0.0179 0.0139 0.0107 0.0082 0.0062 0.0047 0.0035 0.0026 0.0019 0.0013 0.0010 0.0007 0.0005 0.0003 0.0002 0.0002 0.0001 Janet Geary 2012 0.01 0.4960 0.4562 0.4168 0.3783 0.3409 0.3050 0.2709 0.2389 0.2090 0.1814 0.1562 0.1335 0.1131 0.0951 0.0793 0.0655 0.0537 0.0436 0.0351 0.0281 0.0222 0.0174 0.0136 0.0104 0.0080 0.0060 0.0045 0.0034 0.0025 0.0018 0.0013 0.0009 0.0007 0.0005 0.0003 0.0002 0.0002 0.0001 0.02 0.4920 0.4522 0.4129 0.3745 0.3372 0.3015 0.2676 0.2358 0.2061 0.1788 0.1539 0.1314 0.1112 0.0934 0.0778 0.0643 0.0526 0.0427 0.0344 0.0274 0.0217 0.0170 0.0132 0.0102 0.0078 0.0059 0.0044 0.0033 0.0024 0.0018 0.0013 0.0009 0.0006 0.0005 0.0003 0.0002 0.0001 0.0001 0.03 0.4880 0.4483 0.4090 0.3707 0.3336 0.2981 0.2643 0.2327 0.2033 0.1762 0.1515 0.1292 0.1093 0.0918 0.0764 0.0630 0.0516 0.0418 0.0336 0.0268 0.0212 0.0166 0.0129 0.0099 0.0075 0.0057 0.0043 0.0032 0.0023 0.0017 0.0012 0.0009 0.0006 0.0004 0.0003 0.0002 0.0001 0.0001 0.04 0.4840 0.4443 0.4052 0.3669 0.3300 0.2946 0.2611 0.2296 0.2005 0.1736 0.1492 0.1271 0.1075 0.0901 0.0749 0.0618 0.0505 0.0409 0.0329 0.0262 0.0207 0.0162 0.0125 0.0096 0.0073 0.0055 0.0041 0.0031 0.0023 0.0016 0.0012 0.0008 0.0006 0.0004 0.0003 0.0002 0.0001 0.0001 0.05 0.4801 0.4404 0.4013 0.3632 0.3264 0.2912 0.2578 0.2266 0.1977 0.1711 0.1469 0.1251 0.1056 0.0885 0.0735 0.0606 0.0495 0.0401 0.0322 0.0256 0.0202 0.0158 0.0122 0.0094 0.0071 0.0054 0.0040 0.0030 0.0022 0.0016 0.0011 0.0008 0.0006 0.0004 0.0003 0.0002 0.0001 0.0001 0.06 0.4761 0.4364 0.3974 0.3594 0.3228 0.2877 0.2546 0.2236 0.1949 0.1685 0.1446 0.1230 0.1038 0.0869 0.0721 0.0594 0.0485 0.0392 0.0314 0.0250 0.0197 0.0154 0.0119 0.0091 0.0069 0.0052 0.0039 0.0029 0.0021 0.0015 0.0011 0.0008 0.0006 0.0004 0.0003 0.0002 0.0001 0.0001 0.07 0.4721 0.4325 0.3936 0.3557 0.3192 0.2843 0.2514 0.2206 0.1922 0.1660 0.1423 0.1210 0.1020 0.0853 0.0708 0.0582 0.0475 0.0384 0.0307 0.0244 0.0192 0.0150 0.0116 0.0089 0.0068 0.0051 0.0038 0.0028 0.0021 0.0015 0.0011 0.0008 0.0005 0.0004 0.0003 0.0002 0.0001 0.0001 0.08 0.4681 0.4286 0.3897 0.3520 0.3156 0.2810 0.2483 0.2177 0.1894 0.1635 0.1401 0.1190 0.1003 0.0838 0.0694 0.0571 0.0465 0.0375 0.0301 0.0239 0.0188 0.0146 0.0113 0.0087 0.0066 0.0049 0.0037 0.0027 0.0020 0.0014 0.0010 0.0007 0.0005 0.0004 0.0003 0.0002 0.0001 0.0001 0.09 0.4641 0.4247 0.3859 0.3483 0.3121 0.2776 0.2451 0.2148 0.1867 0.1611 0.1379 0.1170 0.0985 0.0823 0.0681 0.0559 0.0455 0.0367 0.0294 0.0233 0.0183 0.0143 0.0110 0.0084 0.0064 0.0048 0.0036 0.0026 0.0019 0.0014 0.0010 0.0007 0.0005 0.0003 0.0002 0.0002 0.0001 0.0001 Page 17 When dealing with problems involving the use of Normal tables, it is advisable to draw sketches to identify the required areas. -4 1: -3 -2 -1 0 1 2 3 4 Find the probability that z is greater than 1. From tables; when we look up z = 1.000 we get 0.1587 2: Thus P( z > 1 ) = 0.1587 Find the probability that z is less than 2.01 When we look up z = 2.01 we get 0.0222 as the area above z = 2.01 The required area is 1 - 0.0222 = 0.9778 Thus P( z < 2.01 ) = 0.9778 3: Find the probability that z is between 0.6 and 2.01 The required area is shown. -4 -2 From tables 0 2 z = 0.6 gives 4 0.2743 z = 2.01 gives 0.02222 The required area is found by 0.2743 - 0.0222= 0.2521 as it is between greater than 0.6 and greater than 2.01 Thus P( 0.6 < z < 2.01 ) = 0.2521 4: Find the probability than z is less than -1.5 From tables we cannot look up z = -1.5 directly as only positive values are listed. However we can look up z = + 1.5 to find the area. As the Normal curve is symmetrical, the area below z = -1.5 is the same as the area above z = + 1.5. From tables z = 1.5 gives 0.0668 Thus the area below z = -1.5 is 0.0668. Thus P( z < -1.5 ) = 0.0668. 5. Find the probability that z lies between -1.5 and 0.6 -4 -2 0 2 4 From tables z = 0.6 gives 0.2743 From tables The required area lies between z = -1.5 and z = 0.6 The required area is 1 - .2743 - 0.0668 = 0.6589 Janet Geary 2012 z = 1.5 gives 0.0668 Thus P(-1.5 < z < 0.6) = 0.6589 Page 18 Worked Example: The mean weight of 500 schoolboys is 55 kg and the standard deviation is 4 kg. Assuming that the weights are Normally distributed, find the percentages of boys that are expected to weigh: a) more than 58 kg b) between 58 and 60 kg c) between 50 and 65 kg 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 a) μ = 55 σ=4 more than 58 kg x = 58 𝑥−𝜇 𝑧 = From tables 58−55 = 𝜎 z = 0.75 3 = 4 = 0.75 4 gives 0.2266 Thus the required area is 0.2266 Thus we would expect 22.66% of the boys to weigh more than 58 kg b) between 58 and 60 kg x = 58 x = 60 z = 0.75 and tables give 0.2266 𝑧 = 𝑥−𝜇 𝜎 = 60−55 4 = 5 Thus Area 1 (above 0.75) = 0.2266 = 1.25 4 z = 1.25 tables give 0.1056 Thus Area 2 (above 1.25) = 0.1056 The area between z = 0.75 and z = 1.25 is 0.2266 - 0.1056 = 0.1210 Area 1 – Area 2 Thus we would expect 12.1 % of the boys to weigh between 58 and 60 kg. c) between 50 and 65 kg x = 50 𝑧 = 𝑥−𝜇 𝜎 = 50−55 4 −5 = 4 = −1.25 Thus Area 1 (below – 1.25) = 0.1056 From tables z = 1.25 gives 0.1056 x = 65 𝑧 = 𝑥−𝜇 𝜎 = 65−55 4 = 10 4 From tables z = 2.5 gives 0.00621 The required area is Total Area - A1 - A2 = 2.5 Thus A2 (above 2.5) = 0.00621 1 - 0.1056 - 0.00621 = 0.88819 Thus we would expect 88.82% of the boys to weigh between 50 and 65 kg Janet Geary 2012 Page 19 Example: The weekly output of a production line varies according to the Normal distribution with a mean of 1,500 units and a standard deviation of 120 units. The manager of the factory wants to know what the production output is for 95% of the time. To answer this question: We wish to find the limits of production output that encloses 95% of the area. If we take this 95% symmetrically, we have two tails of 2.5 % each left. Thus area in the tail = 2.5% or 0.025 Scanning the tables for an area of 0.025 we find the z value to be 1.96. Thus 95% of the area lies between z = -1.96 and z = + 1.96. We can now use this statement to solve equations. Mean = 1500 Upper limit Lower limit z = 1.96 z = x - mean = s.d. 1.96 = 235.2 x - 1500 120 x - 1500 120 1.96 × 120 sd = 120 = x - 1500 = x - 1500 z = -1.96 z = x - mean = s.d. -1.96 = x - 1500 120 -1.96 × 120 - 235.2 x - 1500 120 = x - 1500 = x - 1500 235.2 + 1500 = x -235.2 + 1500 = x 1735.2 = x 1264.8 = x Thus 95% of the time, the production output will be between 1265 and 1735 units a week. Seminar Questions: 1. A thousand candidates sat an examination, the results of which were Normally distributed with a mean mark of 50 and a standard deviation of 10 marks. How many candidates would be expected to score: a) less than 75 marks b) less than 25 marks c) more than 60 marks d) a grade C ( i.e. 50 to 59 inclusive) 2. A machine in a factory produces components whose lengths are Normally distributed with mean = 102 mm and standard deviation = 1.5 mm. a) Find the probability that, if a component is selected at random and measured, its length will be i) Less than 100 mm ii) Greater than 104 mm b) If a component is only acceptable when its length lies in the range 100 mm to 104 mm, find the percentage of acceptable components. Janet Geary 2012 Page 20 3. As a result of tests on light bulbs, it was found that the life-time of a particular make of bulb was distributed Normally with an average life of 2040 hours and a standard deviation of 60 hours. What percentage of bulbs are expected to last : a) for more than 2150 hours b) for more than 1960 hours 4. As a result of the introduction of Income Self-Assessment, at the end of the financial year, taxpayers may either have overpaid or underpaid their tax. Those taxpayers who have overpaid are entitled to a refund, whilst those who have underpaid owe the Tax Office money. Past experience leads the Tax Office to believe that the amounts to be refunded or owed are Normally distributed. This year, for one category of taxpayer, the mean of these amounts was a refund of ,607 and a standard deviation of ,320. What proportion of taxpayers is entitled to a tax refund greater than ,1,100? What proportion of taxpayers owe money to the Tax Office? What proportion of taxpayers is entitled to a refund of between ,50 and ,250? For a separate class of taxpayers, the standard deviation is unknown. The mean refund for this class is ,550 and 71.9% of these tax payers have refund greater than ,300. What is the standard deviation for this class of taxpayer. 5. A box of breakfast cereal states that it weighs 500 grams. This is the nominal weight. When the person responsible for quality control in the company that produces the cereal measured a sample of 100 boxes of cereal the mean weight was 504 gms and the standard deviation was 2 gms. These values were exactly correct according to the company=s policy. Why would the company plan to have a mean weight greater than the nominal weight? What is the probability that a customer, buying one packet of cereal, would buy a packet: a) weighing more than 503 gms b) weighing below the nominal weight Answers 1 a) 993.79 2 a) i) 0.0918 3 a) 3.36% 4 a) 6.18% 5 a) 0.6915 Janet Geary 2012 b) 6.21 ii) 0.0918 b) 90.82% b) 2.87% b) 0.02275 c) 158.7 b) 81.64% d) 315.9 c) 8.68% d) ,431 Page 21 Practical session 2 [Using the file seabridge.sav with all variables defined] 1. Producing descriptive statistics in SPSS 2. Drawing graphs in SPSS Data summary & analysis [Thanks to Richard Charlesworth for some of these notes on SPSS] Figure 4 shows the basic command sequence for an elementary statistical analysis. Click first on ‘Analyze’, then ‘Descriptive Statistics’ then ‘Frequencies’ Figure 4 (above) - using ‘Analyze’ to provide some elementary statistical analysis Figure 5 (below) - requesting an analysis of the variable ‘sport’ Janet Geary 2012 Page 22 In Figure 5 the researcher has requested an analysis of the variable ‘Sport’ and samples of output are in Figures 6 & 7. Had we wished to do so, we could have requested the same analysis of several variables, rather than just ‘Sport’. Note the left-hand window records the output requested so far – useful for quickly locating earlier analyses. Figure 6 (above) - a frequency table of ‘Sport’ Figure 7 (below) - a bar chart of ‘Sport’ Janet Geary 2012 Page 23 To produce descriptive statistics for age The output produced is Descriptive Statistics N Minimum Q6 age of respondent 58 Valid N (listwise) 58 Janet Geary 2012 17 Maximum 68 Mean 36.19 Std. Deviation 11.607 Page 24 4 Incorporating output in a document The focus of these notes is about interacting with SPSS, and hence the figures illustrate the screen display. However SPSS often produces more output than is needed (so you need to be selective); moreover this is not always in an appropriate style for inclusion in a document such as a dissertation or report. Tables and graphs can easily be copied in to a document using copy and paste commands. SPSS tables are copied to a Word document as a Word table. You may wish to reformat aspects of the table (e.g. column widths may need readjusting, changing lines/ borders etc.) to get closer to the preferred style, and there is the added benefit that you can edit out any redundant or irrelevant information. SPSS graphs will be copied as a picture, which can be adjusted for size and position to fit in the document. 5 Saving SPSS files Remember to save both the data and output files before exiting SPSS (you will be reminded of this by SPSS if you fail to do so). In Figure 8 the user has given the name ‘Seabridge Fitness & Sports Centre survey’ to the output file; the full filename is ‘Seabridge Fitness & Sports Centre survey.spv’. Similarly if the data file is also called ‘Seabridge Fitness & Sports Centre survey’, the full filename is ‘Seabridge Fitness & Sports Centre survey.sav’. Figure 8 - the output file is saved as ‘Seabridge Fitness & Sports Centre .spv’ Janet Geary 2012 Page 25 Topic 2: Hypothesis Testing: Chi-Squared test Chi-square test of association This test is one of the most important tests used in survey analysis. It is used to test whether two types of classification (i.e. answers to two questions) are statistically independent. The test is extremely useful in management research and can be used to explore whether, for instance, there is any relationship between gender and management grade, gender and motivation, stress and salary level etc. In a survey, airline passengers were asked to answer the question >How did you book your flight= and also asked for their gender. The chi-squared test allows us to test to see if there is any significant difference in the responses of the male and female passengers. The resulting contingency table was: method of booking internet (1) telephone(2) travel agent (3) Total gender male (1) 14 16 22 52 Total female (2) 15 24 9 48 29 40 31 100 Looking at this table, it would seem that women were more likely to use the telephone more than men and that men were more likely to use travel agents. However, we need to perform the chi-squared test to see if these differences are statistically significant. The chi-square test compares the ‘observed’ results with those would be ‘expected’ if the results were independent. The expected results are calculated by the formula: Expected = row total × column total Grand total As an example, The number of males who use the internet was observed to be 14. The expected value for this cell : Expected = row total × column total Grand total = 29 × 52 = 15.08 100 The table of observed and expected values is shown below: method of booking internet telephone travel agent Total gender male 14 [15.08] 16 [20.80] 22 [16.12] 52 Total female (2) 15 [13.92] 24 [19.20] 9 [14.88] 48 29 40 31 100 Formally we test whether our null hypothesis of independence is supported by the evidence of our survey. If not, we will reject the null hypothesis in favour of the alternative hypothesis which claims that method of booking and gender are not independent. Formally: Null hypothesis Alternative hypothesis Janet Geary 2012 H0: H1: method of booking is independent of gender method of booking is not independent of gender Page 26 The overall comparative measure is provided by the test statistic: (𝑜 − 𝑒)2 𝜒2 = ∑ 𝑒 This is calculated by: observed, o expected, e o-e (o - e)2 (o - e)2/e 14 15.08 -1.08 1.1664 0.0773 15 13.92 1.08 1.1664 0.0838 16 20.8 -4.8 23.04 1.1077 24 19.2 4.8 23.04 1.2 22 16.12 5.88 34.5744 2.1448 9 14.88 -5.88 34.5744 2.3235 total 6.9371 In this case χ2 = 6.9371 The degrees of freedom , df, can be found from the formula : df = (number of rows-1) × (number of columns-1) Note: Do not count the “totals” in the number of rows and columns. In this case df = (3-1) ×(2-1) = 2 Testing at a level of significance of 5%, the critical value of the chi-square distribution (with 2 df) is 5.991 (see tables at the end of this section). We compare our calculated test statistic with the tabulated value. The >critical value= is exceeded by the test statistic as can be clearly seen in the graph below. The test statistic falls into the >rejection (or critical) region= (5.991 and above), and consequently we reject the null hypothesis that the method of booking and gender are independent. We therefore take this test result as evidence that there is an association between the method of booking and gender. As the largest relative difference is between the use of travel agents, we can say that men are more likely to book through travel agents and women less likely. The relevant SPSS printout is on the next page : Janet Geary 2012 Page 27 Case Processing Summary Cases Valid Missing N Percent N B6 * C3 100 100.0% 0 Percent .0% Total N 100 Percent 100.0% B6 * C3 Crosstabulation B6 1 2 3 Total Count Expected Count Count Expected Count Count Expected Count Count Expected Count C3 1 14 15.1 16 20.8 22 16.1 52 52.0 Total 2 15 13.9 24 19.2 9 14.9 48 48.0 29 29.0 40 40.0 31 31.0 100 100.0 Chi-Square Tests Value df Asymp. Sig. (2-sided) Pearson Chi-Square 6.937 2 .031 Likelihood Ratio 7.109 2 .029 Linear-by-Linear Association 3.204 1 .073 N of Valid Cases 100 a 0 cells (.0%) have expected count less than 5. The minimum expected count is 13.92. The value for ‘Pearson Chi-Square’ is given as 6.937. Interpretation of the output: This gives a calculated value of chi-squared of 6.937 which can be compared with the critical value of 5.9915 (from statistics tables using a 5% level of significance). As the calculated value is greater than the critical value it does lie in the ‘reject’ region of the curve (i.e. the area to the right of our calculated value is less than 5%). SPSS gives us the area to the right of the calculated value as 0.031. We could have used this to see that the calculated value of chi-squared lies in the tail and thus we do not have to use statistics table. If the value in the last column for Pearson Chi-Square was greater then 5% (0.05) we would have concluded that the method of booking and gender were independent at the 5% level of significance. In this example, we would conclude that there is evidence of an association between gender and method used to book the flight. Some limitations of the chi-square test There are some limitations of the chi-square test. the test identifies only the presence of relationships, not the focus or direction; the statistical properties of the test require that: o the expected values must be ≥ 5. If this is not the case, appropriate rows or columns must be pooled together; however in some cases we simply do not have enough data (or density of data throughout the table), and combining rows and columns results in collating data into meaningless overarching classes rendering the test of little or no value. Janet Geary 2012 Page 28 Note: The often stated requirement above, that all expected values must be ≥ 5, is a safety first measure. In practice, we can accommodate up to 20% of the cells having expected values of between 1 and 5 (notice that the SPSS output provides a reminder of this). In the event of too many cells containing small expected frequencies we would need to combine rows or columns as detailed above. the χ2 formula needs to be modified slightly for 2 by 2 tables. We use the formula: 𝜒2 = ∑(|𝑜 − 𝑒 | − 0.5)2 /e This involves subtracting 0.5 from the absolute difference between o and e, before squaring the result and then proceeding as before. The problem arises because we are using the continuous chi-square distribution to approximate to the discrete cell frequencies; with a small number of cells (that is 2 by 2 tables) this approximation needs a ‘continuity correction’. In fact, if the o and e values are quite large, the original uncorrected result will be close to the adjusted value; however the problem becomes more marked when the frequencies are small, so it is wise to always use the modified formula with 2 by 2 tables. Notice that SPSS provides the continuity correction when a 2 by 2 table is analysed. The test assumes a null hypothesis that ‘gender’ and ‘method of booking’ are statistically independent, and expected frequencies for each of the categories are calculated. These are compared with what actually was recorded in the survey, and a test statistic is computed. The value of the relevant test statistic above is 6.937, which would occur by chance with a probability of 0.031 (or 3.1%) when ‘gender’ and ‘method of booking’ are statistically independent. In such testing we typically assume a probability value of less than 0.05 (or 5%) is evidence that the null hypothesis is incorrect. In this case we therefore have sufficient evidence to reject the null hypothesis and conclude that ‘gender’ and ‘method of booking’ are not independent. So male and females have used different methods of booking. The chi-square test is used here because both variables are nominal. If both were ordinal the chisquare test could still be used. Yates Correction for 2x2 Contingency Tables: This is used for 2 by 2 contingency tables when the total number of items in the table is relatively small (say, < 50 ). With a 2x2 table the degree of freedom= (r-1)×(c-1) = 1. Example: 42 people were asked about their voting intentions in a forthcoming referendum on the single currency. The results are shown below: Would vote for would vote against total female 12 7 19 male 8 15 23 20 22 42 would vote for would vote against female 12 (9.05) 7 (9.95) 19 male 8 (10.95) 15 (12.05) 23 20 22 42 We have calculated the >expected values= as usual i.e expected = (row total×column total) grand total The test statistic, with Yates Correction, is given by: Janet Geary 2012 Page 29 χ2 = ( |o - e| - 0.5)2 e Where |o-e| denotes the absolute value of a number (regardless of sign). So |3| = |-3| = 3 H0 H1 etc. Rows and columns are independent [voting does not depend on gender] some form of dependence exists [the genders vote differently] level of the test: 5% o e o-e |o-e|-0.5 (|o-e|-0.5)2 (|o-e|-0.5)2/e 12 9.05 2.95 2.45 6.0025 0.6633 7 9.95 -2.95 2.45 6.0025 0.6033 8 10.95 -2.95 2.45 6.0025 0.5482 15 12.05 2.95 2.45 6.0025 0.4981 total 2.3129 Since the critical value of χ2 with 1 degree of freedom at the 5% level is 3.8415 (from tables) then we fail to reject H0 and conclude that, on the basis of the evidence available to us, rows and columns are independent. Thus there is no evidence to support the claim that males and females vote differently. Seminar Exercise 1. Number of defects found in a product: 0 1 2 3 4 or more: Expected percentage: 5% 15% 30% 30% 20% We take a random sample of 35 items and observe the following numbers of units with the stated number of defects: Number of defects found in a product: 0 1 2 3 4 or more: Numbers observed, 3 8 4 11 9 Test to see if this data is consistent with the claimed proportions of defectives. (Remember to ensure that all expected values are greater than or equal to 5) 2. A random sample of people reporting sick over a five day period was: Monday Tuesday Wednesday Thursday Friday 8 20 14 18 25 Is there any evidence to suggest that the reporting is not spread evenly through the week? 3. A firm uses three similar machines to produce a large number of components. On a particular day a random sample of 99 from the defective components produced on the early shift were traced back to the machine that produced them. The same was done with a random sample of 65 defectives from the late shift. The table below shows the number of defectives found to be from each machine on each shift: machine A machine B machine C Early Shift 37 29 33 Late Shift 13 16 36 Janet Geary 2012 Page 30 Test the hypothesis that the probability of a defective coming from a particular machine is independent of the shift in which it was produced. 4. Stapleton Electrics manufacture televisions at four different factories and quality control is of great interest to the company. The table below shows reliability of the machines in a particular month: Factory A Factory B Factory C Factory D Needed repair 4 15 9 12 No repair needed 8 10 6 6 a) b) Test to see if there is any association between factory and reliability of machines. Subsequently, it was discovered that the figures above had been divided by 10 to make the arithmetic easier. Correct for this and repeat the test. Does it make any difference to the result? 5. Debtovia is a small country in the middle of a recession. As part of a nation-wide survey a particular town was selected at random and the question “Would you support an incomes policy?” was asked to a number of workers selected at random. Their answers (Yes, No or Don’t Know) and their employment status(skilled/unskilled and Union/Non-Union) were recorded and the data is presented below: a) b) c) Skilled and in union Skilled and not in union Unskilled and in union Unskilled and not in union Yes 7 7 9 12 No 24 21 9 11 Don’t know 29 27 17 27 Test the hypothesis that there is no association between the responses to the above question and employment status. Form a new 2x2 contingency table from the above data by omitting all the ADon=t know@ responses and then pooling the remaining responses to obtain one column for ASkilled@ and one column for AUnskilled@ workers. Test to see if there is any association evident in your new table. Comment on your findings in both cases and compare the two situations. Answers 1. χ2 test = 6.904 critical = 7.815 do not reject H0 The data is consistent with claimed proportions 2 χ2 test = 9.647 critical = 9.488 reject H0 Reporting sick is not evenly spread. Monday has fewer than expected, Friday has more. 3 χ2 test = 8.7346 critical = 5.991 reject H0 Number of defectives varies according to shift 4 a) χ2 test = 3.5773 critical = 7.815 do not reject H0 No relationship between factory and reliability b) New χ2 test = 35.77 (i.e. 10 times as big) reject H0 5 a) χ2 test = 8.43 Critical = 12.592 do not reject H0 Conclude no association between response and employment status b) χ2 test = 6.8727 critical = 3.841 reject H0 Conclude there is an association between response and employment Janet Geary 2012 Page 31 Chi-Squared Tables area in one tail 0.100 0.050 0.025 0.010 0.005 0.001 1 2.7055 3.8415 5.0239 6.6349 7.8794 10.8276 2 4.6052 5.9915 7.3778 9.2103 10.5966 13.8155 3 6.2514 7.8147 9.3484 11.3449 12.8382 16.2662 4 7.7794 9.4877 11.1433 13.2767 14.8603 18.4668 5 9.2364 11.0705 12.8325 15.0863 16.7496 20.5150 6 10.6446 12.5916 14.4494 16.8119 18.5476 22.4577 7 12.0170 14.0671 16.0128 18.4753 20.2777 24.3219 8 13.3616 15.5073 17.5345 20.0902 21.9550 26.1245 9 14.6837 16.9190 19.0228 21.6660 23.5894 27.8772 10 15.9872 18.3070 20.4832 23.2093 25.1882 29.5883 11 17.2750 19.6751 21.9200 24.7250 26.7568 31.2641 12 18.5493 21.0261 23.3367 26.2170 28.2995 32.9095 13 19.8119 22.3620 24.7356 27.6882 29.8195 34.5282 14 21.0641 23.6848 26.1189 29.1412 31.3193 36.1233 15 22.3071 24.9958 27.4884 30.5779 32.8013 37.6973 16 23.5418 26.2962 28.8454 31.9999 34.2672 39.2524 17 24.7690 27.5871 30.1910 33.4087 35.7185 40.7902 18 25.9894 28.8693 31.5264 34.8053 37.1565 42.3124 19 27.2036 30.1435 32.8523 36.1909 38.5823 43.8202 20 28.4120 31.4104 34.1696 37.5662 39.9968 45.3147 21 29.6151 32.6706 35.4789 38.9322 41.4011 46.7970 22 30.8133 33.9244 36.7807 40.2894 42.7957 48.2679 23 32.0069 35.1725 38.0756 41.6384 44.1813 49.7282 24 33.1962 36.4150 39.3641 42.9798 45.5585 51.1786 25 34.3816 37.6525 40.6465 44.3141 46.9279 52.6197 26 35.5632 38.8851 41.9232 45.6417 48.2899 54.0520 27 36.7412 40.1133 43.1945 46.9629 49.6449 55.4760 28 37.9159 41.3371 44.4608 48.2782 50.9934 56.8923 29 39.0875 42.5570 45.7223 49.5879 52.3356 58.3012 30 40.2560 43.7730 46.9792 50.8922 53.6720 59.7031 31 41.4217 44.9853 48.2319 52.1914 55.0027 61.0983 32 42.5847 46.1943 49.4804 53.4858 56.3281 62.4872 33 43.7452 47.3999 50.7251 54.7755 57.6484 63.8701 34 44.9032 48.6024 51.9660 56.0609 58.9639 65.2472 35 46.0588 49.8018 53.2033 57.3421 60.2748 66.6188 36 47.2122 50.9985 54.4373 58.6192 61.5812 67.9852 37 48.3634 52.1923 55.6680 59.8925 62.8833 69.3465 38 49.5126 53.3835 56.8955 61.1621 64.1814 70.7029 39 50.6598 54.5722 58.1201 62.4281 65.4756 72.0547 40 51.8051 55.7585 59.3417 63.6907 66.7660 73.4020 41 52.9485 56.9424 60.5606 64.9501 68.0527 74.7449 42 54.0902 58.1240 61.7768 66.2062 69.3360 76.0838 43 55.2302 59.3035 62.9904 67.4593 70.6159 77.4186 44 56.3685 60.4809 64.2015 68.7095 71.8926 78.7495 45 57.5053 61.6562 65.4102 69.9568 73.1661 80.0767 46 58.6405 62.8296 66.6165 71.2014 74.4365 81.4003 degrees of freedom Janet Geary 2012 Page 32 Practical session 3. Using SPPS for the Chi-squared test and Seabridge file As an example , suppose we are interested to see if there is any association between gender (Q5) and the main reason for joining the Centre (Q10) Using Analyse, Descriptive Statistics, Crosstabs Move “Q5 gender “ into the rows and “Q10 main reason “ into the columns Click on the Statistics button and choose Chi-squared Janet Geary 2012 Page 33 From the cells button , choose observed and expected Now choose continue and Ok Janet Geary 2012 Page 34 The resulting output is : Case Processing Summary Cases Valid N Q5 gender * Q10 main Missing Percent 58 N 96.7% Total Percent 2 N 3.3% Percent 60 100.0% reason for joining Q5 gender * Q10 main reason for joining Crosstabulation Q10 main reason for joining recommended Q5 gender male location of range of by membership centre facilities friend/relative rates Count Expected other Total 2 9 4 3 9 27 4.2 6.5 5.6 5.1 5.6 27.0 7 5 8 8 3 31 4.8 7.5 6.4 5.9 6.4 31.0 9 14 12 11 12 58 9.0 14.0 12.0 11.0 12.0 58.0 Count female Count Expected Count Total Count Expected Count Chi-Square Tests Asymp. Sig. (2Value Pearson Chi-Square Likelihood Ratio N of Valid Cases df sided) 10.300a 4 .036 10.682 4 .030 58 a. 2 cells (20.0%) have expected count less than 5. The minimum expected count is 4.19. As the area in the tail of the chi-squared distribution is 0.036 (see value in last column) and thus less than 5%, we conclude that there is an association between gender and the reason for joining the centre. 2 cells (20%) have expected values of less than 5. This is acceptable. Janet Geary 2012 Page 35 Topic 3 Notation Significance Testing For Means µ 𝑥̅ σ is the mean of the population is the mean of the sample is the standard deviation of the population s is the best estimate of the population standard deviation from a sample [Reminder: We use the formula 𝜎 = √(∑(𝑥 − 𝜇)2 /𝑛 ] for the standard deviation of a ̅)2 /(𝑛 − 1)] population and formula 𝑠 = √(∑(𝑥 − 𝑥 for the best estimate of the population standard deviation from a sample ] The Central Limit Theorem states that if random samples of the same size are repeatedly drawn from a population of any distribution, the means of those samples will be Normally Distributed. Additionally, the mean of the sample means will be the same as the population mean. Mean of sample means =µ If the population size is large relative to the sample size, n, then the standard deviation of the sample means is equal to the population standard deviation divided by sample size s.d. of sample means = 𝜎⁄ √𝑛 Z-test for a Population Mean We use the z-test for a population mean when the standard deviation of the population is known. Procedure: Two-tailed test 1) Set up hypotheses H0 : µ= some value H1 : µ ≠some value 2) Choose significance level 5% is commonly used 3) Calculate test statistic 𝑧= 4) Find critical value(s) from tables 5) Accept or reject hypothesis 6) Summarise findings [null hypothesis] [alternative hypothesis] ̅− 𝛍 𝒙 𝝈 ⁄ 𝒏 √ For a 2-tailed tail, area in each tail is half of significance level. Example: The lengths of metal bars produced by a particular machine are Normally distributed with a mean length of 420 cm and a standard deviation of 12 cm. After a recent service to the machine, a sample of 100 bars was taken and the mean length of this sample was found to be 423cm. Is there any evidence , at the 5% level, of a change in the mean length of the bars produced by the machine. Assume that the standard deviation remains the same after the service. In this example: Janet Geary 2012 population mean Population standard deviation Sample mean Sample size µ = 420 σ = 12 𝑥̅ = 423 n = 100 Page 36 1) H0 : µ = 420 H1 : µ ≠ 420 [mean length remains the same] [mean length changes, 2-tailed test] 2) significance level , α = 0.05 [5%] ̅− 𝛍 𝒙 𝝈 ⁄ 𝒏 √ 𝑧= 3) = 𝟒𝟐𝟑−𝟒𝟐𝟎 𝟏𝟐⁄ √𝟏𝟎𝟎 = 𝟑 𝟏.𝟐 = 𝟐. 𝟓 4) Critical values: In this example we are using a two-tailed test as the alternative hypothesis has two parts ( less than 420 and greater than 420). We have to split the 5% significance level into 2 halves of 0.025 each. From the Normal tables,( below), we get that an area in a tail of 0.025 gives critical values of z = ±1.96. 5) The diagram below (to be completed in seminar) shows the “accept” and “reject” areas for H0. -4 -3 -2 -1 0 1 2 As test z = 2.5 lies in the reject region, 3 4 we reject H0 6) We conclude that there is evidence to suggest, at the 5% level, that the mean length of the metal bars has changed since the machine was serviced. N.B. In this example we assumed that the standard deviation had not changed although we were unsure about whether the mean length had changed. This is not very realistic. In general, if we know the population standard deviation, we are likely to know the population mean as well and thus have no need to test it. It is much more usual that we do not know the population mean and the population standard deviation. This scenario is covered by the t-test. If the sample size is large , we can use the t-test . Critical values of the Normal Distribution Alpha is the area in one tail. alpha 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 Janet Geary 2012 z value 2.5758 2.3263 2.1701 2.0537 1.9600 1.8808 1.8119 1.7507 1.6954 1.6449 alpha 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 z value 1.6449 1.2816 1.0364 0.8416 0.6745 0.5244 0.3853 0.2533 0.1257 0.0000 Page 37 The t-test for a population mean The t-test is used to test the value of a population mean when the population standard deviation is not known. Procedure: 1) Calculate the sample mean, , and the estimated standard deviation, s. ∑(𝑥−𝑥̅ )2 𝑠= 2) 3) √𝑛−1 Set up hypotheses Choose significance level 4) Calculate test statistic 5) 6) 7) Find critical value(s) from tables degrees of freedom = n-1 Accept or reject hypothesis Summarise findings H0 : µ = some value 5% is commonly used ̅− 𝛍 𝒙 𝑡= 𝒔 ⁄ 𝒏 √ Example: A random sample of 8 women yielded the following cholesterol levels: 3.1 2.8 1.5 1.7 2.4 1.9 3.3 1.6 Test whether the sample could be drawn from a population whose mean cholesterol level is 3.1. Test at the 5% level of significance. Population mean (to be tested): Sample size: sample mean: estimated standard deviation: H0 : H1 µ = 3.1 µ ≠ 3.1 µ = 3.1 n=8 𝑥̅ = 2.2875 s = 0.7120 [ from calculator] [ from calculator] [2-tailed test] Significance level = 5% thus α = 0.05/2 = 0.025 [2-tailed test] ̅ − 𝛍 𝟐. 𝟐𝟖𝟕𝟓 − 𝟑. 𝟏 −𝟎. 𝟖𝟏𝟐𝟓 𝒙 𝑡= 𝒔 = = = −𝟑. 𝟐𝟐𝟕𝟔 𝟎. 𝟕𝟏𝟐𝟎⁄ 𝟎. 𝟐𝟓𝟏𝟕 ⁄ 𝒏 √ √𝟖 degrees of freedom = n-1 = 7 From the t-tables (next page), α = 0.025 the critical values are ± 2.3646 As the value of the test statistics t = -3.2276 is outside the critical values, we Reject H0 There is evidence to suggest that this sample is not drawn from a population with a mean cholesterol level of 3.1. Janet Geary 2012 Page 38 t- tables 0.1 area in one tail 0.05 0.025 0.01 0.005 0.0025 1 3.0777 6.3138 12.7062 31.8205 63.6567 127.3213 2 1.8856 2.9200 4.3027 6.9646 9.9248 14.0890 3 1.6377 2.3534 3.1824 4.5407 5.8409 7.4533 4 1.5332 2.1318 2.7764 3.7469 4.6041 5.5976 5 1.4759 2.0150 2.5706 3.3649 4.0321 4.7733 6 1.4398 1.9432 2.4469 3.1427 3.7074 4.3168 7 1.4149 1.8946 2.3646 2.9980 3.4995 4.0293 8 1.3968 1.8595 2.3060 2.8965 3.3554 3.8325 9 1.3830 1.8331 2.2622 2.8214 3.2498 3.6897 10 1.3722 1.8125 2.2281 2.7638 3.1693 3.5814 11 1.3634 1.7959 2.2010 2.7181 3.1058 3.4966 12 1.3562 1.7823 2.1788 2.6810 3.0545 3.4284 13 1.3502 1.7709 2.1604 2.6503 3.0123 3.3725 14 1.3450 1.7613 2.1448 2.6245 2.9768 3.3257 15 1.3406 1.7531 2.1314 2.6025 2.9467 3.2860 16 1.3368 1.7459 2.1199 2.5835 2.9208 3.2520 17 1.3334 1.7396 2.1098 2.5669 2.8982 3.2224 18 1.3304 1.7341 2.1009 2.5524 2.8784 3.1966 19 1.3277 1.7291 2.0930 2.5395 2.8609 3.1737 20 1.3253 1.7247 2.0860 2.5280 2.8453 3.1534 21 1.3232 1.7207 2.0796 2.5176 2.8314 3.1352 22 1.3212 1.7171 2.0739 2.5083 2.8188 3.1188 23 1.3195 1.7139 2.0687 2.4999 2.8073 3.1040 24 1.3178 1.7109 2.0639 2.4922 2.7969 3.0905 25 1.3163 1.7081 2.0595 2.4851 2.7874 3.0782 26 1.3150 1.7056 2.0555 2.4786 2.7787 3.0669 27 1.3137 1.7033 2.0518 2.4727 2.7707 3.0565 28 1.3125 1.7011 2.0484 2.4671 2.7633 3.0469 29 1.3114 1.6991 2.0452 2.4620 2.7564 3.0380 30 1.3104 1.6973 2.0423 2.4573 2.7500 3.0298 31 1.3095 1.6955 2.0395 2.4528 2.7440 3.0221 32 1.3086 1.6939 2.0369 2.4487 2.7385 3.0149 33 1.3077 1.6924 2.0345 2.4448 2.7333 3.0082 34 1.3070 1.6909 2.0322 2.4411 2.7284 3.0020 35 1.3062 1.6896 2.0301 2.4377 2.7238 2.9960 36 1.3055 1.6883 2.0281 2.4345 2.7195 2.9905 37 1.3049 1.6871 2.0262 2.4314 2.7154 2.9852 38 1.3042 1.6860 2.0244 2.4286 2.7116 2.9803 39 1.3036 1.6849 2.0227 2.4258 2.7079 2.9756 40 1.3031 1.6839 2.0211 2.4233 2.7045 2.9712 41 1.3025 1.6829 2.0195 2.4208 2.7012 2.9670 42 1.3020 1.6820 2.0181 2.4185 2.6981 2.9630 43 1.3016 1.6811 2.0167 2.4163 2.6951 2.9592 44 1.3011 1.6802 2.0154 2.4141 2.6923 2.9555 45 1.3006 1.6794 2.0141 2.4121 2.6896 2.9521 46 1.3002 1.6787 2.0129 2.4102 2.6870 2.9488 50 1.2987 1.6759 2.0086 2.4033 2.6778 2.9370 degrees of freedom Janet Geary 2012 Page 39 55 1.2971 1.6730 2.0040 2.3961 2.6682 2.9247 60 1.2958 1.6706 2.0003 2.3901 2.6603 2.9146 65 1.2947 1.6686 1.9971 2.3851 2.6536 2.9060 70 1.2938 1.6669 1.9944 2.3808 2.6479 2.8987 75 1.2929 1.6654 1.9921 2.3771 2.6430 2.8924 80 1.2922 1.6641 1.9901 2.3739 2.6387 2.8870 85 1.2916 1.6630 1.9883 2.3710 2.6349 2.8822 90 1.2910 1.6620 1.9867 2.3685 2.6316 2.8779 95 1.2905 1.6611 1.9853 2.3662 2.6286 2.8741 100 1.2901 1.6602 1.9840 2.3642 2.6259 2.8707 200 1.2858 1.6525 1.9719 2.3451 2.6006 2.8385 300 1.2844 1.6499 1.9679 2.3388 2.5923 2.8279 400 1.2837 1.6487 1.9659 2.3357 2.5882 2.8227 500 1.2832 1.6479 1.9647 2.3338 2.5857 2.8195 600 1.2830 1.6474 1.9639 2.3326 2.5840 2.8175 700 1.2828 1.6470 1.9634 2.3317 2.5829 2.8160 800 1.2826 1.6468 1.9629 2.3310 2.5820 2.8148 900 1.2825 1.6465 1.9626 2.3305 2.5813 2.8140 1000 1.2824 1.6464 1.9623 2.3301 2.5808 2.8133 Seminar Exercise 1. The commissions earned by an estate agent last year were Normally distributed with a mean of £2,560 and a standard deviation of £310. A random sample of 24 sales from this year was examined and the mean commission found to be £2,690. Assuming no change in the standard deviation, does this evidence provide significant evidence of a change in the mean commission? Test at the 5% level of significance. 2. A random sample of 40 sacks of animal feed is taken from a population whose mean is unknown but whose variance is 15.7 kg2. The mean weight of the sample is 86.3 kg. Test the hypothesis that the mean weight of the sacks is 86 kg. Test at the 5% significance level. 3. The mean time taken by a standard piece of software to run a particular task is 64 seconds. In order to compare a different piece of software it was given a sample of 15 tasks of the same type, the times being: 61.8 65.0 73.0 67.5 72.0 69.0 68.1 59.8 63.1 61.5 71.5 63.1 69.4 68.9 67.0 Does this data indicate a difference in the means times between the two pieces of software? Test at the 5% level of significance. You have been told by “experts” that the average weekly salary for a part-time secretary is £275. You decide to check this assertion by taking a random sample of 30 secretaries and obtain a sample mean of £270 and a standard deviation of £40. Test at the 5% level to determine whether the experts are correct. 4. 5 During 2011 the number of beds required at a hospital was Normally distributed with a mean of 1800 a day and a standard deviation of 190 per day. During the first 50 days of 2012, the average daily requirement for beds was 1830. This data is considered to be a valid sample for 2012. A senior hospital manager has claimed that this gives evidence that the requirement for beds has changed since 2011. Would you agree? Is the sampling method valid? Janet Geary 2012 Page 40 Answers: 1. Test statistic, z = 2.05 reject H0 mean has changed 2. Test statistic z = 0.4788 do not reject H0 3. Test statistic t = 2.54 reject H0 times are not the same 4. Test statistic t = -0.6847 do not reject H0 experts could be right 5. Test statistic z = 1.116 do not reject H0 requirement for beds has not changed Sampling method is questionable as only covers winter (January and February) and thus will not represent the demand for beds for the whole year. Illnesses are often seasonal. One-tailed t-test for a Population Mean In a one tailed test the alternative hypothesis is that the population mean is either greater than a particular value or it is less than a particular value. (With a two-tailed test it was simply “not equal”) Example: A jeweller was sold some silver wire that he was suspicious about. If the wire was pure silver it would give a reading of 1.5 ohms when tested for electrical resistance. If the wire was not pure silver the resistance would be increased. The jeweller tested 5 pieces of the dubious wire and the following readings for electrical resistance were obtained: 1.51 1.49 1.54 1.52 1.54 Test at the 5% level the hypothesis that the wire is pure silver. µ = 1.5 𝑥̅ = 1.52 s = 0.0212 H0 : H1 : [to be tested] [from calculator] [from calculator] n=5 µ = 1.5 [it is pure silver] µ > 1.5 [it is not pure silver, resistance is increased] thus α = 0.05 Significance level = 5% [1-tailed test] ̅ − 𝛍 𝟏. 𝟓𝟐 − 𝟏. 𝟓 𝒙 𝟎. 𝟎𝟐 𝑡= 𝒔 = = = 𝟐. 𝟏𝟎𝟓 𝟎. 𝟎𝟐𝟏𝟐 𝟎. 𝟎𝟎𝟗𝟓 ⁄ 𝒏 ⁄ √ √𝟓 α = 0.05 degrees of freedom = n-1 = 4 critical value = +2.1318 [we need the positive critical value for H1 > some value] [if H1 had been < some value, we would have used the negative critical value] Do not reject H0 There is no significance evidence to suggest that the silver wire was impure. The jeweller should allay his suspicions Exercises One tailed t-test 1.. A firm which manufactures panels tests a sample of 13 panels, loading them until they crack. The results of this test are the following loads (in Newtons). 2.7 7.2 Janet Geary 2012 4.6 8.6 3.1 8.9 4.2 7.3 3.3 8.9 6.7 8.9 8.6 Page 41 An important customer of this firm asserts that the quality of the panels is getting worse that the mean cracking load is now less than the previous value of 7 Newtons. Does the evidence bear out the customer=s claim? Test at the 5% level. and 2. A health clinic claims that people following its diet programme will lose, on average, at least 8 kilograms during the programme. A random sample of 41 people on the programme showed a mean weight loss of 7 kilograms. The sample standard deviation, s, was found to be 3.1 kilograms. Test at the 5% level of significance whether the company is exaggerating, i.e. the mean weight loss, in general, is less than 8 kilograms. 3. A manager in a university department claimed that the average hours members of staff work each week is 35 hours. A random sample of 12 lecturers found that the number of hours worked in a particular week was taken. The results are given below: 37, 41, 32, 45, 39, 35, 43, 38, 40, 42, 38, 34 The staff union claims that the average number of hours worked is greater than 35. Does the evidence bear out the union=s claim? Test at the 5% level of significance. Answers: 1. mean =6.3846, sd = 2.4535, Test statistic t = -0.9044, critical = -1.782 do not reject H0 Evidence does not support the claim 2. Test statistic t = -2.065, critical = -1.684 reject H0 Mean weight loss is less than 8 kg. Company is exaggerating. 3. Mean = 38.66, sd = 3.821, Test statistic t = +3.3237, critical = + 1.796 reject H0 Mean hours worked is greater than 35. Evidence does bear out the union=s claim. Janet Geary 2012 Page 42 Topic 4 Two-sample test and test for proportions Two Sample t-Test for Population Means Suppose we have two independent random samples : Sample A: Sample B: x1, y1 x2 x3 y2 y3 ..... ..... xnx from a Normal distribution with mean µx and standard deviation σ has sample mean 𝑥̅ , sd = sx and sample size nx from a Normal distribution with mean µy and standard deviation σ has sample mean = 𝑦̅ , sd = sy and sample size ny yny Notice that we are assuming that the population standard deviations are the same. Since this standard deviation is unknown we have to find an estimate for it. Notation sample A sample B sample size nx ny sample mean ̅ 𝒙 ̅ 𝒚 sample standard deviation sx sy The estimate for the common standard deviation is given by s , where 𝒔𝟐 = The test statistic is 𝒕= (𝒏𝒙 −𝟏)𝒔𝒙 𝟐 +(𝒏𝒚 −𝟏)𝒔𝟐𝒚 𝒏𝒙 +𝒏𝒚 −𝟐 (𝒙 ̅− 𝒚̅ )−( 𝝁𝒙 −𝝁𝒚) 𝒔√(𝟏⁄𝒏𝒙+ 𝟏⁄𝒏𝒚 ) which fits a t-distribution with nx + ny - 2 degrees of freedom We usually use a two sample t-test to determine if the population means are equal. i.e. (µx - µy =0) In this case, the test statistic is 𝒕= (𝒙 ̅− 𝒚̅ ) 𝒔√(𝟏⁄𝒏𝒙+ 𝟏⁄𝒏𝒚 ) Example: As part of an investigation undertaken by a telephone company, a comparison of the weekly phone bills in two areas was undertaken. The figures are given below : Area A: 5.7 Area B: 8.9 12.0 3.0 10.1 8.2 13.7 5.2 11.9 2.2 11.7 5.7 10.4 3.2 7.3 9.6 5.3 3.1 6.8 3.9 11.8 Test, at the 5% level to see if there is any difference between the mean phone bills in the two areas. Janet Geary 2012 Page 43 𝒔𝟐 = calculations Area A Area B sample means, 𝑥̅ = 9.7 𝑦̅ sample s.d , sx = 2.91067 sx2 =8.47200 sy = 2.7109 sy2 = 7.3489 sample size, nx = 11 ny = 10 (𝒏𝒙 −𝟏)𝒔𝒙 𝟐 +(𝒏𝒚 −𝟏)𝒔𝟐𝒚 𝒏𝒙 +𝒏𝒚 −𝟐 = (𝟏𝟏−𝟏)×𝟖.𝟒𝟕𝟐+(𝟏𝟎−𝟏)×𝟕.𝟑𝟒𝟖𝟗 𝟏𝟏+𝟏𝟎−𝟐 common variance, s2 = 7.940 H0: H1: µx = µy or µx ≠ µy = 5.3 = 𝟖𝟒.𝟕𝟐+𝟔𝟔.𝟏𝟒𝟎𝟏 𝟏𝟗 = 7.940 common standard deviation, s = 2.8178 µx - µy = 0 [population means are the same] [pop. means are not the same] Choose α = 5% Test Statistic 𝒕= is (𝒙 ̅− 𝒚̅ ) 𝒔√(𝟏⁄𝒏𝒙 + 𝟏⁄𝒏𝒚 ) = 𝟗.𝟕−𝟓.𝟑 𝟐.𝟖𝟏𝟕𝟖√𝟏⁄𝟏𝟏+𝟏⁄𝟏𝟎 = 𝟒.𝟒 𝟐.𝟖𝟏𝟕𝟖√𝟎.𝟏𝟗𝟎𝟗 =3.574 If α = 5% for a 2-tailed test, each tail has 0.025, degrees of freedom = nx + ny -2 = 19 critical values = ± 2.0930 As the test statistic t = 3.574 is outside the “accept region “ (± 2.0930), Reject H0 The mean amount spent on telephone bills is not the same in the two areas Exercise: 1. 2 sample t-test Independent random samples of current account balances at two branches of a particular bank yielded the following results: Branch number of accounts sampled sample mean balance sample standard deviation Holloway 12 £1,000 £150 Islington 10 £920 £120 Test , at the 5% level of significance, whether there is any difference between the mean balances in the two branches of the bank. 2. A firm is studying the delivery times of two raw material suppliers. The firm is basically satisfied with supplier A and is prepared to stay with that supplier if the mean delivery time is roughly the same as that of supplier B. Independent samples gave the following results: Janet Geary 2012 Page 44 sample size sample mean delivery time sample standard deviation supplier A 50 14 days 3 days supplier B 30 12.5 days 2 days Test, at the 5% level of significance, whether there is any difference in the mean times for delivery. 3. In a wage discrimination case involving male and female employees, independent samples of male and female employees with five year’s experience or more provided the hourly wage results shown in the table. sample size sample mean (£ per hour) sample standard deviation male employees 44 £9.25 £1.00 female employees 32 £8.70 £0.80 Does wage discrimination appear to be present in this case? Test at the 5% level of significance. Test at the 1% level of significance. Do your conclusions depend upon the level of significance? 4. A University careers office decided to collect data on the starting salaries for graduates in different subjects ten years ago. She wanted then to compare the averages ten years ago with those form last year’s graduates. Among the areas investigated were accounting graduates and general business graduates. The salaries (in £1,000 per annum) found from two samples are given below: Accounting Business 14.4 13.2 12.6 11.8 13.1 12.5 14.0 11.5 13.5 14.9 13.1 12.3 14.1 14.5 12.4 13.7 12.6 12.2 14.9 13.4 14.6 13.1 14.6 Janet Geary 2012 Page 45 Use a 5% level of significance to test the hypothesis that there is no difference between the mean annual starting salary of Accounting graduates and the mean starting salary of Business graduates ten years ago. What is your conclusion? Answers: 1. Common s = 137.31 t = 1.36 critical = ± 2.0860 No significant difference in means do not reject H0 2. Common s = 2.6723 t = 2.4306 There is a difference in the means critical = ± 2.0 (approx) reject H0 3. Common s = 0.9215 t = 2.569 a) critical = ± 1.9921 (approx) reject H0 There is a difference in the means b) critical = ± 2.6430 (approx) do not reject H0 There is no difference in the means Difference in wages is significant at the 1% level but not at the 5% level. 4 Accounting mean = 13.6583 sample s.d. = 0.8867 Business mean = 13.0091 sample s.d. = 1.0784 common s = 0.9827 t = 1.5826 critical = ± 2.0796 No difference in the mean starting salaries do not reject H0 Test for a difference in proportions Consider 2 random samples of sizes nx and ny with proportions of a >success= equal to px and py respectively. We wish to determine if there is any significant difference between the 2 proportions. The first step is to calculate the common proportion P. 𝑃= 𝑛𝑥 𝑝𝑥 + 𝑛𝑦 𝑝𝑦 𝑛𝑥 +𝑛𝑦 then Q = 1 - P Note: P can be remembered as: P = total number of successes total sampled The test statistic is 𝑍= 𝑝𝑥 − 𝑝𝑦 √𝑃𝑄(1⁄𝑛𝑥 + 1⁄𝑛𝑦 ) which fits a Normal distribution with mean 0 and standard deviation 1, i.e. the standardised Normal distribution as found in tables. The null hypothesis, H0 , will be that there is no difference between the 2 proportions. Example: Two newspapers conducted opinion polls asking voters whether they would vote for Mr Whittington as the next Mayor of London. In the Evening News poll, 325 voters out of 500 said that they would vote for Mr Whittington. In the Morning Metro poll, 201 voters out of 300 polled said they would vote for Mr. Whittington. Is there any difference in the results of the two polls? Janet Geary 2012 Page 46 Answer: H0 : there is no difference between the proportions saying they will vote for Mr Whittington. H1 : there is a significant difference between the proportions saying they will vote for Mr Whittington. Evening News Morning Metro number of successes 325 201 sample proportions px = 325 = 0.65 500 py = 201 = 0.67 300 sample size nx = 500 ny = 300 P = nx px + ny py nx + ny total 526 800 then Q = 1- P P = 5000.65 + 3000.67 500 + 300 = 325 + 201 800 = 526 800 = 0.6575 Note : we could have calculated P directly from the totals column. P = 0.6575 thus Q = 1 – P = 0.3425 The test statistic is 𝑝𝑥 −𝑝𝑦 𝑍= √𝑃𝑄(1⁄𝑛𝑥 +1⁄𝑛𝑦 ) = 0.65−0.67 √ 0.6575×0.3425(1⁄500+1⁄300) −0.02 √0.221519×(0.002+0.00333) = −0.02 √0.001201 = =-0.5771 Testing at the 5% level of significance, with a 2-tailed test, the critical values are ± 1.96 Do not reject H0 , Conclude: There is no significant difference between the results of the 2 opinion polls. Seminar Exercises Difference in proportions 1. A medical research unit decided to test two drugs, A and B, for reducing blood pressure. The drugs were given to 2 sets of volunteers. One group of 90 volunteers was treated with drug A and 60 of these volunteers reported lower blood pressure. The second group of 80 volunteers was treated with Drug B; of these 50 reported lower blood pressure. Test at the 5% level of significance if there is any difference between the 2 drugs ability to lower blood pressure. 2. A survey firm conducted door-to-door interviews on a new consumer product. Some individuals co-operated with the interviewers and completed the questionnaire whilst other individuals did not co-operate. The sample data is shown in the table below. Testing at the 5% level of significance, test the hypothesis that the rate of co-operation is the same for both men and women. sample size number co-operating men 200 110 women 300 210 Janet Geary 2012 Page 47 3. Two universities were planning to merge. At one of the 2 universities, NLU, 200 staff were interviewed. Of these 44 said that they thought the new merged University would be “good for the local area”. At the second university, GLU, 48 of the 300 staff interviewed thought the new University would be “good for the local area”. Test, at the 5% level of significance, whether there is any difference in the two proportions who think that the new university would be “good for the local area”. Test at the 10% level of significance. 4. A check was conducted on a form completed in 2 offices. The first office yielded a random sample of 250 forms, of which 35 contained errors. The second office had a sample size of 300 and 27 forms that contained errors. Test at the 10% level of significance whether there is any difference in the proportion of forms containing errors between the 2 offices. Answers: 1. P = 0.647 2. P = 0.64 3. P = 0.184 z = 0.5679 z = -3.423 z = 1.69 4 z = 1.85 P = 0.1127 Practical session t-test. No significant difference in the proportions. There is a significant difference No significant difference at 5% level Significant difference at 10% level There is a significant difference Use of SPSs for a t-test for a population means and a two–sample One sample t-test Example : We wish to test if the mean age of the users of the Seabridge sports centre could come from a population with mean age = 40. Choose Analyse, Compare Means , One-Sample T-test Janet Geary 2012 Page 48 Choose age as the “test variable” and enter 40 as the “test value” Using the Options button choose 95% as the Confidence Interval Percentage Now choose Continue and OK The SPSS output is : One-Sample Statistics N Q6 age of respondent Janet Geary 2012 Mean 58 36.19 Std. Deviation 11.607 Std. Error Mean 1.524 Page 49 One-Sample Test Test Value = 40 95% Confidence Interval of the Difference Mean t Q6 age of respondent df -2.500 Sig. (2-tailed) 57 Difference .015 Lower -3.810 Upper -6.86 -.76 In this case we would reject the hypothesis that the mean age of the population is 40 in favour of the alternative hypothesis that it is not 40 ( the value in the tail is 0.015 which is less that 0.05). This is not surprising as the sample mean is 36.19 If we repeat this test with a test value of 36 We get : One-Sample Test Test Value = 36 95% Confidence Interval of the Difference Mean t Q6 age of respondent df .124 Sig. (2-tailed) 57 .901 Difference .190 Lower Upper -2.86 3.24 In this case we would not reject the null hypothesis that the mean age of the population is 36 as the value in the tail is 0.901 ( i.e. above 0.05) Using SPSS for a two-sample t-test. As an example we can assume that the male and female members were surveyed independently and thus can be considered as two independent samples . We can now test to see if there is a difference in the mean age of the male and female members. Janet Geary 2012 Page 50 Choose age as the Test Variable and gender as the Grouping Variable Now define groups using the codes on the data file ( 1 and 2) Choose Continue and then OK The output takes the form of : Group Statistics Q5 gender Q6 age of respondent Janet Geary 2012 N Mean Std. Deviation Std. Error Mean male 27 34.67 11.622 2.237 female 31 37.52 11.619 2.087 Page 51 Independent Samples Test Levene's Test for Equality of Variances t-test for Equality of Means 95% Confidence Interval of the Sig. (2F Q6 age of Equal respondent variances .037 Sig. .849 t df - tailed) Mean Std. Error Difference Difference Difference Lower Upper 56 .356 -2.849 3.059 -8.977 3.278 - 54.907 .356 -2.849 3.059 -8.980 3.281 .932 assumed Equal variances not .932 assumed The value in the tails are 0.356 if we assume that the populations from which the samples were drawn have equal variances and 0.356 if we do not make this assumption . In both cases the values are greater than 0.05 and thus we do not reject the hypotheses that they have equal means. This is effectively saying that a female mean age of 37.52 and a male mean age of 34.67 are not significantly different. Janet Geary 2012 Page 52 Topic 5 Confidence Intervals Point and Interval Estimates. The mean rent paid by students in a provincial town is £102 per week with a standard deviation of £10. If we were ignorant of this fact (which we would be unless a census had been taken) we would have to conduct a survey to estimate the population mean. If the results of the survey gave a sample mean of £100 we could use this to estimate the value of the population mean. This value of the sample mean, £100, is known as a point estimate because it consists of just one value. Point estimates can be very misleading as they give a false impression of accuracy. By themselves they do not recognise the fact that they are only estimates and thus are subject to a degree of uncertainty. This question of uncertainty is addressed by using interval estimates such as confidence intervals. A point estimate is The mean weekly rent is £100' An interval estimate would be >The mean weekly rent is in the range £96 to £104 Confidence Interval for a Population Mean Known - Population Standard Deviation If the population standard deviation is known then we can use the Normal distribution in our calculation of the confidence interval. The formula for a Confidence Interval is 𝑥̅ ± 𝑧𝛼 𝜎 ⁄ √𝑛 zα is found from the Normal tables, n is the size of the sample and σ is the standard deviation of the population from which the sample is drawn. For a 95% confidence interval α = 100% - 95% = 2 5% 2 i.e. α = 0.025, Zα = 1.96 For a 90% confidence interval α= 100% - 90% = 2 10 % 2 i.e. α = 0.05, Zα =1.6449 For a 80% confidence interval α= 100% - 80% = 20 % i.e. α = 0.10, Zα =1.2816 2 2 Interpretation of a Confidence Interval A 95 % confidence interval gives a range within which there is a 95% probability that the population mean lies. If we took 100 samples and for each calculated the sample mean, standard deviation and hence the 95% confidence intervals, 95 of these intervals would contain the population mean. Thus 5% of the time the population mean will NOT lie in the 95% confidence interval. Janet Geary 2012 Page 53 Example: An accountant knows, from past experience that the standard deviation of the value of all invoices is £11.60. However he has forgotten the value of the mean. In order to estimate the value of the mean, he takes a sample of 20 invoices and finds a sample mean of £51.41. Determine a 95% confidence interval for the mean value of all invoices. 95 % confidence interval 𝑥̅ ± 𝑧𝛼 𝜎 = 51.41 ± 5.084 = [46.33, ⁄ = 51.41 ± 1.96 × 11.60⁄ √𝑛 √20 95% confidence interval is 56,49] [ £46.33 , £56.49 ] Thus there is a 95% confidence that the mean value of all invoices lies in the range £46.33 to £56.49. Confidence Interval for a Population Mean Unknown - Population Standard Deviation If the population standard deviation is not known then we use the t distribution in our calculation of the confidence interval. We use the value s as the best estimate of the population standard deviation. The formula for a Confidence Interval is 𝑥̅ ± 𝑡𝛼 𝑠 ⁄ √𝑛 tα is found from the t-tables. The degrees of freedom is n -1 Example: The heights, in cm, of 6 policemen were : 180 176 179 181 183 179 Calculate a) a 90% Confidence Interval for the mean heights of all policemen b) a 95% Confidence Interval for the mean heights of all policemen c) If we had obtained the same sample statistics (mean and standard deviation) from a sample of 60 policemen what effect would this have on the 95% Confidence Interval? Mean 𝑥̅ = 179.667 a) sample standard deviation s = 2.3381 90% Confidence Interval α = 0.05 degrees of freedom = n - 1 = 5 𝑥̅ ± [results from calculator] tα = 2.0150 𝑡𝛼 𝑠 = 179.667 ± 1.9234 = [177.74 , 181.59] ⁄ = 179.667 ± 2.015 × 2.3381⁄ √𝑛 √6 There is a 90% confidence that the mean height of all policemen lies in the range 177.74cm to 181.59 cm. b) 95% Confidence Interval α = 0.025 degrees of freedom = n -1 = 5 𝑥̅ ± N.B. tα = 2.5706 𝑡𝛼 𝑠 = 179.667 ± 2.4537 = [177.21 , 182.12] ⁄ = 179.667 ± 2.5706 × 2.3381⁄ √𝑛 √6 A higher percentage confidence interval gives a wider interval Janet Geary 2012 Page 54 c) 95% Confidence Interval with sample size = 60 α = 0.025 degrees of freedom = n -1 = 59 𝑥̅ ± tα = 2.0003 (using df = 60) 𝑡𝛼 𝑠 = 179.667 ± 0.60368 = [179.06 , 180.27] ⁄ = 179.667 ± 2.0003 × 2.3381⁄ √𝑛 √60 N.B. Larger sample sizes give narrower confidence intervals. Thus for a “more accurate” estimate of the mean,(i.e. a narrower confidence interval) take a larger sample. I hope this was obvious before but we have now shown it to be the case. Confidence Interval for a Population Proportions - A proportion can represent any set amount and is mainly used when the data is not numerical. Let p = sample proportion (expressed as a decimal) then q = 1- p If np > 5 and nq > 5 then we can calculate a Confidence Interval for a proportion by: p ± Z α √( pq/n) Example: A random sample of 1,000 electors was polled and 400 of the electors said that they will vote Labour. How accurate an estimate is this sample proportion with respect to how all electors will vote? p = 400 = 0.4 1000 For a 95% Confidence Interval, thus q = 0.6 Z α = 1.96 95% Confidence Interval p ± Z α √( pq/n) 0.4 ± 1.96 × √( 0.24/1000) 0.4 0.4 0.4 ± 1.96 × 0.0155 0.4 ± 0.03036 95% Confidence Interval ( 0.3696 ± 1.96 × √( (0.4 × 0.6)/1000) ± 1.96 × √( 0.00024) , 0.43036) There is a 95% probability that the proportion of all electors who will vote Labour lies in the range 36.96% to 43.036%. More sensibly (37% to 43%) will vote Labour with a confidence of 95%. This is usually reported in the press as “ 40% will vote Labour with an error due to sampling of plus or minus 3%” Sample Size Required for a given level of accuracy We can re-arrange the formula above to help as calculate the sample size required for a given level of accuracy. Notation : let M be the size of the margin of error . Then from above M= Z α √( pq/n) For a 95% confidence interval Z α = 1.96 The size of the product pq will vary according to value of p. But as q = 1 - p, we can say that the quantity pq = p(1-p) Janet Geary 2012 Page 55 From the graph below of y = p (1-p) we can see that the largest possible value of pq is 0.25 .This occurs when both p and q are 0.5. q=1-p 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 p(1-p) 0 0.09 0.16 0.21 0.24 0.25 0.24 0.21 0.16 0.09 0 0.3 value of pq = p(1-p) p 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 value of p 0.7 0.8 0.9 1 For a 95% confidence interval we now have: largest error: Z α √( pq/n) 1.96 √( 0.25 /n) M= M= rearranging this formula to make n the subject gives: M = √( 0.25 /n) 1.96 M2 1.962 n = 0.25 n = 0.25 × 1.96 2 M2 = 0.9604 M2 Thus if a 95% Confidence Interval required a maximum margin of error of 2% n Some results: margin M 1% 2% 3% 4% 5% = NB 0.9604/M^2 Janet Geary 2012 9604 2401 1067.1111 600.25 384.16 0.25 × 1.96 2 M2 = 0.9604 ( 0.02)2 = M = 0.02 2401 Sample size must be an integer. minimum sample size 9604 2401 1068 601 385 Page 56 Seminar Exercise 1. An ambulance station is looking at its response times and takes a sample of 16 call outs and finds that the mean of the sample is 17 minutes whilst the standard deviation of the sample is 5 minutes. What are the 99% confidence limits for the mean response time of all call outs. 2. In order to evaluate the success of a television advertising campaign for a new product, a company interviewed 400 residents in the television area. 120 of them knew about the product. How accurately does this estimate the percentage of residents in the area who know about the product? Calculate a 95% confidence interval. 3. A random sample of 400 rail passengers is taken and 55% are in favour of proposed new timetables. Calculate a 95% confidence interval for the proportion of all passengers that are in favour of the timetables. 4. A manufacturer needs to estimate the mean life of batteries they are producing. To do this they take a sample of 100 batteries and find that the mean life of this sample is 50 hours. The standard deviation of all battery lifetimes is known to be 6 hours. Calculate a) a 95% Confidence Interval for the mean of all battery lifetimes. b) a 99% Confidence Interval for the mean of all battery lifetimes. 5. The Human Resources department of a company is leading a campaign to reduce absenteeism of staff. Last year 15% of the 59,202 hours which should have been worked over a 46 week year were lost due to absenteeism. The campaign that was set up has now been running for 20 weeks. Absenteeism reports for the last 10 weeks of the campaign were examined. The weekly hours lost were: week no. hours lost 11 195 12 190 13 162 14 170 15 177 16 190 17 198 18 177 19 184 20 191 a) Considering these 10 weeks as a random sample of all the weeks to be worked, has there been a significant decrease in absenteeism over last year? You should carry out a test for the mean number of hours lost per week. Test at the 10% level of significance, and explain your choice of test statistic. b. Calculate a 90% confidence interval for the mean number of hours lost per week after the campaign, and explain its meaning. 6. Last Christmas, the average value of purchases per customer at a toyshop was ,36.00 with a standard deviation of ,10.25. The toyshop is anxious to know early on whether this years average Christmas spend is different, so it takes a random sample of 15 customers. Their spend on toys is (in ,): 22 48 36 45 35 11 22 69 86 45 57 43 22 24 17 The shop wishes to be 90% confident that the error due to sampling is no larger than ,3.00. Assuming that the population standard deviation remains at ,10.25 a) Is the sample taken sufficiently large to achieve the accuracy objective? b) Test whether the average spend this Christmas is significantly different from last year’s. Test at the 10% level of significance. c) Establish an 80% confidence interval for the average spend this year. d) Are your answers to (b) and ( c) consistent? Explain your reasoning. Janet Geary 2012 Page 57 Answers 1. (13.32, 20.68) 2. (25.5%, 34.5%) 3. (50.1% , 59.9%) 4 a) (48.824, 51.176) b) (48.455 , 51.545) 5 sample mean = 183.4 sample sd = 11.6065 last year mean = 193.05 t = -2.629 critical = -1.383 reject H0 , Has been a decrease in hours lost b) (176.67 , 190.13) 6 a) sample size is too small, it should be at least 32 b) z = 1.058 do not reject H0 , Mean is not significantly different from last year. c) (35.41 , 42.19) d) both b) and c) imply that the value 36 would lie in a 90% confidence interval, i.e. are consistent Practical session : Using SPSS to find Confidence Intervals To find a 95% confidence of the length of membership: Choose Analyze, Descriptive Statistics and then Explore Select “length of membership” as the dependent list. Now choose Statistics button and use 95% for the Confidence Interval for Mean Janet Geary 2012 Page 58 Now choose Continue and then OK The output is : Case Processing Summary Cases Valid N Q8 length of membership Missing Percent 60 N 100.0% Total Percent 0 .0% N Percent 60 100.0% Descriptives Statistic Q8 length of membership Mean Std. Error 2.32 95% Confidence Interval for Lower Bound 2.05 Mean Upper Bound 2.58 5% Trimmed Mean 2.30 Median 2.00 Variance 1.034 Std. Deviation 1.017 Minimum 1 Maximum 4 Range 3 Interquartile Range 1 Skewness Kurtosis .131 .320 .309 -.956 .608 This gives a confidence interval of [ 2.05 , 2.58] We could report this as : There is a 95% confidence that the mean length of membership is between 2 and 2½years Janet Geary 2012 Page 59 Topic 6: 1. Correlation and regression Pearson product moment coefficient of correlation If we wish to measure the strength of linear relationship between two variables (x and y), and the data is cardinal, we use the Pearson product moment coefficient of correlation (r), which is defined as follows: 𝑟= 𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 √(𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 ) × (𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 ) The value of lies in the range -1 < r < +1. A r value of -1 signifies a perfect negative linear correlation, +1 a perfect positive linear correlation, and 0 no linear correlation. These scenarios are most easily demonstrated on a scatter graph. Note that the coefficient only measures the strength of relationship; it is not evidence of cause and effect. In effect r measures how adequately a scatter of observations can be represented by a straight line. However it is more meaningful to interpret the strength of the relationship by looking at r 2 (known as the coefficient of determination, where 0 < r2 < +1) which measures the proportion (or percentage) of variation in y which is explained by the variation in x. To focus on r tends to overstate the strength of the relationship (e.g. r = 0.7 seems to suggest a fairly strong relationship, but it explains less than 50% of the variation). We can test the significance of the coefficient using the test statistic: t= r (N-2) (1-r2) which has a t-distribution with N-2 degrees of freedom Example In a survey of 30 company employees, the correlation between length of service and age is 0.87207. A scatter diagram of the data is shown below. y = 0.4573x - 8.2194 R² = 0.7605 length of service employee data on age and length of service 26 24 22 20 18 16 14 12 10 8 6 4 2 0 0 5 10 15 20 25 30 35 40 45 50 55 60 age The details of the calculation for the value of the correlation coefficient are shown below: Janet Geary 2012 Page 60 totals service y 2 24 2 9 12 6 1 5 3 4 3 1 14 5 4 13 1 11 8 1 8 12 13 3 9 2 11 4 5 6 age x 24 51 25 34 40 32 23 28 21 33 26 21 57 27 23 45 22 43 35 19 28 44 40 35 48 25 43 24 30 35 y2 4 576 4 81 144 36 1 25 9 16 9 1 196 25 16 169 1 121 64 1 64 144 169 9 81 4 121 16 25 36 x2 576 2601 625 1156 1600 1024 529 784 441 1089 676 441 3249 729 529 2025 484 1849 1225 361 784 1936 1600 1225 2304 625 1849 576 900 1225 xy 48 1224 50 306 480 192 23 140 63 132 78 21 798 135 92 585 22 473 280 19 224 528 520 105 432 50 473 96 150 210 Σy =202 Σx= 981 Σy2= 2168 Σx2 =35017 Σxy =7949 𝑟= 𝑟= √(𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 ) × (𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 ) 30 × 7949 − 981 × 202 √(30 × 35017 − (981)2 ) × (30 × 2169 − (202)2 ) 𝑟= 𝑟= 𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 238470 − 198162 √(1050510 − 962361) × (65040 − 40804) 40308 88149×24236 √ 40308 = 46221 = 0.87207 As the value of r = 0.87207 , the value of r2 is 0.7605 Thus 76% of the variation in the length of service can be explained by the variation in age . Janet Geary 2012 Page 61 The SPSS printout looks like: Correlations AGE SERVICE Pearson Correlation 1 0.872 Sig. (1-tailed) . 0 N 30 30 SERVICE Pearson Correlation 0.872 1 Sig. (1-tailed) 0 . N 30 30 ** Correlation is significant at the 0.01 level (1-tailed). AGE Thus r = 0.872071 and r2 = 0.760508; therefore we have a strong positive correlation in which approximately 76% of the variation in length of service is explained by the variation in age. The remaining 24% will be due to other factors (e.g. not all employees will have >worked their way up= through the company; some will have joined from other companies etc..). We can test the significance of this result by computing the test statistic and comparing it with critical values from the t-distribution. Our null hypothesis is that the true population correlation ρ = 0; our alternative hypothesis is that ρ > 0. The test statistic 𝑡= 0.872071×√(30−2) √(1−0.760508) = 9.4294 Testing at a level of significance of 5%, the critical t-value is 1.701. Since the test statistic exceeds the critical value we can reject the null hypothesis, and conclude that we have evidence of a genuine linear relationship between length of service and age. Moreover, the fact that 9.4294 is much larger than 1.701 shows that we could adopt a much smaller level of significance and still reject the hypothesis of no relationship. Note the output does not specifically provide the test statistic but does indicate that the correlation is statistically significant at a 1% level of significance. Note we are using a one-tailed test of significance with alternative hypothesis ρ > 0; if r had been negative, our alternative hypothesis would be ρ < 0, the test statistic would also be negative, and the critical t-value would be on the negative side of the t-distribution. Thus the null hypothesis would be rejected if the test statistic is less than the critical value (i.e. closer to the tail). The test is therefore a complete mirror image of the example above. Regression Whilst correlation measures the extent to which there is a linear relationship between 2 variables, regression enable us to find the equation that describes that linear relationship. The equation of a regression line has the form: Y = a + bX where Y is the dependent variable (the one we wish to predict / explain) and X is the independent variable. The value Aa@ is known as the intercept of the line and Ab@ measures the gradient of this line. The relevant formulae are: gradient 𝑏= intercept 𝑎= Janet Geary 2012 ∑ 𝑥𝑦− ∑ 𝑥 ∑ 𝑦 2 𝑛 ∑ 𝑥 2 −(∑ 𝑥) ∑ 𝑦−𝑏 ∑ 𝑥 𝑛 Page 62 Looking at the calculation above, we can see that the value of the “top” of b is the same as the “top” of r and that the value of the “left-bottom bracket” of r is the same as the “bottom” of b. Thus 𝑏= Thus 𝑎= ∑ 𝑥𝑦− ∑ 𝑥 ∑ 𝑦 = 2 𝑛 ∑ 𝑥 2 −(∑ 𝑥) ∑ 𝑦−𝑏 ∑ 𝑥 𝑛 40308 = 0.45727 88149 202−0.45727×981 = 30 = −8.2194 Thus the equation of the line linking length of service (y) and age (x) is: y = -8.2194 + 0.45727x This equation can then be used to make predictions. The SPSS regression output, with interpretation in italics, looks like: Variables Entered/Removedb Model 1 Variables Variables Entered Removed agea Method . Enter Variables Entered/Removed This simply tells us that >age= was the independent variable and >service= the dependent variable Model Summaryb Model R R Square .872a 1 Adjusted R Std. Error of the Square Estimate .761 .752 2.629 Model summary The value of the correlation coefficient, r was 0.872 and the value of r2 was 0.761. Coefficientsa Unstandardized Standardized 95.0% Confidence Interval Coefficients Coefficients for B Upper Model 1 B (Constant) age Std. Error -8.219 1.657 .457 .048 Beta .872 t Sig. Lower Bound Bound -4.961 .000 -11.613 -4.826 9.429 .000 .358 .557 Coefficients The unstandardized coefficients give us the values of a and b in the regression equation. Thus the equation here is y = -8.219 + 0.457x The final column “Sig” gives values less than 0.01 thus we can say that the coefficients of the regression equation are significantly different from zero at the 1% level ( and thus at 5% level). Janet Geary 2012 Page 63 Casewise Diagnosticsa Case Number Std. Residual 2 service 3.385 Predicted Value 24 Residual 15.10 8.899 a. Dependent Variable: service Casewise diagnostics During the input dialogue, we asked for any standardised residuals outside the range -3 to + 3. The output shows that one reading, case number 2, had a large standardised residual. This indicates that this point does not fit the general trend of the straight line and can be regarded as an outlier (i.e. an unusual reading). Case number 2 is an employee aged 51 with 24 years of service. On the scatter diagram, we can see that this point is a long way from the regression line. Residuals Statisticsa Minimum Predicted Value Maximum Mean Std. Deviation N .47 17.85 6.73 4.603 30 Residual -4.785 8.899 .000 2.583 30 Std. Predicted Value -1.361 2.414 .000 1.000 30 Std. Residual -1.820 3.385 .000 .983 30 a. Dependent Variable: service Residual Statistics This table can be ignored for simple cases. Rank Correlation If we wish to measure the strength of relationship between two variables, and at least one of them is ordinal, we need to use a nonparametric measure of correlation in which the strength of relationship is now based on the ranks of the data. There are two main measures of rank correlation, Kendall=s τ statistic and Spearman=s rank correlation coefficient; we shall focus on the latter. Observed values for x and y are replaced by their ranks, and the difference (d) in ranking between each set of paired observations is calculated. Spearman=s coefficient (rs) is found from the following formula: rs=1- 6Σd2 N(N2_1) As with Pearson=s coefficient, we can also test whether the result is statistically significant. Providing N 10 we can use the t statistic as before: rs (N-2) t= 2 (1-rs) which has a t-distribution with N-2 degrees of freedom For larger samples (say N > 30) we can use a normal approximation: z=rs (N-1) Janet Geary 2012 Page 64 Example 1 [thanks to Richard Charlesworth for this example] A survey of MBA students included a question which asked the respondent to identify the most important factor in their choice of UNL for their MBA. Factor MBA course programme/design Discussion with tutor at recruitment fair/open evening Recommended by a colleague/friend Living in London Credit transfer/flexible study Prior study at UNL Photos of the staff in the brochure Full-time 34 09 22 21 00 04 10 Part-time 25 10 18 28 09 10 00 Full-time MBA (%) Part-time MBA (%) FT ranks PT ranks d d2 MBA programme/design 34 25 1 2 -1 1 Discussion at fair/open eve 9 10 5 4.5 0.5 0.25 Recommendation.. 22 18 2 3 -1 1 Living in London 21 28 3 1 2 4 CATS/flexible study 0 9 7 6 1 1 Prior study at UNL 4 10 6 4.5 1.5 2.25 Photos of the staff .. 10 0 4 7 -3 9 Factor: Σd = 18.50 2 10 rs=1- Therefore rs = 1 - 6Σd2 N(N2_1) (6×18.5) 7×(49-1) = 0.6696 Note that tied ranks are averaged (i.e. the scores of 10% for Part-time MBA jointly cover the rankings 4 and 5, so both take on the rank 4.5). SPSS Output: Spearman=s rank correlation coefficient Nonparametric Correlations Correlations Spearman's rho Janet Geary 2012 FT MBA Correlation Coefficient Sig. (1-tailed) N PT MBA Correlation Coefficient Sig. (1-tailed) N FT MBA 1 . 7 0.667 0.051 7 PT MBA 0.667 0.051 7 1 . 7 Page 65 How to use SPSS for Correlation and Regression Example: load up the SPPSS file age&service.sav To produce a correlation matrix Choose Analyze, Correlate, Bivariate Select age and service as the input variables. Check that Pearson is ticked. Choose OK The resulting output looks like: service Pearson Correlation 1 Sig. (2-tailed) N age Pearson Correlation Sig. (2-tailed) N Janet Geary 2012 .872** .000 30 30 .872** 1 .000 30 30 Page 66 service Pearson Correlation 1 Sig. (2-tailed) N age Pearson Correlation Sig. (2-tailed) N .872** .000 30 30 .872** 1 .000 30 30 **. Correlation is significant at the 0.01 level (2-tailed). To produce Regression output Select Analyze Regression Linear Select age as the independent variable Select service as the dependent variable Select enter as the method Then select the statistics button Select Casewise diagnostics Outliers outside 3 standard deviations tick Confidence Intervals, choose 95% as Level then Continue then OK The output was shown previously SPSS has the capacity to produce further analyses for regression, for example an analysis of the residuals is possible (and extremely useful). Janet Geary 2012 Page 67 Topic 7: Multiple Regression In many situations we want to be able to use more than one independent variable to predict the value of the dependent variable. We use multiple regression in these cases. For example, we might suspect that the price of a house depends not only on the number of bedrooms it has but also depends on the number of living rooms, number of bath/shower rooms, size of garden and so on. To predict the price of a house we would need a model that took into account all the significant factors. We denote the independent variables as x1, x2, ........xk and the associated coefficients as b1, b2 ..... bk . The constant of the regression equation is denoted by α . Thus the regression equation is: y = a + b1x1 + b2x2 + b3x3 + .......... bkxk The formulae for calculating the values of a , b1 , b2 , b3.......... bk are too complicated to include here. We will focus on analysing the SPSS printouts. t-test on Regression Coefficients We can use the t-test to decide if the coefficients of the regression line are significant H0: H1: βi = 0 [coefficient = 0 :no significant contribution to linear relationship] βi ≠ 0 The test statistics is given by: t= bi / Std Err of bi We reject H0 if |t| > t( α/2, n-2) [found from statistics tables] As before, we can reject H0 if the value of p <0.025 in SPSS printout.. Multicollinearity If a regression equation contains two or more “independent” variables that have a strong linear relationship between them, then the model has “multicollinearity”. This means that the independent variables are not really independent in the sense that they are related to each other. This strong linear relationship between two or more of the independent variables may make the estimates of the coefficients unreliable and may, in fact, make some coefficients negative when they should be positive (or vice versa). We can check for collinearity by looking at the correlation matrix . Multiple Regression Example: London Theatres Data was obtained from the Society of London Theatres about various statistics recorded for London Theatres from 1986 to 2010. The data set was obtained from http://www.solt.co.uk/downloads/pdfs/theatreland/2010-Graph2.pdf The full data file is shown overleaf. Janet Geary 2012 Page 68 year attendances in thousands gross box office revenue £ thousands average no of theatres open no of performances no of new productions 1986 10236 112068 42 16543 213 1987 10881 129589 42 16603 212 1988 10897 139338 43 16970 28 1989 10945 153251 42 16436 237 1990 11321 177904 40 15887 187 1991 10905 186790 39 15508 192 1992 10900 194772 41 15916 193 1993 11503 215619 41 15922 198 1994 11163 217763 41 16063 208 1995 11938 238741 43 17163 208 1996 11179 229017 41 16084 186 1997 11466 246082 39 15568 195 1998 11925 257920 41 16018 207 1999 11931 266565 44 17089 265 2000 11555 286556 43 16633 252 2001 11735 298989 44 17035 264 2002 12064 327972 44 17090 221 2003 11585 321485 42 16664 225 2004 12025 343674 43 17235 225 2005 12319 383942 45 17406 221 2006 12351 400853 43 16912 268 2007 13636 469939 44 17455 243 2008 13892 483349 45 18275 241 2009 14257 504984 45 17923 260 2010 14152 512332 46 18615 264 We are going to use multiple regression to produce a model that will enable the value of gross box office revenue to be explained by the other variables. There are two basic approaches to multiple regression, top-down and bottom-up. With top-down regression, we start by using all the independent variables in our model and then successively eliminate those have values of bi that are not significant (Sig T >0.025) or are highly correlated with other variables (multicollinearity) Bottom-up approaches successively add variables to the model until no improvement can be made. The order in which the variables are added is often the order of the values of the correlation coefficients with the dependent variable (variable with the highest r is used first, then variable with second highest r is added and so on) excluding multicollinearity at each stage. For both of these approaches we need to know what the values of the various correlation coefficients are. This can be obtained by using the correlate command in SPSS . Note: SPSS highlights any correlations that are significant by the use of * and **. We get: Janet Geary 2012 Page 69 Correlations attendance attendance revenue 1 .954** .720** .803** .489* .000 .000 .000 .013 25 25 25 25 25 .954** 1 .715** .771** .559** .000 .000 .004 25 25 25 1 .955** .397* .000 .050 Pearson theatresopen performances newproductions Correlation Sig. (2-tailed) N revenue Pearson Correlation Sig. (2-tailed) N theatresopen Pearson .000 25 25 .720** .715** .000 .000 25 25 25 25 25 .803** .771** .955** 1 .367 .000 .000 .000 25 25 25 25 25 .489* .559** .397* .367 1 .013 .004 .050 .071 25 25 25 25 Correlation Sig. (2-tailed) N performances Pearson Correlation Sig. (2-tailed) N newproductions Pearson .071 Correlation Sig. (2-tailed) N 25 **. Correlation is significant at the 0.01 level (2-tailed). *. Correlation is significant at the 0.05 level (2-tailed). From this correlation matrix we can see that: All of the other variables are correlated with revenue Attendance is highly correlated with the number of theatres open and the number of performances As we want to use “revenue” as our dependent variable, we can guess that “attendance” will be in the model and at most one of “theatres open” and “performances”. We would not expect both “theatres open” and “performances” to be included as they are highly correlated and thus not really independent. “Top-Down” Approach The “top-down” approach initially involves using all four independent variables in the model. The relevant printout is shown below with some comments in italics. The SPSS commands are summarised below Janet Geary 2012 Page 70 The output is shown below: Variables Entered/Removedb Model Variables Entered 1 Variables Removed newproductions, performances, Method . Enter attendance, theatresopen a. All requested variables entered. b. Dependent Variable: revenue Variables Entered/Removed All four possible independent variables have been added using the “Enter” method. Revenue is the dependent variable. This indicates that all 4 independent variables were used in the model. Model Summaryb Model R 1 Janet Geary 2012 .961a R Square .924 Adjusted R Std. Error of the Square Estimate .909 36143.148 Page 71 Model Summaryb Model R R Square .961a 1 Adjusted R Std. Error of the Square Estimate .924 .909 36143.148 a. Predictors: (Constant), newproductions, performances, attendance, theatresopen b. Dependent Variable: revenue Model Summary The value of r = 0.961 and r2 = 0.924. However for multiple regression models, it is usual to use the adjusted R square value of 0.909 for r2 as this value takes into account the number of variables being used as well as the strength of the correlation. ANOVAb Model 1 Sum of Squares df Mean Square F Regression 3.166E11 4 7.915E10 Residual 2.613E10 20 1.306E9 Total 3.427E11 24 Sig. .000a 60.586 ANOVA The F statistics of 60.586 indicates that the model as a whole is significant ( sig < 0.025) Coefficientsa Standardized Unstandardized Coefficients Model 1 B Std. Error (Constant) -1043627.767 177852.629 attendance 101.514 13.178 theatresopen 12735.681 performances newproductions Coefficients Beta t Sig. -5.868 .000 .913 7.703 .000 14486.506 .200 .879 .390 -28.361 39.178 -.191 -.724 .478 260.547 186.511 .104 1.397 .178 a. Dependent Variable: revenue Coefficients The regression equation is given by : revenue = -1043627.767 + 101×attendance + 12735.681× theatres open + -28.361×performances +260.547×new productions Looking at the p values (Sig. Column): Coefficient of the constant is significantly different from zero (p = 0.000 <0.025) Coefficient of attendance is significantly different from zero (p = 0.000 < 0.025) Coefficient of theatres open is NOT significantly different from zero (p = 0.390) Coefficient of performances is NOT significantly different from zero (p = 0.478) Coefficient of new productions is NOT significantly different from zero (p = 0.178) Janet Geary 2012 Page 72 This implies that we should get a better model if we went through a process of deleting variables one by one . The first one to delete would be “performances “ as it has the highest value of p and is correlated with other variables. Residuals Statisticsa Minimum Predicted Value Maximum Mean Std. Deviation N 116690.98 536192.75 283979.76 114851.420 25 -50616.152 68157.070 .000 32994.029 25 Std. Predicted Value -1.457 2.196 .000 1.000 25 Std. Residual -1.400 1.886 .000 .913 25 Residual a. Dependent Variable: revenue Residual Statistics The Maximum standardised residual = 2.196 minimum standardised residual = -1.400 Thus all points are fairly close to the regression line and there are no outliers. Conclusion: Try another model that leaves out the variable performances. However, we will not go through the whole procedure ourselves as we can utilise a special facility in SPSS that uses a “bottom up” procedure. Bottom-Up= Approach (SPSS STEPWISE facility) From the correlation matrix ,we can see that the order of the correlation coefficients with sales are: 1. attendance (0.954) 2. performances (0.4578) 3. Theatres open (0.2277 4. new productions (0.559) Attendance is highly correlated with the number of theatres open and the number of performances This means that we have a case of multicollinearity here. We are unlikely to use all three of these in the “best” model. The rest of this printout is the result of the stepwise multiple regression command. NB. The user did not have to specify the order in which the variables were introduced. The “stepwise” command works by introducing the variables, one at a time, based on the value of the correlation coefficients. “Stepwise” stops when it cannot find a “better” model. Thus the last model produced by stepwise is considered the best model to use for predictions. Janet Geary 2012 Page 73 We also can choose to have some useful plots here Janet Geary 2012 Page 74 The step-wise process is summarised in the final output. Variables Entered/Removeda Model Variables Entered 1 Variables Removed attendance Method . Stepwise (Criteria: Probability-of-F-to-enter <= .050, Probability-of-F-toremove >= .100). a. Dependent Variable: revenue The final (and best) model only uses “attendance” to predict the gross box office revenue. Model Summaryb Model R .954a 1 Adjusted R Std. Error of the Square Estimate R Square .909 .905 36753.551 a. Predictors: (Constant), attendance b. Dependent Variable: revenue The value of adjusted r2 is high at 0.905 ANOVAb Model 1 Sum of Squares df Mean Square Regression 3.116E11 1 3.116E11 Residual 3.107E10 23 1.351E9 Total 3.427E11 24 F Sig. 230.702 .000a a. Predictors: (Constant), attendance b. Dependent Variable: revenue The regression model as a whole is significant as F has p=0.000 Coefficientsa Standardized Unstandardized Coefficients Model 1 B Std. Error (Constant) -974724.442 83195.461 attendance 106.037 6.981 Coefficients Beta t .954 Sig. -11.716 .000 15.189 .000 a. Dependent Variable: revenue The regression equation is revenue = -974724 + 106.037×attendance Both of the coefficients of the constant and attendance are significantly different from zero. Janet Geary 2012 Page 75 Excluded Variablesb Collinearity Model 1 Beta In t Sig. Partial Statistics Correlation Tolerance theatresopen .060a .650 .523 .137 .482 performances .013a .124 .902 .026 .355 newproductions .122a 1.770 .091 .353 .761 a. Predictors in the Model: (Constant), attendance b. Dependent Variable: revenue Residuals Statisticsa Minimum Predicted Value Maximum Mean Std. Deviation N 110668.88 537043.06 283979.76 113951.373 25 -52402.609 67772.398 .000 35979.706 25 Std. Predicted Value -1.521 2.221 .000 1.000 25 Std. Residual -1.426 1.844 .000 .979 25 Residual a. Dependent Variable: revenue The minimum standardised residual is -1.426 and the maximum is 1.844 so there are no outliers The histogram of the residuals closely approximates to a normal distribution, thus the model is a good one. Janet Geary 2012 Page 76 The points on the p-p plot are quite close to the diagonal line and do not really exhibit any particular pattern. A scatter graph of the standardised residuals and the standardised predicted values is evenly scattered above and below the zero line and there is no particular pattern here . Janet Geary 2012 Page 77 Topic 7: Interpreting Regression Output Using the boats data for multiple regression in SPSS Data was collected on the prices charged for weekly boat hire at Easter and in the Summer from a number of boatyards on the Norfolk Broads. For each boat, the fields shown below were recorded. Data definitions: We want to see how the maximum price charged during Easter for a week’s hire is related to the attributes of the boat (length, width, number of fixed berths, maximum number of berths ) The correlation matrix gives: Correlations length in metres width in metres .688** .000 maximum number of number of maximum berths fixed berths Easter price .769** .828** .857** .000 .000 .000 length in metres Pearson Correlation Sig. (2-tailed) 1 N Pearson Correlation Sig. (2-tailed) 112 .688** .000 112 1 112 .498** .000 112 .492** .000 112 .537** .000 N Pearson Correlation Sig. (2-tailed) 112 .769** .000 112 .498** .000 112 1 112 .891** .000 112 .748** .000 number of fixed berths N Pearson Correlation Sig. (2-tailed) 112 .828** .000 112 .492** .000 112 .891** .000 112 1 112 .863** .000 maximum Easter price N Pearson Correlation Sig. (2-tailed) 112 .857** .000 112 .537** .000 112 .748** .000 112 .863** .000 112 1 N 112 **. Correlation is significant at the 0.01 level (2-tailed). 112 112 112 112 width in metres maximum number of berths There appears to be some multi-collinearity here as there are strong correlations between some of the “independent” variables. In particular, “number of fixed berths” is strongly correlated with “maximum number of berths”. Thus we would not expect both of these variables to be present in a “good” model. Janet Geary 2012 Page 78 Using the step-wise approach: Model 1 Variables Entered/Removeda Variables Removed Variables Entered number of fixed berths Method . Stepwise (Criteria: Probability-of-F-to-enter <= .050, Probability-of-F-to-remove >= .100). 2 length in metres . Stepwise (Criteria: Probability-of-F-to-enter <= .050, Probability-of-F-to-remove >= .100). a. Dependent Variable: maximum Easter price The first model just used “number of fixed berths” as this has the highest correlation with maximum Easter price. To this model, the variable “length in metres“ was added. No further additions were made as the other variables correlated strongly with these two. Model Summaryc Model R 1 .863a R Square Adjusted R Square .745 .743 Std. Error of the Estimate 67.792 R Square Change Change Statistics F Change df1 df2 Sig. F Change .745 322.149 1 110 .000 2 .810 .806 58.845 .064 a. Predictors: (Constant), number of fixed berths b. Predictors: (Constant), number of fixed berths, length in metres c. Dependent Variable: maximum Easter price 36.988 1 109 .000 .900b The value of adjusted r2 was 0.743 for model 1 (number of fixed berths). This rose to r 2 = 0.806 when the “length in metres” was added to model 1 to form model 2. Coefficientsa Unstandardized Standardized Coefficients Coefficients B Std. Error Beta 263.311 15.076 Model 1 (Constant) 2 .863 t 17.466 Sig. .000 17.949 .000 .291 .771 number of fixed berths (Constant) 62.209 3.466 12.597 43.251 number of fixed berths length in metres 35.206 5.363 .489 6.564 33.623 5.528 .453 6.082 Collinearity Statistics Tolerance VIF 1.000 1.000 .000 .315 3.178 .000 .315 3.178 a. Dependent Variable: maximum Easter price In model 1 . The coefficient of number of fixed berths is significantly different from zero (p=0.000<0.05) In model 2, both coefficients are significantly different from zero (p=0.000 in both cases ) Casewise Diagnosticsa maximum Case Number 15 Janet Geary 2012 Std. Residual 3.397 Easter price 945 Predicted Value 745.12 Residual 199.877 Page 79 Casewise Diagnosticsa maximum Case Number Std. Residual 15 Easter price 3.397 Predicted Value 945 Residual 745.12 199.877 a. Dependent Variable: maximum Easter price The only real outlier is case number 15 as it is the only point with a standardised residual outside the range -3 to + 3. Residuals Statisticsa Minimum Predicted Value Maximum Mean Std. Deviation N 328.79 806.65 508.26 120.382 112 -158.095 199.877 .000 58.313 112 Std. Predicted Value -1.491 2.479 .000 1.000 112 Std. Residual -2.687 3.397 .000 .991 112 Residual a. Dependent Variable: maximum Easter price The graph of the residuals does roughly resemble a normal distribution curve. Thus the assumption that the residuals are normally distributed seems reasonable . Janet Geary 2012 Page 80 The points are close to the straight line and there is no real pattern in the points , thus the assumption of normally seems reasonable. The residuals are fairly evenly spread about the horizontal line through zero. The difference from the zero line does not change much as the standardised predictions increase. Conclusion : This is a good model. Compare this output with one that includes all possible variables. Variables Entered/Removedb Variables Variables Model Entered Removed Method 1 maximum . Enter number of berths, width in metres, length in metres, number of fixed berths a. All requested variables entered. b. Dependent Variable: maximum Easter price Janet Geary 2012 Page 81 Model Summaryb Model 1 R .903a R Square .816 Change Statistics R Square F Change Change df1 df2 .816 118.850 4 107 Std. Error of the Estimate 58.396 Adjusted R Square .809 Sig. F Change .000 a. Predictors: (Constant), maximum number of berths, width in metres, length in metres, number of fixed berths b. Dependent Variable: maximum Easter price Model 1 Regression Residual Sum of Squares 1621146.826 364878.665 ANOVAb df 4 107 1986025.491 111 Total Mean Square 405286.707 3410.081 F 118.850 Sig. .000a a. Predictors: (Constant), maximum number of berths, width in metres, length in metres, number of fixed berths b. Dependent Variable: maximum Easter price The F statistics is significant and thus indicates that there is a linear relationship between maximum Easter price and at least one of the other variables Coefficientsa Model (Constant) 1 length in metres width in metres number of fixed berths maximum number of berths Unstandardized Standardized Collinearity Coefficients Coefficients Statistics B 34.804 Std. Error 73.494 36.055 -8.502 44.915 -10.983 6.704 27.532 7.726 5.921 Beta .485 -.018 .623 -.172 t .474 Sig. Tolerance .637 5.378 -.309 5.814 -1.855 .000 .758 .000 .066 .211 .503 .149 .201 VIF 4.746 1.987 6.696 4.985 a. Dependent Variable: maximum Easter price (If we were building a model ourselves, we would eliminate “width in metres” and “maximum number of berths” as the coefficients are not significantly different from zero and the signs are not what we would expect. We would expect wider boats with more berths to have an increase in price .) Casewise Diagnosticsa maximum Case Number Std. Residual 15 Easter price 3.191 Predicted Value 945 758.64 Residual 186.357 a. Dependent Variable: maximum Easter price Residuals Statisticsa Minimum Predicted Value Residual Maximum Mean Std. Deviation N 336.28 780.69 508.26 120.851 112 -147.015 186.357 .000 57.334 112 Std. Predicted Value -1.423 2.254 .000 1.000 112 Std. Residual -2.518 3.191 .000 .982 112 Janet Geary 2012 Page 82 Casewise Diagnosticsa maximum Case Number 15 Std. Residual Easter price 3.191 945 Predicted Value 758.64 Residual 186.357 a. Dependent Variable: maximum Easter price Janet Geary 2012 Page 83 Features of a Good Linear Regression Model [SPSS printout] 1 Source of information Model Summary Desirable feature A high value of adjusted r2 2 Table of Correlations [A low value indicates a poor linear fit] Model chosen should not contain independent variables that are highly correlated with each other. 3 Coefficients table [Model should not exhibit any multi-collinearity] All coefficients significantly different from zero i.e. Sig column has values below 0.025 4 ANOVA 5 Casewise diagnostics 6 7 Histogram of residuals Normal p-p plot [If a coefficient is close to zero it adds nothing useful to the model] The F statistic is significantly different from zero indicating that there is a relationship between the dependent variable and at least one of the independent variables. Only 5% of readings are outside the range -2 to +2 in 'Std. Residual' column Any outliers are outside the range -3 to +3 [If there are a number of outliers than data points could just be 'odd' cases or could indicate lack of a linear relationship.] Histogram fits the superimposed Normal curve (or is close) [If histogram does not approximate to Normal curve, it implies residuals are not normally distributed and thus model is not a good fit. Should consider possibility of non-linear relationship] Points are: scattered about the diagonal line close to the diagonal line do not exhibit any pattern [If points are not as stated above, the model is not a good fit. Should consider possibility of non-linear relationship] Seminar Exercise Using the SPSS data file boats.sav, produce the best multiple regression model for predicting the maximum summer prices. Write a report on your printouts. Janet Geary 2012 Page 84 Topic 8 : Graphical Linear Programming Linear Programming methods can be used to solve problems where we wish to maximise (or minimise) a linear function subject to a number of linear constraints. Graphical linear programming can be used when there are only two variables. The stages of graphical linear programming are: 1. Formulation of the problem. This involves translating a description of a problem into a mathematical format. In particular, the linear constraints and objective function have to be generated. 2. Determination of the set of feasible solutions. This involves drawing a graph to determine the region where all the constraints are met. 3. Finding the optimal solution. This involves finding the “best” feasible solution. Example: The Soft Toy Company A manufacturer of expensive soft toys makes giant teddy bears and fluffy rabbits. Each teddy bear has a contribution to profit of £30 whilst each rabbit has a contribution of £40. (They are very expensive.) The manufacturer wants to determine which combination of teddy bears and rabbits should be made in order to maximise the contribution. Each teddy bear requires 2 hours of machining and 2 hours of hand labour whilst each rabbit requires 1 hour of machining and 3 hours of hand labour. Each teddy bear and each rabbit requires 1 kilogram of stuffing. The manufacturer has certain limitations on the possible production. In particular, there are only 50 hours of machining and 90 hours of hand labour available each week. Stuffing is in short supply, so the manufacturer can only rely on 40 kilogrammes each week. Determine which combination of teddy bears and rabbits should be produced in order to maximise profit. Problem Formulation : Summarising the details given above: machine (hour) labour (h) stuffing(kg) teddy bear 2 2 1 rabbit 1 3 1 total(week) 50 90 40 Defining the Variables: Let x be the number of teddy bears produced and sold each week Let y be the number of rabbits produced and sold each week. Formulating the Objective Function: Contribution is C = 30x + 40y so our objective function is Max 30x + 40y Formulate Constraints: machine hours: 2x + y ≤ 50 hand labour: 2x + 3y ≤ 90 stuffing: x + y ≤ 40 As we cannot have a negative number of toys we should include: x ≥ 0 Janet Geary 2012 y≥0 Page 85 Formulation Summary: max 30x + 40y such that 2x + y ≤ 50 2x + 3y ≤ 90 x + y ≤ 40 x ≥ 0 y ≥0 Determining the Feasible Region To draw a graph of the feasible region we have to consider each constraint in turn. 2x + y ≤ 50 Machine hours: In order to draw the line 2x + y = 50, we need two points. Choosing x = 0 and y = 0 for our two points we get: x = 0: 2x + y = 50 y = 0: 2x + y = 50 y = 50 2x = 50 x = 25 The graph now looks like: Soft toy company rabbits 60 40 20 0 0 5 10 20 25 teddy15bears We now have to decide which side of the line is required, i.e. on which side of the line is 2x + y actually less than 50. 30 In general, if the constraint in x and y includes ≤ the required area will be to the “bottom left” of the line. If the constraint includes ≥ the required area will be to the “top right”. 2x + 3y ≤ 90 2x + 3y = 90 3y = 90 y = 30 Again we will want the area under the line. The graph now looks like: Hand Labour x=0 y = 0: 2x + 3y = 90 2x = 90 x = 45 Soft toy company 60 rabbits machine 40 20 labour 0 0 Janet Geary 2012 10 20 teddy bears 30 40 50 Page 86 x + y ≤ 40 x + y = 40 y = 40 Stuffing: x=0 y=0 x + y = 40 x = 40 We want the area under this line as well. The two further constraints x ≥ 0 and y ≥ 0 can be included to give the graph shown: Soft toy company 60 50 machine rabbits 40 30 stuffing 20 10 labour 0 0 10 20 30 40 50 teddy bears In this graph the feasible region can be shaded. It is the region where all the constraints are satisfied. Soft toy company 60 50 rabbits 40 30 20 feasible region 10 0 0 10 20 teddy bears 30 Optimising In this example we wish to maximise the objective function: 40 50 30 x + 40y Method 1: We can use the gradient of the line in order to draw an objective function line. In general, any line with equation ax + by = c has gradient -a/b The line 30x + 40y has gradient = -30 40 = -3 4 Thus we can draw any line with such a gradient (3 down and 4 along) as our initial objective function line. This line is then moved parallel to find the optimal solution. Method 2 Linear programming theory tells us that the optimal solution will always lie at a vertex (corner) of the feasible region. Janet Geary 2012 Page 87 Trying each vertex in turn: vertex x=0, y=0 x=0, y=30 x=15, y=20 x=25, y=0 value of objective function 30x + 40y = 0 30x + 40y = 1200 30x + 40y = 1250 30x + 40y = 750 The highest value of the objective function is found at: x = 15, y = 20. Production Plan. In order to maximise the value of the weekly contribution, the manufacturer should produce and sell 15 teddy bears and 20 rabbits each week. This will generate a weekly contribution of £1250 Example: Camping Trip A youth club is planning a camping trip. Two sizes of tent are available 4-person and 8-person tents. There are 64 people that want to go on the trip but there is only room on the site for 13 tents. Only 8 4-person tents are available. If each 4-person tent costs £15 a night and each 8-person tent costs £45 per night, how many of each type of tent minimises the nightly cost? Formulation: Let x be the number of 4-person tents used and let y be the number of 8-person tents used. Minimise cost: Total number of tents: number of 4-person tents: people accommodated Drawing: x + y = 13 4x + 8y = 64 Cost = 15x + 45y x+y x 4x + 8y x y x = 0, y = 13 x = 0, y = 8 ≤ 13 ≤8 ≥ 64 ≥ 0, ≥0 y = 0, x = 13 y = 0, x = 16 8-person tents The graph looks like: camping trip 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 feasible 0 1 [ Janet Geary 2012 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 4-person tents Page 88 WinQSB graph The objective function is 15x + 45y which has gradient -15 = 45 Thus we can draw a line anywhere with this gradient (1 down and 3 along) -1 3 As we wish to minimise the objective function, we must move the objective function line towards the origin. The optimal solution is: The cost will be x = 8, y=4 15×8 + 45×4 = 300 Thus the trip organisers should book 8 4-person tents and 4 8-person tents. This will cost £300 per night and will allow 64 people to be accommodated. A total of 12 tents are used. Seminar Sheet 1. A company manufactures two types of sweatshirts: hooded and round neck. Each hooded sweatshirt makes a contribution to profit of £4 and each round neck sweatshirt makes a contribution of £3. Each hooded sweatshirt requires 1 hour of labour and each round neck requires 2 hours. There are 110 hours of labour available each day. There are limitations in the production capacity so that only 70 hooded sweatshirts can be made in a day. The company wishes to maximise the contribution from these sweatshirts. a) b) c) Janet Geary 2012 Formulate as a linear programming problem. Determine the number of hooded and round neck sweatshirts that should be made each day in order to maximise contribution. What is the value of the maximum daily contribution? Page 89 2. A health enthusiast would like to organise his food consumption of two diet supplements Vita and Glow so that his minimum daily requirement of three basic nutrients A, B and C is satisfied. The minimum daily requirements are 14 units of A, 12 units of B and 18 units of C. Product Vita has 2 units of A and one unit each of B and C in each packet. Product Glow has one unit each of A and B and 3 units of C in each packet. The price of Vita is 20p and the price of Glow is 40p per packet. The health enthusiast wants to determine the level of consumption of Vita and Glow that will minimise expenditure whilst satisfying the minimum daily requirements. a) b) c) 3. Formulate the description given above as a Linear Programming problem. Advise the user on the best combination of Vita and Glow to use. What is the minimum cost? A furniture manufacturer makes two types of tables: Traditional and Modern. Each traditional table requires 6 hours of cutting time, 5 hours of sanding time and 2 hours for staining. Each traditional table sold gives a contribution to profit of £50. Each modern table requires 2 hours of cutting time, 5 hours of sanding time and 4 hours of staining time. Each modern table sold makes a contribution of £30. Each day the manufacturer has available 36 hours of cutting time,40 hours of sanding time and 28 hours of staining time. All other inputs are available as required. The company can sell all the tables it makes. a) 4 Find the number of each type of table that should be made in order to maximise the contribution. Find the maximum contribution possible. A clothes manufacturer offers two versions of a particular t-shirt; one printed and the other plain. The manufacturing requirements are: Cutting and printing time: The printed t-shirt takes 9 minutes each whilst each plain t-shirt takes only 3 minutes to cut and print. There are 360 minutes available for cutting and printing each day. Each printed t-shirt takes 5 minutes for sewing and packing whilst each plain t-shirt takes only 3 minutes. There are 240 minutes available for sewing and packing of these t-shirts each day. A contract with a local shop requires a minimum of 12 printed t-shirts to be produced each day. The manufacturer makes a contribution to profit of £6 from the manufacturer and sale of each printed t-shirt and £5 for each plain t-shirt. The manufacturer wishes to maximise this contribution to profit. Formulate the scenario described above as a linear programming problem. You should clearly indicate the meaning of each constraint and the meaning of any variables. Using a graphical method, or otherwise, determine which combination of printed and plain tshirt the manufacturer should produce and sell in order to maximise contribution to profit. 5. A small engineering company makes two types of engine parts, coded as part A and part B. Part A has a contribution of £30 per unit and part B £40. The company wishes to establish the weekly production plan which maximises contribution. Janet Geary 2012 Page 90 Production data are as follows: part A part B total available per week machining (hours) 4 labour (hours) 4 materials (kg) 1 2 6 1 100 180 40 Because of a trade agreement, sales of part A are limited to a weekly maximum of 20 units and to honour an agreement with an old established customer at least 10 units of part B must be made each week. Formulate the scenario described above as a linear programming problem. You should clearly indicate the meaning of each constraint and the meaning of any variables. Using a graphical method, or otherwise, determine which combination of part A and part B the company should produce and sell in order to maximise contribution to profit. What is the expected contribution to profit? Answers: 1. a) max 4x + 3y such that x + 2y ≤ 110 and x ≤70 b) 70 hooded and 20 round neck c) £340 2. a) min 20 x + 40y such that 2x + y ≥ 14 and x + y ≥ 12 and x + 3y ≥18 b) 9 packets of Vita and 3 packets of Glow c) £3 3 max 50x + 30y such that 6x + 2y ≤ 36 5x + 5y ≤ 40 2x + 4y ≤ 28 a) 5 traditional and 3 modern each day. b) Maximum contribution: £340 4 max 6x + 5y such that 9x + 3y ≤ 360 5x + 3y ≤ 240 x ≥12 Produce 12 printed and 60 plain t-shirts each day 5 Max 30A + 40B such that 4A + 2B ≤ 100 4A + 6B ≤ 180 A + B ≤ 40 A ≤20 B ≥10 Produce 15 units of part A and 20 units of part B each week. The expected contribution is £1250 per week from the sale of these parts. Janet Geary 2012 Page 91 Topic 9 Analysis Linear Programming-Shadow Prices and Sensitivity Consider the following problem: A company makes two kinds of armchair; Model A with loose covers and Model B with fitted covers only. The company estimates that it can sell as many of the armchairs as it can make. Management must now determine the production targets for the next few months in order to maximise profit. The company knows that it will make a profit of £50 on each type A model and £40 on each type B model. As Model A chairs require extra packing, chairs and loose covers, the company can only manage to make a maximum of 8 chairs a day. In the machining department, where the wood for the chairs is shaped, each Model A armchair requires 1 hour whilst each Model B requires 1.5 hours. There are 15 hours of shaping time available each day. In the upholstery department each Model A requires 3 hours and each Model B requires 2 hours. There is a total of 30 hours available for upholstering each day. Determine the optimal production plan. Furthermore answer the following questions: 1. If an extra hour each day could be made available for shaping the wood, what would this be worth? 2. If an extra hour of upholstery time could be made available, by how much would profit be increased? 3. By how much could the profit on type A chairs alter before the original optimal plan changes? 4. If the original solution is to remain optimal, by how much could the profit from type B change? Model: Let x be the number of model A armchairs produced each day. Let y be the number of model B armchairs produced each day. maximise profit: subject to: max A shaping: upholstery: and max 50x + 40y x ≤8 x + 1.5y ≤ 15 3x + 2 y ≤ 30 x ≥ 0, y ≥ 0 The outline graph of the problem is given below: Janet Geary 2012 Page 92 The objective function 50x + 40y has gradient = - 50 = -5 40 4 [Any line ax + by = c has gradient -a/b] We can thus draw a profit line with this gradient and move it “top-right” to maximise. The optimal solution is found at the point x = 6 and y = 6. The optimal production plan is to make 6 model A armchairs and 6 model B armchairs each day. This will produce a maximum profit of: Model A 6 @ £50 = £300 Model B 6 @ £40 = £240 Total = £540 Thus the maximum profit available is £540 a day. Slack and Binding Constraints. Since the optimal solution is bounded by two constraints, shaping and upholstery, these constraints are known as binding. Considering each constraint in turn. Shaping x + 1.5y ≤ 15 When x = 6 and y = 6, x + 1.5y = 6 + 1.5× 6 = 6 + 9 = 15 All of the available time for shaping has been used. Thus the slack on this constraint is 0. Upholstery When x = 6 and y = 6 3x + 2y ≤ 30 3x + 2y = 3 × 6 + 2 × 6 = 18 + 12 = 30 All of the time available for upholstering has been used. The slack on this constraint is 0. Janet Geary 2012 Page 93 Maximum model A when x = 6 and y = 6 x ≤8 Left hand side, LHS: Right hand side, RHS Slack = RHS - LHS = 2 x=6 8 All of the model A armchairs possible have not been made. Thus there is a slack on this constraint of 2. N.B. Binding constraints have zero slack. Shadow Prices The shadow price ( or dual price) for a particular constraint shows the amount of improvement in the optimal objective value as the right hand side of that constraint is increased by one unit, with all other data held fixed (Eppen, Gould and Schmidt, Introductory Management Science) Thus if we are trying to maximise the objective function, the shadow price gives the increase in the value of the objective function. If we are seeking to minimise, the shadow price gives the decrease in the objective function value. Shaping x + 1.5y ≤ 15 If the time available for shaping is increased by 1 hour the new constraint will be: x + 1.5y ≤ 16 The new optimal solution will be where this line meets the upholstery line, (see graph). New: shaping Old: upholstery When y = 7.2 x + 1.5y = 16 × 3 3x + 2y = 30 Subtracting 3x + 2y = 30 3x + 14.4 = 30 3x = 15.6 3x + 4.5y = 48 3x + 2y = 30 2.5y = 18 y = 7.2 x = 5.2 Thus the new optimal solution would be 5.2 of model A and 7.2 of model B per day In practice this would mean producing 26 model A and 36 model B in a five day week. New profit = 50x + 40y = 50 × 5.2 + 40 × 7.2 = 548 Old profit = 540 Extra profit = 8 One extra hour of shaping is worth £8. The shadow price for shaping is £8. Upholstery 3x + 2y ≤ 30 If upholstery time is increased by 1 hour, the constraint becomes: 3x + 2y ≤ 31 and the new optimal solution will be where this constraint line meets the line for shaping. New: upholstery Old: shaping 3x + 2y = 31 x + 1.5y = 15 × 3 When y = 5.6 3x + 2y = 31 3x + 11.2 = 31 3x = 19.8 x = 6.6 Janet Geary 2012 3x + 2y = 31 3x + 4.5y = 45 -2.5y = -14 y = 5.6 Page 94 New optimal solution is x = 6.6, y = 5.6 New profit Old profit Extra profit 50x + 40y = 50 × 6.6 + 40 × 5.6 = 554 = 540 = 14 One extra hour of time for upholstery is worth £14. The shadow price of upholstery is £14. Sensitivity Analysis on the Objective Coefficient Ranges The objective coefficient ranges tells us the changes that can be made in the objective function coefficients without changing the optimal solution. This is particularly useful as profits, costs etc are likely to change over time. Our objective function line , profit = 50x + 40y, has gradient -50/40 = -5/4 = -1.25 Shaping constraint line x + 1.5y = 15 Upholstery constraint 3x + 2y = 30 has gradient -1/1.5 = -0.667 has gradient -3/2 = -1.5 The profit line will reach an optimal solution at the intersection of the shaping and upholstery lines whilst the gradient of the profit line lies between the gradients of these two binding constraints. Thus the current solution will remain optimal whilst the gradient of the objective function lies between the gradient of the two binding constraints. Model A Let the profit from model A chairs change to a new amount called new. The objective function will now look like: (new)x + 40y which has gradient - new 40 The current optimal solution will remain optimal whilst: gradient of upholstery < gradient of profit line < gradient of shaping -1.5 < - new < -0.667 40 - 1.5 < - new - new < -0.667 40 40 -60 < - new - new < -0.667 40 new < 60 -new < -26.667 26.667 < new 26.667 < new < 60 Current solution remains optimal whilst the profit from model A lies in the range £26.67 to £60 Thus means that the profit could rise by an amount less than £60 - £50 = £10 or it could fall by an amount less than £50 - £26.67 = £23.33 without affecting the optimal solution. If the profit from each model A armchair remains in the range £26.67 to £60 the current solution will remain optimal. Model B Let the profit from each model B armchair change to a new amount >new=. The new objective function will be: 50x + (new)y which has gradient - 50 new Janet Geary 2012 Page 95 The current solution will remain optimal whilst the gradient of the profit line is between the gradients of the two binding constraints. - 1.5 < -1.5 - 50 new < -0.667 < -50 new -1.5×new < -50 50 < 1.5×new 50 < new 1.5 33.333 < - 50 new - 50 0.667×new new new < -0.667 < -0.667×new < 50 < 50___ 0.667 new < 75 33.33 < new < 75 The current solution will remain optimal whilst the profit from each model B armchair lies in the range £33.33 to £75. This is the same as saying: The current solution will remain optimal whilst the value of the profit from each model B does not rise by more than £35 [£75 - £40] or fall by more than £6.67 [£40 - £33.33]. Thus continue to produce 6 of each model each day whilst the profit from each model B is between £33.33 and £75. The WinQSB input and output looks like : Choose Edit and then variable names Janet Geary 2012 Page 96 Choose edit and then constraints The expression can now be entered into the spreadsheet format To get the output: Choose “Solve and Analyze” then “Solve the problem” Solution is : Janet Geary 2012 Page 97 Notes for a report The optimal production plan is to make 6 model A armchairs and 6 model B armchairs each day. This will produce a maximum daily profit of £540 If the profit from each model A armchair remains in the range £26.67 to £60 the current solution will remain optimal The current solution will remain optimal whilst the profit from each model B armchair lies in the range £33.33 to £75. All of the available time for shaping has been used. Thus the slack on this constraint is 0. One extra hour of shaping is worth £8. The shadow price for shaping is £8. All of the time available for upholstering has been used. The slack on this constraint is 0. One extra hour of time for upholstery is worth £14. The shadow price of upholstery is £14. All of the 8 model A armchairs possible have not been made. Thus there is a slack on this constraint of 2 as only 6 have been made. N.B. Only one change can be made at a time Seminar Exercise (the first part of each question was done last week) 1. A company manufactures two types of sweatshirts: hooded and round neck. Each hooded sweatshirt makes a contribution to profit of ,4 and each round neck sweatshirt makes a contribution of ,3. Each hooded sweatshirt requires 1 hour of labour and each round neck requires 2 hours. There are 110 hours of labour available each day. There are limitations in the production capacity so that only 70 hooded sweatshirts can be made in a day. The company wishes to maximise the contribution from these sweatshirts. a) `b) c) d) e) f) Janet Geary 2012 Formulate as a linear programming problem. Determine the number of hooded and round neck sweatshirts that should be made each day in order to maximise contribution. What is the value of the maximum daily contribution? If another hour of labour was available each day, what would be the maximum daily contribution? If the limitations in the production process were changed so that 71 hooded sweatshirts could be made each day, what would be the effect on the maximum daily contribution? Within what limits could the contribution from hooded sweatshirts lie, without the optimal production plan changing? Page 98 2. A health enthusiast would like to organise his food consumption of two diet supplements Vita= and Glow so that his minimum daily requirement of three basic nutrients A, B and C is satisfied. The minimum daily requirements are 14 units of A, 12 units of B and 18 units of C. Product Vita has 2 units of A and one unit each of B and C in each packet. Product Glow has one unit each of A and B and 3 units of C in each packet. The price of Vita is 20p and the price of Glow is 40p per packet. The health enthusiast wants to determine the level of consumption of Vita and Glow that will minimise expenditure whilst satisfying the minimum daily requirements. a) b) c) d) e) f) 3. Formulate the description given above as a Linear Programming problem. Advise the user on the best combination of Vita and Glow to use. What is the minimum cost? If the minimum daily requirement for the nutrient A was increased by 1 unit, what would the minimum cost be? If the minimum daily requirement for nutrient A was reduced by 1 unit, what would the minimum cost be? Within what range could the price of Vita lie without changing the optimal combination of Vita and Glow? A furniture manufacturer makes two types of tables: Traditional and Modern. Each traditional table requires 6 hours of cutting time, 5 hours of sanding time and 2 hours for staining. Each traditional table sold gives a contribution to profit of £50. Each modern table requires 2 hours of cutting time, 5 hours of sanding time and 4 hours of staining time. Each modern table sold makes a contribution of £30. Each day the manufacturer has available 36 hours of cutting time, 40 hours of sanding time and 28 hours of staining time. All other inputs are available as required. The company can sell all the tables it makes. a) b) c) d) 6 Find the number of each type of table that should be made in order to maximise the contribution. Find the maximum contribution possible. How will the company’s optimal mix of tables change if there were only 30 hours of cutting time available each day? If an extra hour of cutting time became available each day, by how much would the maximum contribution change? By how much could the contribution to profit of a Modern table increase before the optimal production plan changes? A clothes manufacturer offers two versions of a particular t-shirt; one printed and the other plain. The manufacturing requirements are: Cutting and printing time: The printed t-shirt takes 9 minutes each whilst each plain t-shirt takes only 3 minutes to cut and print. There are 360 minutes available for cutting and printing each day. Each printed t-shirt takes 5 minutes for sewing and packing whilst each plain t-shirt takes only 3 minutes. There are 240 minutes available for sewing and packing of these t-shirts each day. A contract with a local shop requires a minimum of 12 printed t-shirts to be produced each day. The manufacturer makes a contribution to profit of £6 from the manufacturer and sale of each printed t-shirt and £5 for each plain t-shirt. The manufacturer wishes to maximise this contribution to profit. Janet Geary 2012 Page 99 Formulate the scenario described above as a linear programming problem. You should clearly indicate the meaning of each constraint and the meaning of any variables. Using a graphical method, or otherwise, determine which combination of printed and plain tshirt the manufacturer should produce and sell in order to maximise contribution to profit. Identify any binding constraints If the time available for sewing and packing the t-shirts was to increase by 1 minute each day, what effect would this have on the maximum contribution that could be earned from the manufacture and sale of the t-shirts? At present the contribution from a printed t-shirt is £6. By how much could this contribution increase before the production plan would need to be changed? 6. A small engineering company makes two types of engine parts, coded as part A and part B. Part A has a contribution of £30 per unit and part B £40. The company wishes to establish the weekly production plan which maximises contribution. Production data are as follows: part A part B total available per week machining (hours) 4 labour (hours) 4 materials (kg) 1 2 6 1 100 180 40 Because of a trade agreement, sales of part A are limited to a weekly maximum of 20 units and to honour an agreement with an old established customer at least 10 units of part B must be made each week. Formulate the scenario described above as a linear programming problem. You should clearly indicate the meaning of each constraint and the meaning of any variables. Using a graphical method, or otherwise, determine which combination of part A and part B the company should produce and sell in order to maximise contribution to profit. What is the expected contribution to profit? b) If the number of hours available for machining each week increased by 5, by how much would the maximum possible contribution change? c) If the amount of material available each week increased by 1 kg, what effect would this have on the maximum contribution? d) Within what range could the contribution from a part B lie, without altering the optimal solution found previously. Answers: 1. d) maximum contribution would increase by £1.50 to £341.50 e) maximum contribution would increase by £2.50 f) contribution from hooded must be above £1.50 2. d) e) f) minimum cost will not change i.e. £3 minimum cost will not change i.e. £3 price of Vita must lie in the range 13.3p to 40p a) Maximum contribution: £340 3 Janet Geary 2012 Page 100 b) c) d) 3.5 Traditional and 4.5 modern each day (7 Traditional and 9 modern every 2 days) maximum contribution would increase by £5 could increase by up to £20 4 Produce 12 printed and 60 plain t-shirts each day binding constraints: sewing and order (printed >=12) If sewing and packing time increases by 1 min, max contribution increases by £1.67 Contribution from a printed t-shirt could increase by up to £2.33 before the optimal solution changes. Produce 15 units of part A and 20 units of part B each week. The expected contribution is £1250 per week from the sale of these parts. b) 5 @ 1.25 = £6.25 c) c) no effect as slack constraint d) £15 to £45 Janet Geary 2012 Page 101 Topic 10: variables Linear Programming - Extension to more than two There are many types of problems that can be solved using linear programming techniques. We shall limit our consideration to just two examples: financial planning and farm production. Example: Lottery Winner A lottery winner, Fred, has instructed his financial adviser to invest ,100,000 of his winnings in the best combination of three stocks, Alpha, Beta and Gamma. The details on these stocks are given in the table below: Stock price per share estimated annual return per share maximum possible investment Alpha 60 7 60,000 Beta 25 3 25,000 Gamma 20 3 30,000 Formulate, and solve, a linear programme to show how many shares of each stock Fred should purchase in order to maximise the estimated total annual return. Advise Fred on the results of the sensitivity analysis. Formulation Let A be the number of Alpha shares purchased. Let B be the number of Beta shares purchased. Let C be the number of Gamma shares purchased. Total annual return per share is 7A + 3B + 3C Total amount of money to be invested is £100,000 Thus 60A +25 B + 20C ≤ 100,000 [assuming we do not have to invest all of it] Only £60,000 can be invested in Alpha,@ £60 each, Only £25,000 can be invested in Beta @ £25 each, Only £30,000 can be invested in Gamma @£20 each, Problem: subject to total investment: max A: max B: max C: non-negativity max 7A + 3B + 3C A ≥ 0, 60A + 25B + 20C ≤ 100,000 A ≤ 1,000 B ≤ 1,000 C ≤ 1,500 B ≥ 0, C≥ 0 therefore therefore therefore A ≤1,000 B ≤1,000 C ≤1,500 Computer Input Janet Geary 2012 Page 102 Computer Output Decision variable Solution value Alpha 750 Beta 1,000 Gamma 1,500 Objective Function (max) = 12,750 constraint total investment Slack or Surplus Shadow Price 0 0.1167 max in A 250 0 max in B 0 0.0833 max in C 0 0.6667 Optimal solution: Buy 750 shares in Alpha, 1,000 shares in Beta and 1,500 shares in Gamma. This will give a total estimated return of £12,750. Slacks: Constraint total investment has slack = S1 = 0. All of the £100,000 has been invested. Constraint max in A has slack = S2 = 250. The number of Alpha shares bought was 250 less than the number available. [1000 were available but only 750 bought] Constraint max in B has slack = S3 = 0. The maximum number of Beta shares has been purchased. Constraint max in C has slack = S4 = 0. The maximum number of Gamma shares has been purchased. Janet Geary 2012 Page 103 Binding Constraints The binding constraints are constraints total investment, max in B, and max in C.(They all have zero slack). Thus the maximum return is limited by the total amount of money that can be spent, the number of Beta shares available and the number of Gamma shares available. Shadow Prices Total Investment: Total investment has a shadow price of 0.1167 If the total investment possible was increased by £1, a further £0.1167 could be earned. Since any increased investment would be spent on Alpha shares, (Beta and Gamma shares are all used), every extra £1 spent on Alpha shares would generate £7/60 = £0.1167. Max in A The maximum Alpha constraint has shadow price equal to zero as we can earn nothing by increasing the availability of Alpha shares as there are already unused shares. Max in B If the total number of Beta shares available was increased by 1, a further £0.083 could be earned. We could not predict this result as more Beta shares will reduce the number of Alpha and Gamma shares in an unknown manner. Max in C If the total number of Gamma shares available was increased by 1, a further £0.67 could be earned. Sensitivity Analysis Decision Variable Unit Cost or Profit Allowable Min Allowable max A 7 0 7.2 B 3 2.9167 M =Infinity C 3 2.3333 M =Infinity A shares The current suggested portfolio of shares will remain optimal whilst the return on an Alpha share remains in the range 0 to £7.20, assuming the other returns remain constant. B shares The current suggested portfolio will remain optimal whilst the return on a Beta share remains above £2.92. C shares The current portfolio will remain optimal whilst the return on a Gamma share remains above £2.33. Ranges for Shadow Prices Constraint Right Hand Side Shadow Price Allowable Min RHS Allowable Max RHS total investment 100,000 0.1167 55,000 115,000 max in A 750 0 750 M max in B 1,000 0.0833 400 2,800 max in C 1,000 0.6667 750 3,750 These values give the ranges within which the shadow prices apply. Janet Geary 2012 Page 104 Total investment If the RHS of constraint total investment increases to 115,000 or less, we can say that for each extra £1 available for investment the return will increase by £0.1167. However if the increase takes the total investment available to above £115,000 we do not know what the extra return will be. Similar arguments apply for reductions in the maximum total investment, provided the total available does not fall below £55,000. Max in A If the RHS of constraint max A, maximum number of Alpha shares, takes any value above 750, then for each additional share, the extra return will be zero. (This confirms that we only needed 750 Alpha shares for the original solution.) Max in B If the RHS of constraint max B, maximum number of Beta shares, is in the range 400 to 2,800 shares, the shadow price of £0.0833 will apply. As long as the maximum number of Gamma shares is in the range 750 to 3750, the shadow price of £0.6667 will apply. Farm Production A farmer has two farms, Home Farm and Meadow Farm, in which he grows corn and wheat. Both farms are 40 acres in size. To satisfy a contract with a local mill, the farmer must produce 7,000 barrels of corn and 11,000 barrels of wheat each year. The farmer wishes to minimise the cost of meeting the contract. The data for each farm is given below: Home Farm yield per acre cost per acre Meadow Farm yield per acre cost per acre corn 400 barrels £100 corn 650 barrels £120 wheat 300 barrels £90 wheat 350 barrels £80 Write a report to the farmer analysing your findings. Formulation: Variables: We need to distinguish between production at Home and Meadow Farm. Let CH be the number of acres of corn planted at Home Farm Let CM be the number of acres of corn planted at Meadow Farm Let WH be the number of acres of wheat planted at Home Farm Let WM be the number of acres of wheat planted at Meadow Farm Minimise Cost: Minimise 100CH + 90WH + 120CM + 80WM subject to: yield area corn (1) 400CH + 650CM wheat (2) 300WH + 350WM Home Farm (3) CH + WH Meadow Farm (4) CM + WM CH ≥ 0, CM ≥ 0, WH ≥0 WM ≥ 0 ≥ 7,000 ≥ 11,000 ≤ 40 ≤ 40 The input for Farms is: Janet Geary 2012 Page 105 Report 1. Planting Plan In order to minimise the planting costs, the following should be implemented: Plant Home Farm with 2.56 acres of wheat Plant Meadow Farm with 10.77 acres of corn and 29.23 acres of wheat. This will incur costs of £3,861.54, the minimum that can be achieved. 2. Implications of Suggested Planting Programme The planting plan given above will result in exactly 7,000 barrels of corn and exactly 11,000 barrels of wheat being produced. [S1 = 0, S2 = 0] 37.44 acres of Home Farm will not be planted, whilst all 40 acres of Meadow Farm will be used. 3. 3.1 Scope of the recommendations Change in Costs Janet Geary 2012 Page 106 The planting plan given above will result in a minimum cost whilst the following conditions hold: * The cost of planting corn at Home Farm stays above £89.23 per acre. * The cost of planting corn at Meadow Farm stays below £137.50 per acre. * The cost of planting wheat at Home Farm stays in the range £68.57 to £105 per acres. * The cost of planting wheat at Meadow Farm stays in the range £62.50 to £105 per acre. If any one of the conditions above fails to hold then a new planting plan will be required. Furthermore, if two or more of the present costs change, a new plan will be required. As these ranges are relatively large, the proposed plan should hold good for some time 3.2 Change in Contract The proposed plan exactly meets the contracts for 7,000 barrels of corn and 11,000 barrels of wheat. If the requirement for corn was to increase by one barrel, the extra costs incurred in meeting this target would be 22.3 pence. This marginal cost of 22.3 pence per barrel applies for levels of production between 5,571.43 and 26,000 barrels. If the minimum amount of wheat required was increased, the extra cost would be 30 pence per barrel. This extra cost per barrel will apply whilst the contract for wheat lies in the range 10,230.8 to 22,230.8 barrels. If the minium requirement for corn or wheat fell outside these ranges, a new planting plan would be required. 3.3 . Changes in Farm Size In total only 2.56 acres of Home Farm are being used. Clearly it would not make sense to consider increasing the size of Home Farm. All 40 acres of Meadow Farm are being used. If the size of Meadow Farm could be increased by 1 acre the minimum cost could be reduced by ,25. This marginal reduction in cost applies when the size of Meadow Farm lies in the range 10.77 acres to 42.2 acres. Similarly, if the size of Meadow Farm is reduced by 1 acre, the minimum cost will rise by £25. [If we could increase the size of Meadow Farm by 1 acre, we could produce an extra 350 barrels at Meadow Farm at a cost of £80. At the same time we would reduce the yield from Home Farm by 350 barrels i.e. 350/300 = 1.16666 acres. This reduction would save 1.16666*90 =105. Thus the net reduction would be 105 - 80 = £25 ] Seminar Exercise Question 1. Formulate the problem below as a linear programming problem. Solve the problem using QSB for Windows. Write some notes on the interpretation of the output. During the seminar discuss the formulation and the output. A company has two factories A and B. Each factory makes two products, standard and deluxe. A unit of standard gives a profit contribution of £10, while a unit of deluxe gives a profit contribution of £15. The company wishes to maximise this profit contribution. Each factory uses two processes, grinding and polishing, for producing its products. Factory A has a grinding capacity of 80 hours per week and a polishing capacity of 60 hours per week. For factory B these capacities are 60 and 75 hours per week respectively. Janet Geary 2012 Page 107 The grinding and polishing times in hours for a unit of each type of product in each factory are given in the table below: Factory A Factory B standard deluxe standard deluxe grinding 4 2 5 3 polishing 2 5 5 6 It is possible, for example, that factory B has older machines than factory A, resulting in higher unit processing times. In addition each unit of each product uses 4 kilograms of a raw material which we shall refer to as “raw”. The company has 120 kilograms of “raw” available per week. (Example is taken from H.P.Williams >Model Building in Mathematical Programming= 3rd edition, Wiley 1990 page 45) [9.166 units of standard in factory A, 8.33 units of deluxe in factory A, 12.5 units of deluxe in factory B each week. Maximum contribution = £404.17 per week] Question 2. Hitech Training A small private college Hitech Training provides training in web-page design and development for companies. The companies send their employees to the college premises for short courses. Each course lasts 8 hours. The distribution of the 8 hours is flexible; a course make take place during one day or spread over a number of days. It is also possible for one course to spread over a number of weeks, e.g. 2 hours on Monday evening for 4 weeks. The companies send employees in groups of 10 or less, as the college’s computer rooms can only accommodate 10 trainees at a time. If a company chooses to send less than 10 employees on a particular course they are not charged a reduced fee as the teaching costs are fixed. If the company wishes to send more than 10 employees, they must book more than one course. The prices charged to the companies for a course are as follows: Introductory Web Design Intermediate Web Design Advanced Web Design £500 £600 £700 for an 8-hour course for up to 10 employees for an 8-hour course for up to 10 employees for an 8-hour course for up to 10 employees The college employs trainers to deliver these courses. The trainers are only paid for the hours they work. The rates of pay are as follows: Introductory Web Design Intermediate Web Design Advanced Web Design £18 an hour £21 an hour £35 an hour Demand for these training courses is such that the college is confident of having enough companies wanting courses to fill all the courses it provides. However, the number of courses that can be provided each week is limited by the fact that the college only has 200 trainer-hours available each week. [A trainer-hour is defined as one trainer teaching for one hour.] The college has enough rooms and trainers to meet any time-tabling problems that could arise. Of these 200 trainer-hours, only 80 are available for the Advanced Web Design course. A total of 160 trainer-hours are available for teaching the Intermediate or Advanced courses. The trainers that can teach the Advanced courses can also teach the Intermediate courses. All trainers can teach the Introductory course. The college’s analysis of the demand for its courses indicates that at least 5 of all courses offered should be at the Advanced level and at least 5 of all courses offered at Intermediate level each week. Janet Geary 2012 Page 108 The college wishes to maximise its Net Revenue. This is defined as: Net Revenue = Revenue from courses - Labour Costs. Required: Formulate the scenario described on the previous page as a Linear Programme. Keep a copy of any notes made during the formulation, as you may need these in the examination. Enter the problem into QSB for Windows (or a similar package) Produce printouts of The input data Solution of the problem Sensitivity Analysis You will need to ensure that your printouts include: shadow prices, right-hand side ranges, ranges of objective coefficients. $ Bring these printouts and any notes to the seminar Answer: Question 1. Let AS be number of standard products made in factory A each week Let BS be number of standard products made in factory B each week Let AD be number of deluxe products made in factory A each week Let BD be number of standard products made in factory B each week max subject to: grinding in A: polishing in A: grinding in B: polishing in B: raw: Janet Geary 2012 10AS + 10BS+ 15AD + 15BD 4AS + 2AD ≤ 80 2AS + 5AD ≤60 5BS + 3BD ≤60 5BS + 6BD ≤ 75 4AS + 4BS + 4AD + 4BD ≤ 120 Page 109 Question 2: Janet Geary 2012 Output. Page 110 Topic 11: Project Management: Critical Path Introduction In every industry there are concerns about how to manage large-scale, complicated projects effectively. Millions of pounds in cost overruns have been wasted due to the poor planning and control of projects. Project planning Planning the project requires that the objectives are clearly defined so the project team knows exactly what is required of them. All the activities involved in the project must be clearly identified. An activity is the performance of an individual job that requires labour, resources and time. Once the activities have been defined, their sequential relationships; which activity comes first, which activities must precede others; must be determined. Once the activities and their relationships to each other are known a network of activities can be drawn up. The time estimates of each activity then enable a total project completion time can be calculated. If this total completion time is longer than that specified in the objectives, means must be found to reduce the total project time (known as crashing). This reduction in time is usually done by assigning more resources to certain activities which will reduce the overall project time. Project Control Once the planning process is complete and the work begun, the focus then moves onto controlling actual work involved. Controlling a project involves ensuring that the timetable set in the planning stage is adhered to and that the activities are completed in the appropriate sequence. Should any unforeseen problems arise at this stage, the project may have to be re-scheduled and additional resources allocated to ensure that the original completion time is adhered to. We are going to cover: Understand how to plan, monitor and control projects with the use of CPM. Be able to determine the earliest start, earliest finish, latest finish and float times for each activity, along with the total project completion time. Produce Gantt Charts to facilitate monitoring of the project. Reduce the total project time, at the least total cost, by crashing the network. Critical Path Method The Critical Path Method (CPM) is a popular technique that helps managers plan, schedule, monitor and control large and complex projects. The steps in the procedure are: 1. Define the project and all of its significant activities. 2. Decide which activities must precede others. 3. Draw the network connecting all of the activities. 4. Assign times and/or costs estimates to each activity. 5. Compute the critical path through the network. 6. Use the network to plan, schedule, monitor and control the project. Example: to produce a text book Activity Activity description code A Initial consultation with publisher B Prepare proposal C Sign contract D Write material E First proof-read F Final proof-read and publish Duration (months) 3 2 1 18 4 6 Preceding activities A B B C, D E The CPM diagram consists of a network of boxes of the form Janet Geary 2012 Page 111 Activity EST EFT duration LST LFT EST = earliest start time EFT = earliest finish time A 3 EST LST EFT LFT LST = latest start time LFT = latest finish time B 2 EST LST EFT LFT D 18 C 1 EST LST EST LST EFT LFT E 4 EST LST EFT LFT F 6 EST LST EFT LFT EFT LFT Forward Pass The forward pass involves calculating the earliest time that each activity can start and finish. Let activity A start at time 0 (its earliest start time) As activity A takes 3 months its earliest finish time is 0+3 Thus earliest finish time (EF) = earliest start time (ES) + duration B cannot start until A has completed, thus its earliest start time is the same as A’s earliest finish time .Thus B has ES = 3. The duration is 2 and thus the earliest finishing time = 5 C follows on from B and thus its earliest start time = B’s earliest finish time = 5. The earliest finish time is 5+1 = 6 D follows on from B and thus the earliest start time is 5. It earliest finish time is 5+18 = 23 E follows from both C and D . As D has an earliest finish time of 23 , E cannot start until 23. Thus ES= 23. Duration is 4 , thus earliest finish time = 23+4 = 27 F cannot start until E has finished , thus earliest start time = 27 and earliest finish time = 33. Thus the project will take 33 months . A 3 0 LST 3 LFT B 2 3 LST 5 LFT D 18 C 1 5 LST 5 LST 23 LFT 6 LFT E 4 23 LST 27 LFT F 6 27 LST 33 LFT Backward Pass The backward pass is used to determine which activities are “critical” and works by considering the latest start time. Latest start time (LS) = latest finish time (LF – Duration) The latest finish time of F = 33 (as we do not want to extend the total time taken). The latest start time = 33 - duration (6) = 27 As F can start at 27 months , then this will be E’s latest finish time . E has latest start time of 27-4 = 23 C has latest finish of 23 and earliest start of 23-1 = 22 D has latest finish time of 23 and earliest start time of 23-18 = 5 Janet Geary 2012 Page 112 B must occur before C and D. Taking the smallest value of the latest start time form the 2 , we get that B has a latest finish time of 5. The latest start time is 5-2 = 3. A has a latest finish time of 3 and an earliest start time of 3-3=0 A 3 0 0 3 3 B 2 3 3 5 5 C 1 D 18 5 5 5 22 6 23 E 4 23 23 27 27 F 6 27 27 33 33 23 23 The critical path is where the earliest finishing time equals the latest finishing time. In this example, the critical path is ABDEF. The minimum time the project can take is 33 months Example: An Estate Agency is planning to open a new office in North London. The main activities are as follows: Activity Description A Find new office location B Recruit new staff C Make alterations to office D Order equipment E Install new equipment F Train staff G Test operations A 9 B 7 0 0 9 7 Preceding activity None None A A D B C,E,F C 5 9 14 D 3 9 12 F 4 7 11 E 3 12 15 Duration (weeks) 9 7 5 3 3 4 1 G 1 15 16 Forward Pass A: EST = 0 duration of A = 9 EFT = 0 + 9 = 9 B: EST = 0 duration of B = 7 EFT = 0 + 7 = 7 C: EST = 9 duration of C = 5 EFT = 9+ 5 = 14 D: EST = 9 duration of D = 3 EFT = 9 + 3 = 12 E: EST = 12 duration of E = 3 EFT = 12 + 3 = 15 F: EST = 7 duration of F = 4 EFT = 7 + 4 = 11 G: G can start until C. E. F has finished. These have EFT values of 14, 15, 11. Thus EST = 15 duration = 1 EFT = 15 + 1 = 16 Thus the earliest finishing time of the entire project is 16 weeks. Janet Geary 2012 Page 113 Backward Pass A 9 B 7 G: E: C: D: F: A: B: 0 0 0 9 9 7 C 5 9 10 14 15 D 3 9 9 12 12 F 4 7 11 11 15 E 3 12 12 15 15 G 1 15 15 16 16 LFT = 16 [same as final EFT] EST = LFT – duration = 16 – 1 = 15 LFT = 15 EST = LFT – duration = 15 – 3 = 12 LFT = 15 EST = LFT – duration = 15 – 5 = 10 LFT = 12 EST = LFT – duration = 12 – 3 = 9 LFT = 15 EST = LFT – duration = 15 – 4 = 11 LFT = 9 (smaller of EFT of C and D) EST = LFT – duration = 9 – 9 = 0 LFT= 11 EST = LFT – duration = 11 – 7 = 4 The critical path is the route joining the nodes where EFT = LFT. In this example, the critical path is ADEG. The shortest finishing time for the project is 16 weeks. Floats The floats measure the amount of time that an activity can over-run by without affecting the total completion time. Float = Latest finish time (LFT) – Earliest finish time (EFT) Considering the Estate's Agency example used earlier: A 9 B 7 0 0 0 4 9 9 7 11 C 5 9 10 14 15 D 3 9 9 12 12 F 4 7 11 11 15 E 3 12 12 15 15 G 1 15 15 16 16 The critical path is ADEG. The shortest finishing time for the project is 16 weeks. Activity A B C D E F G Janet Geary 2012 EFT 9 7 14 12 15 11 16 LFT 9 11 15 12 15 15 16 float 0 4 1 0 0 4 0 Critical yes yes yes yes Page 114 Gantt Charts Gantt Charts can be used by managers to chart the progress of a project. In these charts, time is measured along the horizontal axis, each activity is listed on the vertical axis, and a bar drawn to show the duration of each activity. The variety of Gantt chart we will use, shows each bar stating at its earliest start time (EST) and the length of the bar represent the duration of the activity with any 'float' being shown at the end of the duration. The Gantt Chart for the Estates Agency is : Estate Agents days 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 A B C D E F G no activity duration float Monitoring progress on a Gantt chart A crucial step in meeting a completion target is to monitor a project's progress. The Gantt charts provide a visual means of tracking the progress of the project's activities by the scheduled completion dates. The chart also let the manager know where he has some 'slack' in the sense of the 'float' times. For example, at the end of day 9, Activity A should be completed and F should have been started. Activities D and C should begin next day. Ideally activity B will have finished but at long as it completes by day 11, the project can still complete on time. Gantt Charts Advantages 1. The chart is easy to draw, particularly if appropriate software is used. 2. An earliest possible completion date is clearly visible. 3. For each activity the earliest possible start and finishing times are shown. Janet Geary 2012 Disadvantages 1. The chart only gives one possible sequence for the activities, (the earliest possible completion time). 2. The precedence rules are not clear from the chart so it is not obvious how a delay in one activity will affect the project. Page 115 Seminar Activity : For the following cases, draw the network, find the critical path and shortest duration. Identify the “float“ for each activity a) Activity Preceding Duration A None 3 B None 4 C A 4 D A 7 E B,C 2 b) Activity A B C D E F G H c) Activity A B C D E F G H I Janet Geary 2012 Preceding None None A A C B D,F B Preceding None A A none D B,C,E F E G,H Duration 4 6 4 3 5 2 8 5 Duration 4 12 11 20 6 7 10 5 4 Page 116 Topic 11: Project Management: Crashing Some definitions: Normal Cost: The cost associated with the normal time estimate for the activity. Crash Cost: The cost associated with the minimum possible time for an activity. Crash Costs, because of the extra wages, overtime payments etc., are higher than the normal costs. Crash time: The minimum possible time that an activity can take. Slope: The average cost of shortening an activity by one time unit. Slope = increase in cost = crash cost – normal cost decrease in time normal time-crash time Crashing: This process involves finding the least cost method of reducing the overall project duration. This is done by reducing the time of the activity with the minimum ‘slope’ providing this activity is on the critical path. The process is repeated until no further savings can be made. Example: Activity Preceding activity None none A A B,C A B C D E Firstly: A 4 B 8 Normal duration 4 8 5 9 5 Crash duration 3 5 3 7 3 total Normal cost (£) 360 300 170 220 200 1250 Crash cost (£) 420 510 270 300 360 Draw network, find critical path and shortest duration. 0 0 0 1 4 4 D 9 4 5 13 14 C 5 4 4 9 9 E 5 8 9 Critical path = A,C,E normal duration = 14 days 9 9 14 14 Total Cost = £1250 Crashing: First Crash: Step 1: Calculate the ‘slopes’ using: Slope = increase in cost decrease in time Activity A B C D E Janet Geary 2012 = critical Normal duration Crash duration Normal cost (£) yes 4 8 5 9 5 3 5 3 7 3 total 360 300 170 220 200 1250 yes yes crash cost – normal cost normal time-crash time Crash cost (£) 420 510 270 300 360 slope =(420-360) / (4-3) = 60 =(270-170) / (5-3) = 50 = (360-200) / (5-3) = 80 Page 117 Thus activity C has the minimum slope and thus is the cheapest way of reducing the total duration by 1 day. Reducing C by 1 day costs £50. Step 2: New network: A 4 0 0 4 4 B 8 0 0 8 8 Step 3: D 9 4 4 13 13 C 4 4 4 8 9 E 5 8 8 13 13 New critical path, total duration and cost Critical paths = ACE BE and AD Total Cost = £1250 + £50 = £1300 and duration = 13 days Second Crash: duration of C is now 4 and all activities are critical Step 1: Calculate the ‘slopes’ using: Slope = increase in cost decrease in time Activity critical Normal duration Crash duration A yes 4 B Yes C = crash cost – normal cost normal time-crash time Normal cost (£) Crash cost (£) slope 3 360 420 = 60 [from before] 8 5 300 510 = (510-300) /(8-5) = 70 Yes 4 3 170 (220) 270 = 50 [from before] D Yes 9 7 220 300 = (300-220) /(9-7) = 40 E yes 5 3 200 360 = 80 [from before] total 1250(1300) Thus activity D has the minimum slope and thus is the cheapest way of reducing the total duration by 1 day. Reducing D by 1 day costs £40. However, we cannot reduce D on its own as all paths are critical. We have to consider combinations of reductions: A and B extra cost = 60 + 70 = 130 D and E extra cost = 40 + 80 = 120 B,C and D extra cost = 70 + 50 + 40 = 160 A and E extra cost = 60 + 80 = 140 All of these combinations will reduce the total duration by 1 day. The cheapest option is D and E. Janet Geary 2012 Page 118 Step 2: New network: A 4 B 8 0 0 0 0 4 4 D 8 4 4 12 12 C 4 4 4 8 9 E 4 8 8 Step 3: New critical path, total duration and cost Critical paths = ACE BE and AD Total Cost = £1300 + £120 = £1420 and 8 8 12 12 duration = 12 days Third Crash: duration of C is now 4, D is 8 and E is 4 and all activities are critical Step 1: Activity A B C D E critical yes yes yes yes yes Normal duration 4 8 4 8 4 Crash duration 3 5 3 7 3 slope 60 70 50 40 80 We cannot reduce one activity on its own as all paths are critical. We have to consider combinations of reductions: A and B extra cost = 130 D and E extra cost = 120 B,C and D extra cost = 160 A and E extra cost = 140 All of these combinations will reduce the total duration by 1 day. The cheapest option is D and E. Step 2: New network: A 4 0 0 4 4 B 8 0 0 8 8 Step 3: D 7 4 4 11 12 C 4 4 4 8 9 New critical path, total duration and cost Critical paths = ACE BE and AD and Total Cost = £1420 + £120 = £1540 E 3 8 8 11 11 duration = 11 days Fourth Crash: duration of C is now 4, D is 7 and E is 3 and all activities are critical Janet Geary 2012 Page 119 Step 1: Activity A B C D E critical yes yes yes yes yes Normal duration 4 8 4 7 3 Crash duration 3 5 3 7 3 slope 60 70 50 40 80 We cannot reduce one activity on its own as all paths are critical and we cannot reduce activities D and E as they are at their ‘crash duration’. We have to consider combinations of reductions: A and B D and E B,C and D A and E extra cost = 130 D and E cannot be reduced D cannot be reduced E cannot be reduced The only possible option left is to reduce A and B. This will reduce the total duration by 1 day. Step 2: New network: A 3 B 7 0 0 0 0 Step 3: 3 4 D 7 3 4 10 12 C 4 3 3 7 7 E 3 7 7 New critical path, total duration and cost Critical paths = ACE BE and AD and Total Cost = £1540 + £130 = £1670 Activity A B C D E critical yes yes yes yes yes Normal duration 3 7 4 7 3 Crash duration 3 5 3 7 3 7 7 10 10 duration = 10 days slope 60 70 50 40 80 The only activities that are not at their minimum duration are B and C. As reducing B and C will not reduce the total duration, we cannot ‘crash’ any further. Janet Geary 2012 Page 120 Example: Klone Computers Klone Computers is a small manufacturer of personal computers that is about to design, manufacture and Market the Klone 2000 palmbook computer. The company faces three major tasks in introducing a new computer.: 1. Manufacturing the new computer 2. Training staff and sales teams to operate the new computer 3. Advertising the new computer When the proposed specifications for the new computer have been reviewed, the manufacturing phase begins with the design of a prototype computer. Once the design is determined, the required materials are purchased and the prototypes are manufactured. Prototype models are then tested and analysed by staff who have completed the staff training course. Based on their input, refinements are made to the prototype and an initial production run of computers is scheduled. Staff training of company personnel begins once the computer is designed allowing staff to test the prototypes once they have been manufactured. After the computer design has been revised based on staff input, the sales force undergoes full-scale training. Advertising is a two-phase process. First, a small group works closely with the design team so that once a product design has been chosen, the marketing team can begin an initial pre-production advertising campaign. Following this initial campaign and completion of the final design revisions, a larger advertising campaign team is introduced to the special features of the computer, and a full-scale advertising programme is launched. The entire project is concluded when the initial production run is completed, the sales staff are trained, and the advertising campaign is underway. Klone has come up with the following information to assist the planning of the project . Phase Activity Manufacturing A B C D E F G Training H I J Advertising Description Prototype model design Purchase of materials Manufacture of prototype models Revision of design Initial production run Staff training Staff input on prototype models Sales training Pre-production advertising Post-production advertising Immediate predecessors None A B estimated completion time(days) 90 15 5 G D A C, F 20 21 25 14 D A D,I 28 30 45 [ based on an example in Lawrence,John & Paternack,Barry, Applied Management Science, 2nd edition, Wiley 2002 B 90 C 105 15 A 90 0 90 Janet Geary 2012 105 110 5 F 25 90 I 30 90 115 G 14 115 129 D 20 129 E 21 149 170 H 28 149 177 J 45 149 194 149 120 Page 121 Forward pass: Node A EST = 0 EFT = EST+ duration of A = 0 + 90 = 90 B EST = 90 EFT = 90 + duration of B = 90 +15 = 105 C EST= 105 EFT = 105 +duration of C = 105 +5 = 110 F EST = 90 EFT= 90+ duration of F = 90 + 25 = 115 G G cannot start until C & F have finished EST = 115 EFT= 115duration of G = 115 +14 = 129 D EST = 129 EFT = 129 + duration of D = 129 +20 = 149 E EST = 149 EFT= 149+ duration of E = 149 + 21 = 170 H EST = 149 EFT = 149+ duration of H = 149 + 28 = 177 I EST = 90 EFT = 90 + duration of I = 90 + 30 = 120 J J cannot start until both D and I have finished EST = 149 EFT = 149 + duration of J = 149+ 45 = 194 B 15 A 90 0 0 90 90 90 95 105 110 C 105 110 5 110 115 F 25 90 90 115 115 I 30 90 120 149 G 14 115 115 129 129 D 20 EFT 90 105 110 115 129 149 170 177 194 129 129 149 149 Backward pass: Node J LFT = 194, LST = 194 - duration of J = 149 H LFT = 194 LST = 194 - duration of H = 194 - 28 = 166 E LFT = 194 LST = 194 – duration of E = 194 – 21 = 173 D Via E: LFT = LST of E = 173 Via H: LFT = LST of H = 166 Via J: LFT = LST of = 149 Thus LFT of D = 149, LST of D = 159 – duration of D = 149 - 20 = 129 G LFT = LST of D = 129 LST of G = 129 – duration of G = 129-14= 115 C LFT = LST of G =115 LST of C = 115 - duration of c = 115 - 5 = 115 B LFT = LST of C = 110 LST = 110 - duration of B = 110 - 15 = 95 F LFT = LST of G = 115 LST of F = 115 – duration of F = 115 -25 =90 I LFT = LST of J = 149 LST = I = 149 – duration of I = 149-30 = 119 A Via B: LFT = LST of B = 95 Via F: LFT = LST of F = 90 Via I: LFT = LST of I = 119 Thus LST of A = 90 – duration of a = 90-90 = 0 E 21 149 173 170 194 H 28 149 166 177 194 J 45 149 149 194 194 LST 194 166 194 129 115 110 95 90 119 0 Critical path is AFGDJ minimum completion time is 194 days Floats The floats measure the amount of time that an activity can over-run by without affecting the total completion time. Float = Latest finish time (LFT) – Earliest finish time (EFT) Janet Geary 2012 Page 122 Activity A B C D E F G H I J EFT 90 105 110 149 170 115 129 177 120 194 LFT 90 110 115 149 194 115 129 194 149 194 float 0 5 5 0 24 0 0 17 29 0 Critical yes yes yes Yes yes From the Gantt chart below, we can see the floats Klone Computers days 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 A B C D E F G H I J In the Gantt chart the activities are presented at their EST with the duration represented in black and float in grey. Seminar Question A bank is planning to install a new computerised accounting system. The bank management has determined the activities required to complete the project, the precedence relationships, and activity time estimates (in weeks) as shown in the table below. activity description A B C D E Recruit staff Systems development Systems training Equipment training Manual system test Janet Geary 2012 predecessor A A B,C Normal time 9 11 7 10 1 Crash time 7 9 5 8 1 Normal cost 4800 9100 3000 3600 0 Crash cost 6300 15500 4000 5000 0 Page 123 F G H I J K a. b. b. Preliminary system changeover Computer-personnel interface Equipment modification Equipment testing System debugging and installation Equipment changeover B,C 5 3 1500 2000 D,E 6 5 1800 2000 D,E H F,G 3 1 2 3 1 2 0 0 0 0 0 0 G,I 8 6 5000 7000 Determine the project completion time and the critical path. Represent the project as a Gantt chart. Crash the network to 26 weeks. Indicate how much this will cost the bank. Identify the new critical path. [Example is based on an example from Taylor, Bernard, Introduction to Management Science, 7th edition, Prentice Hall, 2002] How to use MS project 2010 Using the estate agents example we had before, we can use Microsoft project to schedule the task. The main difference we will notice at first is that MS Project uses a calendar . To make the comparison easy with our manual example let us start the project on 1st January 2013 . We calculated that the project would take at least 16 weeks . 16 weeks from Tuesday 1st January would have taken us to Monday 22nd April. The data entry looks like: Note only the first date, 1st January was entered , the rest were calculated by the software. Data entered: As a network diagram this looks like: Janet Geary 2012 Page 124 As Gantt Chart we get Using the example above : Load Up Microsoft Office Project Choose File, New, Blank Project To use the system with “auto scheduled tasks” we need to turn off manually scheduled tasks . We do this by selecting , File , Options , Schedule, New task created as choose Auto Scheduled then OK Janet Geary 2012 Page 125 To start to enter the project details: The screen is split in two, you will need to use more on the left hand side to add predecessors The first entry is Z: Start project with 0 duration. It is here that we set the start date of 1st January Janet Geary 2012 Page 126 The rest of the details can now be added: Notes: a) only the first starting date is added by us b) The predecessors have to be added by the activity number given in the first column To get the different displays: Using the View Tab, choose Gantt chart icon You will have to increase the size of the right-hand of the screen to see the whole chart Using the View Tab, choose the network icon . This looks like Janet Geary 2012 Page 127 Seminar Activity Use MS project to determine the critical path and the project duration for : a. Klone Computers b. Bank Accounting System Answers Janet Geary 2012 Page 128 Critical Path is A,F,G,D,J start date = Tues1st January, end date = Friday 27th September (38 weeks and 4 days = 38×5 + 4 = 194 working days Janet Geary 2012 Page 129 Critical path = ADGK duration = 33 weeks Janet Geary 2012 Page 130 Janet Geary 2012 Page 131