Trend of Saudi Arabia Students Taking Higher Education Abroad A THESIS SUBMITTED TO THE GRADUATE EDUCATIONAL COUNCIL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS For the degree MASTER OF SCIENCE By Majed Saeed Alghamdi Advisor Dr. Rahmatullah Imon Ball State University Muncie, Indiana May 2016 Trend of Saudi Arabia Students Taking Higher Education Abroad A THESIS SUBMITTED TO THE GRADUATE EDUCATIONAL COUNCIL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE MASTER OF SCIENCE By Majed Saeed Alghamdi Committee Approval: ……………………………………………………………………………………………. Committee Chairman Date …………………………………………………………………………………………… Committee Member Date ……………………………………………………………………………………………. Committee Member Date Department Head Approval: …………………………………………………………………………………………… Head of Department Date Graduate office Check: …………………………………………………………………………………………… Dean of Graduate School Date Ball State University Muncie, Indiana May, 2016 i ACKNOWLEDGEMENTS I would like to express my special appreciation and thanks to my advisor Professor Dr. Rahmatullah Imon, you have been a tremendous mentor for me, for his patience, motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time during my analysis and writing the report. I could not have imagined having a better advisor and mentor for my thesis other than him I would also like to thank my committee members, professor Dr. Munni Begum and Dr. Yayuan Xiao for their encouragement, insightful comments and patience. I am thankful to all my classmates for their kind supports. Last but not the least, I would like to thank my family: my parents, my brothers and sisters, for supporting me throughout my life. Majed Alghamdi May 7, 2016 ii ABSTRACT In this study our prime objective was to investigate the trend of Saudi Arabia students who are studying abroad for higher education. We find student enrolment is growing almost exponentially over the years. The most popular programs are Engineering and Medical Science and the least popular programs are Agriculture and Fine Arts. We also find an evidence of gender discrimination against women among the Saudi Arabia students studying abroad. In quest of which factors influence the number of students studying abroad we consider regression analysis and find that budget in higher education and oil price are the most important variables to explain students’ enrolment. Both regression and cross validation study reveal that the robust reweighted least squares (RLS) fit the data better than other models and yield better forecasts. iii Table of Contents CHAPTER 1 .................................................................................................................................. 1 INTRODUCTION ..................................................................................................................... 1 1.1 Objective of the Study ....................................................................................................... 3 1.2 Sources of Data .................................................................................................................. 3 1.3 Methodology...................................................................................................................... 4 CHAPTER 2 .................................................................................................................................. 5 Trend of Saudi Arabia Students Studying abroad ................................................................. 5 2.1 Trend Analysis ................................................................................................................... 5 2.2 Trend Analysis of Nine Major Programs ........................................................................ 10 2.3 Trend Analysis of Some Other Relevant Variables......................................................... 28 2.4 Summary Results of Trend Analysis ............................................................................... 34 CHAPTER 3 ................................................................................................................................ 35 Comparison between Genders and Different Programs ..................................................... 35 3.1 Comparison between Genders ......................................................................................... 35 3.2 Tests for the Equality of Means between Male and Female Students ............................. 41 3.3 Comparison of the Individual Treatment Means ............................................................. 46 3.4 Result Summary .............................................................................................................. 48 iv CHAPTER 4 ................................................................................................................................ 50 Modeling and Fitting of Data Using Regression Diagnostics and Robust Regression ...... 50 4.1 Classical Regression Analysis ......................................................................................... 50 4.2 Regression Diagnostics.................................................................................................... 54 4.3 Robust Regression ........................................................................................................... 62 4.4 Regression Results ........................................................................................................... 65 4.5 Results Comparisons ....................................................................................................... 75 CHAPTER 5 ................................................................................................................................ 76 Cross Validation of Forecasts................................................................................................. 76 5.1 Evaluation of Forecasts by Cross Validation .................................................................. 76 5.2 Cross Validation Results ................................................................................................. 78 CHAPTER 6 ................................................................................................................................ 80 Conclusions and Areas of Further Research ........................................................................ 80 6.1 Conclusions ..................................................................................................................... 80 6.2 Areas of Further Research ............................................................................................... 81 References .................................................................................................................................... 82 APPENDIX A .............................................................................................................................. 84 APPENDIX B .............................................................................................................................. 88 v List of Tables Chapter 2 Table 2.1: Trend Summary of the Total Number of Students ...................................................... 12 Table 2.2: Trend Summary of the Total Number of Social Science Students .............................. 15 Table 2.3: Trend Summary of the Total Number of Natural Science Students ............................ 17 Table 2.4: Trend Summary of the Total Number of Medical Science Students ........................... 18 Table 2.5: Trend Summary of the Total Number of Law Students .............................................. 20 Table 2.6: Trend Summary of the Total Number of Humanities Students ................................... 21 Table 2.7: Trend Summary of the Total Number of Fine Arts ..................................................... 23 Table 2.8: Trend Summary of the Total Number of Engineering Students .................................. 24 Table 2.9: Trend Summary of the Total Number of Education Students ..................................... 26 Table 2.10 Trend Summary of the Total Number of Agriculture Students .................................. 27 Table 2.11: Trend Summary of Oil Revenue ................................................................................ 30 Table 2.12: Trend Summary of Budget in Higher Education ....................................................... 32 Table 2.13: Trend Summary of Oil Price...................................................................................... 33 Table 2.14: Trend Summary ......................................................................................................... 34 Chapter 3 Table 3.1: Summary Test Results for the Equality of Means between Male and Female Students ....................................................................................................................................................... 42 Table 3.2: Average Number of Students in Different Programs .................................................. 43 Table 3.3 ANOVA Table for the Equality of Mean Test of Nine Programs ................................ 48 vi Chapter 4 Table 4.1: Regression Results Summary ...................................................................................... 75 Chapter 5 Table 5.1: Original and Forecasted Values for 2011-2014 ........................................................... 78 Table 5.2: Cross Validation Result Summary............................................................................... 79 vii List of Figures Chapter 2 Figure 2.1: Time Series Plot of the Total Number of Students .................................................... 10 Figure 2.2: Trend Analysis of the Total Number of Students....................................................... 11 Figure 2.3: Time Series Plot of Total Number of Students in Different Programs ...................... 12 Figure 2.4: Time Series Plot of Total Number of Students (in ln) in Different Programs ........... 13 Figure 2.5: Trend Analysis Plot of the Total Number of Social Science Students ....................... 15 Figure 2.6: Trend Analysis Plot of the Total Number of Students for Natural Science ............... 16 Figure 2.7: Trend Analysis Plot of the Total Number of Students for Medical Science .............. 18 Figure 2.8: Trend Analysis Plot of the Total Number of Students for Law ................................. 19 Figure 2.9: Trend Analysis Plot of the Total Number of Students for Humanities ...................... 21 Figure 2.10: Trend Analysis Plot of the Total Number of Students for Fine Arts ....................... 22 Figure 2.11: Trend Analysis Plot of the Total Number of Students for Engineering ................... 24 Figure 2.12: Trend Analysis Plot of the Total Number of Students for Education ...................... 25 Figure 2.13: Trend Analysis Plot of the Total Number of Students for Agriculture .................... 27 Figure 2.14: Time Series Plot of the Budget in Higher Education ............................................... 28 Figure 2.15: Time Series Plot of Oil Price .................................................................................... 28 Figure 2.16: Time Series Plot of Oil Revenue .............................................................................. 29 Figure 2.17: Trend Analysis of Oil Revenue ................................................................................ 30 Figure 2.18: Trend Analysis of Budget in Higher Education ....................................................... 31 Figure 2.19: Trend Analysis of Oil Price ...................................................................................... 33 viii Chapter 3 Figure 3.1: Time Series Plot of Male and Female Students in Social Science ............................. 35 Figure 3.2: Time Series Plot of Male and Female Students in Natural Science ........................... 36 Figure 3.3: Time Series Plot of Male and Female Students in Medical Science .......................... 37 Figure 3.4: Time Series Plot of Male and Female Students in Law ............................................. 37 Figure 3.5: Time Series Plot of Male and Female Students in Humanities .................................. 38 Figure 3.6: Time Series Plot of Male and Female Students in Engineering ................................. 39 Figure 3.7: Time Series Plot of Male and Female Students in Education .................................... 39 Figure 3.8: Time Series Plot of Male and Female Students in Fine Arts ..................................... 40 Figure 3.9: Time Series Plot of Male and Female Students in Agriculture .................................. 40 Figure 3.10: Box Plot of Number of Students in Different Programs .......................................... 43 Chapter 4 Figure 4.1: Scatter Plot of the Total Number of Students vs Budget in Higher Education .......... 66 Figure 4.2: Scatter Plot of the Total Number of Students vs Oil Price ......................................... 67 Figure 4.3: RLS and OLS Fit of the Total Number of Students vs Oil Price ............................... 67 Figure 4.4: Scatter Plot of the Total Number of Students vs Oil Revenue ................................... 68 Figure 4.5: Normal Probability Plot of the Residuals for Model A .............................................. 72 Figure 4.6: Normal Probability Plot of the Residuals for Model B .............................................. 73 Figure 4.7: Normal Probability Plot of the Residuals for Model C .............................................. 74 Chapter 5 Figure 5.1: Scatterplot of RLS, OLS, Exponential Forecasts vs Original Values ........................ 78 ix CHAPTER 1 INTRODUCTION As early as the reign of King Abdulaziz, The founding king of Saudi Arabia, students were being sponsored to study abroad. Early programs were limited to Arab countries such as Egypt and Lebanon to study Arabic and Islamic studies. The number of Saudi Arabian students studying abroad has increased dramatically during the past decade. This explosive growth can be attributed to an educational agreement brokered between former U.S. president George Bush and Saudi King Abdullah bin Abdulaziz Al Saud in 2005. The agreement opened the doors for Saudi students to pursue their higher educational degrees in the U.S. with their government paying all of their educational expenses. As a result over 100,000 Saudi students were enrolled in American colleges and universities in 2013-14, making Saudi Arabia the fourth largest sponsor of international students to the U.S. Saudi enrollments overseas have been growing exponentially since the 2005 introduction of the King Abdullah bin Abdulaziz Scholarship Program (KASP). In 2012, the KASP was extended with the aim of helping a further 50,000 Saudis graduate from the world’s top 500 universities by 2020. According to data from the Institute for International Education, in the 2012/13 academic year there were a total of 44,586 tertiary-level Saudi students in the United States, an almost 100 percent increase from 2010/11 and a 12-fold increase from 2005. The most recent data from the Student and Exchange Visitor Program’s SEVIS database show that there were a total of 70,366 active nonimmigrant Saudi students (including dependents) in the 1 United States in July 2014 on F, J or M visas. This compares to 61,944 at the same time in 2013. Saudi government data pegs the 2013/14 number of Saudi students and dependents in the United States at a significantly larger 106,858. Of those 89,423 were reported to be on government scholarships. The same data show that there were 20,252 students in the United Kingdom, 18,926 in Canada, and 13,002 in Australia, with just under 200,000 total Saudi students at institutions abroad (75% male) across the world. By level of study, 120,000 students are at the undergraduate level, 47,500 at the master’s level and 10,400 at the doctoral level. The KASP will continue to prioritize fields designated as important to progressing the Saudi “knowledge economy,” such as medicine, engineering and science. Approximately 70 percent of scholarship students currently study in subjects related to Business Administration, Engineering, Information Technology and Medicine. The top fields of study for Saudi students in the United States last year were: Intensive English (27.2%), Engineering (21.1%), Business/Management (17.1%), Math and Computer Science (7.4%), and Health Professions (5.6%). The Saudi government is projected to invest over 10% of its annual budget to higher education for the foreseeable future. Currently it invests nearly $2.4 billion in the KASP initiative annually, which includes academic funding as well as living expenses for over 100,000 students enrolled in graduate and undergraduate programs in the U.S. If the Saudi government continues to support KASP at the current level, it will soon surpass South Korea in terms of sending more students abroad to study 2 1.1 Objective of the Study In this study our prime objective was to investigate the trend of Saudi Arabia students who are studying abroad for higher education. We would like to investigate both the overall trend and also trends of individual programs. We would like to see whether there is any special preference for any particular program. Another point of our interest is to investigate whether there is any gender discrimination among the students? We would also like to find out the most important factors that influence the number of students studying abroad most. We would employ regression analysis for this and for the validity of the model we would employ recent diagnostics. If the conventionally used least squares method fails we would either use robust regression or choose some other models. To confirm which method does fit the data best we would apply cross validation. 1.2 Sources of Data The most important data I need for my study is the number of Saudi Arabia students studying abroad for higher education. This data set is taken from the official website The Ministry of Higher Education of Saudi Arabia as given below. https://www.mohe.gov.sa/ar/Ministry/Deputy-Ministry-for-Planning-and-Informationaffairs/HESC/Ehsaat/Pages/default.aspx We have data for both male and female students in nine programs from 1981-2014. The nine programs are Social Science, Natural Science, Medical Science, Law, Humanities, Fine Arts, Engineering, Education, and Agriculture. We believe that Budget in Higher Education is a key factor to understand the number of Saudi Arabia students studying abroad. The Budget in Higher Education data set from 1981 to 2014 is 3 taken from the official website of the Ministry of Finance of Saudi Arabia. Here is the link of the data: https://www.mof.gov.sa/english/DownloadsCenter/Pages/Budget.aspx We know Saudi Arabia heavily relies on Oil. We feel Oil Revenue and Oil Price could be very important variables for our study. We collect these data from 1981-3014 from the official website of Saudi Arabian Moneytary Agency (SAMA). Here is the link of the data: http://www.sama.gov.sa/en-US/EconomicReports/Pages/YearlyStatistics.aspx All these data are presented in Appendix A of my thesis. 1.3 Methodology In this study we have employed a number of modern and sophisticate statistical techniques. We have used linear, quadratic and exponential trend models to investigate both the overall trend and also trends of individual programs. We have used experimental design technique to see whether there is any special preference for any particular program and to investigate whether there is any gender discrimination among the students. We would also like to find out the most important factors that influence the number of students studying abroad most. We employ Fisher’s LSD and Tukey’s test in this regard. We employ recent diagnostics like Jarque-Bera and Rescaled Moments for normality and the robust reweighted least squares (RLS) technique for regression analysis. Finally we employ a cross validation study based on the mean squared percentage error (MSPE) to confirm which method does fit the data best. 4 CHAPTER 2 Trend of Saudi Arabia Students Studying abroad In this chapter we introduce different time series models that we are going to use in our study with their estimation procedures and properties. An excellent review of different aspects of time series models are available in Pyndick and Rubenfield (1998), Bowerman et al. (2005), Montgomery et al. (2008) and estimation. A time series is a chronological sequence of observations on a particular variable. A time series model accounts for patterns of the past movement of a variable and uses that information to predict its future movements, i.e., it is a sophisticated method of extrapolating data. There are two different approaches of modeling a time series data: deterministic and stochastic. 2.1 Trend Analysis We begin with simple models that can be used to forecast a time series on the basis of its past behavior. Most of the series we encounter are not continuous in time, instead, they consist of discrete observations made at regular intervals of time. We denote the values of a time series by { y t }, t = 1, 2, …, T. Our objective is to model the series y t and use that model to forecast y t beyond the last observation yT . We denote the forecast l periods ahead by yˆ T l . We sometimes can describe a time series y t by using a trend model defined as yt TR t t where TR t is the trend in time period t. 5 (2.1) 2.1.1 Linear Trend Model: TR t 0 1t (2.2) We can predict y t by yˆ t ˆ0 ˆ1t (2.3) Then the forecast l period ahead is given by yˆ T l ˆ0 ˆ1 T l (2.4) 1 T l t T For this particular model the distance value is DV = . Hence the 100(1– )% T 2 t t 2 t 1 prediction interval for an individual value of the dependent variable yˆ T l t T 2 , / 2 s 1 DV . 2.1.2 Polynomial Trend Model of Order p TR t 0 1t 2 t 2 ... p t p (2.5) If the number of observation is not too large, we can predict y t by ŷt ˆ0 ˆ1t ˆ2t 2 ... ˆ pt p (2.6) Then the forecast l period ahead is given by 2 p ŷT l ˆ0 ˆ1 T l ˆ2 T l ... ˆ p T l (2.7) The 100(1– )% prediction interval for an individual value of the dependent variable yˆ T l t T p1, / 2 s 1 DV 6 (2.8) Quadratic Trend Model: It is a special case of polynomial trend model when order p = 2. Hence from the above results we have TR t 0 1t 2 t 2 (2.9) If the number of observation is not too large, we can predict y t by ŷ t ˆ0 ˆ1t ˆ 2 t 2 (2.10) Then the forecast l period ahead is given by ŷ T l ˆ0 ˆ1 T l ˆ 2 T l 2 (2.11) The 100(1– )% prediction interval for an individual value of the dependent variable yˆ T l t T 3, / 2 s 1 DV (2.12) 2.1.3 Comparisons of Different Methods Minitab computes three measures of accuracy of the fitted model: MAPE, MAD, and MSD for each of the simple forecasting and smoothing methods. For all three measures, the smaller the value, the better the fit of the model. Use these statistics to compare the fits of the different methods. MAPE, or Mean Absolute Percentage Error, measures the accuracy of fitted time series values. It expresses accuracy as a percentage. MAPE = | y t yˆt / yt | T 7 100 (2.13) where yt equals the actual value, ŷt equals the fitted value, and T equals the number of observations. MAD (Mean), which stands for Mean Absolute Deviation, measures the accuracy of fitted time series values. It expresses accuracy in the same units as the data, which helps conceptualize the amount of error. MAD (Mean) = | y t yˆt | T (2.14) where yt equals the actual value, ŷt equals the fitted value, and T equals the number of observations. MSD stands for Mean Squared Deviation. MSD is always computed using the same denominator, T, regardless of the model, so you can compare MSD values across models. MSD is a more sensitive measure of an unusually large forecast error than MAD. MSD = y yˆt 2 t T (2.15) where yt equals the actual value, ŷt equals the fitted value, and T equals the number of observations. 2.1.4 Exponential smoothing Exponential smoothing provides a forecasting method that is most effective when the components of the time series may be changing over time. It is often more reasonable to have more recent values of y t play a greater role than do earlier values. In such a case recent values should be weighted more heavily in the moving average. 8 Suppose that the time series y t has a level (or mean) that may slowly change over time but has no trend or seasonal pattern. This series can be described as yt 0 t (2.16) Then the estimate T for the level of the series in time period T is given by the smoothing equation T yT 1 T 1 (2.17) where is a smoothing constant between 0 and 1, and T 1 is the estimate of the level in the time period T – 1. A point forecast for one period ahead us given by yˆ T 1 T (2.18) which implies yˆ T 1 = yT 1 yT 1 1 2 yT 2 ... = 1 yT 0 (2.19) It is easy to show that the l period forecast yˆ T l can be given by yˆ T l = 1 yT 0 (2.20) There are several methods to choose the appropriate value of . The most popular method is to choose which minimizes the mean sum of (squared) distances (MSD) of the actual and forecasted values. Other measures of accuracy are the mean absolute percentage error (MAPE) and the mean absolute deviation (MAD). 9 2.2 Trend Analysis of Nine Major Programs In this section we would like to investigate trend of total number of students studying abroad in nine major programs. For each program we consider three different trend models: linear, quadratic, and exponential. We also compute MAPE, MAD and MSD to evaluate which method better fits the data. 2.2.1 All Programs At first we consider the total number of students studying abroad in all programs. Figure 2.1 gives the time series plot of the total number of students from 1980 to 2014. From this figure it is clear that the number of students studying abroad has an increasing trend. It seems to us that this increase is not linear, it is exponential. Time Series Plot of Total No. of Students 100000 Total 80000 60000 40000 20000 0 1980 1985 1990 1995 Year 2000 2005 2010 Figure 2.1: Time Series Plot of the Total Number of Students Now we would like to fit this data by three trend models: linear, quadratic and exponential and the graphs are presented in Figure 2.2. 10 Trend Analysis Plot for Total Linear Trend Model Yt = -17044 + 2139*t Variable A ctual F its 100000 A ccuracy Measures MA PE 208 MA D 17431 MSD 439097288 Total 75000 50000 25000 0 3 6 9 12 15 18 Index 21 24 27 30 33 Trend Analysis Plot for Total Quadratic Trend Model Yt = 27234 - 5241*t + 210.8*t**2 Variable A ctual Fits 100000 80000 A ccuracy Measures MA PE 119 MA D 8259 MSD 110484665 Total 60000 40000 20000 0 3 6 9 12 15 18 21 Index 24 27 30 33 Trend Analysis Plot for Total Growth Curve Model Yt = 2054.90 * (1.0895**t) Variable A ctual Fits 100000 A ccuracy Measures MA PE 77 MA D 12364 MSD 505301336 Total 80000 60000 40000 20000 0 3 6 9 12 15 18 21 Index 24 27 30 33 Figure 2.2: Trend Analysis of the Total Number of Students From Figure 2.2 it is clear that the number of Saudi Arabia students studying abroad has an increasing trend. It seems to us that an exponential model may fit the data better. But graphical summaries are very subjective in nature. So for more convincing conclusions we need to look at 11 numerical quantities. The following table gives a summary result to compare three different trend models. Table 2.1: Trend Summary of the Total Number of Students Model MAPE MAD MSD Linear 208 17431 439097288 Quadratic 119 8259 110484665 Exponential 77 12364 505301336 Results presented in Table 2.1 clearly show that both the quadratic trend model and the exponential trend model fit the data better than the linear model but in terms of MAPE the exponential trend model is better than the other two models. Now we will investigate trend models for nine separate programs. Time Series Plot of Students in Different Programs 35000 Variable A griculture Education Engineering Fine A rts Humanities Law Medical Science Natural Science Social Science 30000 25000 Data 20000 15000 10000 5000 0 1980 1985 1990 1995 2000 Year 2005 2010 Figure 2.3: Time Series Plot of Total Number of Students in Different Programs 12 Figure 2.3 shows that the number of Saudi Arabia students studying abroad in each different programs has an overall increasing trend. But there are huge differences in the number of students so when they are plotted together some programs are not distinguishable at all. As a remedy to this problem we plot the same graph in natural log scale and the graph is presented in Figure 2.4. Time Series Plot of Students in Different Programs (in ln) 11 Variable A griculture Education Engineering Fine A rts Humanities Law Medical Science Natural Science Social Science 10 9 Data 8 7 6 5 4 3 1980 1985 1990 1995 2000 Year 2005 2010 Figure 2.4: Time Series Plot of Total Number of Students (in ln) in Different Programs Figure 2.3 shows that the number of Saudi Arabia students studying abroad in each different programs has an overall increasing trend. But there are huge differences in the number of students so when they are plotted together some programs are not distinguishable at all. As a remedy to this problem we plot the same graph in natural log scale and the graph is presented in Figure 2.4. It is clear from this figure that the number of students differs significantly from one program to another. The highest enrolled programs are Engineering, Natural Science, Medical Science and Social Science. But the number of students in Social Science dropped in the last few years. The programs which have relatively less number of students are Agriculture and Fine Arts. 13 Now we will investigate trend models for nine separate programs. 2.2.2 Social Sciences Among the nine programs at first we consider the total number of students studying abroad in Social Science program. Figure 2.5 gives linear, quadratic and exponential trend fits for the Social Science program. From the figure it is clear that the number of students studying abroad in Social Science program shows an increasing trend. It seems to us that an exponential model may fit the data. The following table gives a summary result to compare three different trend models. Trend Analysis Plot for The Total of Social Sceiences Linear Trend Model Yt = -1599 + 289*t 35000 Variable A ctual Fits 30000 A ccuracy Measures MA PE 234 MA D 3537 MSD 34029670 25000 Total 20000 15000 10000 5000 0 3 6 9 12 15 18 Index 21 24 27 30 33 Trend Analysis Plot for The Total of Social Sceiences Quadratic Trend Model Yt = 3155 - 503*t + 22.6*t**2 35000 Variable A ctual Fits 30000 A ccuracy Measures MA PE 112 MA D 2595 MSD 30241525 Total 25000 20000 15000 10000 5000 0 3 6 9 12 15 18 Index 21 14 24 27 30 33 Trend Analysis Plot for The Total of Social Sceiences Growth Curve Model Yt = 658.094 * (1.0530**t) 35000 Variable A ctual Fits 30000 A ccuracy Measures MA PE 93 MA D 2552 MSD 39771799 Total 25000 20000 15000 10000 5000 0 3 6 9 12 15 18 Index 21 24 27 30 33 Figure 2.5: Trend Analysis Plot of the Total Number of Social Science Students Table 2.2: Trend Summary of the Total Number of Social Science Students Model MAPE MAD MSD Linear 234 3537 34029670 Quadratic 112 2595 30241525 Exponential 93 2552 39771799 Results presented in Table 2.2 clearly show that the exponential trend model fits the data better than the other two models. 2.2.3 Natural Sciences Our next example is the total number of students studying abroad in Natural Science program. Figure 2.6 gives linear, quadratic and exponential trend fits for the Natural Science program. From the figure it is clear that the number of students studying abroad in Natural Science program has an increasing trend and an exponential model may better fit the data. 15 Trend Analysis Plot for the Total of Natural Sciences Linear Trend Model Yt = -4613 + 508*t 30000 Variable A ctual Fits 25000 A ccuracy Measures MA PE 278 MA D 4086 MSD 27110563 Total 20000 15000 10000 5000 0 -5000 3 6 9 12 15 18 Index 21 24 27 30 33 Trend Analysis Plot for the Total of Natural Sciences Quadratic Trend Model Yt = 5952 - 1252*t + 50.31*t**2 30000 Variable A ctual Fits 25000 A ccuracy Measures MA PE 193 MA D 2392 MSD 8401020 Total 20000 15000 10000 5000 0 3 6 9 12 15 18 Index 21 24 27 30 33 Trend Analysis Plot for the Total of Natural Sciences Growth Curve Model Yt = 279.595 * (1.1053**t) 30000 Variable A ctual Fits 25000 A ccuracy Measures MA PE 72 MA D 2666 MSD 30860217 Total 20000 15000 10000 5000 0 3 6 9 12 15 18 Index 21 24 27 30 33 Figure 2.6: Trend Analysis Plot of the Total Number of Students for Natural Science 16 Table 2.3: Trend Summary of the Total Number of Natural Science Students Model MAPE MAD MSD Linear 278 4086 27110563 Quadratic 193 2392 8401020 Exponential 72 2666 30860217 Results presented in Table 2.3 clearly show that the exponential trend model fits the data better than the other two models. 2.2.4 Medical Science Our next example is the total number of students studying abroad in Medical Science program. Figure 2.7 gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that the number of students studying abroad in natural science program has an increasing trend and an exponential model may better fit the data. Trend Analysis Plot for the Total of Medical Science Linear Trend Model Yt = -4742 + 528*t 30000 Variable A ctual F its 25000 A ccuracy Measures MA PE 249 MA D 4015 MSD 25461692 Total 20000 15000 10000 5000 0 -5000 3 6 9 12 15 18 Index 21 24 27 30 33 Trend Analysis Plot for the Total of Medical Science Quadratic Trend Model Yt = 5652 - 1205*t + 49.50*t**2 30000 Variable A ctual F its 25000 A ccuracy Measures MA PE 165 MA D 2250 MSD 7351186 Total 20000 15000 10000 5000 17 0 3 6 9 12 15 18 Index 21 24 27 30 33 Trend Analysis Plot for the Total of Medical Science Growth Curve Model Yt = 259.904 * (1.1148**t) 30000 Variable A ctual Fits 25000 A ccuracy Measures 61 MA PE 2408 MA D 25015184 MSD Total 20000 15000 10000 5000 0 3 6 9 12 15 18 Index 21 24 27 30 33 Figure 2.7: Trend Analysis Plot of the Total Number of Students for Medical Science Table 2.4: Trend Summary of the Total Number of Medical Science Students Model MAPE MAD MSD Linear 249 4015 25461692 Quadratic 165 2250 7351186 Exponential 61 2408 25015184 Results presented in Table 2.4 clearly show that the exponential trend model fits the data better than the other two models. 2.2.5 Law Here we consider the total number of students studying abroad in law program. Figure 2.8 gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that the number of students studying abroad in Law program has an increasing trend and an exponential model may better fit the data. 18 Figure 2.8: Trend Analysis Plot of the Total Number of Students for Law 19 Table 2.5: Trend Summary of the Total Number of Law Students Model MAPE MAD MSD Linear 563 657 644213 Quadratic 357 338 174624 Exponential 96 419 755189 Results presented in Table 2.5 clearly show that the exponential trend model fits the data better than the other two models. 2.2.6 Humanities Now we consider the total number of students studying abroad in Humanities program. Figure 2.9 gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that the number of students studying abroad in Humanities program has an increasing trend. We also observe from this plot that both quadratic and exponential models adequately fit the data. 20 Figure 2.9: Trend Analysis Plot of the Total Number of Students for Humanities Table 2.6: Trend Summary of the Total Number of Humanities Students Model MAPE MAD MSD Linear 167 1179 2573862 Quadratic 58 752 1348197 Exponential 87 880 2475024 Results presented in Table 2.6 clearly show that the quadratic trend model fits the data better than the other two models. 21 2.2.7 Fine Arts Now we consider the total number of students studying abroad in Fine Arts program. Figure 2.10 gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that the number of students studying abroad in Fine Arts program has an increasing trend and an exponential model may better fit the data Figure 2.10: Trend Analysis Plot of the Total Number of Students for Fine Arts 22 Table 2.7: Trend Summary of the Total Number of Fine Arts Model MAPE MAD MSD Linear 224.2 194.6 71151.6 Quadratic 180.2 132.1 29439.3 Exponential 69.5 126.9 84233.2 Results presented in Table 2.7 clearly show that the exponential trend model fits the data better than the other two models. . 2.2.8 Engineering Now we consider the total number of students studying abroad in Engineering program. Figure 2.11 gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that the number of students studying abroad in Engineering program has an increasing trend. We also observe from this plot that an exponential model may better fit the data. . 23 Figure 2.11: Trend Analysis Plot of the Total Number of Students for Engineering Table 2.8: Trend Summary of the Total Number of Engineering Students Model MAPE MAD MSD Linear 397 4738 36869030 Quadratic 258 2724 11068847 Exponential 119 3466 50802116 Results presented in Table 2.8 clearly show that the exponential trend model fits the data better than the other two models. 24 2.2.9 Education Now we consider the total number of students studying abroad in Education program. Figure 2.12 gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that the number of students studying abroad in Education program has an increasing trend. We also observe from this plot that both quadratic and exponential models adequately fit the data. Figure 2.12: Trend Analysis Plot of the Total Number of Students for Education 25 Table 2.9: Trend Summary of the Total Number of Education Students Model MAPE MAD MSD Linear 134 577 464455 Quadratic 48 301 214264 Exponential 82 506 523959 Results presented in Table 2.9 clearly show that the quadratic trend model fits the data better than the other two models. 2.2.10 Agriculture Finally we consider the total number of students studying abroad in Agriculture. Figure 2.13 gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that the number of students studying abroad in Agriculture program has an increasing trend. We also observe from this plot that both quadratic and exponential models adequately fit the data. 26 Figure 2.13: Trend Analysis Plot of the Total Number of Students for Agriculture Table 2.10 Trend Summary of the Total Number of Agriculture Students Model MAPE MAD MSD Linear 36.53 26.68 1190.99 Quadratic 28.773 20.265 610.926 Exponential 33.25 25.90 1214.57 Results presented in Table 2.10 clearly show that the quadratic trend model fits the data better than the other two models. 27 2.3 Trend Analysis of Some Other Relevant Variables Here we consider some other variables which we believe may have a significant impact on the number of students studying abroad. These variables are budget in higher education, oil price and oil revenue. Oil is the key factor of Saudi Arabia economy, so oil price and oil revenue should affect almost all major policies of the government. At first we would like to see the trend of these variables. Time series plots of these three variables are presented in Figures 2.14 to 2.16. Time Series Plot of Budgei in HE 2.0000E+11 Budgei in HE 1.5000E+11 1.0000E+11 5.0000E+10 0 1981 1986 1991 1996 Year 2001 2006 2011 Figure 2.14: Time Series Plot of the Budget in Higher Education We observe from this figure that the budget in higher education has a steady progress over the years and it clearly shows an increasing trend. Oil price dropped once but gained later and thus shows an upward trend overall. Oil revenue also shows an increasing pattern. Time Series Plot of Oil Price 100 90 80 Oil Price 70 60 50 40 30 20 10 1981 1986 1991 1996 Year 2001 2006 2011 Figure 2.15: Time Series Plot of Oil Price 28 Time Series Plot of Oil Revenue 1200000 1000000 Oil Revenue 800000 600000 400000 200000 0 1981 1986 1991 1996 Year 2001 2006 2011 Figure 2.16: Time Series Plot of Oil Revenue Now we fit these three variables by three different trend models. 2.3.1 Oil Revenue At first we consider oil revenue over the years. Figure 2.17 gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that oil revenue has an increasing trend. We also observe from this plot that both quadratic and exponential models adequately fit the data. Trend Analysis Plot for Oil Revenue Linear Trend Model Yt = -127953 + 26267*t 1200000 Variable A ctual F its 1000000 A ccuracy Measures MA PE 8.35439E+01 MA D 1.64688E+05 MSD 4.23297E+10 Oil Revenue 800000 600000 400000 200000 0 3 6 9 12 15 18 21 Index 24 27 30 33 Trend Analysis Plot for Oil Revenue Quadratic Trend Model Yt = 294817 - 44194* t + 2013* t* * 2 1200000 Variable A ctual F its Oil Revenue 1000000 A ccuracy Measures MA PE 3.55741E+01 MA D 7.94309E+04 MSD 1.23704E+10 800000 600000 400000 200000 0 3 6 9 12 15 18 21 Index 29 24 27 30 33 Trend Analysis Plot for Oil Revenue Growth Curve Model Yt = 55445.0 * (1.0792**t) 1200000 Variable A ctual F its Oil Revenue 1000000 A ccuracy Measures MA PE 4.79737E+01 MA D 1.26068E+05 MSD 3.40103E+10 800000 600000 400000 200000 0 3 6 9 12 15 18 21 Index 24 27 30 33 Figure 2.17: Trend Analysis of Oil Revenue Table 2.11: Trend Summary of Oil Revenue Model MAPE MAD MSD Linear 8.35439E+01 1.64688E+05 4.23297E+10 Quadratic 3.55741E+01 7.94309E+04 1.23704E+10 Exponential 4.79737E+01 1.26068E+05 3.40103E+10 Results presented in Table 2.11 clearly show that the quadratic trend model fits the data better than the other two models. 2.3.2 Budget in Higher Education Next we consider the budget in higher education. Figure 2.18 gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that the budget in higher education shows an increasing trend. We also observe from this plot that both quadratic and exponential models adequately fit the data. 30 Trend Analysis Plot for Budgei in HE Linear Trend Model Yt = -38718871627 + 5497524487*t Variable A ctual Fits 2.0000E+11 A ccuracy Measures MA PE 5.86496E+04 MA D 1.89828E+10 MSD 5.58537E+20 Budgei in HE 1.5000E+11 1.0000E+11 5.0000E+10 0 3 6 9 12 15 18 21 Index 24 27 30 33 Trend Analysis Plot for Budgei in HE Budgei in HE Quadratic Trend Model Yt = 12499933066 - 3038942962*t + 243899070*t**2 2.0000E+11 Variable A ctual Fits 1.5000E+11 A ccuracy Measures MA PE 1.64690E+04 MA D 8.29748E+09 MSD 1.18811E+20 1.0000E+11 5.0000E+10 0 3 6 9 12 15 18 21 Index 24 27 30 33 Trend Analysis Plot for Budgei in HE Budgei in HE Growth Curve Model Yt = 102994932 * (1.3105**t) 1.0000E+12 Variable A ctual Fits 8.0000E+11 A ccuracy Measures MA PE 5.35190E+02 MA D 8.71668E+10 MSD 3.87341E+22 6.0000E+11 4.0000E+11 2.0000E+11 0 3 6 9 12 15 18 21 Index 24 27 30 33 Figure 2.18: Trend Analysis of Budget in Higher Education 31 Table 2.12: Trend Summary of Budget in Higher Education Model MAPE MAD MSD Linear 5.86496E+04 1.89828E+10 5.58537E+20 Quadratic 1.64690E+04 8.29748E+09 1.18811E+20 Exponential 5.35190E+02 8.71668E+10 3.87341E+22 Results presented in Table 2.12 clearly show that the exponential trend model fits the data better than the other two models. 2.3.3 Oil Price Next we consider oil price. Figure 2.19 gives linear, quadratic and exponential trend fits of this data. From the figure it is clear that oil price shows an increasing trend. We also observe from this plot that both quadratic and exponential models adequately fit the data. Trend Analysis Plot for Oil Price Linear Trend Model Yt = 30.67 + 0.877* t 100 Variable A ctual F its 90 80 A ccuracy Measures MA PE 59.160 MA D 20.980 MSD 554.086 Oil Price 70 60 50 40 30 20 10 3 6 9 12 15 18 Index 21 32 24 27 30 33 Trend Analysis Plot for Oil Price Quadratic Trend Model Yt = 82.96 - 7.838* t + 0.2490* t* * 2 Variable A ctual F its 100 90 A ccuracy Measures MA PE 18.8959 MA D 7.6090 MSD 95.7177 Oil Price 80 70 60 50 40 30 20 10 3 6 9 12 15 18 Index 21 24 27 30 33 Trend Analysis Plot for Oil Price Growth Curve Model Yt = 28.291 * (1.01911**t) 100 Variable A ctual F its 90 80 A ccuracy Measures MA PE 48.344 MA D 19.889 MSD 565.457 Oil Price 70 60 50 40 30 20 10 3 6 9 12 15 18 Index 21 24 27 30 33 Figure 2.19: Trend Analysis of Oil Price Table 2.13: Trend Summary of Oil Price Model MAPE MAD MSD Linear 59.160 20.980 554.086 Quadratic 18.8959 7.6090 95.7177 Exponential 48.344 19.889 565.457 Results presented in Table 2.13 clearly show that the quadratic trend model fits the data better than the other two models. 33 2.4 Summary Results of Trend Analysis In this section we summarize the above trend results. Altogether we have considered 13 variables. Table 2.14 gives a quick view regarding which model is appropriate for which variable. Table 2.14: Trend Summary Variable Model Direction Total Number of Students Exponential Increasing Students in Social Science Exponential Increasing Students in Natural Science Exponential Increasing Students in Medical Science Exponential Increasing Students in Law Exponential Increasing Students in Humanities Quadratic Increasing Students in Fine Arts Exponential Increasing Students in Engineering Quadratic Increasing Students in Education Exponential Increasing Students in Agriculture Quadratic Increasing Oil Revenue Quadratic Increasing Budget in Higher Education Exponential Increasing Oil Price Quadratic Increasing The above results show that out of 13 variables not a single one fit a linear trend model. For most of the variables both quadratic and exponential models perform similar but on 8 cases exponential model fit the data better and on 5 remaining cases quadratic model performs better and all of them show increasing trend. 34 CHAPTER 3 Comparison between Genders and Different Programs We have separate information regarding male and female Saudi Arabia students who are studying abroad. In this chapter we would like to see whether there is any gender discrimination. We would also like to see that whether there is a significant difference among the number of students studying different programs. 3.1 Comparison between Genders At first we would like to investigate whether there is any gender discrimination. At first we will look at the number of male and female students in different programs. 3.1.1 Social Science Figure 3.1 gives a time series plot of the number of male and female students in Social Science program. Figure 3.1: Time Series Plot of Male and Female Students in Social Science 35 It is clear from this figure that the number of male students is consistently higher but the gap becomes very high in the recent years. 3.1.2 Natural Science Figure 3.2 gives time series plot of the number of male and female students in Natural Science program. Figure 3.2: Time Series Plot of Male and Female Students in Natural Science It is clear from this figure that the number of male students is consistently higher but the gap becomes very high in the recent years. 3.1.3 Medical Science Figure 3.3 gives a time series plot of the number of male and female students in Medical Science program. 36 Figure 3.3: Time Series Plot of Male and Female Students in Medical Science It is clear from this figure that the number of male students is consistently higher but the gap becomes very high in the recent years. 3.1.4 Law Figure 3.4 gives a time series plot of the number of male and female students in Law program. Figure 3.4: Time Series Plot of Male and Female Students in Law It is clear from this figure that the number of male students is consistently higher but the gap becomes very high in the recent years. 37 3.1.5 Humanities Figure 3.5 gives a time series plot of the number of male and female students in Humanities program. Figure 3.5: Time Series Plot of Male and Female Students in Humanities It is clear from this figure that the number of female students was higher initially. Then the gap between male and female gets narrowed. However, in recent years the number of male students gets increased and currently it is more than the female students. 3.1.6 Engineering Figure 3.6 gives a time series plot of the number of male and female students in Engineering program. 38 Figure 3.6: Time Series Plot of Male and Female Students in Engineering It is clear from this figure that the number of male students is consistently higher but the gap becomes a rocket high in the recent years. 3.1.7 Education Figure 3.7 gives a time series plot of the number of male and female students in Education program. Figure 3.7: Time Series Plot of Male and Female Students in Education It is clear from this figure that the number of male students was higher before but the gap gets narrowed and currently the number of female students has overtaken the number of male students. 39 3.1.8 Fine Arts Figure 3.8 gives a time series plot of the number of male and female students in Fine Arts program. Figure 3.8: Time Series Plot of Male and Female Students in Fine Arts Probably this is the only program where the number of female students is consistently higher than male students and the gap becomes higher in the recent years. 3.1.9 Agriculture Figure 3.9 gives a time series plot of the number of male and female students in Agriculture program. Figure 3.9: Time Series Plot of Male and Female Students in Agriculture 40 Figure 3.9 shows that that the number of male students was much higher before. The gap narrowed down gradually but the number of male students is consistently higher than the female students. 3.2 Tests for the Equality of Means between Male and Female Students In the previous section we have seen that in almost every program the number of male students is higher than that of the female students. As we know graphs are very subjective here we test the difference between mean of male and female students. Let us denote the number of male students by X and the number of female students by Y. We are interested in testing the hypothesis against . H 0 : X Y H1 : X Y Under H 0 , the test statistic becomes X Y Z ( X / n) ( Y / m) 2 2 Assuming further normality and large sample sizes, the critical region for the test becomes | x y | z / 2 S 2 X / n SY / m 2 We test the equality of mean of male and female students for all nine programs and the results are presented below. We present the average number of male and female students, z-value and its corresponding p-value, whether the difference is significant or not, and if so, to which gender it is biased. It is worth mentioning that * stands for significant at the 10% level, ** stands for significant at the 5% level and *** stands for significant at the 1% level. 41 Table 3.1: Summary Test Results for the Equality of Means between Male and Female Students Program Male Female (Ave) z-value p-value Difference Biased to (Ave) Social Science 2737 722 2.20 0.032 **Significant Male Natural Science 3146 1137 2.09 0.040 **Significant Male Medical 3102 1388 1.86 0.068 *Significant Male Law 546 109 2.65 0.010 **Significant Male Humanities 890 957 -0.25 0.807 Insignificant Fine Arts 57 127 -1.57 0.121 Insignificant Engineering 4374 150 3.17 0.002 ***Significant Education 421 438 -0.14 0.887 Insignificant Agriculture 79.6 5.74 11.40 0.000 ***Significant Science Male Male It is clear from this table that the number of male students is significantly higher than the number of female students in 6 out of 9 programs. Female students are more in only three programs but the differences are not statistically significant. So we can say that male students have advantageous position than female students. 3.2.1 Comparison among All Programs Now we would like to see whether there is any difference among the number of students studying different programs. 42 Table 3.2: Average Number of Students in Different Programs Program Average Number of Students Social Science 3459 Natural Science 4284 Medical Science 4490 Law 655 Humanities 1847 Fine Arts 184.6 Engineering 4524 Education 859 Agriculture 85.32 ure, Education, Engineering, Fine Arts, Humanities, Law, Medical Science, Natural Scie 35000 30000 Data 25000 20000 15000 10000 5000 0 re l tu u c ri Ag a uc Ed n tio r ee n i g En ing e Fi n Ar ts m Hu s itie an w La al dic e M ce ien c S ra tu a N l ce ien c S l ci a So Figure 3.10: Box Plot of Number of Students in Different Programs 43 ce i en c S The above table and the figure clearly shows differences in the average number of students, but we also need to know whether this difference is statistically significant or not. 3.2.2 Tests for the Equality of Means among All Programs Frequently, experiments want to compare more than two components. We will be comparing the means of m normal distributions under the assumption that the variances are all the same. Let us now consider m normal distributions with unknown means 1 , 2 ,..., m and an unknown but common variance 2 . We wish to test the null hypothesis . H 0 : 1 2 ... m X 11 X 12 X1j X 1n1 X 1. X 21 X 22 X2j X 2n2 X 2. X i1 X i2 X ij X ini X i. X m1 X m2 X mj X mnm X m. X .. ni The i-th group mean is X i. X j 1 where X .. , i = 1, 2, …, m ni m and the grand mean is ij ni X i 1 j 1 n m ij n X i 1 i n n n1 n2 ... nm. 44 i. To determine a critical region for a test of H 0 , we partition the total sum of squares as X m SS (TO) = ni i 1 j 1 n X m Let i i 1 X ni m i 1 j 1 ij X .. X ij X i . X i . X .. = m ni 2 i 1 j 1 X m ni i 1 j 1 ij X i . ni X i . X .. 2 m 2 i 1 X .. = SS (Programs), the sum of squares among the different programs. 2 i. X i. 2 ij 2 = SS (Error), the sum of squares within programs (often called the error sum of squares). It is easy to show that X ij X i. 2 2 ~ 1 , and 2 2 n m j 1 i X ij X .. / ~ n 1 / ni 2 ~ 2 ni 1 2 X X ij i. i 1 j 1 m Hence, ni X i. X .. 2 / 2 ~ 2 m 1 and i 1 j 1 ~ 2 n m m X ni ni 2 i. 2 2 i 1 SSProgram / m 1 ~ Fm1,nm SSError / n m Thus The information used for the tests of the equality of several means is often summarized in an analysis of variance (ANOVA) table. Source Sum of Squares (SS) Degrees of Freedom Mean Squares (MS) F Ratio Programs SS(P) m–1 MS(P) = SS(P)/(m – 1) MS(P)/MS(E) Error SS(E) n–m MS(E) = SS(E)/(n – m) Total SS(T) n–1 We would reject H 0 if the observed value of F is too large. Thus the critical region is in the form . F F ;m1,nm 45 3.3 Comparison of the Individual Treatment Means There are several methods by which we can compare treatment means. 3.3.1 The Least Significance Difference (Fisher’s LSD) Method Suppose that following an analysis of variance F test where the null hypothesis is rejected, we wish to test H 0 : i j for all i j. This could be done by using the t statistic t= yi. y j . EMS 1 / ni 1 / n j The pair of means i and j would be declared significantly different if | yi. y j . | t(1 / 2), N p EMS 1 / ni 1 / n j The quantity LSD = t(1 / 2), N p EMS1 / ni 1 / n j is called the least significant difference. A design is called balanced when n1 = n2 = … = n p = n, and LSD = t(1 / 2), N p 2EMS/n 46 3.3.2 Duncan’s Multiple Range Test A widely used procedure for comparing all pairs of means is the multiple range test proposed by Duncan. We first arrange the p treatment means in ascending order and compute the standard error of each average as s y1. EMS / nh p where nh p / 1 / ni . i 1 If n1 = n2 = … = n p = n, we have nh = n, and hence s y1. EMS / n The significant ranges are calculated as Rk r k , N p s y1. , k = 2, 3, …, p where the values of r k , N p is obtained from a table given by Duncan. Then the observed differences between means are tested, beginning with the largest versus smallest and compared with the least significant range R p . Next, the difference between the largest and the second smallest is computed and compared with the least significant range R p1 . Finally, the difference between the second largest and the smallest is computed and compared with the least significant range R p1 . This process is continued until the differences of all possible p(p–1)/2 pairs of means have been considered. If an observed difference is greater than the corresponding least significant range, then we conclude that the pair of means in question is significantly different. 3.3.3 The Newman-Keuls Test This test is similar to Duncan’s multiple range test, except that the critical difference between means are calculated differently. Here we compute a set of critical values 47 K k q k , N p s y1. , k = 2, 3, …, p where q k , N p is the upper percentage point of the Studentized range for groups of means of size k and N – p error degrees of freedom. The Studentized range is defined as q= ymax ymin EMS / n 3.3.4 Tukey’s Test Tukey proposed a multiple comparison procedure based on the Studentized range statistic. His procedure requires the use of q p, N p to determine the critical value of all pairwise comparisons, regardless of how many means are in the group. Thus, Tukey’s test declares two means significantly different if the absolute value of their sample differences exceeds T = q p, N p s y1. 3.4 Result Summary At first we would like to test the equality of mean number of students in nine programs. The summary results are presented in Table 3.3. Table 3.3 ANOVA Table for the Equality of Mean Test of Nine Programs Source SS DF MS Programs 998821022 8 124852628 Error 7322357160 297 24654401 Total 8321178183 305 48 F Ratio 5.06 p-value 0.000 Table 3.3 clearly shows that the programs effect is highly significant. So we must reject the hypothesis of equal mean for the nine programs. Now in search of which programs differ significantly from the other programs we report Tukey’s test and Fisher’s LSD as they are very effective and readily available in MINITAB. Here we present only the summary result the details result is presented in the Appendix. Grouping Information Using Tukey Method Engineering Medical Science Natural Science Social Science Humanities Education Law Fine Arts Agriculture N 34 34 34 34 34 34 34 34 34 Mean 4524 4490 4284 3459 1847 859 655 185 85 Grouping A A A B A B C A B C A B C B C C C Tukey’s test shows that most of the Saudi Arabia students go abroad to study Engineering and Medical Science and the least number of students study Agriculture and Fine Arts. Grouping Information Using Fisher Method Engineering Medical Science Natural Science Social Science Humanities Education Law Fine Arts Agriculture N 34 34 34 34 34 34 34 34 34 Mean 4524 4490 4284 3459 1847 859 655 185 85 Grouping A A A A B B C C C C C However, Fisher’s LSD shows most of the Saudi Arabia students go abroad to study Engineering, Medical Science and Natural Science and the least popular programs are Agriculture, Fine Arts, Law and Education. 49 CHAPTER 4 Modeling and Fitting of Data Using Regression Diagnostics and Robust Regression In this chapter at first we discuss classical regression method with diagnostics and then discuss some robust methods that are commonly used in regression. We will employ all these things to investigate which variables have significant impact on the number of Saudi Arabia students studying abroad. 4.1 Classical Regression Analysis Regression is probably the most popular and commonly used statistical method in all branches of knowledge. It is a conceptually simple method for investigating functional relationships among variables. The user of regression analysis attempts to discern the relationship between a dependent (response) variable and one or more independent (explanatory/predictor/regressor) variables. Regression can be used to predict the value of a response variable from knowledge of the values of one or more explanatory variables. We write the multiple regression model as Yi 0 1 X 1i 2 X 2i ... k X ki i , i = 1, 2, …, n (4.1) where Y is the dependent variable, the X’s are the independent variables, and is the error term. Here we have a dependent variable and k explanatory variables excluding the intercept term. This model is also called a k + 1 variable regression model. 50 The assumptions of the multiple regression model are quite similar to those of the two-variable linear regression model: The relationship between Y and X is linear. But no exact linear relationship exists between two or more X’s. The X’s are nonstochastic variables whose values are fixed. The error has zero expected values: E( ) = 0 The error term has constant variance for all observations, i.e., E( i ) = 2 , i = 1, 2, …, n. 2 The random variables i are statistically independent. Thus, E( i j ) = 0, for all i j. The error term is normally distributed. 4.1.1 Estimation Technique We can express the multiple regression model in matrix notation as: Y=X + (4.2) Where y1 y2 Y= ... y n 1 x11 1 x 12 X= ... ... 1 x1n ... xk1 ... xk 2 ... ... ... xkn 0 1 = ... k 1 = 2 ... n We obtain the OLS estimate of k unknown parameters 0 , 1 , …, k in such a way that the sum n 2 of squares (SS) i = Y X Y X is minimized. i 1 51 The value of that minimizes is given by the solution to =0 We get = 2 X Y – 2 X X = 0 ˆ = X X 1 X Y (4.3) We also have V ( ˆ ) = 2 X X 1 (4.4) For this model, the residuals are ˆi Yi Yˆi Y ˆ0 ˆ1 X1i ˆ2 X 2i ... ˆk X ki , i = 1, 2, …, n (4.5) n 2 An unbiased and consistent estimate of 2 is s 2 ˆi /( n k 1) . The estimated standard error i 1 of ˆ j is s ˆ s 2V j , where V j is the j-th diagonal element of X X 1 . When the errors are j normally distributed, then ˆ j j s ˆ ~ t n k 1 j 4.1.2 Checking for Goodness of Fit We can use the R 2 statistic as a measure of goodness of fit for the multiple regression model. We know that n ˆi 2 RSS ESS =1– = 1 – n i 1 R2 = 2 TSS TSS Yi Y (4.6) i 1 R 2 is the proportion of the total variation in Y explained by the regression of Y on X. It is easy to show that R 2 ranges in value between 0 and 1. But it is only a descriptive statistics. Roughly 52 speaking, we associate a high value of R 2 (close to 1) with a good fit of the model by the regression line and associate a low value of R 2 (close to 0) with a poor fit. How large must R 2 be for the regression equation to be useful? That depends upon the area of application. If we could develop a regression equation to predict the stock market, we would be ecstatic if R 2 = 0.50. On the other hand, if we were predicting death in road accident, we would want the prediction equation to have strong predictive ability, since the consequences of poor prediction could be quite serious. But the difficulty with R 2 as a measure of goodness of fit is that it does not account for the number of degrees of freedom. A natural solution is to use variances, not variations and that help to define a corrected (adjusted) R 2 , defined as R 2 = 1 – [Estimated V( ) / Estimated V(Y)] Now n Estimated V( ) = s 2 ˆi /( n k 1) 2 i 1 and Estimated V(Y) = Yi Y / (n – 1) n 2 i 1 Thus the corrected R 2 becomes n R =1– 2 ˆi 2 i 1 Yi Y n 2 n 1 n 1 = 1 1 R 2 n k 1 n k 1 (4.7) i 1 4.1.3 Tests of Regression Coefficients We often like to establish that the explanatory variable X has a significant effect on Y, that the coefficient of X (which is ) is significant. In this situation the null hypothesis is constructed in 53 way that makes its rejection possible. We begin with a null hypothesis, which usually states that a certain effect is not present, i.e., = 0. We estimate ˆ and its standard error from the data and compute the statistic t= ˆ ~ t n k 1 s ˆ (4.8) 4.2 Regression Diagnostics Diagnostics are designed to find problems with the assumptions of any statistical procedure. In diagnostic approach we estimate the parameters (in regression fit the model) by the classical method (the OLS) and then see whether there is any violation of assumptions and/or irregularity in the results regarding the six standard assumptions mentioned at the beginning of this section. But among them the assumption of normality is the most important assumption. 4.2.1 Test for Normality The normality assumption means the errors are distributed as normal. The simplest graphical display for checking normality in regression analysis is the normal probability plot. This method is based in the fact that if the ordered residuals are plotted against their cumulative probabilities on normal probability paper, the resulting points should lie approximately on a straight line. An excellent review of different analytical tests for normality is available in Imon (2003). A test based on the correlation of true observations and the expectation of normalized order statistics is known as the Shapiro – Wilk test. A test based on empirical distribution function is known as Anderson – Darling test. It is often very useful to test whether a given data set approximates a normal distribution. This can be evaluated informally by checking to see whether the mean and the median 54 are nearly equal, whether the skewness is approximately zero, and whether the kurtosis is close to 3. A more formal test for normality is given by the Jarque – Bera statistic: JB = [n / 6] [ S 2 ( K 3) 2 / 4] (4.9) Imon (2003) suggests a slight adjustment to the JB statistic to make it more suitable for the regression problems. His proposed statistic based on rescaled moments (RM) of ordinary least squares residuals is defined as RM = [n c 3 / 6] [ S 2 c ( K 3) 2 / 4] (4.10) where c = n/(n – k), k is the number of independent variables in a regression model. Both the JB and the RM statistic follow a chi square distribution with 2 degrees of freedom. If the values of these statistics are greater than the critical value of the chi square, we reject the null hypothesis of normality. 4.2.2 Outliers In Statistics we often observe that the values of descriptive measures are often much influenced by few extreme observations which are commonly known as outliers. According to Barnett and Lewis (1993), ‘Observations which stand apart from the bulk of the data are called outliers.’ Different aspects of outliers with its consequences are discussed by Hadi, Imon and Werner (2009). Hampel et al. (1986) claim that a routine data set typically contains about 1-10% outliers, and even the highest quality data set cannot be guaranteed free of outliers. to Barnett and Lewis (1993) commented ‘Any outliers, however, are always extreme values in the sample.’ But this statement is not always true, especially in regression analysis. 55 In a regression problem, observations are judged as outliers on the basis of how unsuccessful the fitted regression equation is in accommodating them and that is why observations corresponding to excessively large residuals are treated as outliers. Types of Outliers X – Outlier: This is a point that is outlying in regard to the x–coordinate. In the literature an X– outlier is more popularly known as a high leverage point. Y – Outlier: This is a point that is outlying only because its y–coordinate is extreme. X – and Y – Outlier: A point that is outlying in both x and y coordinates is known as x – and y – outlier. Residual Outlier: This is a point that has a large standardized (deletion) residual. Most of the commonly used outlier detection methods are based on this approach where an observation is judged as outlier on the basis of how unsuccessful the fitted regression equation is in accommodating it. Detection of Outliers We often use the following three types of residuals for the identification of outliers. Standardized residuals T yi xi ˆ di ˆ T yi xi ˆ Studentized residuals ri ˆ 1 w ii , i = 1, 2, …, n (4.11) , i = 1, 2, …, n (4.12) Deletion Studentized (Externally Studentized or R-Student) residuals T yi xi ˆ ti ˆ i 1 wii , i = 1, 2, …, n 56 (4.13) 2 where ˆ i is the OLS estimates of the mean squared error (MSE) based on a data set with the i- th observation deleted. As a thumb rule we call an observation outlier when its corresponding residual value exceeds 3 in absolute value. A good review of recent outlier detection techniques in linear regression is available in Imon (2008), and Hadi, Imon and Werner (2009). 4.2.3 Multicollinearity One basic assumption of the multiple regression model is that there is no exact linear relationship between any of the independent variables in the model. If such an exact linear relationship does exist, we say that the independent variables are perfectly collinear or that perfect collinearity exists. Multicollinearity arises when two or more variables (or combinations of variables) are highly correlated with each other. Effects of Multicollinearity Wrong interpretation of the regression coefficients Large variances and covariances for the OLS estimators of the regression parameters Unduly large (in absolute value) estimates of the regression parameters Indications of Multicollinearity High Correlation Values Calculate regression coefficients between all explanatory variables and test the maximum (in absolute value) correlation coefficient by the statistic t = 57 rij n 2 1 rij 2 ~ tn 2 There is an evidence of multicollineatiy at the 5% level of significance if |t| > tn 2,0.975 Large Variance Inflation Factor We know that the variance of ˆ j is 2V j , where V j is the j-th diagonal element of X X 1 . Consequently V( ˆ j ) is large, if V j is large. Hence V j will be called the variance inflation factor (VIF) of the explanatory variable X j . One or more large VIF’s indicate multicollienarity. Thumb rule: VIF < 5 No multicollinearity 5 VIF 10 Moderate multicollinearity VIF > 10 Severe multicollinearity Large Condition Number A condition number is associated with the characteristic roots (eigen values) of the matrix X X . The condition number of X X is defined as max min A large condition number indicates the existence of multicollinearity. < 10 Thumb rule: No multicollinearity 10 30 Moderate multicollinearity > 30 Severe multicollinearity Low Tolerance Value Tolerance values are defined as inverse of VIF values. In other words, we can define Tolerance value = 1/VIF 58 Since tolerance values are inverse of VIF’s, low tolerance values indicate multicollinearity problem. Thumb rule: VIF > 0.2 0.1 VIF 0.2 VIF < 0.1 No multicollinearity Moderate multicollinearity Severe multicollinearity 4.2.4 Variable Selection In some applications theoretical considerations or prior experience can be helpful in selecting the regressors to be used in the model. Building a regression model that includes only a subset of available regressors involves two conflicting objectives. 1. We would like the model to include as many regressors as possible so that the information content in these factors can influence the fitted value of the response. 2. We want the model to include as few regressors as possible because the variance of the fitted response increases as the number of regressors increases. Also the more regressors there are in a model, the greater the cost of data collection and model maintenance. Finding an appropriate subset of regressors for the model is called the variable selection problem. Graphical Methods A number of graphical displays are used for variable selection. Here is a list of few of them Added Variable Plot Partial Residual (PR) plot (Ezekiel, 1924) Component and Component-plus-residual (CCPR) plot (Wood, 1973) 59 Augmented Partial Residual (APR) plot (Mallows, 1986) Conditional Expectation and Residual (CERES) plot (Cook, 1993) Robust Added Variable Plot (Imon, 2003) Model Selection Criteria Minimum Residual Mean Square (RMS) ˆ 2 SSE n k 1 n where SSE = ( yi yˆi ) 2is the residual sum of squares, n is the number of observations, k is the i 1 number of explanatory variables. Maximum R-Square R2 1 SSE , SST n where SST ( yi y ) 2 is the total sum of squares. i 1 Maximum Adjusted R-Square Ra2 1 SSE /( n k 1) SST /( n 1) Akaike Information Criterion For a model with p = k + 1 predictors including the intercept, the Akaike information criterion suggests to choose p for which the statistic n AIC (p) = ln 1 ˆi 2 2 p n i 1 n will be minimized. This statistic imposes a penalty for including insignificant variables. 60 Mallows Cp For a model with p predictors, Cp Y T ( I W )Y (2 p n), σˆ 2 where ˆ 2 is a good estimate of s2 (usually obtained from the full model). The above expression can be reexpressed as Cp n p ˆ p 2 σˆ 2 (2 p n), 2 where ˆ p is the MSE from the sub model. It is straight forward to show that for the full model C p = p. But here we search for a sub model where C p ≈ p for a value of p which is less than the value of p for the full model. Other Model Selection Criteria Schwarz Criterion (SC) Bayesian Information Criterion (BIC) Final Prediction Error (FPE) or Prediction Criterion (PC) Hannan-Quinn Criterion (HQC) Variable Selection Methods Forward Selection Start with the empty model, then add the most significant variable (the one with the largest t-value or smallest p-value). Repeat until all candidate variables to enter the model have insignificant regression coefficients. 61 Backward Elimination Start with the full model, then delete the least significant variable (the one with the smallest t-value or largest p-value). Repeat until all regression coefficients in the model are significant. Stepwise Method This is a combination of forward selection and backward elimination methods. 4.3 Robust Regression Robustness is now playing a key role in time series. According to Kadane (1984) ‘Robustness is a fundamental issue for all statistical analyses; in fact it might be argued that robustness is the subject of statistics.' The term robustness signifies insensitivity to small deviations from the assumption. That means a robust procedure is nearly as efficient as the classical procedure when classical assumptions hold strictly but is considerably more efficient over all when there is a small departure from them. The main application of robust techniques in a time series problem is to try to devise estimators that are not strongly affected by outliers or departures from the assumed model. In time series, robust techniques grew up in parallel to diagnostics [see Hampel et al. (1986)] and initially they were used to estimate parameters and to construct confidence intervals in such a way that outliers or departures from the assumptions do not affect them. A large body of literature is now available [Rousseuw and Leroy (1987), Maronna, Martin, and Yohai (2006), Hadi, Imon and Werner (2009)] for robust techniques that are readily applicable in linear regression or in time series. 62 4.3.1. L – estimator A first step toward a more robust time series estimator was the consideration of least absolute values estimator (often referred to as L – estimator). In the OLS method, outliers may have a very large influence since the estimated parameters are estimated by minimizing the sum of squared residuals n u t 1 2 t L estimates are then considered to be less sensitive since they are determined by minimizing the sum of absolute residuals n | u t 1 t | The L estimator was first introduced by Edgeworth in 1887 who argued that the OLS method is over influenced by outliers, but because of computational difficulties it was not popular and not much used until quite recently. Sometimes we consider the L – estimator as a special case of L p -norm estimator in the literature where the estimators are obtained by minimizing n | u t 1 t |p The L1 -norm estimator is the OLS, while the L2 - norm estimator is the L – estimator. But unfortunately a single erroneous observation (high leverage point) can still totally offset the L-estimator. 4.3.2. Least Median of Squares Rousseeuw (1984) proposed Least Median of Squares (LMS) method which is a fitting technique less sensitive to outliers than the OLS. In OLS, we estimate parameters by n Minimizing the sum of squared residuals u t 1 63 2 t Which is obviously the same if we 1 n 2 Minimize the mean of squared residuals u t . n t 1 Sample means are sensitive to outliers, but medians are not. Hence to make it less sensitive we can replace the mean by a median to obtain median sum of squared residuals 2 MSR ( ˆ ) = Median { uˆ t } (4.14) Then the LMS estimate of is the value that minimizes MSR ( ˆ ). Rousseeuw and Leroy (1987) have shown that LMS estimates are very robust with respect to outliers and have the highest possible 50% breakdown point. 4.3.3. Least Trimmed Squares The least trimmed (sum of) squares (LTS) estimator is proposed by Rousseeuw (1984). In this method we try to estimate in such a way that LTS ( ˆ ) = minimize h uˆ t 1 t 2 (4.15) Here ût is the t-th ordered residual. For a trimming percentage of , Rousseeuw and Leroy (1987) suggested choosing the number of observations h based on which the model is fitted as h = [n (1 – )] + 1. The advantage of using LTS over LMS is that, in the LMS we always fit the regression line based on roughly 50% of the data, but in the LTS we can control the level of trimming. When we suspect that the data contains nearly 10% outliers, the LTS with 10% trimming will certainly produce better result than the LMS. We can increase the level of trimming if we suspect there are more outliers in the data. 64 4.3.4 Reweighted Least Squares Another way to obtain a set of results based on a robust fit is the method of Reweighted Least Squares (RLS) proposed by Rousseeuw and Leroy (1987). In this method, the parameters are estimated by the LMS method and the outliers are identified. After that the final model is fitted by the least squares without the potential outliers. Since this fitting does not involve any outliers this method is claimed to be more appropriate for the majority of the observations. However, the residuals of the deleted points are reestimated from the robust fit to produce a full set of residuals. 4.4 Regression Results Here we employ regression method to understand which variables have significant impact on the number of Saudi Arabia Students studying abroad. Budget in higher education can be an immediate choice. Saud Arabia economy heavily relies on oil. So the two other variables one can consider are oil price and oil revenue. We begin with a simple linear regression model with the number of Saudi Arabia students studying abroad on the three explanatory variables one at a time. Figure 4.1 gives a scatter plot of the total number of students versus budget in higher education. We observe an upward and strong linear relationship between these two variables. The attached MINITAB output shows that the value of R 2 is 0.83 and the p-value corresponding to the variable budget in higher education is highly significant (0.000). 65 Scatterplot of Total No. of Students vs Budgei in HE 100000 Total No. of Students 80000 60000 40000 20000 0 0 5.0000E+10 1.0000E+11 Budgei in HE 1.5000E+11 2.0000E+11 Figure 4.1: Scatter Plot of the Total Number of Students vs Budget in Higher Education Regression Analysis: Total No. of Students versus Budget in HE The regression equation is Total No. of Students = - 5982 + 0.000000 Budget in HE Predictor Constant Budget in HE Coef -5982 0.00000046 S = 12621.3 R-Sq = 83.0% SE Coef 3025 0.00000004 T -1.98 12.48 P 0.057 0.000 VIF 1.000 R-Sq(adj) = 82.4% Figure 4.2 gives a scatter plot of the total number of students versus budget in higher education. We observe an upward and linear relationship between these two variables. The attached MINITAB output shows that the value of R 2 is 0.529 which is not great. This graph also shows that probably there are few outliers in this data. So we think it will be a good idea to employ a robust regression here. We fit the reweighted least squares (RLS) method to this data and the fitted plot is presented in Figure 4.3. 66 Scatterplot of Total No. of Students vs Oil Price 100000 Total No. of Students 80000 60000 40000 20000 0 10 20 30 40 50 60 Oil Price 70 80 90 100 Figure 4.2: Scatter Plot of the Total Number of Students vs Oil Price Regression Analysis: Total No. of Students versus Oil Price The regression equation is Total No. of Students = - 19210 + 860 Oil Price Predictor Constant Oil Price Coef -19210 860.3 S = 20985.0 SE Coef 7525 143.6 R-Sq = 52.9% T -2.55 5.99 P 0.016 0.000 VIF 1.000 R-Sq(adj) = 51.4% OLS and RLS Fit of Total No. of Students vs Oil Price Variable O LS RLS Total No. of Students 100000 No. of Students 80000 60000 40000 20000 0 10 20 30 40 50 60 Oil Price 70 80 90 100 Figure 4.3: RLS and OLS Fit of the Total Number of Students vs Oil Price 67 Regression Analysis: Total No. of Students_1 versus Oil Price_1 The regression equation is Total No. of Students_1 = - 29017 + 1363 Oil Price_1 Predictor Constant Oil Price_1 S = 6886.77 Coef -29017 1362.94 SE Coef 2561 56.54 R-Sq = 96.2% T -11.33 24.10 P 0.000 0.000 R-Sq(adj) = 96.0% We observe from Figure 4.3 that the robust RLS fit the data much better than the traditionally used OLS fit. Now we observe an upward and very linear relationship between these two variables. The attached MINITAB output shows that the value of R 2 gets increased from 0.529 to 0.962 which is a huge improvement. So we can say robust regression performs much better than the classical regression method here. Scatterplot of Total No. of Students vs Oil Revenue Total No. of Students 100000 80000 60000 40000 20000 0 0 200000 400000 600000 Oil Revenue 800000 1000000 1200000 Figure 4.4: Scatter Plot of the Total Number of Students vs Oil Revenue 68 Figure 4.4 gives a scatter plot of the total number of students versus oil revenue. We observe an upward and linear relationship between these two variables. The attached MINITAB output shows that the value of R 2 is 0.786 which is good. Regression Analysis: Total No. of Students versus Oil Revenue The regression equation is Total No. of Students = - 6054 + 0.0797 Oil Revenue Predictor Constant Oil Revenue S = 14154.7 Coef -6054 0.079707 SE Coef 3443 0.007362 R-Sq = 78.6% T -1.76 10.83 P 0.088 0.000 R-Sq(adj) = 77.9 Since each of the three explanatory variables shows a linear relationship with the total number of students studying abroad, now we fit a multiple linear regression model. Response variable: The total number of students studying abroad Explanatory variables: Budget in higher education, Oil price, and Oil revenue. Regression Analysis: Total No. of versus Budget in HE, Oil Revenue, Oil Price The regression equation is Total No. of Students = - 18688 + 0.000000 Budget in HE - 0.0127 Oil Revenue + 417 Oil Price Predictor Constant Budget in HE Oil Revenue Oil Price Coef -18688 0.00000042 -0.01267 417.3 S = 10526.9 R-Sq = 88.9% SE Coef 4476 0.00000008 0.01897 134.2 T -4.18 5.23 -0.67 3.11 P 0.000 0.000 0.509 0.004 VIF 6.812 12.003 3.471 R-Sq(adj) = 87.8% The attached MINITAB output for multiple regression is quite confusing. Here the value of R 2 is 0.889 which is good, but we observe that the effect of oil revenue is negative which completely 69 conflicts with our findings in Figure 4.4. It may be a clear case of wrong sign problem which is caused by multicollinearity. We checked the VIF values and found the largest one as 12.003 which shows that this model is severely affected by multicollinearity. The above results suggest us that we cannot keep all the three explanatory variables in the model. In quest of which of the explanatory variables should remain in the model we apply the forward selection, the backward elimination and stepwise regression methods and the MINITAB results are reported. Stepwise Regression: Total No. of versus Oil Revenue, Budget in HE, ... Forward selection. Alpha-to-Enter: 0.05 Response is Total No. of Students on 3 predictors, with N = 34 Step Constant Budget in HE T-Value P-Value 1 -5982 2 -17088 0.00000 12.48 0.000 0.00000 9.92 0.000 Oil Price T-Value P-Value S R-Sq R-Sq(adj) Mallows Cp 350 3.98 0.000 12621 82.95 82.42 16.0 10432 88.72 87.99 2.4 Stepwise Regression: Total No. of versus Oil Revenue, Budget in HE, ... Backward elimination. Alpha-to-Remove: 0.05 Response is Total No. of Students on 3 predictors, with N = 34 Step Constant 1 -18688 Oil Revenue T-Value P-Value -0.013 -0.67 0.509 2 -17088 70 Budget in HE T-Value P-Value 0.00000 5.23 0.000 0.00000 9.92 0.000 Oil Price T-Value P-Value 417 3.11 0.004 350 3.98 0.000 S R-Sq R-Sq(adj) Mallows Cp 10527 88.88 87.77 4.0 10432 88.72 87.99 2.4 Stepwise Regression: Total No. of versus Oil Revenue, Budget in HE, ... Alpha-to-Enter: 0.05 Alpha-to-Remove: 0.05 Response is Total No. of Students on 3 predictors, with N = 34 Step Constant Budget in HE T-Value P-Value 1 -5982 2 -17088 0.00000 12.48 0.000 0.00000 9.92 0.000 Oil Price T-Value P-Value S R-Sq R-Sq(adj) Mallows Cp 350 3.98 0.000 12621 82.95 82.42 16.0 10432 88.72 87.99 2.4 All these three methods come up with exactly the same conclusion, i.e. the explanatory variables that we should keep in our study are budget in higher education and oil price. Let us denote this as Model A Regression Analysis: Model A: Total No. of Stu versus Budget in HE, Oil Price The regression equation is Total No. of Students = - 17088 + 0.000000 Budget in HE + 350 Oil Price Predictor Constant Budget in HE Oil Price Coef -17088 0.00000037 350.09 S = 10432.5 R-Sq = 88.7% SE Coef 3747 0.00000004 87.97 T -4.56 9.92 3.98 P 0.000 0.000 0.000 R-Sq(adj) = 88.0% 71 VIF 1.519 1.519 The attached MINITAB output for Model A looks better now. Here the value of R 2 is 0.887 which is good, but more importantly we see that the effects of both of the explanatory variables are positive and they are statistically significant. Probability Plot of Residuals Normal - 95% CI 99 Mean StDev N AD P-Value 95 90 0 10111 34 0.906 0.019 Percent 80 70 60 50 40 30 20 10 5 1 -30000 -20000 -10000 0 Residuals 10000 20000 30000 Figure 4.5: Normal Probability Plot of the Residuals for Model A But when we look at the normality plot of residuals as shown in Figure 4.5 we do not feel very good about Model A. For this particular case the value of the Jarque-Bera test is 6.72 (p-value 0.0347) and the RM test is 8.37 (p-value 0.0152). So both of the tests reject the assumption of normality of errors and thus the model looks questionable. As an alternative choice we fit the same model by the robust reweighted least squares (RLS) method and we call it Model B. Regression Analysis: Model B: Total No. of Stu versus Budget in HE_1, Oil Price_1 The regression equation is Total No. of Students_1 = - 24848 + 0.000000 Budget in HE_1 + 992 Oil Price_1 Predictor Constant Budget in HE_1 Oil Price_1 S = 5984.88 Coef -24848 0.00000016 991.7 R-Sq = 97.2% SE Coef 2647 0.00000005 136.8 T -9.39 2.91 7.25 R-Sq(adj) = 97.0% 72 P 0.000 0.008 0.000 The attached MINITAB output shows that Model B produces even better fit in terms of R 2 as its value goes up to 0.972 from 0.887 when the OLS fit was done. Here the effects of both of the explanatory variables are positive and they are statistically significant. Probability Plot of RLS Normal - 95% CI 99 Mean StDev N AD P-Value 95 90 -7.56700E-12 5730 25 0.532 0.157 Percent 80 70 60 50 40 30 20 10 5 1 -20000 -10000 0 RLS 10000 20000 Figure 4.6: Normal Probability Plot of the Residuals for Model B For model B, the normality plot of residuals as shown in Figure 4.6 look much better than what we saw for Model A. For a confirmation we compute the Jarque-Bera and the RM values for Model B. We see that the value of the Jarque-Bera test is 1.56 (p-value 0.4584) and the RM test is 1.69 (p-value 0.4296). So both of the tests now accept the assumption of normality of errors and thus the model can be considered as a valid one. In the previous chapter we have seen that most of the variables we consider here in our regression model show exponential growth. So it may be a good idea to fit the model using a log transformation on the response as suggested by Montgomery et al. (2013). This third model will be denoted as Model C. 73 Regression Analysis: Model C: The regression equation is Total No. of Students_2 = 7.44 + 0.000000 Budget in HE_2 + 0.0217 Oil Price_2 Predictor Constant Budget in HE_2 Oil Price_2 S = 0.331114 Coef 7.4370 0.00000000 0.021717 SE Coef 0.1189 0.00000000 0.002792 R-Sq = 92.6% T 62.53 10.12 7.78 P 0.000 0.000 0.000 R-Sq(adj) = 92.1% Probability Plot of Residuals_1 Normal - 95% CI 99 Mean StDev N AD P-Value 95 90 -4.44089E-15 0.3209 34 0.459 0.247 Percent 80 70 60 50 40 30 20 10 5 1 -1.0 -0.5 0.0 Residuals_1 0.5 1.0 Figure 4.7: Normal Probability Plot of the Residuals for Model C The attached MINITAB output shows that Model C falls in between Model A and Model B in terms of possessing better R 2 . For this model the value of R 2 is 0.926. But it was 0.972 for Model B and 0.887 for Model A. Here the effects of both of the explanatory variables are positive and they are statistically significant. The normality plot of residuals for model C looks good as shown in Figure 4.7. Now we compute the Jarque-Bera and the RM values for Model C. We see that the value of the Jarque-Bera test is 1.86 (p-value 0.3946) and the RM test is 1.97 (p-value 0.3734). So both of the tests now accept the assumption of normality of errors and thus the model can be considered as a valid one. 74 4.5 Results Comparisons In this section we summarize our above findings. To explain the number of students studying abroad we began with three explanatory variables but this model failed the multicollinearity check. After that we employed the variable selection procedure to select the best set of regressors. After this selection was made we fit the data with three different models and the result summaries are presented in Table 4.1. Table 4.1: Regression Results Summary Model R2 JB RM Normality A: OLS 0.887 0.0347 0.0152 Rejected B: RLS 0.972 0.4584 0.4296 Accepted C: Exponential 0.926 0.3946 0.3734 Accepted The above results suggest that the traditional least squares method performs worst among the three models considered here. It not only possesses the lowest R 2 , it fails the normality test as well. Both the robust fit and the exponential model pass the normality test but we will put the robust RLS ahead of the exponential model both in terms of possessing higher R 2 and p-value in test of normality. 75 CHAPTER 5 Cross Validation of Forecasts In this chapter our main objective is to evaluate forecasts made by different regression methods and models. We would employ the cross validation method for this purpose. 5.1 Evaluation of Forecasts by Cross Validation Cross-validation is a technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. One round of crossvalidation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). An excellent review of different type of cross validation techniques is available in Izenman (2008). Picard and Cook (1984) developed all basic fundamentals of applying cross validation technique in regression and time series. According to Montgomery et al. (2013), three types of procedures are useful for validating a regression or time series model. (i) Analysis of the model coefficients and predicted values including comparisons with prior experience, physical theory, and other analytical models or simulation results, (ii) Collection of new data with which to investigate the model’s predictive performance, 76 (iii) Data splitting, that is, setting aside some of the original data and using these observations to investigate the model’s predictive performance. Since we have a large number of data set, we prefer the data splitting technique for cross-validation of the fitted model. In order to find out the best prediction model we usually leave out say, l observations aside as holdback period. The size of l is usually 10% to 20% of the original data. Suppose that we tentatively select two models namely, A and B. We fit both the models using (T – l) set of observations. Then we compute MSPE A 1 l 2 e Ai l t 1 (5.1) for model A and 1 l 2 MSPE B eBi l t 1 (5.2) for model B. Several methods have been devised to determine whether one MSPE is statistically different from the other. One such popular method of testing is the F-test approach, where Fstatistic is constructed as a ratio between the two MSPEs keeping the larger MSPE in the numerator of the F-statistic. If the MSPE for model A is larger, this statistic takes the form: F MSPE A MSPE B (5.3) This statistic follows an F distribution with (l , l) degrees of freedom under the null hypothesis of equal forecasting performance. If the F-test is significant we will choose model B for this data otherwise, we would conclude that there is a little bit difference in choosing between these two models. 77 5.2 Cross Validation Results In this section we employ the linear regression with the OLS and RLS methods and an exponential model for cross validation. Since we have 34 years data, we will use the first 90% of our data (30 years) for fitting the model and information for the last 10% of observations (4 years) will be forecasted by these three different methods. Table 5.1: Original and Forecasted Values for 2011-2014 Year Original RLS OLS Exponential 2011 95991 89716.3 69734.2 70962 2012 86030 90866.4 78140.5 97382 2013 102302 95339.7 89855.8 136358 2014 90925 87741.9 89071.0 121570 Scatterplot of Original vs RLS, OLS, Exponential Forecasts 140000 Variable Original RLS OLS Exponential 130000 Forecast 120000 110000 100000 90000 80000 70000 85000 87500 90000 92500 95000 Original 97500 100000 102500 Figure 5.1: Scatterplot of RLS, OLS, Exponential Forecasts vs Original Values 78 Table 5.1 provides total number of students studying abroad. Three different forecasted values are for the years 2011-2014 are presented together with the original values. Figure 5.1 gives a graphical display to show which forecasted values get closer to their corresponding original ones. The original values are plotted in black dots while the RLS forecasts plotted in red dots are quite close to the black ones. This graph clearly shows that the RLS forecast are much better than the OLS forecasts. Although exponential model performed better than the OLS fit. In terms of forecasts it seems to perform even worse the OLS. Table 5.2: Cross Validation Result Summary Model MSPE F p-value OLS 227502579 RLS 30342093 7.49791 0.0383 Exponential 713559588 0.525061 0.7260 As we know that the graphical summaries are subjective, we do an analytical test to evaluate the forecasts as designed in (5.1) to (5.3) and the results are presented in Table 5.2. We observe from this table that the MSPE value for the RLS is much less than that of OLS and exponential model. We also observe that the p-value of the F test is highly significant in comparison to the OLS. However, the exponential forecasts produce very insignificant p-value in this regard. Thus we can conclude that the RLS produces the best set of forecasts followed by the OLS forecasts. Exponential forecasts are the worst in this study. 79 CHAPTER 6 Conclusions and Areas of Further Research In this chapter we will summarize the findings of our research to draw some conclusions and outline ideas for our future research. 6.1 Conclusions In this study our prime objective was to investigate the trend of Saudi Arabia students who are studying abroad for higher education. We investigate both the overall trend and also trends of nine individual programs. We observe that not a single variable fit linear trend model. All of them fit either quadratic or exponential models. Then we investigate trends of some other variables such as budget in higher education, oil price, and oil revenue which should influence the number of students studying abroad. We observe similar trend for these variables as well. We also observe that most of the Saudi Arabia students go abroad to study Engineering and Medical Science and the least number of students study Agriculture and Fine Arts. We also found that the number of male students is significantly higher than the number of female students in 6 out of 9 programs. Female students are more in only three programs but the differences are not statistically significant. So we get an evidence of gender discrimination among the Saudi Arabia students studying abroad. In quest of which factors influence the number of students studying abroad we consider regression analysis and the two variables that we found affect most are budget in higher education and oil 80 price. We also observe that commonly used least squares method have several limitations in this case so we finally used the robust reweighted least squares to fit the data. To verify how good the fit is, we did cross validation to generate forecasts for the last four years of data and we found that the RLS fit produces much better forecasts than other methods. Our findings cause a little bit concern about the future of the programs in which the Saudi Students go abroad for higher studies. Since we see that oil price has a significant positive impact on the number of students we suspect the recent fall in oil price might affect the programs adversely. 6.2 Areas of Further Research Although our data sets are time series, we are not able to consider a variety of time series methods due to time constraints. We only consider the deterministic models in fitting the data. In future we would like to extend our research by considering stochastic ARIMA models. Volatility could be an essential part of this data. We would like to consider ARCH/GARCH or ARFIMA/GARFIMA models on these data in future. 81 References 1. Bowerman, B. L., O’Connell, R. T., and Koehler, A. B. (2005). Forecasting, Time Series, and Regression: An Applied Approach, 4th Ed., Duxbury Publishing, Thomson Books/Cole, New Jersey. 2. Hadi, A.S., Imon, A.H.M.R. and Werner, M. (2009). Detection of outliers, Wiley Interdisciplinary Reviews: Computational Statistics, 1, pp. 57 – 70. 3. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. and Stahel, W. (1986). Robust Statistics: The Approach Based on Influence Function, Wiley, New York. 4. Imon, A. H. M. R. (2003). Residuals from Deletion in Added Variable Plots, Journal of Applied Statistics, 30, 841– 855. 5. Imon, A. H. M. R. (2003). Regression Residuals, Moments, and Their Use in Tests for Normality, Communications in Statistics—Theory and Methods, 32, pp. 1021 – 1034. 6. Imon, A. H. M. R. (2008). Diagnostic Robust Approach of Outlier Detection in Regression, Journal of Statistical Research, 42, 105 – 120. 7. Izenman, A.J. (2008), Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning, Springer, New York. 8. Kadane, J.B. (1984). Robustness of Bayesian Analysis, Elsevier North-Holland, Amsterdam. 9. Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006), Robust Statistics: Theory and Methods, Wiley, New York. 82 10 Montgomery, D., Jennings, C., and Kulachi, M. (2008), Introduction to Time Series Analysis and Forecasting, Wiley, New York. 11. Montgomery, D., Peck, E., and Vining, G. (2013), An Introduction to Regression Analysis, 5th Ed., Wiley, New York. 12. Pindyck, R. S. and Rubenfeld, D. L. (1998), Econometric Models and Economic Forecasts, 4th Ed. Irwin/McGraw-Hill Boston. 13 Rousseeuw, P.J. (1984). Least Median of Squares Regression, Journal of the American Statistical Association, 79, pp. 871 – 880. 14. Rousseeuw, P.J. and Leroy, A.M. (1987). Robust Regression and Outlier Detection, Wiley, New York. 15. Rousseeuw, P.J. and Leroy, A.M. (1987). A Fast Algorithm for S-Regression Estimates, Journal of Computational and Graphical Statistics, 15, pp. 414–427. 16. Saudi Arabian Moneytary Agency (SAMA). http://www.sama.gov.sa/en-US/EconomicReports/Pages/YearlyStatistics.aspx 17. Saudi Arabia Cultural Mission to the U.S. http://www.sacm.org/ArabicSACM/pdf/Posters_Sacm_schlorship.pdf 19. The Ministry of Education https://www.mohe.gov.sa/ar/Ministry/Deputy-Ministry-for-Planning-andInformation-affairs/HESC/Ehsaat/Pages/default.aspx 20. The Ministry of Education https://www.mof.gov.sa/english/DownloadsCenter/Pages/Budget.aspx 83 APPENDIX A Table: A1. Number of Saudi Students Studying Abroad for Higher Education Year 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Social Science Male Female 2015 84 2061 213 1735 156 1356 141 1540 164 1199 161 1062 138 939 92 685 112 570 79 598 82 628 81 605 89 647 88 475 51 531 58 598 151 107 75 676 254 1759 534 1917 568 687 244 764 296 754 333 591 241 2267 510 4663 968 5424 1273 9462 2045 16318 4132 26043 7702 1547 1093 1542 1269 1287 1068 Total 2099 2274 1891 1497 1704 1360 1200 1031 797 649 680 709 694 735 526 589 749 182 930 2293 2485 931 1060 1087 832 2777 5631 6697 11507 20450 33745 2640 2811 2355 Natural Science Male 1124 1117 974 673 611 647 645 597 555 462 423 430 424 428 425 481 536 535 595 974 1072 730 788 776 597 2823 3136 5130 7118 8584 11945 16331 19047 16245 Female 48 78 72 47 53 65 61 71 125 100 88 79 76 73 89 133 372 424 388 537 570 436 392 407 282 607 720 1262 1715 2567 4481 6306 8230 7711 84 Total 1172 1195 1046 720 664 712 706 668 680 562 511 509 500 501 514 614 908 959 983 1511 1642 1166 1180 1183 879 3430 3856 6392 8833 11151 16426 22637 27277 23956 Medical Science Male 1312 758 666 508 621 637 654 578 542 448 361 431 508 552 559 550 673 860 966 1361 1626 1171 1214 1376 1709 3895 4983 3652 6173 7524 11589 14717 17208 15097 Female 235 130 110 81 86 82 86 64 59 58 46 60 59 60 50 62 110 206 248 313 398 307 362 398 467 986 1380 1674 2340 3736 6287 7913 9881 8847 Total 1547 888 776 589 707 719 740 642 601 506 407 491 567 612 609 612 783 1066 1214 1674 2024 1478 1576 1774 2176 4881 6363 5326 8513 11260 17876 22630 27089 23944 Year 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Law Male 123 313 42 32 39 43 41 39 44 39 36 35 55 54 29 29 31 39 78 183 292 24 105 127 240 506 625 756 1744 1729 2289 2989 3096 2715 Female 2 19 4 9 6 6 2 1 2 1 3 1 1 1 0 0 8 8 8 17 25 56 5 10 25 37 58 82 208 260 475 629 902 827 Humanities Total 125 332 46 41 45 49 43 40 46 40 39 36 56 55 29 29 39 47 86 200 317 80 110 137 265 543 683 838 1952 1989 2764 3618 3998 3542 Male 408 327 236 190 287 321 228 191 116 97 107 107 129 108 111 335 441 533 481 711 754 653 568 567 268 677 949 522 4336 1920 1998 5370 3161 3050 Female 117 2363 2203 274 252 261 260 168 110 49 34 44 57 61 90 501 735 549 816 1048 1119 1018 1010 1030 744 977 1495 408 2820 1786 1455 3800 2646 2231 85 Fine Arts Total 525 2690 2439 464 539 582 488 359 226 146 141 151 186 169 201 836 1176 1082 1297 1759 1873 1671 1578 1597 1012 1654 2444 930 7156 3706 3453 9170 5807 5281 Male 98 45 47 27 29 24 17 13 12 10 9 10 10 12 5 3 13 9 6 14 18 24 14 20 21 28 27 17 68 77 143 269 331 474 Female 2 33 32 11 23 26 35 18 26 11 16 22 21 21 26 22 35 37 31 38 53 56 58 50 62 64 119 52 178 266 406 621 868 994 Total 100 78 79 38 52 50 52 31 38 21 25 32 31 33 31 25 48 46 37 52 71 80 72 70 83 92 146 69 246 343 549 890 1199 1468 Year 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Engineering Male Female 1490 20 1137 14 1026 68 849 12 737 9 537 17 499 6 449 10 451 9 428 10 467 18 362 3 407 2 411 6 419 37 544 15 1123 34 1435 100 542 46 498 43 516 44 681 88 2711 162 5481 292 5080 130 6665 317 10647 360 18104 692 21461 672 30164 968 26255 860 1490 20 1137 14 1026 68 Total 1510 1151 1094 861 746 554 505 459 460 438 485 365 409 417 456 559 1157 1535 588 541 560 769 2873 5773 5210 6982 11007 18796 22133 31132 27115 1510 1151 1094 Education Male 382 516 514 339 473 296 174 157 148 120 123 120 104 109 62 74 118 107 228 459 458 193 176 177 171 300 2144 319 1254 610 955 863 1016 1059 Female 25 212 265 202 309 344 351 192 106 68 87 93 93 88 60 55 88 75 353 631 560 311 276 167 224 323 1019 216 710 716 1341 1342 1867 2117 86 Agriculture Total 407 728 779 541 782 640 525 349 254 188 210 213 197 197 122 129 206 182 581 1090 1018 504 452 344 395 623 3163 535 1964 1326 2296 2205 2883 3176 Male 219 176 138 107 99 81 95 82 82 52 49 50 55 62 61 54 66 58 82 82 83 79 74 54 34 81 80 29 44 74 74 88 88 74 Female 1 4 3 2 3 4 3 0 1 1 1 0 1 1 2 2 2 6 3 4 8 4 10 13 5 15 31 0 0 2 12 19 18 14 Total 220 180 141 109 102 85 98 82 83 53 50 50 56 63 63 56 68 64 85 86 91 83 84 67 39 96 111 29 44 76 86 107 106 88 Table: A2. Saudi Arabia Oil Revenue, Oil Price and Budget in Higher Education Year 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 Oil Revenue 328594 186006 145123 121348 88425 42464 67405 48400 75900 96800 149497 128790 105976 95505 105728 135982 159985 79998 104447 214424 183915 166100 231000 330000 504540 604470 562186 983369 434420 670265 1034360 1144818 1035046 913346 Budget in HE 2.76845E+06 9.35426E+06 1.03608E+07 9.30524E+06 1.10786E+07 7.13496E+09 6.00293E+09 6.15068E+09 5.73860E+09 5.75337E+09 6.09730E+09 3.18550E+10 3.41000E+10 3.51000E+10 2.69120E+10 2.76267E+10 4.17000E+10 4.31000E+10 4.41000E+10 4.92840E+10 5.43000E+10 4.70370E+10 6.75000E+10 6.36500E+10 7.01000E+10 8.73000E+10 9.67000E+10 1.05000E+11 1.22100E+11 1.37600E+11 1.50000E+11 1.68600E+11 2.04000E+11 2.10000E+11 87 Oil Price 77.80 74.58 68.43 69.36 67.16 26.21 28.38 20.45 25.20 28.40 23.50 22.64 20.52 19.31 19.24 23.07 23.04 15.08 21.60 35.64 31.14 31.27 30.92 35.14 50.21 59.94 62.59 80.38 53.89 68.60 88.79 93.06 88.95 80.34 APPENDIX B One-way ANOVA: Agriculture, Education, Engineering, Fine Arts, Humanities, ... Source Factor Error Total DF 8 297 305 S = 4965 SS 998821022 7322357160 8321178183 R-Sq = 12.00% MS 124852628 24654401 Level Agriculture Education Engineering Fine Arts Humanities Law Medical Science Natural Science Social Science N 34 34 34 34 34 34 34 34 34 F 5.06 P 0.000 Mean 85 859 4524 185 1847 655 4490 4284 3459 R-Sq(adj) = 9.63% StDev 38 895 8019 340 2142 1159 7337 7319 6584 Individual 95% CIs For Mean Based on Pooled StDev --------+---------+---------+---------+(-------*--------) (-------*--------) (--------*-------) (-------*-------) (-------*--------) (-------*--------) (-------*--------) (-------*--------) (-------*--------) --------+---------+---------+---------+0 2000 4000 6000 Pooled StDev = 4965 One-way ANOVA: Agriculture, Education, Engineering, Fine Arts, Humanities, ... Source Factor Error Total S = 4965 DF 8 297 305 SS 998821022 7322357160 8321178183 R-Sq = 12.00% MS 124852628 24654401 Level Agriculture Education Engineering Fine Arts Humanities Law Medical Science Natural Science Social Science N 34 34 34 34 34 34 34 34 34 Mean 85 859 4524 185 1847 655 4490 4284 3459 F 5.06 P 0.000 R-Sq(adj) = 9.63% StDev 38 895 8019 340 2142 1159 7337 7319 6584 Individual 95% CIs For Mean Based on Pooled StDev --------+---------+---------+---------+(-------*--------) (-------*--------) (--------*-------) (-------*-------) (-------*--------) (-------*--------) (-------*--------) (-------*--------) (-------*--------) --------+---------+---------+---------+0 2000 4000 6000 Pooled StDev = 4965 Grouping Information Using Tukey Method 88 Engineering Medical Science Natural Science Social Science Humanities Education Law Fine Arts Agriculture N 34 34 34 34 34 34 34 34 34 Mean 4524 4490 4284 3459 1847 859 655 185 85 Grouping A A A B A B C A B C A B C B C C C Means that do not share a letter are significantly different. Tukey 95% Simultaneous Confidence Intervals All Pairwise Comparisons Individual confidence level = 99.79% Agriculture subtracted from: Education Engineering Fine Arts Humanities Law Medical Science Natural Science Social Science Lower -2965 700 -3639 -1977 -3169 666 460 -365 Center 774 4439 99 1761 569 4405 4198 3373 Upper 4512 8177 3838 5500 4308 8143 7937 7112 ------+---------+---------+---------+--(-------*------) (-------*------) (------*-------) (-------*------) (------*-------) (-------*------) (------*-------) (-------*------) ------+---------+---------+---------+---5000 0 5000 10000 Upper 7403 3064 4726 3534 7369 7163 6338 ------+---------+---------+---------+--(------*-------) (-------*------) (-------*------) (-------*------) (------*-------) (-------*------) (------*-------) ------+---------+---------+---------+---5000 0 5000 10000 Upper -601 1061 -131 3704 3498 2673 ------+---------+---------+---------+--(------*-------) (-------*------) (------*-------) (-------*------) (-------*------) (-------*------) ------+---------+---------+---------+---5000 0 5000 10000 Education subtracted from: Engineering Fine Arts Humanities Law Medical Science Natural Science Social Science Lower -73 -4413 -2751 -3943 -107 -314 -1138 Center 3665 -674 988 -204 3631 3425 2600 Engineering subtracted from: Fine Arts Humanities Law Medical Science Natural Science Social Science Lower -8078 -6415 -7607 -3772 -3979 -4803 Center -4339 -2677 -3869 -34 -240 -1065 89 Fine Arts subtracted from: Humanities Law Medical Science Natural Science Social Science Lower -2076 -3268 567 361 -464 Center 1662 470 4305 4099 3274 Upper 5400 4208 8044 7837 7012 ------+---------+---------+---------+--(------*-------) (-------*------) (-------*------) (------*-------) (-------*------) ------+---------+---------+---------+---5000 0 5000 10000 Center -1192 2643 2437 1612 Upper 2546 6382 6175 5350 ------+---------+---------+---------+--(-------*------) (------*-------) (-------*------) (------*-------) ------+---------+---------+---------+---5000 0 5000 10000 Center 3835 3629 2804 Upper 7574 7367 6542 ------+---------+---------+---------+--(-------*------) (------*-------) (-------*------) ------+---------+---------+---------+---5000 0 5000 10000 Humanities subtracted from: Law Medical Science Natural Science Social Science Lower -4930 -1095 -1301 -2126 Law subtracted from: Medical Science Natural Science Social Science Lower 97 -109 -934 Medical Science subtracted from: Natural Science Social Science Lower -3945 -4770 Center -206 -1031 Upper 3532 2707 ------+---------+---------+---------+--(-------*------) (-------*------) ------+---------+---------+---------+---5000 0 5000 10000 Natural Science subtracted from: Social Science Lower -4563 Center -825 Upper 2913 ------+---------+---------+---------+--(------*-------) ------+---------+---------+---------+---5000 0 5000 10000 Grouping Information Using Fisher Method Engineering Medical Science Natural Science Social Science Humanities Education Law Fine Arts Agriculture N 34 34 34 34 34 34 34 34 34 Mean 4524 4490 4284 3459 1847 859 655 185 85 Grouping A A A A B B C C C C C 90 Means that do not share a letter are significantly different. Fisher 95% Individual Confidence Intervals All Pairwise Comparisons Simultaneous confidence level = 43.41% Agriculture subtracted from: Education Engineering Fine Arts Humanities Law Medical Science Natural Science Social Science Lower -1596 2069 -2271 -609 -1801 2035 1828 1003 Center 774 4439 99 1761 569 4405 4198 3373 Upper 3144 6809 2469 4131 2939 6775 6568 5743 ---------+---------+---------+---------+ (------*------) (------*-----) (-----*------) (------*------) (------*-----) (------*-----) (------*------) (------*-----) ---------+---------+---------+---------+ -3500 0 3500 7000 Upper 6035 1696 3358 2166 6001 5795 4970 ---------+---------+---------+---------+ (-----*------) (------*------) (------*------) (-----*------) (-----*------) (------*------) (-----*------) ---------+---------+---------+---------+ -3500 0 3500 7000 Upper -1969 -307 -1499 2336 2130 1305 ---------+---------+---------+---------+ (------*-----) (-----*------) (------*------) (------*------) (-----*------) (------*------) ---------+---------+---------+---------+ -3500 0 3500 7000 Upper 4032 2840 6675 6469 5644 ---------+---------+---------+---------+ (------*------) (-----*------) (-----*------) (------*-----) (-----*------) ---------+---------+---------+---------+ -3500 0 3500 7000 Education subtracted from: Engineering Fine Arts Humanities Law Medical Science Natural Science Social Science Lower 1295 -3044 -1382 -2574 1261 1055 230 Center 3665 -674 988 -204 3631 3425 2600 Engineering subtracted from: Fine Arts Humanities Law Medical Science Natural Science Social Science Lower -6709 -5047 -6239 -2404 -2610 -3435 Center -4339 -2677 -3869 -34 -240 -1065 Fine Arts subtracted from: Humanities Law Medical Science Natural Science Social Science Lower -708 -1900 1935 1729 904 Center 1662 470 4305 4099 3274 91 Humanities subtracted from: Law Medical Science Natural Science Social Science Lower -3562 273 67 -758 Center -1192 2643 2437 1612 Upper 1178 5013 4807 3982 ---------+---------+---------+---------+ (------*-----) (------*-----) (------*------) (------*-----) ---------+---------+---------+---------+ -3500 0 3500 7000 Center 3835 3629 2804 Upper 6205 5999 5174 ---------+---------+---------+---------+ (------*------) (-----*------) (------*------) ---------+---------+---------+---------+ -3500 0 3500 7000 Law subtracted from: Medical Science Natural Science Social Science Lower 1465 1259 434 Medical Science subtracted from: Natural Science Social Science Lower -2576 -3401 Center -206 -1031 Upper 2164 1339 ---------+---------+---------+---------+ (-----*------) (------*------) ---------+---------+---------+---------+ -3500 0 3500 7000 Natural Science subtracted from: Social Science Lower -3195 Center -825 Upper 1545 ---------+---------+---------+---------+ (------*-----) ---------+---------+---------+---------+ -3500 0 3500 7000 92