The Correlation Between Countries’ Gdp Per Capita and The Number Of Vehicles Owned By The Citizens 1. Introduction 2. Data Analysis and preparation 2.1.Exploratory data analysis 2.2.Summary Statistics 2.3.Outlier Analysis and Box and Whisker Diagrams 3. Correlation Investigation with several methods 3.1. Scatter diagram and line of best fit 3.2.Pearson’s product-moment coefficient, r 3.3.Hypothesis testing 3.3.1.1. Proper test selection and defining the hypothesis 3.3.1.2. T-test test 4. Conclusion 5. Further Research 1. Introduction As I approach my ages to buy my own car, motorcars have started to attract my attention. In the last 10 years have lived in 3 different countries, Turkiye, France and Spain and as I witnessed the significant differences on the motorcars ownerships in these countries. My hypothesis is that in the wealthier countries, number of cars purchased by the citizens increase and I wanted to work on this topic and try to prove my prediction. In this report, the correlation between GDP and the number of vehicles owned by the citizens is investigated. This research is conducted with 20 different countries, 10 of them are developed and 10 of them are developing countries. My main goal is to find correlation between GDP and the number of vehicles owned by the citizen. Therefore, the dependent variable (y) will be GDP and the independent variable (x) will be number of vehicles owned by citizens. In this research, first I will try to understand the data, and we will prepare it to analyse. Then, I will perform proper methods to find the correlation. Finally, I will define a hypothesis and test it. I have gathered data via simple random sampling from the Wikipedia IMF estimates list (List of countries by GDP (nominal) per capita - Wikipedia) which contained all 192 countries in the world. I used simple random sampling on purpose because the result should be independent from the data of choice, in order to designate 20 countries from 192 I used R-tool by using “random()” function (a desktop software). In order to investigate the research question, first I will prepare the data to analyse. I will do this by performing exploratory data analysis. This step will help me before making any decision about data. Therefore, in this research you will first find the results from Exploratory Data Analysis. Aims of this part are mainly better understanding the data and preparing the data. Firstly, exploratory data analysis will be conducted by visualising the data with histograms where we may see the data distribution and the frequency. If there is an extreme case, we can easily observe it at the beginning. This exercise will be performed on two variables of our data set; GDP and Number of Motor Vehicles separately. Secondly, I will perform summary statistics, which is a part of descriptive statistics, and it is important to summarize the data and to obtain information. In this step, I will find the measure of location and measure of spread. To determine measure of location, I will use mean, median and interquartile mean. To determine measure of spread, I will basically get help from standard deviation and variance. In addition to them, I will also look at the shape of the distribution to see the skewness. Lastly, I will identify outliers, if they exist. It is an important data preparation step because outliers may affect the research significantly. Therefore, it is better to detect them in the beginning. Before proceeding to following step, if there is an outlier, I will remove it from our data set. By using the summary statistics data, I will also generate Box and Whisker diagrams to visualize the distribution and the skewness of the data. After analysing and preparing the data set, next step will be to find the correlation. Firstly, I will find the line of best fit to see the relationship between the variables. Also, I will generate Scatter Diagram together with the line of best fit graph between two variables. Secondly, I will find Pearson Product-Moment Correlation Coefficient. Then, I will perform hypothesis testing to see whether it will give a result to conclude the correlation or not. In this research, my main goal is to find correlation between GDP and the number of vehicles owned by the citizens. 2. Data Analysis and Preparation In this section, we try to understand our data set better by performing several analysis approaches. The main goal of this section is to prepare the data for the next steps and to understand the structure of each variable. Our data set has 20 observations (n=20) and 4 variables. These variables are Country name, GDP, number of vehicle and Country condition. From these 4 variable; Country name is to label the data and each observation has different value. Country condition is a categorical variable and takes two values as developed and developing. Lastly, GDP and number of vehicle are numeric variables of our data set, and we will mainly focus on these two variables. 2.1.Exploratory Data Analysis From our GDP data set, the most important variables for us are GDP and number of vehicles. Therefore, first let's look at their histograms. Histogram graphs helps us to identify the center of the data, distribution of the data and symmetry of the data. When we first look at the histogram graph of GDP, we may say that the distribution of the data is not symmetric. Furthermore, the data has a small right tail, therefore it is slightly right - skewed. Many of the values are near the lower end range, and higher values are infrequent. After looking at the histogram graph of number of vehicles owned by citizens, similar to GDP data, we may directly conclude that the distribution is not symmetric. Also, this time the data is left–skewed. Therefore, the median will be greater than the mean. We may also check and observe this information in the next section by looking at their summary statistics. For left skewed distributions, the median will be greater than the mean since the mean is less sensitive to the higher values, we may see this after looking its mean and median in the next section. Lastly, the blank part of the histogram graph on the left side of the graph may indicate outliers. However, the left rectangle is only one bin apart from the others. Therefore, we cannot directly say that so I will perform an outlier analysis in the following sections. R Software (skewness coefficient) : To support hıstogram results, skewness coeffıcıent for GDP variable is 0.26 which is positive therefore ıt indicates right skewness. However, since 0.26 < 0.5. this skewness is too small. In other words, yes, the distribution is non symmetric, but its asymmetry is not that big. Whereas the skewness coefficient for vehicle variable is -0.53 whıch ıs negative therefore, distribution is left skewed. 2.2.Summary Statistics In this section, we will consider the results of summary statistics. This step will give us more information about our data. Now, let's look at the table; GDP Min 1st Quartile Median Mean 3rd Quartile Max Number of Vehicle 249 13785 53127 52979 80998 132372 10 335.5 556 494.1 652.5 831 Now we can state our findings from summary statistics; First of all, for both variables GDP and number of vehicle, their mean and median is not equal which means distribution is not symmetric. For Number of Vehicle variable, its mean (494.1) is less than its median (556), therefore, the distribution for vehicle variable is leftskewed. Difference between maximum and the minimum for GDP (132,372249) is bigger than the difference for number of vehicle (831-10). Therefore, we may say that the range of GDP data is bigger than the range for number of vehicle. GDP distribution is more spread. Upper quartile (3rd quartile) of GDP data is 80,998 which means that the 75% of all data points are below than this value. Which is 652.5 for number of vehicle. In other words, if we order all the data points, 15th point would be 80,998 for GDP and 652.5 for number of vehicle. Since we have 20 observations (n =20), 75% means 15th point according to its ranking. In the given data no observation occurs more than once. Hence, there is no need to add this in our summary statistics table. GDP Standard Deviation Variance 40771.59 1662322822 Number of Vehicle 239.8881 57546.31 Standard deviation indicates the distance between each data points to the mean. Smaller standard deviation means that the data points placed closer to their mean whereas the higher values for standard deviation means that points are spread out from their mean. For our example, GDP data is more spread than number of vehicle. Variance also indicates the distance however this distance is square distance. The more spread out the values are in a dataset, the higher the variance. 2.3.Outlier Analysis and Box and Whisker diagrams To identify the outliers and to construct a boxer and whisker diagram, we need to define minimum value, Quartile 1, median, Quartile 3 and maximum values in the data set. These figures are calculated in the Table-X above. Outlier Analysis for GDP Data Range for GDP = Max-Min Values = 132,372 – 249 = 132,123 Mean for GDP = ∑𝑋𝑖 𝑛 1059583 = 20 = 52979 IQR (Interquartile range)= 𝑸𝟑 -𝑸𝟏 IQR for GDP = 80998 – 13785 = 67213 There is an outier if ; X < 𝑸𝟏 - 1.5 (IQR) or X > 𝑸𝟑 +1.5 (IQR) For upper boundary: Check if X > Q3 +1.5 (IQR) Upper boundary for GDP = 81,962 + (1.5 (67,213)) = 182,782 There are no outliers in the GDP data set since no values are higher than this upper boundary. For lower boundary: Check if X < Q1 - 1.5 (IQR) Lower boundary for GDP: 13,785 - (1.5 (67,213)) = - 87,034 There are no outliers in the data set since no values are lower than this lower boundary. Above box plot gives us similar results as histogram graph gave in previous section. Box is located slightly left side of the graph in other words left whisker is shorter than the right one, so we can say that the distribution is not symetric. The avarage score, ie. mean, is 53,126.5 and it can be easily seen in the graph. Thats why box plots are useful to better understand the data. Also, box plot for GDP variable supports our findings about outliers above. There is not any data point located on whiskers. This means there isn’t any indication for outliers. Outlier Analysis for Number of Vehicle Data Range for Number of Vehicles = Max-Min Values = 831 – 10 = 821 Mean for Number of Vehicles = ∑𝑋𝑖 𝑛 9882 = 20 = 494.1 IQR (Interquartile range)= 𝑸𝟑 -𝑸𝟏 IQR for Number of Vehicles = 652.5 – 335.5 = 317 There is an outier if ; X < 𝑸𝟏 - 1.5 (IQR) or X > 𝑸𝟑 +1.5 (IQR) For upper boundary: Check if X > Q3 +1.5 (IQR) Upper boundary for Number of Vehicles = 652.5 + (1.5 (317)) = 1128 There are no outliers in the Number of Vehicles data set since no values are higher than this upper boundary. For lower boundary: Check if X < Q1 - 1.5 (IQR) Lower boundary for GDP: 335.5 - (1.5 (317)) = - 140 There are no outliers in the data set since no values are lower than this lower boundary. Right whisker is shorter than the left whisker, this indicates left skewness. Also, min value (10), max value (831) and the mean (556) stated in the graph. Lastly, ın both box plots there is not any data point on whiskers, all points lie between second and 3rd quartile which indicates there isn’t ant outliers on both variables. Therefore, box plot supports our findings above 3. Correlation Investigation with several methods In this section, we will perform three different methods to understand the relationship between GDP and the number of vehicles owned by citizens. First we will find the line of best fit equation, later we will find the linear correlation coefficient (Pearson) and lastly we will conduct hypothesis testing. 3.1. Line of Best Fit Coefficient Correlation measures the relationship between two variables. In this part we will see the relationship between GDP and number of vehicles owned by citizens. To find this coefficient I used a website (Correlation coefficient calculator Pearson and Spearman's rank, with solution (statskingdom.com)) for my calculation. Line of Best Fit y = 128.71x – 10618 In the above graph you may see the scatter diagram between GDP and number of Vehicle variables. Purple dots represent the distribution of points (xi,yi). Here our independent variable x is number of vehicle and dependent variable y is GDP. We may see how they are located in the diagram. In addition, blue line in the graph illustrates us the line of best fit. This line helps us to find a equation for regression. If we state basic line equation as Y =α x + β , Coefficient of x is α and it will be slope of this line. And β is constant term of this equation. In Statistics, this equation is commonly used to make regression. For example, our line of best fit equation is y = 128.71x – 10618. Therefore, if someone wants to predict the GDP value for a country who owns 367 vehicle by 1000 citizens. Only thing to do is putting x =367 in our equation. After that, one will find the predicted GDP as y = 128.71*(367) +/- 10,618. Besides from regression, line of best fit coefficient also gives us an information about correlation. Since the slope is positive, it indicates a positive correlation between GDP and number of vehicle owned by citizens. Now, we have an indication for positive correlation. In order to understand how strong this relationship is we will find a linear correlation coeeficient in next section. 3.2.Linear Correlation Coeefficient (Pearson) Correlation coefficient takes value between -1 and 1. A correlation of -1 indicates a perfect negative correlation and a correlation of +1 indicates perfect positive. You may see the results as; The coefficient is 0.7573. In our case, r = 0.7573 which is positive and close to 1. Therefore, we may conclude that there is a significant positive relationship between GDP and number of vehicle. Furthermore, the coefficient of determination 𝑅2 value tells us the proportion of variance in a data set. In our case 57% of the variation in GDP per capita is explained by variation in number of vehicles owned by 1000 citizens. Also we will use this value to find p value under hypothesis testing. 𝑅 2 = 0.75732 = 0,5735 3.3.Hypothesis testing Lastly, we will conduct hypothesis testing. Aim of this test is to support our findings in the previous chapter, and to investigate our research question in a different way. To do so, firstly we will decide that which type of hypothesis testing will be proper to our data. To make this decision, we will use the results of the previous sections. 3.3.1.1. Proper test selection and defining the hypothesis There are several types of hypothesis testing in statistics. Some of them are used to compare the mean of samples and some can be used to compare the variance. Chi square test, t-test and Ftest are the most common types of Hypothesis testing. In our research we are investigating the existence of correlation. Therefore, we should perform a t-test since it is known as a statistical test with the use of the correlation coefficient. The correlation coefficient r is tested by this method. Before performing the test lets write our hypothesis first: H0: r =0 HA: r is not equal to 0 From the previous section, we now that if r is zero it means there is no correlation and if it is closer to -1 or 1, it indicates a strong correlation. Therefore, The null hypothesis (H0) of t-test states that there is no correlation between dependent and independent variables (GDP and number of vehicle). Whereas the alternative hypothesis (HA) states that there is a significant correlation. Now we are ready to perform a hypothesis test. 3.3.1.2. T-test To find t we need to know degrees of freedom, coefficient of correlation because: We already know these values from the previous sections; r= 0.7573 Degrees of freedom = n-2 = 20-2 = 18 (n is number of observations minus number of variable) R^2 = 0.5735 Therefore t=4.91976 (by GDC ? or website) Now, we need to find p-value. In order to find this we need the probability for observing more extreme than t=4.91976 and multiply by two since we are performing a two-sided test. In order to find this we may use t-table or t-test calculators. After using calculator, our p-value is 0.00011. In this test, if we choose 0.05 significance level (sigma=0.05) our test result can be stated as following; - P < sigma then we have the right to reject the null hypothesis. - Null hypothesis is r=0, and after rejecting the null. It indicates non-zero r, - Non-zero r concludes that there is a correlation between GDP per capita and number of vehicles owned by 1,000 citizens. 4. Conclusion In this research I investigated the correlation between GDP and number of vehicle. In section 2, I implemented some exploratory data analysis methods and obtained summary statistics results to better understand the data. Also, we made an outlier analysis in order to obtain better result. The goal of section 2 was to prepare the data for correlation analysis. The results obtained in this section helped me in the rest of analysis. In section 3, we investigated the relationship. Firstly, I found the line of best fit between GDP and vehicle variables, and its positive slope indicated to me there is a positive relationship. Secondly, to see this relationship better I calculated linear correlation coefficient r which is 0.7573. By having a correlation coefficient bigger than 0.5 helped me to conclude that there is a strong positive correlation between GDP and number of vehicle. Lastly, I conducted a hypothesis testing in order to reject the null hypothesis, and in order to support correlation indication. Hypothesis testing also gave the similar result by finding a small p-value. Therefore, after my investigation I can conclude that there is a strong positive correlation between GDP and number of vehicle owned by citizens. 5. Evaluation Limitations Strength 6. Further Research In this paper, I worked with 20 randomly selected observations. Also, I only used two variables from our data set which are GDP and number of vehicle. For the further research, we may divide our data set into two by using the categorical variable in our data set which is the status of the countries namely “developed and developing”. By conducting similar works with both developed countries and developing countries, we may obtain two different results.