Uploaded by Duru

GDP Correlation Report v01 12092023-SON-SON[15]

advertisement
The Correlation Between Countries’ Gdp Per
Capita and The Number Of Vehicles Owned By
The Citizens
1. Introduction
2. Data Analysis and preparation
2.1.Exploratory data analysis
2.2.Summary Statistics
2.3.Outlier Analysis and Box and Whisker Diagrams
3. Correlation Investigation with several methods
3.1. Scatter diagram and line of best fit
3.2.Pearson’s product-moment coefficient, r
3.3.Hypothesis testing
3.3.1.1. Proper test selection and defining the hypothesis
3.3.1.2. T-test test
4. Conclusion
5. Further Research
1. Introduction
As I approach my ages to buy my own car, motorcars have started to attract
my attention. In the last 10 years have lived in 3 different countries, Turkiye,
France and Spain and as I witnessed the significant differences on the
motorcars ownerships in these countries. My hypothesis is that in the
wealthier countries, number of cars purchased by the citizens increase and I
wanted to work on this topic and try to prove my prediction.
In this report, the correlation between GDP and the number of vehicles
owned by the citizens is investigated. This research is conducted with 20
different countries, 10 of them are developed and 10 of them are developing
countries. My main goal is to find correlation between GDP and the number
of vehicles owned by the citizen. Therefore, the dependent variable (y) will
be GDP and the independent variable (x) will be number of vehicles owned
by citizens.
In this research, first I will try to understand the data, and we will prepare it
to analyse. Then, I will perform proper methods to find the correlation.
Finally, I will define a hypothesis and test it.
I have gathered data via simple random sampling from the Wikipedia IMF estimates list
(List of countries by GDP (nominal) per capita - Wikipedia) which contained all 192
countries in the world. I used simple random sampling on purpose because the result
should be independent from the data of choice, in order to designate 20 countries from
192 I used R-tool by using “random()” function (a desktop software).
In order to investigate the research question, first I will prepare the data to
analyse. I will do this by performing exploratory data analysis. This step will
help me before making any decision about data. Therefore, in this research
you will first find the results from Exploratory Data Analysis. Aims of this part
are mainly better understanding the data and preparing the data. Firstly,
exploratory data analysis will be conducted by visualising the data with
histograms where we may see the data distribution and the frequency. If
there is an extreme case, we can easily observe it at the beginning. This
exercise will be performed on two variables of our data set; GDP and Number
of Motor Vehicles separately.
Secondly, I will perform summary statistics, which is a part of descriptive
statistics, and it is important to summarize the data and to obtain
information. In this step, I will find the measure of location and measure of
spread. To determine measure of location, I will use mean, median and
interquartile mean. To determine measure of spread, I will basically get help
from standard deviation and variance. In addition to them, I will also look at
the shape of the distribution to see the skewness.
Lastly, I will identify outliers, if they exist. It is an important data preparation
step because outliers may affect the research significantly. Therefore, it is
better to detect them in the beginning. Before proceeding to following step,
if there is an outlier, I will remove it from our data set. By using the summary
statistics data, I will also generate Box and Whisker diagrams to visualize the
distribution and the skewness of the data.
After analysing and preparing the data set, next step will be to find the
correlation. Firstly, I will find the line of best fit to see the relationship
between the variables. Also, I will generate Scatter Diagram together with
the line of best fit graph between two variables.
Secondly, I will find Pearson Product-Moment Correlation Coefficient. Then,
I will perform hypothesis testing to see whether it will give a result to
conclude the correlation or not.
In this research, my main goal is to find correlation between GDP and the
number of vehicles owned by the citizens.
2. Data Analysis and Preparation
In this section, we try to understand our data set better by performing
several analysis approaches. The main goal of this section is to prepare the
data for the next steps and to understand the structure of each variable. Our
data set has 20 observations (n=20) and 4 variables. These variables are
Country name, GDP, number of vehicle and Country condition. From these 4
variable; Country name is to label the data and each observation has
different value. Country condition is a categorical variable and takes two
values as developed and developing. Lastly, GDP and number of vehicle are
numeric variables of our data set, and we will mainly focus on these two
variables.
2.1.Exploratory Data Analysis
From our GDP data set, the most important variables for us are GDP and
number of vehicles. Therefore, first let's look at their histograms. Histogram
graphs helps us to identify the center of the data, distribution of the data and
symmetry of the data.
When we first look at the histogram graph of GDP, we may say that the
distribution of the data is not symmetric. Furthermore, the data has a small
right tail, therefore it is slightly right - skewed. Many of the values are near
the lower end range, and higher values are infrequent.
After looking at the histogram graph of number of vehicles owned by citizens,
similar to GDP data, we may directly conclude that the distribution is not
symmetric. Also, this time the data is left–skewed. Therefore, the median will
be greater than the mean. We may also check and observe this information
in the next section by looking at their summary statistics. For left skewed
distributions, the median will be greater than the mean since the mean is less
sensitive to the higher values, we may see this after looking its mean and
median in the next section.
Lastly, the blank part of the histogram graph on the left side of the graph may
indicate outliers. However, the left rectangle is only one bin apart from the
others. Therefore, we cannot directly say that so I will perform an outlier
analysis in the following sections.
R Software (skewness coefficient) :
To support hıstogram results, skewness coeffıcıent for GDP variable is 0.26
which is positive therefore ıt indicates right skewness. However, since 0.26 <
0.5. this skewness is too small. In other words, yes, the distribution is non
symmetric, but its asymmetry is not that big. Whereas the skewness
coefficient for vehicle variable is -0.53 whıch ıs negative therefore,
distribution is left skewed.
2.2.Summary Statistics
In this section, we will consider the results of summary statistics. This step will
give us more information about our data. Now, let's look at the table;
GDP
Min
1st Quartile
Median
Mean
3rd Quartile
Max
Number of Vehicle
249
13785
53127
52979
80998
132372
10
335.5
556
494.1
652.5
831
Now we can state our findings from summary statistics;
 First of all, for both variables GDP and number of vehicle, their
mean and median is not equal which means distribution is not
symmetric.
 For Number of Vehicle variable, its mean (494.1) is less than its
median (556), therefore, the distribution for vehicle variable is leftskewed.
 Difference between maximum and the minimum for GDP (132,372249) is bigger than the difference for number of vehicle (831-10).
Therefore, we may say that the range of GDP data is bigger than the
range for number of vehicle. GDP distribution is more spread.
 Upper quartile (3rd quartile) of GDP data is 80,998 which means that
the 75% of all data points are below than this value. Which is 652.5
for number of vehicle. In other words, if we order all the data
points, 15th point would be 80,998 for GDP and 652.5 for number
of vehicle. Since we have 20 observations (n =20), 75% means 15th
point according to its ranking.
 In the given data no observation occurs more than once. Hence,
there is no need to add this in our summary statistics table.
GDP
Standard Deviation
Variance
40771.59
1662322822
Number of Vehicle
239.8881
57546.31
 Standard deviation indicates the distance between each data
points to the mean. Smaller standard deviation means that the data
points placed closer to their mean whereas the higher values for
standard deviation means that points are spread out from their
mean. For our example, GDP data is more spread than number of
vehicle.
 Variance also indicates the distance however this distance is square
distance. The more spread out the values are in a dataset, the
higher the variance.
2.3.Outlier Analysis and Box and Whisker diagrams
To identify the outliers and to construct a boxer and whisker diagram, we need
to define minimum value, Quartile 1, median, Quartile 3 and maximum values
in the data set. These figures are calculated in the Table-X above.
Outlier Analysis for GDP Data
Range for GDP = Max-Min Values = 132,372 – 249 = 132,123
Mean for GDP =
∑𝑋𝑖
𝑛
1059583
=
20
= 52979
IQR (Interquartile range)= 𝑸𝟑 -𝑸𝟏
IQR for GDP = 80998 – 13785 = 67213
There is an outier if ; X < 𝑸𝟏 - 1.5 (IQR) or X > 𝑸𝟑 +1.5 (IQR)
For upper boundary: Check if X > Q3 +1.5 (IQR)
Upper boundary for GDP = 81,962 + (1.5 (67,213)) = 182,782
There are no outliers in the GDP data set since no values are higher than this
upper boundary.
For lower boundary: Check if X < Q1 - 1.5 (IQR)
Lower boundary for GDP: 13,785 - (1.5 (67,213)) = - 87,034
There are no outliers in the data set since no values are lower than this lower
boundary.
Above box plot gives us similar results as histogram graph gave in previous section. Box is
located slightly left side of the graph in other words left whisker is shorter than the right one,
so we can say that the distribution is not symetric. The avarage score, ie. mean, is 53,126.5
and it can be easily seen in the graph. Thats why box plots are useful to better understand
the data. Also, box plot for GDP variable supports our findings about outliers above. There is
not any data point located on whiskers. This means there isn’t any indication for outliers.
Outlier Analysis for Number of Vehicle Data
Range for Number of Vehicles = Max-Min Values = 831 – 10 = 821
Mean for Number of Vehicles =
∑𝑋𝑖
𝑛
9882
=
20
= 494.1
IQR (Interquartile range)= 𝑸𝟑 -𝑸𝟏
IQR for Number of Vehicles = 652.5 – 335.5 = 317
There is an outier if ; X < 𝑸𝟏 - 1.5 (IQR) or X > 𝑸𝟑 +1.5 (IQR)
For upper boundary: Check if X > Q3 +1.5 (IQR)
Upper boundary for Number of Vehicles = 652.5 + (1.5 (317)) = 1128
There are no outliers in the Number of Vehicles data set since no values are
higher than this upper boundary.
For lower boundary: Check if X < Q1 - 1.5 (IQR)
Lower boundary for GDP: 335.5 - (1.5 (317)) = - 140
There are no outliers in the data set since no values are lower than this lower
boundary.
Right whisker is shorter than the left whisker, this indicates left skewness.
Also, min value (10), max value (831) and the mean (556) stated in the
graph.
Lastly, ın both box plots there is not any data point on whiskers, all points
lie between second and 3rd quartile which indicates there isn’t ant
outliers on both variables. Therefore, box plot supports our findings
above
3. Correlation Investigation with several methods
In this section, we will perform three different methods to understand the
relationship between GDP and the number of vehicles owned by citizens. First
we will find the line of best fit equation, later we will find the linear correlation
coefficient (Pearson) and lastly we will conduct hypothesis testing.
3.1. Line of Best Fit Coefficient
Correlation measures the relationship between two variables. In this part
we will see the relationship between GDP and number of vehicles owned by
citizens. To find this coefficient I used a website (Correlation coefficient calculator Pearson and Spearman's rank, with solution (statskingdom.com)) for my calculation.
Line of Best Fit y = 128.71x – 10618
In the above graph you may see the scatter diagram between GDP and
number of Vehicle variables. Purple dots represent the distribution of
points (xi,yi). Here our independent variable x is number of vehicle and
dependent variable y is GDP. We may see how they are located in the
diagram. In addition, blue line in the graph illustrates us the line of best
fit. This line helps us to find a equation for regression.
If we state basic line equation as Y =α x + β ,
Coefficient of x is α and it will be slope of this line. And β is constant
term of this equation.
In Statistics, this equation is commonly used to make regression. For
example, our line of best fit equation is y = 128.71x – 10618.
Therefore, if someone wants to predict the GDP value for a country who
owns 367 vehicle by 1000 citizens. Only thing to do is putting x =367 in
our equation. After that, one will find the predicted GDP as
y = 128.71*(367) +/- 10,618.
Besides from regression, line of best fit coefficient also gives us an
information about correlation. Since the slope is positive, it indicates a
positive correlation between GDP and number of vehicle owned by
citizens.
Now, we have an indication for positive correlation. In order to
understand how strong this relationship is we will find a linear correlation
coeeficient in next section.
3.2.Linear Correlation Coeefficient (Pearson)
Correlation coefficient takes value between -1 and 1. A correlation of -1
indicates a perfect negative correlation and a correlation of +1 indicates
perfect positive. You may see the results as;
The coefficient is 0.7573.
In our case, r = 0.7573 which is positive and close to 1. Therefore, we
may conclude that there is a significant positive relationship between
GDP and number of vehicle.
Furthermore, the coefficient of determination 𝑅2 value tells us the proportion of
variance in a data set. In our case 57% of the variation in GDP per capita is explained
by variation in number of vehicles owned by 1000 citizens. Also we will use this
value to find p value under hypothesis testing.
𝑅 2 = 0.75732 = 0,5735
3.3.Hypothesis testing
Lastly, we will conduct hypothesis testing. Aim of this test is to support our
findings in the previous chapter, and to investigate our research question in a
different way. To do so, firstly we will decide that which type of hypothesis
testing will be proper to our data. To make this decision, we will use the results
of the previous sections.
3.3.1.1. Proper test selection and defining the hypothesis
There are several types of hypothesis testing in statistics. Some
of them are used to compare the mean of samples and some can
be used to compare the variance. Chi square test, t-test and Ftest are the most common types of Hypothesis testing.
In our research we are investigating the existence of correlation.
Therefore, we should perform a t-test since it is known as a
statistical test with the use of the correlation coefficient.
The correlation coefficient r is tested by this method. Before
performing the test lets write our hypothesis first:
H0: r =0
HA: r is not equal to 0
From the previous section, we now that if r is zero it means
there is no correlation and if it is closer to -1 or 1, it indicates a
strong correlation.
Therefore, The null hypothesis (H0) of t-test states that there is
no correlation between dependent and independent variables
(GDP and number of vehicle). Whereas the alternative
hypothesis (HA) states that there is a significant correlation. Now
we are ready to perform a hypothesis test.
3.3.1.2. T-test
To find t we need to know degrees of freedom, coefficient of
correlation because:
We already know these values from the previous sections;
r= 0.7573
Degrees of freedom = n-2 = 20-2 = 18
(n is number of observations minus number of variable)
R^2 = 0.5735
Therefore t=4.91976 (by GDC ? or website)
Now, we need to find p-value. In order to find this we need the
probability for observing more extreme than t=4.91976 and
multiply by two since we are performing a two-sided test. In
order to find this we may use t-table or t-test calculators. After
using calculator, our p-value is 0.00011.
In this test, if we choose 0.05 significance level (sigma=0.05) our
test result can be stated as following;
- P < sigma then we have the right to reject the null
hypothesis.
- Null hypothesis is r=0, and after rejecting the null. It
indicates non-zero r,
- Non-zero r concludes that there is a correlation between
GDP per capita and number of vehicles owned by 1,000
citizens.
4. Conclusion
In this research I investigated the correlation between GDP and number of
vehicle. In section 2, I implemented some exploratory data analysis methods
and obtained summary statistics results to better understand the data. Also,
we made an outlier analysis in order to obtain better result. The goal of
section 2 was to prepare the data for correlation analysis. The results
obtained in this section helped me in the rest of analysis.
In section 3, we investigated the relationship. Firstly, I found the line of best
fit between GDP and vehicle variables, and its positive slope indicated to me
there is a positive relationship. Secondly, to see this relationship better I
calculated linear correlation coefficient r which is 0.7573. By having a
correlation coefficient bigger than 0.5 helped me to conclude that there is a
strong positive correlation between GDP and number of vehicle. Lastly, I
conducted a hypothesis testing in order to reject the null hypothesis, and in
order to support correlation indication. Hypothesis testing also gave the
similar result by finding a small p-value. Therefore, after my investigation I
can conclude that there is a strong positive correlation between GDP and
number of vehicle owned by citizens.
5. Evaluation
Limitations
Strength
6. Further Research
In this paper, I worked with 20 randomly selected observations. Also, I only
used two variables from our data set which are GDP and number of vehicle.
For the further research, we may divide our data set into two by using the
categorical variable in our data set which is the status of the countries namely
“developed and developing”.
By conducting similar works with both developed countries and developing
countries, we may obtain two different results.
Download