Uploaded by Tunahan Küçüker

ercan odev

advertisement
Causality Analysis Between Earthquake
Magnitudes and Seismic Stations
Ercan Gök
5 Ocak 2021
Introduction
Earthquakes are indispensable part of our life, and can cause some sociological trauma in our
society or in our countries. In order to avoid the negative effects of earthquakes, we need to
take some precautions, and if possible, we want to predict before those occur.
Earthquakes are recorded by a seismographic network. Each seismic station in the network
calculates the movement of the floor/ground at that site. The slip of one block of rock over
another in an earthquake gives off energy that makes the ground vibration. That vibration
pushes the neighbor piece of ground and leads it to vibrate, and therefore the energy travels
out from the earthquake hypocenter in a wave. Magnitude is the most common measure of an
earthquake’s size. It is a measure of the size of the earthquake source and is the same number
no matter where you are or what the shaking feels like. We want to find whether there is a
causal relationship between magnitude and station or not, in order to analyze this, we will use
a linear regression model. Firstly, we will use simple linear regression model, and then we will
go into deep with multiple linear regression model analysis with some robustness checks.
Data
Description: Our data set give the locations of 1000 seismic events of MB > 4.0. The events
occurred in a cube near Fiji since 1964.
. lat, Latitude of Event
. long, Longitude
. depth, Depth (km)
. mag, Richter Magnitude
. stations, Number of stations reporting
##data set, variables
data(quakes)
head(quakes)
##
lat
long depth mag stations
## 1 -20.42 181.62
562 4.8
41
## 2 -20.62 181.03
650 4.2
15
1
##
##
##
##
3
4
5
6
-26.00
-17.97
-20.42
-19.68
184.10
181.66
181.96
184.31
42
626
649
195
5.4
4.1
4.0
4.0
43
19
11
12
summary(quakes)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
lat
Min.
:-38.59
1st Qu.:-23.47
Median :-20.30
Mean
:-20.64
3rd Qu.:-17.64
Max.
:-10.72
stations
Min.
: 10.00
1st Qu.: 18.00
Median : 27.00
Mean
: 33.42
3rd Qu.: 42.00
Max.
:132.00
long
Min.
:165.7
1st Qu.:179.6
Median :181.4
Mean
:179.5
3rd Qu.:183.2
Max.
:188.1
depth
Min.
: 40.0
1st Qu.: 99.0
Median :247.0
Mean
:311.4
3rd Qu.:543.0
Max.
:680.0
mag
Min.
:4.00
1st Qu.:4.30
Median :4.60
Mean
:4.62
3rd Qu.:4.90
Max.
:6.40
Theoretical Framework
In a linear regression model, we are keenly interested in seeing if there is a linear relationship
between a predictor variable (in our case, this is ”mag”) and a response variable (in our case,
this is ”stat”). From our data set, we’re going to be examining if there is a linear relationship
between an earthquake’s magnitude and the number of stations that reported the activity. The
intuition here is that as the magnitude of a quake changes, so does the number of stations that
report it in some sort of predictable manner. Firstly, in order to get a better feel for the quakes
data with a scatter plot. A scatter plot shows us the general shape of the data and can provide
us some hints to what the relationship between magnitude variable and stations variable might
be.
attach(quakes)
plot(jitter(mag, amount = 0.05), stations,
pch = 20,
ylab = "# of Stations Reporting",
xlab = "Magnitude",
main = "Fiji Earthquakes Magnitude and Reporting",
col = rgb(0.1, 0.2, 0.8, 0.3))
2
120
20 40 60 80
# of Stations Reporting
Fiji Earthquakes Magnitude and Reporting
4.0
4.5
5.0
5.5
6.0
Magnitude
We can see that we have a fancy scatter plot, magnitude and number of stations are moving
together in some sort of way. We are trying to develop some initial ideas about the relationship
between the magnitude of an earthquake and the number of stations that report that earthquake. Generally speaking, as magnitude rises, the number of stations reporting increases. We
start by giving our model a specific name that we can refer later on: quake.linear.regression,
then we run a regression the number of stations on magnitude:
quake.linear.regression <- lm(stations ~ mag)
quake.linear.regression
##
##
##
##
##
##
##
Call:
lm(formula = stations ~ mag)
Coefficients:
(Intercept)
-180.42
mag
46.28
d
Stations
= −180.42 + 46.28(M agnitude)
Our linear regression model gives us, as we said before, it provides a numerical relationship,
based on our sample dataset, between magnitude of an earthquake and the number of stations
that reported the earthquake. From the slope coefficient, we deduce that 1 unit change on the
Richter scale will, on average, change the number of reporting stations by 46.28 unit. Because
our slope is positive, our linear regression model estimates that there is a positive association
between magnitude and the number of reporting stations. Our intercept coefficient tells us if
magnitude of the earthquake was zero (but couldn’t be since reported earthquake magnitude
should be 4 at minimum.) -180.42 stations would report it. We start with our scatter plot
above, we could include the abline code with our parameters, slope and intercept coefficients,
to add our regression line.
3
plot(jitter(mag, amount = 0.05), stations,
pch = 20,
ylab = "# of Stations Reporting",
xlab = "Magnitude",
main = "Fiji Earthquakes Magnitude and Reporting",
col = rgb(0.1, 0.2, 0.8, 0.3))
abline(-180.42, 46.28, col="red", lwd = 2)
120
20 40 60 80
# of Stations Reporting
Fiji Earthquakes Magnitude and Reporting
4.0
4.5
5.0
5.5
6.0
6.5
Magnitude
We can see that the regression line follows the data fairly well. But this positive relation
could be correlation not causation. To check whether there is a causation or not, I explore the
assumptions of the linear regression model (homoscedasticity, normally distributed errors, and
independent errors) to make sure I am only using this model during appropriate circumstances.
Homoscedasticity means that the variance of our residuals is constant across all earthquake
magnitudes. Put another way, the variance of our residuals is independent from our predictor
variable.
plot(mag, stations,
pch = 20,
ylab = "# of Stations Reporting",
xlab = "Magnitude",
main = "Fiji Earthquakes Magnitude and Reporting",
col = rgb(0.1, 0.2, 0.8, 0.3))
abline(-185, 56, col= "green", lwd = 2)
abline(-270, 56, col= "green", lwd = 2)
4
120
20 40 60 80
# of Stations Reporting
Fiji Earthquakes Magnitude and Reporting
4.0
4.5
5.0
5.5
6.0
Magnitude
As we compare the spread of the data, the variation appears to be relatively constant across
the plot except for the lowest earthquake magnitudes.
Primary option for checking the variance in residuals is a residual plot with our model’s fitted
values on the X-axis and residual size on the Y-axis. This residual plot allows us more clearly to
see changes in the variance of the residuals across all magnitudes compared to the scatter plot.
I created another function for residuals (quake.residuals) and fitted values (quake.fitted.values)
and then construct residual plot with a horizontal line at Y equals zero as a reference point for
the variation in residuals.
quake.residuals <- quake.linear.regression$residuals
quake.fitted.values <- quake.linear.regression$fitted.values
plot(quake.fitted.values, quake.residuals,
pch = 20,
xlab= "Magnitude",
ylab= "Residual",
main= "Residual Plot",
col = rgb(0.1, 0.2, 0.8, 0.3))
abline(0,0,col="brown", lwd = 2.5)
5
20
0
−40
Residual
40
Residual Plot
20
40
60
80
100
120
Magnitude
As we compare variation for different magnitudes, it appears that the residual variation is
slightly lower for magnitudes 4.0-4.25 compared to magnitudes greater than 4.25. While this
assumption is not perfectly met, the data is reasonable enough to allow us to proceed and not
completely disregard our model’s findings.
The third assumption for linear models is that our residuals follow a normal distribution.
We should check characteristics about the normal distribution:
ˆ Bell-Shaped Curve
ˆ Majority of data within one and two standard deviations of the mean
sd(quake.residuals)
## [1] 11.49485
ˆ Symmetrical in shape
An easy way to see if our residuals follows a normal distribution is by showing the residuals in
a histogram.
hist(quake.residuals, breaks=25,
xlab="Residual Value",
ylab="Frequency",
main="Histogram of Residuals",
col="blue")
6
100
50
0
Frequency
150
Histogram of Residuals
−40
−20
0
20
40
Residual Value
We can see from histogram above that our residuals’ standard error is 11.5 approximately, and
our 99 percent of data points fall in that two standard error region. Fourth assumption in our
linear regression model is each error term is independent of other error terms. For our data set,
there is little reason to believe that the residual number of stations reporting an earthquake for
a given magnitude would be dependent on the residual of another predictor/response variable
combination. Given this, we can claim that this assumption is met and continue with our linear
model. Based on our simple knowledge of earthquake reporting, we assume that independence
between residuals holds.
If we truly want a measure of the usefulness of our model, we can begin by looking at
R-Squared. R-Squared is known as the simple coefficient of determination. When comparing
two variables, R-squared represents the proportion of total variation in the response variable
that is explained by the linear regression model. Naturally, a higher R-squared shows that the
predictor variable predicts the response variable well.
In our case, we can see the proportion of variation in the number of stations reporting that
can be explained by our Quake.mod. R-Squared is provided by the summary command, for
which the only argument is the model name. Multiple R-squared, located at the bottom of
the summary output can be interpreted as follows: 72.45 percent of the total variation in the
number of stations reporting a quake can be explained by our linear model. R-squared values
range from 0-1, so 72.45 percent is noteworthy for sure.
summary(quake.linear.regression)
##
##
##
##
##
##
##
Call:
lm(formula = stations ~ mag)
Residuals:
Min
1Q
-48.871 -7.102
Median
-0.474
3Q
6.783
Max
50.244
7
##
##
##
##
##
##
##
##
##
##
##
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -180.4243
4.1899 -43.06
<2e-16 ***
mag
46.2822
0.9034
51.23
<2e-16 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 11.5 on 998 degrees of freedom
Multiple R-squared: 0.7245,Adjusted R-squared: 0.7242
F-statistic: 2625 on 1 and 998 DF, p-value: < 2.2e-16
Hypothesis Testing for Regression Coefficients
An important part of understanding the causation is evaluating if what we are doing is actually
important. We are able to perform this self-analysis in our linear regression model as well
through hypothesis tests on the t and F distribution. For both tests, we will use an alpha of
0.05.
Our goal of hypothesis testing is to test if there is linear relationship between our response
and predictor variable. We test this through examining the slope of the regression model.
With our null hypothesis being the slope equals zero, we will fail to reject the null when
we believe there is no linear relationship between quake magnitude and the number of stations
reporting. On the contrary, when we believe the slope does not equal zero we will reject the
null and conclude there is a linear relationship. (Hypothetically, we could perform a one-sided
test, but the standard two-sided test will serve our purpose of testing significance the best.) We
can use again the summary function we introduced in the last section to analyze the hypothesis
test results.
summary(quake.linear.regression)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = stations ~ mag)
Residuals:
Min
1Q
-48.871 -7.102
Median
-0.474
3Q
6.783
Max
50.244
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -180.4243
4.1899 -43.06
<2e-16 ***
mag
46.2822
0.9034
51.23
<2e-16 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 11.5 on 998 degrees of freedom
Multiple R-squared: 0.7245,Adjusted R-squared: 0.7242
F-statistic: 2625 on 1 and 998 DF, p-value: < 2.2e-16
8
As realized, checking the assumptions of the linear regression model has objective and
subjective components, which ultimately can leave to decision to proceed with the model in
out hands. Now that we understand the basics of our linear model and we can go deep.
Multiple Regression Analysis
Thus far actually, we used simple linear regression, but there could be other factors that are
strongly correlated with magnitude, and those factors might have caused positive relation with
the number of reporting seismic stations. If we utilize only simple linear regression, there can be
omitted variable bias. In order to eliminate this serious problem, we should add another factors
that is correlated with magnitude, and that might possibly have relation with the number of
stations. For instance, depth, latitude, longitude of that Fiji region could affect the number
of stations, and there are powerful relations with magnitude. If we calculate their pairwise
correlations with magnitude:
cor.test(mag,lat)
##
##
##
##
##
##
##
##
##
##
##
Pearson’s product-moment correlation
data: mag and lat
t = -1.5962, df = 998, p-value = 0.1108
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.11210404 0.01156762
sample estimates:
cor
-0.05046165
cor.test(mag,depth)
##
##
##
##
##
##
##
##
##
##
##
Pearson’s product-moment correlation
data: mag and depth
t = -7.488, df = 998, p-value = 1.535e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.2885057 -0.1710909
sample estimates:
cor
-0.2306377
cor.test(mag,long)
##
## Pearson’s product-moment correlation
##
## data: mag and long
## t = -5.5512, df = 998, p-value = 3.637e-08
## alternative hypothesis: true correlation is not equal to 0
9
## 95 percent confidence interval:
## -0.2325652 -0.1122788
## sample estimates:
##
cor
## -0.1730673
quake.multiple.regression <- lm(stations ~ mag + lat + long + depth)
10
Download