Causality Analysis Between Earthquake Magnitudes and Seismic Stations Ercan Gök 5 Ocak 2021 Introduction Earthquakes are indispensable part of our life, and can cause some sociological trauma in our society or in our countries. In order to avoid the negative effects of earthquakes, we need to take some precautions, and if possible, we want to predict before those occur. Earthquakes are recorded by a seismographic network. Each seismic station in the network calculates the movement of the floor/ground at that site. The slip of one block of rock over another in an earthquake gives off energy that makes the ground vibration. That vibration pushes the neighbor piece of ground and leads it to vibrate, and therefore the energy travels out from the earthquake hypocenter in a wave. Magnitude is the most common measure of an earthquake’s size. It is a measure of the size of the earthquake source and is the same number no matter where you are or what the shaking feels like. We want to find whether there is a causal relationship between magnitude and station or not, in order to analyze this, we will use a linear regression model. Firstly, we will use simple linear regression model, and then we will go into deep with multiple linear regression model analysis with some robustness checks. Data Description: Our data set give the locations of 1000 seismic events of MB > 4.0. The events occurred in a cube near Fiji since 1964. . lat, Latitude of Event . long, Longitude . depth, Depth (km) . mag, Richter Magnitude . stations, Number of stations reporting ##data set, variables data(quakes) head(quakes) ## lat long depth mag stations ## 1 -20.42 181.62 562 4.8 41 ## 2 -20.62 181.03 650 4.2 15 1 ## ## ## ## 3 4 5 6 -26.00 -17.97 -20.42 -19.68 184.10 181.66 181.96 184.31 42 626 649 195 5.4 4.1 4.0 4.0 43 19 11 12 summary(quakes) ## ## ## ## ## ## ## ## ## ## ## ## ## ## lat Min. :-38.59 1st Qu.:-23.47 Median :-20.30 Mean :-20.64 3rd Qu.:-17.64 Max. :-10.72 stations Min. : 10.00 1st Qu.: 18.00 Median : 27.00 Mean : 33.42 3rd Qu.: 42.00 Max. :132.00 long Min. :165.7 1st Qu.:179.6 Median :181.4 Mean :179.5 3rd Qu.:183.2 Max. :188.1 depth Min. : 40.0 1st Qu.: 99.0 Median :247.0 Mean :311.4 3rd Qu.:543.0 Max. :680.0 mag Min. :4.00 1st Qu.:4.30 Median :4.60 Mean :4.62 3rd Qu.:4.90 Max. :6.40 Theoretical Framework In a linear regression model, we are keenly interested in seeing if there is a linear relationship between a predictor variable (in our case, this is ”mag”) and a response variable (in our case, this is ”stat”). From our data set, we’re going to be examining if there is a linear relationship between an earthquake’s magnitude and the number of stations that reported the activity. The intuition here is that as the magnitude of a quake changes, so does the number of stations that report it in some sort of predictable manner. Firstly, in order to get a better feel for the quakes data with a scatter plot. A scatter plot shows us the general shape of the data and can provide us some hints to what the relationship between magnitude variable and stations variable might be. attach(quakes) plot(jitter(mag, amount = 0.05), stations, pch = 20, ylab = "# of Stations Reporting", xlab = "Magnitude", main = "Fiji Earthquakes Magnitude and Reporting", col = rgb(0.1, 0.2, 0.8, 0.3)) 2 120 20 40 60 80 # of Stations Reporting Fiji Earthquakes Magnitude and Reporting 4.0 4.5 5.0 5.5 6.0 Magnitude We can see that we have a fancy scatter plot, magnitude and number of stations are moving together in some sort of way. We are trying to develop some initial ideas about the relationship between the magnitude of an earthquake and the number of stations that report that earthquake. Generally speaking, as magnitude rises, the number of stations reporting increases. We start by giving our model a specific name that we can refer later on: quake.linear.regression, then we run a regression the number of stations on magnitude: quake.linear.regression <- lm(stations ~ mag) quake.linear.regression ## ## ## ## ## ## ## Call: lm(formula = stations ~ mag) Coefficients: (Intercept) -180.42 mag 46.28 d Stations = −180.42 + 46.28(M agnitude) Our linear regression model gives us, as we said before, it provides a numerical relationship, based on our sample dataset, between magnitude of an earthquake and the number of stations that reported the earthquake. From the slope coefficient, we deduce that 1 unit change on the Richter scale will, on average, change the number of reporting stations by 46.28 unit. Because our slope is positive, our linear regression model estimates that there is a positive association between magnitude and the number of reporting stations. Our intercept coefficient tells us if magnitude of the earthquake was zero (but couldn’t be since reported earthquake magnitude should be 4 at minimum.) -180.42 stations would report it. We start with our scatter plot above, we could include the abline code with our parameters, slope and intercept coefficients, to add our regression line. 3 plot(jitter(mag, amount = 0.05), stations, pch = 20, ylab = "# of Stations Reporting", xlab = "Magnitude", main = "Fiji Earthquakes Magnitude and Reporting", col = rgb(0.1, 0.2, 0.8, 0.3)) abline(-180.42, 46.28, col="red", lwd = 2) 120 20 40 60 80 # of Stations Reporting Fiji Earthquakes Magnitude and Reporting 4.0 4.5 5.0 5.5 6.0 6.5 Magnitude We can see that the regression line follows the data fairly well. But this positive relation could be correlation not causation. To check whether there is a causation or not, I explore the assumptions of the linear regression model (homoscedasticity, normally distributed errors, and independent errors) to make sure I am only using this model during appropriate circumstances. Homoscedasticity means that the variance of our residuals is constant across all earthquake magnitudes. Put another way, the variance of our residuals is independent from our predictor variable. plot(mag, stations, pch = 20, ylab = "# of Stations Reporting", xlab = "Magnitude", main = "Fiji Earthquakes Magnitude and Reporting", col = rgb(0.1, 0.2, 0.8, 0.3)) abline(-185, 56, col= "green", lwd = 2) abline(-270, 56, col= "green", lwd = 2) 4 120 20 40 60 80 # of Stations Reporting Fiji Earthquakes Magnitude and Reporting 4.0 4.5 5.0 5.5 6.0 Magnitude As we compare the spread of the data, the variation appears to be relatively constant across the plot except for the lowest earthquake magnitudes. Primary option for checking the variance in residuals is a residual plot with our model’s fitted values on the X-axis and residual size on the Y-axis. This residual plot allows us more clearly to see changes in the variance of the residuals across all magnitudes compared to the scatter plot. I created another function for residuals (quake.residuals) and fitted values (quake.fitted.values) and then construct residual plot with a horizontal line at Y equals zero as a reference point for the variation in residuals. quake.residuals <- quake.linear.regression$residuals quake.fitted.values <- quake.linear.regression$fitted.values plot(quake.fitted.values, quake.residuals, pch = 20, xlab= "Magnitude", ylab= "Residual", main= "Residual Plot", col = rgb(0.1, 0.2, 0.8, 0.3)) abline(0,0,col="brown", lwd = 2.5) 5 20 0 −40 Residual 40 Residual Plot 20 40 60 80 100 120 Magnitude As we compare variation for different magnitudes, it appears that the residual variation is slightly lower for magnitudes 4.0-4.25 compared to magnitudes greater than 4.25. While this assumption is not perfectly met, the data is reasonable enough to allow us to proceed and not completely disregard our model’s findings. The third assumption for linear models is that our residuals follow a normal distribution. We should check characteristics about the normal distribution: Bell-Shaped Curve Majority of data within one and two standard deviations of the mean sd(quake.residuals) ## [1] 11.49485 Symmetrical in shape An easy way to see if our residuals follows a normal distribution is by showing the residuals in a histogram. hist(quake.residuals, breaks=25, xlab="Residual Value", ylab="Frequency", main="Histogram of Residuals", col="blue") 6 100 50 0 Frequency 150 Histogram of Residuals −40 −20 0 20 40 Residual Value We can see from histogram above that our residuals’ standard error is 11.5 approximately, and our 99 percent of data points fall in that two standard error region. Fourth assumption in our linear regression model is each error term is independent of other error terms. For our data set, there is little reason to believe that the residual number of stations reporting an earthquake for a given magnitude would be dependent on the residual of another predictor/response variable combination. Given this, we can claim that this assumption is met and continue with our linear model. Based on our simple knowledge of earthquake reporting, we assume that independence between residuals holds. If we truly want a measure of the usefulness of our model, we can begin by looking at R-Squared. R-Squared is known as the simple coefficient of determination. When comparing two variables, R-squared represents the proportion of total variation in the response variable that is explained by the linear regression model. Naturally, a higher R-squared shows that the predictor variable predicts the response variable well. In our case, we can see the proportion of variation in the number of stations reporting that can be explained by our Quake.mod. R-Squared is provided by the summary command, for which the only argument is the model name. Multiple R-squared, located at the bottom of the summary output can be interpreted as follows: 72.45 percent of the total variation in the number of stations reporting a quake can be explained by our linear model. R-squared values range from 0-1, so 72.45 percent is noteworthy for sure. summary(quake.linear.regression) ## ## ## ## ## ## ## Call: lm(formula = stations ~ mag) Residuals: Min 1Q -48.871 -7.102 Median -0.474 3Q 6.783 Max 50.244 7 ## ## ## ## ## ## ## ## ## ## ## Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -180.4243 4.1899 -43.06 <2e-16 *** mag 46.2822 0.9034 51.23 <2e-16 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 11.5 on 998 degrees of freedom Multiple R-squared: 0.7245,Adjusted R-squared: 0.7242 F-statistic: 2625 on 1 and 998 DF, p-value: < 2.2e-16 Hypothesis Testing for Regression Coefficients An important part of understanding the causation is evaluating if what we are doing is actually important. We are able to perform this self-analysis in our linear regression model as well through hypothesis tests on the t and F distribution. For both tests, we will use an alpha of 0.05. Our goal of hypothesis testing is to test if there is linear relationship between our response and predictor variable. We test this through examining the slope of the regression model. With our null hypothesis being the slope equals zero, we will fail to reject the null when we believe there is no linear relationship between quake magnitude and the number of stations reporting. On the contrary, when we believe the slope does not equal zero we will reject the null and conclude there is a linear relationship. (Hypothetically, we could perform a one-sided test, but the standard two-sided test will serve our purpose of testing significance the best.) We can use again the summary function we introduced in the last section to analyze the hypothesis test results. summary(quake.linear.regression) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = stations ~ mag) Residuals: Min 1Q -48.871 -7.102 Median -0.474 3Q 6.783 Max 50.244 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -180.4243 4.1899 -43.06 <2e-16 *** mag 46.2822 0.9034 51.23 <2e-16 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 11.5 on 998 degrees of freedom Multiple R-squared: 0.7245,Adjusted R-squared: 0.7242 F-statistic: 2625 on 1 and 998 DF, p-value: < 2.2e-16 8 As realized, checking the assumptions of the linear regression model has objective and subjective components, which ultimately can leave to decision to proceed with the model in out hands. Now that we understand the basics of our linear model and we can go deep. Multiple Regression Analysis Thus far actually, we used simple linear regression, but there could be other factors that are strongly correlated with magnitude, and those factors might have caused positive relation with the number of reporting seismic stations. If we utilize only simple linear regression, there can be omitted variable bias. In order to eliminate this serious problem, we should add another factors that is correlated with magnitude, and that might possibly have relation with the number of stations. For instance, depth, latitude, longitude of that Fiji region could affect the number of stations, and there are powerful relations with magnitude. If we calculate their pairwise correlations with magnitude: cor.test(mag,lat) ## ## ## ## ## ## ## ## ## ## ## Pearson’s product-moment correlation data: mag and lat t = -1.5962, df = 998, p-value = 0.1108 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.11210404 0.01156762 sample estimates: cor -0.05046165 cor.test(mag,depth) ## ## ## ## ## ## ## ## ## ## ## Pearson’s product-moment correlation data: mag and depth t = -7.488, df = 998, p-value = 1.535e-13 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.2885057 -0.1710909 sample estimates: cor -0.2306377 cor.test(mag,long) ## ## Pearson’s product-moment correlation ## ## data: mag and long ## t = -5.5512, df = 998, p-value = 3.637e-08 ## alternative hypothesis: true correlation is not equal to 0 9 ## 95 percent confidence interval: ## -0.2325652 -0.1122788 ## sample estimates: ## cor ## -0.1730673 quake.multiple.regression <- lm(stations ~ mag + lat + long + depth) 10