Data Analysis Final Report TEAM 04 Yoko Arita Leslie Anne James Shigenori Kobayashi Takeshi Kusunoki Step 1: Determine a concerning Variable Team Objective of the project Find conditions for Strong Company Investopedia explains “Cash Flow” 1. In business as in personal finance, cash flows are essential to solvency. They can be presented as a record of something that has happened in the past, such as the sale of a particular product, or forecasted into the future, representing what a business or a person expects to take in and to spend. Cash flow is crucial to an entity's survival. Having ample cash on hand will ensure that creditors, employees and others can be paid on time. If a business or person does not have enough cash to support its operations, it is said to be insolvent, and a likely candidate for bankruptcy should the insolvency continue. 2. The statement of a business's cash flows is often used by analysts to gauge financial performance. Companies with ample cash on hand are able to invest the cash back into the business in order to generate more cash and profit. Step 1: Determine a concerning Variable Does “Cash Flow” fit our goal? As Investopedia says, cash flow is essential for a company. It needs a lot of cash on hand to be able to pay employees and creditors. It can invest in the company to improve performance and profit. Also, Cash Flow is often used to determine financial performance. Both for practical and analytical reasons, Cash Flow is strong factor for a company’s success! Step 2 : Consider the Explanatory variables Step 2-1 : Dispersion : Get the beautiful Histograms > summary(Cashflow) Min. 1st Qu. Median -189900 558 1515 Mean 3rd Qu. Max. 8101 4215 1711000 > summary(log(Cashflow)-0.1) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's -0.100 6.489 7.338 7.454 8.356 14.250 159 Step 2 : Consider the Explanatory variables Step 2-2 : Association : decide an explanatory variable As we are looking for factors that make a strong company, we decided that the profit is a natural indicator of a successful company. So, our first explanatory variable is Ordinary Profit. Step 2 : Consider the Explanatory variables Step 2-2 : Association : draw scatter plots > cor(Cashflow,OrdinaryProfit) [1] 0.7604991 > cor(log(Cashflow),log(OrdinaryProfit)) [1] 0.814815 Step 2 : Consider the Explanatory variables Step 2-2 : Association : using the data By looking at our previous charts, it is clear that using the log of our variables gives us clearer information than the initial values. In addition, the correlations of Cashflow and OrdinaryProfit: cor(Cashflow,OrdinaryProfit) cor(log(Cashflow),log(OrdinaryProfit)) 0.7604991 0.814815 an improvement of 0.0543159 Step 3 : Construct Single Linear Regression Models > Cashflow.lm<-lm((Cashflow)~(OrdinaryProfit)) > summary(Cashflow.lm) Call: lm(formula = (Cashflow) ~ (OrdinaryProfit)) Residuals: Min 1Q Median -360958 -578 2373 3Q Max 3660 1011555 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.443e+03 7.356e+02 -4.681 3.03e-06 *** OrdinaryProfit 2.136e+00 4.021e-02 53.109 < 2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 32130 on 2089 degrees of freedom Multiple R-squared: 0.5745, Adjusted R-squared: 0.5743 F-statistic: 2821 on 1 and 2089 DF, p-value: < 2.2e-16 Cashflow = (-3.443e+0.3) + (2.136)OrdinaryProfit Step 3 : Construct Single Linear Regression Models > model2<-lm(log(dat[,29])~log(dat[,37])) > summary(model2) Call: lm(formula = log(dat[, 29]) ~ log(dat[, 37])) Residuals: Min 1Q Median 3Q Max -5.4288 -0.5322 -0.0756 0.4844 4.5932 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.83025 0.09951 18.39 <2e-16 *** log(dat[, 37]) 0.78049 0.01304 59.85 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8912 on 1813 degrees of freedom Multiple R-squared: 0.6639, Adjusted R-squared: 0.6637 F-statistic: 3582 on 1 and 1813 DF, p-value: < 2.2e-16 Log(Cashflow) = (1.83025) + (0.78049)(log(OrdinaryProfit)) Step 3 : Construct Single Linear Regression Models Initial results: We found that using log() of both variables gave us better data. Tighter scatter plot The R-Squared improved (0.5743 to 0.6637 ) The error improved (32130 on 2089 degrees of freedom to 0.8912 on 1813 degrees of freedom) Also, we found the p-value of OrdinaryProfit to be <2e-16, which is very close to zero. But, the R-Squared improved by only 0.0894, so clearly adding only OrdinaryProfit is not enough. Step 3 : Model Checking > var(log(dat[,29])) [1] 2.361808 > var(residuals(model2)) [1] 0.7937481 > var(predict(model2)) [1] 1.56806 Step 3 : Model Checking Initial results and impressions: The variance results were: Original 2.361808 Residuals 1.56806 Predicted 0.7937481 Which are all fairly high. And even though the predicted variance is less than one, it is not close to zero. In addition to the R-Squared results, we can say that just using OrdinaryProfit with Cashflow improves the model, but it does not make it a good model. Step 4 : Implement of the Obtained Model by changing explanatory variables After concluding that using OrdinaryProfit to compliment Cashflow does not give us a significant improvement, we have to improve our model! So, we decided to use the following variables because we believe that they all are signs of a strong company: Variable Data Concerning variable Cash Flow 29 Initial explanatory variable Ordinary Profit 37 New explanatory variables Personnel Expenses 41 Tangible Assets 18 Depreciation 28 Step 4 : Implement of the Obtained Model by changing explanatory variables 1 First, we just included Personnel Expenses to our previous data: > model3<-lm(log(dat[,29])~log(dat[,37])+log(dat[,41])) > summary(model3) Call: lm(formula = log(dat[, 29]) ~ log(dat[, 37]) + log(dat[, 41])) Residuals: Min 1Q Median 3Q Max -4.7120 -0.3625 0.0258 0.3959 3.6438 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.98178 0.12307 -7.977 2.63e-15 *** log(dat[, 37]) 0.44945 0.01522 29.527 < 2e-16 *** log(dat[, 41]) 0.59305 0.01953 30.359 < 2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7257 on 1812 degrees of freedom Multiple R-squared: 0.7772, Adjusted R-squared: 0.777 F-statistic: 3161 on 2 and 1812 DF, p-value: < 2.2e-16 Step 4 : Implement of the Obtained Model by changing explanatory variables 2 Then, we included Tangible Assets: > model4<-lm(log(dat[,29])~log(dat[,37])+log(dat[,41])+log(dat[,18])) > summary(model4) Call: lm(formula = log(dat[, 29]) ~ log(dat[, 37]) + log(dat[, 41]) + log(dat[, 18])) Residuals: Min 1Q Median 3Q Max -4.6334 -0.2612 0.0527 0.3206 2.6876 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.65317 0.09596 -17.228 <2e-16 *** log(dat[, 37]) 0.31375 0.01224 25.634 <2e-16 *** log(dat[, 41]) 0.17090 0.01902 8.987 <2e-16 *** log(dat[, 18]) 0.56570 0.01577 35.881 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.555 on 1811 degrees of freedom Multiple R-squared: 0.8698, Adjusted R-squared: 0.8696 F-statistic: 4033 on 3 and 1811 DF, p-value: < 2.2e-16 Step 4 : Implement of the Obtained Model by changing explanatory variables 3 Finally, we included Depreciation: > model5<-lm(log(dat[,29])~log(dat[,37])+log(dat[,41])+log(dat[,18])+log(dat[,28])) > summary(model5) Call: lm(formula = log(dat[, 29]) ~ log(dat[, 37]) + log(dat[, 41]) + log(dat[, 18]) + log(dat[, 28])) Residuals: Min 1Q Median 3Q Max -4.6901 -0.1523 -0.0261 0.1396 2.7388 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.481898 0.084842 5.680 1.57e-08 *** log(dat[, 37]) 0.307379 0.008691 35.368 < 2e-16 *** log(dat[, 41]) -0.040860 0.014402 -2.837 0.0046 ** log(dat[, 18]) -0.004568 0.017541 -0.260 0.7946 log(dat[, 28]) 0.727838 0.017238 42.224 < 2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.394 on 1810 degrees of freedom Multiple R-squared: 0.9344, Adjusted R-squared: 0.9343 F-statistic: 6446 on 4 and 1810 DF, p-value: < 2.2e-16 Step 4 : Implement of the Obtained Model by changing explanatory variables 4 Results of R-Squared: There was a clear increase in Adjusted R-Squared as we included each variable. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 The Adjusted R-Squared increased from 66% to 93%! Cashflow and with Tangible Ordinary Assets Profits with Personnel Expenses with Depreciation NOTE: all values are log() Step 4 : Implement of the Obtained Model by changing explanatory variables 4 Other Results: *all variables are log() T-value P-value Ordinary Profit 35.368 < 2e-16 Personnel Expenses -2.837 0.0046 Tangible Assets -0.260 0.7946 Depreciation 42.224 < 2e-16 The P-values are good and close to zero, except for Tangible Assets The T-values vary widely Depreciation and Ordinary Profit have very good T-values, and Personnel Expenses has an OK one. However, Tangible Assets does not have a good T-value Step 4 : Implement of the Obtained Model by changing explanatory variables 4 Scatter plots of the Linear Model and the final Multiple Regression Model: Step 4 : Implement of the Obtained Model by changing explanatory variables 4 Result and reflections: Our goal is to find factors that indicate a strong company, and we decided to focus on Cash Flow. We also decided that a strong company would have a good Ordinary Profit. So, looked for the correlation between the two. We found that the log() of each value is the best way to present the relationship. Then, we looked for variables that may contribute to Cash Flow, and how they improve our model. By using log() and several explanatory variables, the Adjusted Multiple Square has been drastically improved! Original: 0.5745 → Improved: 0.9344 Step 5 : Residual Analysis Whisker Plots of our models: > boxplot(residuals(Cashflow.lm),residuals(Cashflow.lm4)) Residuals of Simple Residuals of Multiple The Standard Deviation is smaller. Step 5 : Residual Analysis The way to improve the model…. Check whether remarkable characteristics of outlaying data can be analyzed for companies over the standard deviation: > dat[residuals(model5)>0.6,1] [1] HEIWA CONSTRUCTION Sekisui House Hokuriku SDK ENGINEERING DAITO TRUST CONSTRUCTION [5] SHINNIHON Sumitomo Forestry CHUDENKO ATAKA CONSTRUCTION & ENGINEERING [9] Taikisha DAI-DAN Hibiya Engineering Snow Brand Seed [13] CHUYU BOSO OIL & FAT NIHON SHOKUHIN KAKO SHINOBU FOODS PRODUCTS [17] Fuji Spinning Teikoku Sen-i CAROLINA KANBO PRAS [21] ITARIYARD Kureha Chemical Industry Nippon Chemical Industrial Plas-Tech [25] JAPAN CARLIT Nippon Steel Chemical ONO PHARMACEUTICAL SANTEN PHARMACEUTICAL [29] NIPPON SEIRO KINUGAWA RUBBER INDUSTRIAL ISHIZUKA GLASS Sumitomo Osaka Cement [33] Harima Ceramic Pacific Metals Japan Metals & Chemicals Optec Dai-Ichi Denko [37] Japan Bridge KOMAI TEKKO TOSHIBA TUNGALOY TOYO MACHINERY & METAL [41] Hitachi Zosen Tomioka Machinery Kurita Water Industries YUKEN KOGYO Shimpo Industrial [45] Heiwa SANKYO Shinko Electric Denyo [49] MABUCHI MOTOR TAMURA ELECTRIC WORKS HIROSE ELECTRIC OHKURA ELECTRIC [53] KEYENCE MELCO KOMATSU ZENOAH Mazda Motor [57] IKEDA BUSSAN ECHO TRADING NAKAYAMAFUKU Harima-Kyowa [61] DOSHISHA TOKIMEC TOKYO SEIMITSU MITSUMURA PRINTING [65] TSUTSUMI JEWELRY Nintendo NAGASE & CO. G-NET [69] Tokyo Electron OSAKA UOICHIBA RYOYO ELECTRO TOKAI BUSSAN [73] KUWAZAWA Trading TOKYO STYLE UNI¥xa5CHARM CENTRAL AUTOMOTIVE PRODUCTS [77] Ryosan DENKYOSHA SHIMACHU CHIYODA [81] CHUO SUBARU Kansai Sekiwa Real Estate SEIBU RAILWAY HOKKAIDO CHUO BUS [85] MEIDEN ENGINEERING KOEI NIPPON KANZAI INES [89] BIKEN TECHNO Juel Verite Ohkubo SHINDEN NIKKU SANGYO [93] FAST RETAILING There is no overall trend, but there are many companies that deal with materials and electronics. Step 5 : Residual Analysis The way to improve the model…. Check whether remarkable characteristics of outlaying data can be analyzed for companies under the standard deviation: > dat[residuals(model5)< -0.6,1] [1] Arabian Oil FUJIKO ZENITAKA Sumitomo Construction [5] Daiwa Construction DAI NIPPON CONSTRUCTION HAZAMA KOKUNE [9] TADA Arai-Gumi KUMAGAI GUMI Asakawagumi [13] KOMATSU CONSTRUCTION TSUKEN Chugai Ro SANKO METAL INDUSTRIAL [17] KYODO SHIRYO EZAKI GLICO SETTSU OIL MILL KATOKICHI [21] Shoei MIYUKI KEORI ATSUGI NYLON INDUSTRIAL TOMOEGAWA PAPER [25] Settsu Nippon Denko Shimura Kako ASAKA INDUSTRIAL [29] KATO SPRING WORKS KOYO IRON WORKS & CONSTRUCTION ANEST IWATA FUJITSU GENERAL [33] UNIDEN OKAYA ELECTRIC INDUSTRIES SHIZUKI ELECTRIC ZOJIRUSHI [37] TAKASHIMA & CO. DAIKO DENSHI TSUSHIN LECIEN TOKYO SOIR [41] DAITO GYORUI Kinsho-Mataichi YUASA TRADING CANOX [45] MOONBAT ZETT NAGAHORI YAOHAN JAPAN [49] Footwork International OAK Maruzen SHOWA LINE [53] DAIICHI CHUO KISEN Ga-jo-en Kanko Again, there is no overall trend. But, there are many electric, construction, and oil companies. Data Analysis Conclusion: R-Squared In these variables, you can see the strong relationship between CashFlow… With “OrdinaryProfit” 0.6639 improvement: Adding “PersonnelExpence” 0.7772 0.1133 Adding “TangibleTotalAsset” 0.8698 0.0926 Adding “Depreciation” 0.9344 0.0646 the R-Squared is Looking at the first simple regression, R-squared improved from 66.4% to 93.4% at our final multiple model. All the coefficients but “TangibleFixedAsset” are highly significant since their p values are very small. So to improve this model more, we aim to improve on this variable. Data Analysis Conclusion: Other Values Looking back and analyzing more data… *all variables are log() T-value P-value OrdinaryProfit 35.368 < 2e-16 Personnel Expenses -2.837 0.0046 Tangible Assets -0.260 0.7946 Depreciation 42.224 < 2e-16 Ordinary Profit and Depreciation have high T-values, so we can say that they are very effective in supporting Cash Flow. Personnel Expenses does as well, but is much weaker. However, the T-value of Tangible Assets does not show any support for Cash Flow. For P-values, again, Ordinary Profit and Depreciation had a strong result. There is a high probability that they improve the model. Personnel Expenses has a lower probability. On the other hand, there was no evidence that Tangible Assets helps Cash Flow. Data Analysis Conclusion Issues and Improvements: By looking at R-Squared, P-Value, and T-Value, we found that all of our variables were good explanatory values for Cash Flow and helped improve the model in some way. However, there were three problems: Depreciation improved R-Squared, but significantly less than the other variables did Tangible Assets had an insignificant affect on Cash Flow Tangible Assets had a small probability of supporting Cash Flow Data Analysis Conclusion Verify if Tangible Assets is necessary Run Automatic Variables selection Tangible Assets is not impact to our model Data Analysis Conclusion Cash flow model is improved by… - Ordinary Profit - Personnel Expenses - Depreciation Team 04 Thank you for listening! Any questions? APPENDIX Multiple regression model without Tangible Assets QA from last presentation Q1. Some of the ordinary profit records has negative value. How did you stuggle with this problem. Q2. Based on Q1’s answer the chart in slide 4 shows that using wrong data. Q3. In slide 4’s right chart, Is there any mean to adding -0.1? Q4.How did you find explanatory variables on slide 14? <MEMO> This page was added after the final presentation. And charts in slide 4 also replaced.