Table of Contents Introduction ............................................................................................................................................... 3 Major Findings .......................................................................................................................................... 4 Part A: Business and economic data evaluation ........................................................................ 4 1. Data collection methods: ....................................................................................................... 4 2. Source of Data: ......................................................................................................................... 5 3. Method for analyzing data: .................................................................................................... 6 Part B: Communicate findings using appropriate charts / tables. ......................................... 7 1. Cleaning the data. .................................................................................................................... 7 2. Summary statistics, tables, charts to explore each variable ..................................... 11 3. The relationship between variables .................................................................................. 17 4. Summary quantitative variable classifying by qualitative variables ........................ 20 5. Evaluation of various types of tables and charts .......................................................... 21 Part C: Analysing and evaluating “House Price Data Project” data .................................... 22 1. T-test.......................................................................................................................................... 22 2. Regression analysis .............................................................................................................. 24 3. Evaluate the use of summary statistics ........................................................................... 27 4. Differences between regression analysis and correlation coefficients .................. 27 Reference ................................................................................................................................................. 29 Introduction This research aims to accomplish three different things. The first step is to evaluate the facts and data pertaining to business and the economy by making use of the case study titled "House Price Data Project." Using a number of different approaches, the second step entails locating and analyzing the data held by the corporation. In conclusion, ensure that the most important facts are communicated by utilizing the charts and tables that are most applicable. Major Findings Part A: Business and economic data evaluation 1. Data collection methods: Any company may benefit from data. However, data must be collected utilizing a number of methods. Furthermore, data should be as consistent, dependable, and accurate as feasible. Quantitative data and qualitative data are the two fundamental types of data gathered. Quantitative data is information that is expressed in a specified amount or range, allowing it to be tallied and quantified using exact measurements. In statistical analysis, quantitative data are widely used. Questionnaires, surveys, document reviews, random sampling, observations, and in-person or online interviews are just some of the ways that data may be collected (Barrow, 2017): o Surveys and questionnaires: This strategy is convenient to use because it can either be carried out in person or over the internet. Platforms such as Typeform, Qualtrics, and SurveyMonkey allow anybody to collect quantitative data. Furthermore, this technique simplifies and quantifies the thoughts and behaviors of participants. o Document review: As a result of the fact that this is a method for gathering data that already exists inside and even outside of an organization, it is a source of information that is relatively effective. It should be noted, however, that this is primarily a secondary data source that includes three categories of documents: public, private, and physical evidence. o Sampling: Instead of analyzing the entire set of data, researchers may select a smaller group to investigate a representative sample. The two main types of sampling are random probability sampling and non-probability sampling. Furthermore, data analysts may use programming languages and other procedures to uncover patterns from massive data sets. o Observations: This is a straightforward approach that has proven to be effective, and it entails nothing more than the observer merely observing the activities and occurrences taking place in their natural setting. This makes it possible for researchers to have their participants make judgments and respond to challenges in a context that is closer to their everyday lives, as opposed to a controlled environment such as a laboratory or focus group. o Interviews: There are several types of interviews accessible nowadays, as well as technologies to help with the interview process. Individuals may be interviewed over the phone or using video conferencing equipment when researchers generate standard questions for them. Qualitative data: The analyst is able to dig deeper into hypotheses and numerical findings with the help of qualitative data, which are results that are descriptive in nature. In statistics, qualitative data refers to data that is qualitative, non-numerical, and observable, and that is classified by the attribute of an object or phenomenon. Obtain the perspectives and experiences of individual patients as well as their families, for instance. Researchers might collect data through a variety of means, including in-depth interviews, discussions with focus groups, secondary research, or even just observations (Mcclave, P George Benson and Sincich, 2014): o In-dept interview: This method obtains information from respondents in a flexible manner via specialized engagement, conversation, and interview with a research subject. With the most active and full information, "open-ended" questions are usually employed in this manner. o Focus group: Multidimensional and objective results from multiple perspectives will be obtained via interaction, discourse, and discussion with a group of research volunteers. o Secondary research: Collect any information that is already available, whether it be in the form of text, images, audio recordings, or video. In addition, there are two approaches to research, which are known as case studies and longitudinal studies. The first type collects data from the same source on a consistent basis over the course of time to establish a correlation between the subjects that are being researched. In contrast, a case study involves the observation of individuals within a specific setting. o Observations: In thorough field notes, record what researchers can see, hear, or experience. 2. Source of Data: Data and statistics obtained, kept, and collected are used for a variety of analysis and evaluation reasons. Data is classified into two types: main and secondary. Primary data is collected directly by the researchers in line with their unique aims and methodologies. Meanwhile, secondary data was gathered via documentation, research, and analysis of others (Mcclave, P George Benson and Sincich, 2014): Secondary data: The data has been secured by conducting research, reviewing it, or testing it, and it was obtained from reputable data storage locations. Due to the fact that each company maintains its own database, data can be preserved. Examples include data on customers, sales, profits, payroll, and bonuses, in addition to data on employees and their private information. In addition to this, the information that can be obtained from organizations and groups is quite diverse, and the Internet is an extremely helpful resource. Readers, for instance, are able to conduct a speedy search on the Internet for fundamental information such as sales, product pricing, and the commercial plans of the company, among other things. Primary data: This is an excellent source of data, but in order to get the most out of it, the observer will need to take notes while they are out in the field. Observers will collect one or more variables for statistical analysis, and the results of this analysis will ultimately be incorporated in the findings (Anderson et al., 2018). For instance, in order to determine which smartphone is the most popular purchase at Mobile World, the company collects information on the purchasing patterns of a representative sample of the store's customers. Using this data, managers may study their preferences and purchase time. They may get insight into how customers' wishes impact their spending in order to develop successful business strategies. Advantages Secondary data Cost-effective Fast high-quality data gathered by Specialists Data collection methods vary. Primary data Might be more accurate You have more control over the data Privacy is maintained Disadvantages Data could have been out of dated Not tailored to your requirements Data can be skewed in favor of the person gathering it. It take a lot of time to build this data It cost a lot of money It will need more labor during the survey Have a better understanding of the data 3. Method for analyzing data: In statistics, there are two methods for analyzing data: descriptive analysis and inferential analysis. The use of frequency tables, cross tables, and graphs are all components of descriptive analysis, which is used to describe and evaluate data based on qualitative factors. Quantitative variables will be graphically represented as means, medians, modes, standard deviations, and variances in the meantime, or in some other form of graphing representation. A summary of the data is intended to be produced by descriptive analysis. In descriptive analysis, the variables are completely separate from one another and do not influence or affect one another in any way. Because analysts are primarily concerned with locating evidence and examining data, rather than developing correlations between the variables in that data set, this method is straightforward to implement and requires little effort on their part. In addition to this, there is a type of statistical analysis known as inferential analysis. Since it deals with the outcomes of statistical forecasts, this method is statistically more advanced than others like it. The researcher will make use of the estimate (P value: =0.05). Estimation can be broken down into two categories: point estimates, which involve only one value, and interval estimates, which involve two values that can be accessed. In addition, forecasting and drawing conclusions about trends in large populations can be accomplished by testing hypotheses based on both qualitative and quantitative aspects of the variables in question. With the help of this method, the analyst should have no trouble discovering the link between the variables. As a consequence of this, businesses might rely on statistical data to develop hypotheses and practical solutions for their business problems. Differences between descriptive and inferential statistics Descriptive statistics Inferential statistics When dealing with data of a modest size When dealing with data of a huge size Providing a meaningful presentation of all of Analysis, comparison, and forecasting of the data. data expressed as a percentage Simple to carry out the procedure Complex procedure that calls for the use of many different methods Low levels of error in terms of frequency There are many errors. Part B: Communicate findings using appropriate charts / tables. 1. Cleaning the data. Using the frequencies method to identify the missing in the data set Statistics Type N Valid Missing Price Bedrooms Bathrooms Area Furnished Level 501 501 501 501 500 501 501 0 0 0 0 1 0 0 It can be noticed that there are 501 variables and 1 missing in the area variable, since the missing is too tiny, therefore retain it It is clear that the price variable has a great deal of extreme values; hence, the price variable will be treated as a dependent variable in the analysis that will take place in the next two parts: delete the outliers variable in your analysis. The bedrooms and bathrooms variable has a low standard deviation, which means that outliers do not significantly impact the data; hence, the variable should be retained. The area, in contrast to price, is not used much in the following two-part analysis; therefore, it will be kept, as its outliers do not affect the majority of the analysis. This is despite the fact that the area has many outliers. Statistics N The price of Number of Number of The Area of the property bedrooms bathrooms property by m2 Valid 501 501 501 500 0 0 0 1 Missing Percentile 25 850000 2.00 2.00 120.000 s 50 1837000 3.00 2.00 160.000 75 3067500 3.00 3.00 199.750 Q3 + 1.5IQR 6393750 319.375 Use IQR method to find Q3 + 1.5IQR and choose outliers bigger than this value to delete Statistics Type N Valid Missing Price Bedrooms Bathrooms Area Furnished Level_group 479 479 479 479 479 479 479 0 0 0 0 0 0 0 To summarize, after removing the variable and cleaning the data, there are a total of 479 values remaining for each variable, and no missing variables. 2. Summary statistics, tables, charts to explore each variable Qualitative variable a. Level group Level_group Cumulative Frequency Valid Percent Valid Percent Percent 1.00 284 59.3 59.3 59.3 2.00 195 40.7 40.7 100.0 Total 479 100.0 100.0 The percentage of group 1 has reached as high as 59.3%, while the percentage of group 2 has reached as high as 40.7%. b. Type Type Cumulative Frequency Percent Valid Percent Percent Valid Apartment 436 91.0 91.0 91.0 Duplex 27 5.6 5.6 96.7 Penthouse 10 2.1 2.1 98.7 6 1.3 1.3 100.0 479 100.0 100.0 Studio Total The many kinds of homes are shown by the pie chart. It is abundantly obvious that the share of apartments accounts for the largest proportion, reaching a maximum of 91% with 436 flats. There are a total of 27 duplex homes, however only 5.6% of them are occupied by families. Last but not least, the number of Penthouses is 10, while the number of Studios is 6, and their respective percentages are 2.1% and 1.3%. c. Furnished Furnished Cumulative Frequency Valid Percent Valid Percent Percent No 295 61.6 61.6 61.6 Yes 184 38.4 38.4 100.0 Total 479 100.0 100.0 The percentage of not furnished has reached as high as 61.6%, while the percentage of furnished has reached as high as 38.4%. Quantitative variables a. Price b. Descriptive Statistics N Price Range 479 6127000 Minimum 35000 Maximum 6162000 Mean 1999718.33 Std. Deviation Variance 1407692.078 1981596986002. 352 Valid N (listwise) 479 The price of each home in this variable is determined not only by the area and the type of home, but also by whether or not the home is furnished. As a result, the range for this variable is rather large, and it is around 6,127,000. Although the highest possible value is around 6,162,000, the lowest possible value is 35,000. The typical price of a house is now approximately 1,999,718.33 dollars, which indicates that the majority of homes purchased are extremely spacious. This histogram has a big standard deviation and has a tendency to be skewed to the right. b. Bedrooms Descriptive Statistics N Range Bedrooms 479 Valid N (listwise) 479 Minimum 4 1 Maximum 5 Mean 2.75 Std. Deviation .606 Variance .367 The standard deviation is closer to 0.606 than it would otherwise be given that the range for this variable is not very large. The mean total of the house have bedrooms is 479, on the other hand, is around 3 bedrooms. In contrast to the histogram shown before, this one is symmetrical. c. Bathrooms Descriptive Statistics N Range Bathrooms 479 Valid N (listwise) 479 Minimum 3 1 Maximum 4 Mean 2.08 Std. Deviation .716 Variance .513 The standard deviation is closer to 0.716 than it would otherwise be given that the range for this variable is not very large. The mean total of the house have bedrooms is 479, on the other hand, is around 2 bedrooms. In contrast to the histogram shown before, this one is symmetrical. d. Area Descriptive Statistics N Range Area 479 Valid N (listwise) 479 296.0 Minimum 20.0 Maximum 316.0 Mean 157.107 Std. Deviation 52.7270 Variance 2780.132 As a result, the range for this variable is rather large, and it is around 296. Although the highest possible value is around 316, the lowest possible value is 20. The typical mean area of a house is now approximately 157.107, which indicates that the majority of homes purchased are extremely spacious. This histogram has a low standard deviation and has a tendency to be symmetrical 3. The relationship between variables Correlations Price Price Pearson Correlation Bedrooms 1 Sig. (2-tailed) N Bedrooms Bathrooms Area Pearson Correlation 479 .313** Bathrooms Area .313** .567** .449** .000 .000 .000 479 479 479 1 .533** .661** .000 .000 Sig. (2-tailed) .000 N 479 479 479 479 .567** .533** 1 .689** Sig. (2-tailed) .000 .000 N 479 479 479 479 .449** .661** .689** 1 Sig. (2-tailed) .000 .000 .000 N 479 479 479 Pearson Correlation Pearson Correlation .000 479 **. Correlation is significant at the 0.01 level (2-tailed). R = 0.313, which in the range of 0.3 to 0.5. Then it can be said that there is low positive correlation between price and the number of bedrooms in a house. R = 0.567, which in the range of 0.5 to 0.7, it demonstrated the morderate positive correlation between total number of bathrooms and the price of the house. R = 0.449, which is in the range of 0.3 to 0.5. It is stated that there is low positive correlation between sale price and number of area Sig. (2-tailed) = P value = 0.01 < α = 0.05, there is a correlation between price on size and number of rooms and the correlation is significant at 1%. The Scatter Plot shown above depicts the relationship between several factors such as selling price and total area. The pricing will undoubtedly fluctuate as the location changes. Furthermore, it is statistically significant since the dense point distribution implies a positive link between selling price and area. However, there are a few places with unusually sparse distribution that demonstrate minimal price reliance on floor area. The cost will be high since diverse residences have bigger area. Smaller dwellings, on the other hand, will be more affordable. The scatter plot reveals that there is not much of a correlation between price and the number of bedrooms because there is not much of a significant variance regardless of how many bedrooms there are. In addition to this, it does not have any statistical significance because the point distribution only takes into account the selling price and the number of bedrooms. The price is still quite high considering the type of property and the location, despite the fact that there are only a few bedrooms. In the meantime, houses that have multiple bedrooms are selling for an even lower price. Because there is minimal significant variation despite the rise in bathroom count, the Scatter plot demonstrates little link between price and number of bathrooms. Furthermore, since the point distribution just represents the selling price and the number of bathrooms, it is statistically unimportant. Even with a restricted number of bathrooms, the price is still excessive for the house type and location. Houses with many bathrooms, on the other hand, are even much less costly. 4. Summary quantitative variable classifying by qualitative variables Price Count Type Apartment 436 Mean Median Mode Standard Deviation 1928151 1700000 2500000 1362388 1721394 Duplex 27 2876723 3340000 265000a Penthouse 10 3255766 3140473 1250000a 1436892 6 1160333 1200000 1200000 483885 Studio a. Multiple modes exist. The smallest value is shown If you take a look at the table, you'll see that the majority of the homes that were sold fell into the "Apartment" category. More precisely, the average selling price was 1928151, and the number of "Apartment" houses that were sold was the highest at 2500000. The price of "Apartment" has a standard deviation of 1362388, which can be expressed as a number. As for "Duplex" homes, due to the fact that they are more luxurious than "Apartment," they only sold 27 units, with an average value of 2876723, and the majority of "Duplex" houses sold for 265000a, which results in a standard deviation of "Duplex" price of 1721394. Penthouse homes are not all that unlike from "Duplex" residences; in all, 10 Penthouse apartments were sold for an average price of 325766. Despite this, they made the most money selling it for 1250000a, and the standard deviation was 1436892. They were only able to sell a total of six units of the Studio, each of which was purchased for a price ranging from 1160333 to 1200000, with a standard variation of 483885. Price Standard Count Furnished Level_group Mean Median Mode Deviation No 295 2039651 1837000 2200000 1359741 Yes 184 1935696 1650000 1500000 1482878 1.00 284 2125213 2000000 2500000 1432771 2.00 195 1816947 1550000 1600000a 1353242 If you take a look at the table, you'll see that the majority of the homes that were sold fell into the category of "not Furnished." More precisely, the average selling price was 2039651, and the number of "not Furnished" homes that were sold was the most at 2200000. The price of "not Furnished" has a standard deviation of 1359741, which is the value. The "Furnished" homes sold for a total of 184 units, with an average value of 1935696 dollars. The "Furnished" homes that sold the most for 1500000, therefore the "Furnished" price has a standard deviation of 1482878 dollars. The "Level group 1" category was successful in selling 284 units at an average price of 2125213. They also sold the most for a total of 2500000, and the standard deviation was 1432771. For the "Level group 2," they sold 195 units at an average price of 1816947, the highest price at which they sold any of those units was 1600000a, and the standard deviation was 1353242. 5. Evaluation of various types of tables and charts Qualitative variables When it comes to measuring qualitative factors, the most useful tools are frequency tables, bar charts, and pie charts. The number of observations in each category that are distinct from one another is presented in the frequency table. It provides information on a variety of aspects, including the asking price, the size of the home, the number of rooms. Viewers are better able to quickly comprehend fluctuations when they use bar charts and pie charts. In addition to this, they may contrast a number of different variables in order to show the trend of each component. Quantitative variables The quantitative technique provides information in the form of the minimum, maximum, mean, median, mode, and standard deviation of the data. All of the characteristics indicated above, including years, age groups, kinds of properties, and postcodes, are taken into consideration and examined. Graphs typically make use of histograms as a means of representing quantitative variables because they make it possible for users to quickly identify individual points within the data. Bivariate qualitative variables The relationship between qualitative variables can be graphically represented using something called a cross table. The reader may notice a correlation between elements such as the size of the property and the number of rooms and the asking price of the home. It is also possible to use a clustered bar chart in order to call attention to the relationships that exist between the independent variables. Bivariate quantitative variables A scatter plot is a graphical representation that can be used to examine the correlation, magnitude, and synchronization of both independent and dependent variables. This can be done by looking at the relationship between the points on the plot. The correlation coefficient is incorporated into each of the variables, and the resulting data is then represented as a numerical value. Part C: Analysing and evaluating “House Price Data Project” data 1. T-test Depending on the table, two hypotheses are put forward: H0: 𝜎2furnished = 𝜎2not furnished H1: 𝜎2furnished ≠ 𝜎2not furnished F= 2.325 Sig. (F) = 0.128 α = 0.05 According to the result, Sig. (F) = 0.128 > 0.05 => do not reject H0 H0: 𝜎2 furnished = 𝜎2 not furnished The variances between two group are not different The hypothesis in this situation is as follows: H0: µ furnished = µnot furnished H1: µfurnished ≠ µnot furnished Sig. (2-tailed) = 0.432 α = 0.05 According to the result, Sig. (2-tailed) = 0.432 > 0.05 => do not reject H0 µ furnished = µ not furnished This show that there is no different between the price of furnished and un-furnished houses. Depending on the table, two hypotheses are put forward: H0: 𝜎2group1 = 𝜎2 group2 H1: 𝜎2 group1≠ 𝜎2 group2 F= 1.052 Sig. (F) = 0.306 α = 0.05 According to the result, Sig. (F) = 0.306 > 0.05 => do not reject H0 H0: 𝜎2 group1= 𝜎2 group2 The variances between two group are not different The hypothesis in this situation is as follows: H0: µ group1 = µ group2 H1: µ group1 ≠ µ group2 Sig. (2-tailed) = 0.018 α = 0.05 According to the result, Sig. (2-tailed) = 0.018 > 0.05 => do not reject H0 µgroup1 = µgroup2 This show that there is no different between the price of group 1 and group 2 houses. 2. Regression analysis Model Summary Model 1 R .580a R Square .336 Adjusted R Std. Error of the Square Estimate .329 1153106.999 a. Predictors: (Constant), Level_dummy, Bedrooms, Furished_dummy, Bathrooms, Area The change in the dependent variable may be explained by using three different independent variables: the number of bedrooms, bathrooms, and square footage of furnished space (price). These factors have a considerable impact on 33.6% of the total transaction price. In addition, R Square equals 0.336, which indicates that 0.580% of the variation in selling price may be attributed to one of three factors: the number of rooms, the size, or the year. ANOVAa Model 1 Sum of Squares Regression Residual Total df Mean Square 3182761885239 5 6365523770478 05.000 1.000 6289271707852 473 1329655752188 19.600 .625 9472033593091 F Sig. .000b 47.873 478 24.600 a. Dependent Variable: Price b. Predictors: (Constant), Level_dummy, Bedrooms, Furished_dummy, Bathrooms, Area H0: The model is overall insignificant H1: The model is overall significant P Value = 0.000 < α = 0.05 => reject H0 The model is overall significant Coefficientsa Standardized Unstandardized Coefficients Model 1 B Std. Error (Constant) -201980.818 256987.751 Bedrooms -93063.600 117497.280 Bathrooms 957095.887 Coefficients Beta t Sig. -.786 .432 -.040 -.792 .429 103242.323 .487 9.270 .000 3856.473 1583.556 .144 2.435 .015 Furished_dummy -222031.397 109761.620 -.077 -2.023 .044 Level_dummy -130040.013 108835.403 -.045 -1.195 .233 Area a. Dependent Variable: Price Test for significant of β value, there are two hypothesis: H0: β = 0 H1: β ≠ 0 If Sig. > 0.05 => do not reject H0 If Sig. < 0.05 => reject H0 Depending on the data table, P Value of each variables is: P Value of bedrooms = 0.429 > 0.05 => do not reject H0 P Value of bathrooms = 0.000 < 0.05 => reject H0 P Value of Area = 0.015 < 0.05 => reject H0 P Value of Furnished_dummy = 0.044 < 0.005 => reject H0 P Value of Level_dummy = 0.233 > 0.05 => do not reject H0 In a broad sense, the P Values of three variables, namely the number of bathrooms, the Area, and the Funished dummy variable, are all lower than the alpha value. You might argue that these factors have an influence on the Price variable, which means that the regression model considers them to be significant. If a variable's P Value is greater than 0.05, it means that it does not play a significant role in the regression model. Another way to put this is to say that the variable in question does not have an effect on the variable in question, which in this case is Price. We have the formula: Y= B0 + B1X1 + B2X2 + B3X3 in which B0: Regression constant B1, B2, B3: Regression coefficient of 3 variables (Number of bathrooms, Area, Furnished_dummy) X1, X2, X3: 3 independent variables ̂ = −201980.818 + 957095.887 𝐵𝑎𝑡ℎ𝑟𝑜𝑜𝑚𝑠 + 3856.473 𝐴𝑟𝑒𝑎 𝑃𝑟𝑖𝑐𝑒 + (−222031.397) 𝐹𝑢𝑟𝑛𝑖𝑠ℎ𝑒𝑑_𝑑𝑢𝑚𝑚𝑦 β1 = 957095.887: when the value of the bathroom increases by 1, the house price will increase to 957095,887 β2 = 3856.473: when the house increases by 1 m2 of room area, the house price will increase to 3856,473 β3 = -222031.397: when the house is equipped with more than 1 device, the house price decreases by 222031.397 3. Evaluate the use of summary statistics In descriptive statistics, year variables, age bands, property types, and postcodes are analyzed and summed up using various age bands, property types, and postcodes. This strategy is appropriate for assisting readers in comprehending the information because the data is presented succinctly in the form of numerous tables and charts. Having said that, using this method does not result in the formation of any hypotheses or predictions. In addition, the data analyst can make use of inferential statistics to test a hypothesis in order to establish a credible hypothesis that can be used to support the data. However, this strategy is appropriate only for analysts with a very high level of expertise. 4. Differences between regression analysis and correlation coefficients An study of the correlation coefficient is used by analysts in order to investigate the connection that exists between two variables, such as the retail price and the square footage of the home. It is possible to use it in order to ascertain if the connection between the two variables in question is positive or negative. On the other hand, regression analysis takes into account the same characteristics. On the other hand, it demonstrates to researchers how much the dependent variable shifts in response to changes in the independent variable. For instance, if the total floor space is raised by one square meter, the price that is charged per square meter will also rise proportionately. Reference Anderson, D.R., Sweeney, D.J., Williams, T.A., Camm, J.D. and Cochran, J.J. (2018). Statistics for business & economics. Boston, Ma: Cengage Learning. Barrow, M. (2017). Statistics for economics, accounting and business studies. Pearson Education Limited. Kalish, C. and Thevenow-Harrison, J., 2014. Descriptive and Inferential Problems of Induction. Psychology of Learning and Motivation, pp.1-39. Kaur, P., Stoltzfus, J. and Yellapu, V., 2018. Descriptive statistics. International Journal of Academic Medicine, 4(1), p.60. Mcclave, J.T., P George Benson and Sincich, T. (2014). Statistics for business and economics. Boston: Pearson.