Robert Gordon University CMM703 – Data Analysis 2022 Module leader Dr. Sameera Viswakula Unit Coursework Weighting: 85% for Report 15% for In-class presentation LO1 Critically appraise data transformation methods for Learning Outcomes Covered in this Assignment: statistical analysis. LO2 Justify analysis methods and conclusions by selective and critical use of relevant theories. LO3 Design, implement and evaluate the data analytic lifecycle stages: clean, transform, analyse and visualise. LO4 Communicate conclusions, insights and recommendations to a wider audience by tailoring them at different levels of detail. Handed Out: 30th of Mar 2022 Due Date TBA - for the Report Each individual will get 15 minutes time slot to demonstrate his or her solutions in a viva defense Expected deliverables Method of Submission: Type of Feedback and Due Date: One electronic file containing the report as specified below. Online via Moodle – Report Written feedback and marks - TBA Coursework Description (Subject to RGU validation) Q1. Find a dataset with economic indicators for at least 10 recent years. Select two commonly used economic indicators (eg. GDP, inflation, GDP to Debt ratio etc.) and visualize the behavior of those two indicators of the following countries. You are supposed to provide only two plots i.e. one plot for each indicator. [LO1, LO3 and LO4] • Sri Lanka • Argentina • Bangladesh • Singapore • Venezuela Q2. The insuranceData.csv file contains a sample of 1471 records of insurance contractors currently enrolled in an insurance plan in a famous insurance company in Sri Lanka. It includes some characteristics of the patient and the premium decided by the insurance company. The features are: age: age of the insurance contractor sex: gender of the insurance contractor bmi: body mass index of the contractor (kg / m ^ 2) num_kids: number of children covered by health insurance smoking_status: smoking status of the insurance contractor district: residential area of the beneficiary; Colombo, Galle, Badulla and Trinco. premium: monthly insurance premium decided by health insurance company a. Do a thorough descriptive analysis and identify the patterns and potential significant variables. Use appropriate plots and tables. b. Perform a cluster analysis to the dataset and identify potential clusters. [LO1] [LO2] c. Fit a suitable predictive model to predict the insurance premium of a contractor. [LO2] d. Improve the model in part b) by transforming, recoding etc. variables meaningfully. [LO3] e. Check for the model assumptions. [LO2] f. [LO4] Validate your model using a testing dataset. Q3. Write an R function to do the following tasks. When a dataset is fed to the function, your function should: a. Identify qualitative and quantitative variables in the dataset. [LO1] b. Count the missing values in each variable. Impute the missing values using: i. the mean value of the variable if it is numeric. ii. the mode of the variable if it is categorical. [LO3] c. Identify univariate outliers for each numeric variable. [LO2] d. Summarize each variable using a proper visualization tool for the respective variable (eg: histogram, boxplot etc.). [LO4] e. When the response variable is specified as an argument, it should run the best predictive model for that response category (consider only continuous and binary response variables) and select features considering all other meaningful variables. Your function should print relevant diagnostic metrics and plots for the selected model. [LO3] The Report You need to formulate solutions clearly explaining your R codes and specifying outputs where necessary. Titles/captions should be given to any figures or tables. For the descriptive summary, a small interpretation/description should be given after each figure/table. Your report should be no longer than 2000 words. Presentation You need to explain your codes and outputs using slides and/or by running codes for each question. Coursework Marking scheme The Coursework will be marked based on the following marking criteria: Question Q1 Marks 15x2=30 Q2 a) 5 Q2 b) 5 Q2 c) 5 Q2 d) 15 Q2 e) 5 Q2 f) 10 Q3 a) 5 Q3 b) 5 Q3 c) 5 Q3 d) 10 Q3 e) 15 Presentation 15 Total 130 Marks provided Comments