Uploaded by Tharindu Samarakoon

01 CMM703-Data Analysis-Coursework-2022

advertisement
Robert Gordon University
CMM703 – Data Analysis 2022
Module leader
Dr. Sameera Viswakula
Unit
Coursework
Weighting:
85% for Report
15% for In-class presentation
LO1 Critically appraise data transformation methods for
Learning Outcomes
Covered in this
Assignment:
statistical analysis.
LO2 Justify analysis methods and conclusions by selective and
critical use of relevant theories.
LO3 Design, implement and evaluate the data analytic
lifecycle stages: clean, transform, analyse and visualise.
LO4 Communicate conclusions, insights and recommendations
to a wider audience by tailoring them at different levels of detail.
Handed Out:
30th of Mar 2022
Due Date
TBA - for the Report
Each individual will get 15 minutes time slot to demonstrate his or her
solutions in a viva defense
Expected deliverables
Method of Submission:
Type of Feedback and Due
Date:
One electronic file containing the report as specified below.
Online via Moodle – Report
Written feedback and marks - TBA
Coursework Description (Subject to RGU validation)
Q1. Find a dataset with economic indicators for at least 10 recent years. Select two commonly
used economic indicators (eg. GDP, inflation, GDP to Debt ratio etc.) and visualize the
behavior of those two indicators of the following countries. You are supposed to provide
only two plots i.e. one plot for each indicator. [LO1, LO3 and LO4]
•
Sri Lanka
•
Argentina
•
Bangladesh
•
Singapore
•
Venezuela
Q2. The insuranceData.csv file contains a sample of 1471 records of insurance contractors
currently enrolled in an insurance plan in a famous insurance company in Sri Lanka. It
includes some characteristics of the patient and the premium decided by the insurance
company. The features are:
age: age of the insurance contractor
sex: gender of the insurance contractor
bmi: body mass index of the contractor (kg / m ^ 2)
num_kids: number of children covered by health insurance
smoking_status: smoking status of the insurance contractor
district: residential area of the beneficiary; Colombo, Galle, Badulla and Trinco.
premium: monthly insurance premium decided by health insurance company
a. Do a thorough descriptive analysis and identify the patterns and potential
significant variables. Use appropriate plots and tables.
b. Perform a cluster analysis to the dataset and identify potential clusters.
[LO1]
[LO2]
c. Fit a suitable predictive model to predict the insurance premium of a contractor.
[LO2]
d. Improve the model in part b) by transforming, recoding etc. variables meaningfully.
[LO3]
e. Check for the model assumptions.
[LO2]
f.
[LO4]
Validate your model using a testing dataset.
Q3.
Write an R function to do the following tasks. When a dataset is fed to the function, your
function should:
a. Identify qualitative and quantitative variables in the dataset.
[LO1]
b. Count the missing values in each variable. Impute the missing values using:
i. the mean value of the variable if it is numeric.
ii. the mode of the variable if it is categorical.
[LO3]
c. Identify univariate outliers for each numeric variable.
[LO2]
d. Summarize each variable using a proper visualization tool for the respective
variable (eg: histogram, boxplot etc.).
[LO4]
e. When the response variable is specified as an argument, it should run the best
predictive model for that response category (consider only continuous and binary
response variables) and select features considering all other meaningful variables.
Your function should print relevant diagnostic metrics and plots for the selected
model.
[LO3]
The Report
You need to formulate solutions clearly explaining your R codes and specifying outputs where
necessary. Titles/captions should be given to any figures or tables. For the descriptive summary,
a small interpretation/description should be given after each figure/table.
Your report should be no longer than 2000 words.
Presentation
You need to explain your codes and outputs using slides and/or by running codes for each
question.
Coursework Marking scheme
The Coursework will be marked based on the following marking criteria:
Question
Q1
Marks
15x2=30
Q2 a)
5
Q2 b)
5
Q2 c)
5
Q2 d)
15
Q2 e)
5
Q2 f)
10
Q3 a)
5
Q3 b)
5
Q3 c)
5
Q3 d)
10
Q3 e)
15
Presentation
15
Total
130
Marks provided
Comments
Download