Uploaded by boobathya1412

R programming

advertisement
Exercise Problems – R Programming
Assignment-3
Name : Boobathy A
Reg.no : 191921047
Subject Code : ITA0411
Date of Submission: Tue, 11 May 2021
1. Briefly describe your respective dataset. Identify which package the dataset
belongs to? Display the structure of your dataset and also print the variables (fields)
included in it. Import the dataset into the R environment before data analysis. The
working directory should be set with your registration number.
Solution:
Description:
The classic Box & Jenkins airline data. Monthly totals of international airline passengers,
1949 to 1960.
Format:
A monthly time series, in thousands.
Program:
#Getting and Setting the Working Directory
getwd()
setwd('C:/Users/BOOBATHY A/Documents/191921047')
getwd()
Output:
#Import the csv file in R by using the read.csv() function:
#Once the file is readed it will show in environment:
#view(df):
#str()-function in R is used for displaying the internal structure of the Dataset.
#dim()-function of the R returns the the number of columns and rows of Dataset
#class()-function represents the set of properties or methods that are common to all
objects of one type.
#head()-function in R is used to get the first 6 rows and columns of a Dataset.
#tail()-function in R is used to get the last 6 rows and columns of a Dataset.
#names()-function gives the variables name in dataset.
#summary()-function used to produce the summaries of the various model fitting
functions.
# $- is used to access each column in the dataset or dataframes.
#unique()-function in R gets the unique values in a particular column.
#length()-function get the length of the variable
Here,
Length of unique values in time column is 144
Length of unique values in value column is 118
2. Perform data analysis by computing the measures of central tendency as well as
measures of dispersion (for minimum for 2 variables under consideration). Substantiate
your answers with appropriate reasoning.
Solution:
#mean()-function calculates sum of the values and dividing with the number of values in
a data series.
#median()-middle most value in a data series.
#mode()-the value that has highest number of occurrences in a set of data
#na.rm is used to omit the missing values from the dataset.
Here,
In my dataset I don't have NA values. So I have not used it.
Program:
y<-table(df$value)
print(y)
mode <- names(y)[which(y == max(y))]
print(mode)
Output:
3.What is the significance of the measurement you had explored with respect to your
data?
Write a report on the same.
Report on Dataset
#sd()-function used to find the standard deviation
Additional:
Program:
library(moments)
png(file = "values.png")
print(skewness(df$value))
hist(df$value)
dev.off()
Program:
library(moments)
png(file = "values.png")
print(skewness(df$time))
boxplot(df$time)
dev.off()
ggplot:
ggplot2 is an open-source data visualization package for the statistical R programming
4.Find out how many missing values (NA) are there in the dataset. Create a subset of
your dataset after removing the missing values.
Solution:
#subset()-function in R is used to create subsets of a Dataset or dataframe.
5.Create a contingency table using crosstab between any two columns which is more
influential and calculate the correlation between them.
#Corr.test()- is used to evaluate the association between two or more variables.
Download