Uploaded by supbrosup12

Computer Assignments

advertisement
DEPARTMENT OF STATISTICS
Computer Assignments
General Procedures
1
DEPARTMENT OF STATISTICS
R Statistical Programming Language
It was created by statisticians
Ross Ihaka and Robert
Gentleman at the University of
Auckland, New Zealand.
R is a versatile language that empowers users to analyze, visualize, and
interpret data effectively, making it a cornerstone in the toolkit of data
analysts, statisticians, researchers, and professionals across industries.
2
DEPARTMENT OF STATISTICS
Why R?
1.Statistical Analysis: R provides a rich set of tools for performing various statistical analyses.
2.Data Visualization: R offers powerful and customizable data visualization capabilities.
3.Data Manipulation: R includes functions for data cleaning, transformation, and manipulation, making it suitable for preparing data for analysis.
4.Statistical Modeling: R supports various statistical modeling techniques.
5.Custom Functions: Users can create their own functions and packages, enabling the development of custom solutions tailored to specific analytical needs.
6.Open Source: R is open-source software, which means it's freely available and can be modified to suit individual requirements.
7.Cross-Platform: R is compatible with various operating systems, making it accessible to users across different platforms.
8.Interoperability: R can interface with other programming languages like Python, C++, and Java, enhancing its flexibility and integration with existing workflows.
3
DEPARTMENT OF STATISTICS
Comprehensive R Archive Network (CRAN)
Community and Packages: R has a centralized hub for users to access and distribute packages on
CRAN (Comprehensive R Archive Network) which extends R's capabilities, covering diverse areas such
as genetics, economics, finance, and social sciences.
Packages can be installed easily with a single command on the terminal.
Install.packages(“Name”
)
4
DEPARTMENT OF STATISTICS
RStudio is an integrated development environment (IDE) for R, a programming language for
statistical computing and graphics. It is developed by Posit, PBC, a public-benefit corporation
founded by J. J. Allaire
https://en.wikipedia.org/wiki/RStudio
5
DEPARTMENT OF STATISTICS
Scholar: R-studio
Scholar is a small computer cluster, suitable for classroom learning about high
performance computing (HPC). It consists of 7 interactive login servers and 28
batch worker nodes.
https://www.rcac.purdue.edu/compute/scholar
6
DEPARTMENT OF STATISTICS
Scholar: R-studio
https://www.rcac.purdue.edu/compute/scholar
4. Wait in the Queue until status changes to
1. Log in with BoilerKey
3. Launch RStudio Server.
2. Select RStudio Server link under
Interactive Apps.
5. Click Connect to RStudio Server.
7
DEPARTMENT OF STATISTICS
Scholar: R-studio
8
DEPARTMENT OF STATISTICS
Scholar: R-studio
The RStudio Server does a front end reboot every
Saturday night around midnight Purdue time.
This is the suggestion from IT:
"To avoid your carriage turning into a pumpkin, save your work before midnight
on Saturday and take a 15 minute break."
You could get some bogus error messages if you don't because Scholar gets confused.
When done or taking long break:
Exit Scholar and Free up Compute Nodes!
9
DEPARTMENT OF STATISTICS
STAT 350
COMPUTER ASSIGNMENT 1
10
DEPARTMENT OF STATISTICS
Variable Name
The dataset provided for student use has been modified and
sourced primarily from a German Credit Status Prediction
Dataset hosted at the University of California at Irvine
Machine Learning Repository
Dataset
Column
creditAmountDollar
1
Credit line of the account in 1994 US dollars
duration
2
The repayment duration of the loan in months.
riskStatus
3
The credit risk classification status
installmentCommitment
4
Installment rate in percentage of disposable income
purpose
5
Purpose for opening line of credit
otherParties
6
Co-applicant or Guarantor
age
7
The age in years of the debtor
personalStatus
8
Sex + Marital Status
numDependents
9
The number of dependent children of the debtor
ownTelephone
10
Has a working telephone number
employment
11
Current employment in years
residenceSince
12
Current residence in number of years
foreignWorker
13
Status as domestic or foreign
job
14
Skill level of current job
housing
15
Status of housing
propertyMagnitude
16
Relative size of the property ownership of the debtor
checkingStatus
17
Status of checking account
savingsStatus
18
Status of savings account and bonds
creditHistory
19
Prior credit history
20
The number of existing lines of credit including the
current
21
Other existing forms of credit with financial institutions
22
The log credit line of the account in 1994 US dollars.
Note that the first character is a small 'el' to denote log
23
The predicted log credit line of the account in 1994 US
dollars based on an internal statistical model.
Note that the first character is a small 'el' to denote log
existingCredits
otherPaymentPlans
lamount
lamountPred
Hofmann,Hans. (1994). Statlog (German Credit Data). UCI Machine Learning Repository.
https://doi.org/10.24432/C5NC77.
Description
11
DEPARTMENT OF STATISTICS
Computer Assignment #1
Load
Data
Exploration
Cleaning
Save
Manipulations/Transformations
12
DEPARTMENT OF STATISTICS
Exploration
View(data)
13
DEPARTMENT OF STATISTICS
Cleaning
complete.cases(data_table)
is.na(data_table)
help(“is.na”)
help(“complete.cases”)
14
DEPARTMENT OF STATISTICS
Manipulations/Transformations
log(creditData_clean$lamount)
15
DEPARTMENT OF STATISTICS
How will computer assignments be graded?
1. Code Cut/Paste  (Input TextBox)
2. Output (File Upload)
3. Interpretations (Type up detailed Answers) (Input TextBox)
output uploads should be screenshot (Snip Tool)
BAD
GOOD
p-value = 0.5383
P value = 0.5383
16
DEPARTMENT OF STATISTICS
Open and Utilize Everything
Do not re-type output; it should be snipped!
Highlight the output.
OPEN EVERYTHING UP
copy/paste or snip between them as you are working
17
Download