DEPARTMENT OF STATISTICS Computer Assignments General Procedures 1 DEPARTMENT OF STATISTICS R Statistical Programming Language It was created by statisticians Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. R is a versatile language that empowers users to analyze, visualize, and interpret data effectively, making it a cornerstone in the toolkit of data analysts, statisticians, researchers, and professionals across industries. 2 DEPARTMENT OF STATISTICS Why R? 1.Statistical Analysis: R provides a rich set of tools for performing various statistical analyses. 2.Data Visualization: R offers powerful and customizable data visualization capabilities. 3.Data Manipulation: R includes functions for data cleaning, transformation, and manipulation, making it suitable for preparing data for analysis. 4.Statistical Modeling: R supports various statistical modeling techniques. 5.Custom Functions: Users can create their own functions and packages, enabling the development of custom solutions tailored to specific analytical needs. 6.Open Source: R is open-source software, which means it's freely available and can be modified to suit individual requirements. 7.Cross-Platform: R is compatible with various operating systems, making it accessible to users across different platforms. 8.Interoperability: R can interface with other programming languages like Python, C++, and Java, enhancing its flexibility and integration with existing workflows. 3 DEPARTMENT OF STATISTICS Comprehensive R Archive Network (CRAN) Community and Packages: R has a centralized hub for users to access and distribute packages on CRAN (Comprehensive R Archive Network) which extends R's capabilities, covering diverse areas such as genetics, economics, finance, and social sciences. Packages can be installed easily with a single command on the terminal. Install.packages(“Name” ) 4 DEPARTMENT OF STATISTICS RStudio is an integrated development environment (IDE) for R, a programming language for statistical computing and graphics. It is developed by Posit, PBC, a public-benefit corporation founded by J. J. Allaire https://en.wikipedia.org/wiki/RStudio 5 DEPARTMENT OF STATISTICS Scholar: R-studio Scholar is a small computer cluster, suitable for classroom learning about high performance computing (HPC). It consists of 7 interactive login servers and 28 batch worker nodes. https://www.rcac.purdue.edu/compute/scholar 6 DEPARTMENT OF STATISTICS Scholar: R-studio https://www.rcac.purdue.edu/compute/scholar 4. Wait in the Queue until status changes to 1. Log in with BoilerKey 3. Launch RStudio Server. 2. Select RStudio Server link under Interactive Apps. 5. Click Connect to RStudio Server. 7 DEPARTMENT OF STATISTICS Scholar: R-studio 8 DEPARTMENT OF STATISTICS Scholar: R-studio The RStudio Server does a front end reboot every Saturday night around midnight Purdue time. This is the suggestion from IT: "To avoid your carriage turning into a pumpkin, save your work before midnight on Saturday and take a 15 minute break." You could get some bogus error messages if you don't because Scholar gets confused. When done or taking long break: Exit Scholar and Free up Compute Nodes! 9 DEPARTMENT OF STATISTICS STAT 350 COMPUTER ASSIGNMENT 1 10 DEPARTMENT OF STATISTICS Variable Name The dataset provided for student use has been modified and sourced primarily from a German Credit Status Prediction Dataset hosted at the University of California at Irvine Machine Learning Repository Dataset Column creditAmountDollar 1 Credit line of the account in 1994 US dollars duration 2 The repayment duration of the loan in months. riskStatus 3 The credit risk classification status installmentCommitment 4 Installment rate in percentage of disposable income purpose 5 Purpose for opening line of credit otherParties 6 Co-applicant or Guarantor age 7 The age in years of the debtor personalStatus 8 Sex + Marital Status numDependents 9 The number of dependent children of the debtor ownTelephone 10 Has a working telephone number employment 11 Current employment in years residenceSince 12 Current residence in number of years foreignWorker 13 Status as domestic or foreign job 14 Skill level of current job housing 15 Status of housing propertyMagnitude 16 Relative size of the property ownership of the debtor checkingStatus 17 Status of checking account savingsStatus 18 Status of savings account and bonds creditHistory 19 Prior credit history 20 The number of existing lines of credit including the current 21 Other existing forms of credit with financial institutions 22 The log credit line of the account in 1994 US dollars. Note that the first character is a small 'el' to denote log 23 The predicted log credit line of the account in 1994 US dollars based on an internal statistical model. Note that the first character is a small 'el' to denote log existingCredits otherPaymentPlans lamount lamountPred Hofmann,Hans. (1994). Statlog (German Credit Data). UCI Machine Learning Repository. https://doi.org/10.24432/C5NC77. Description 11 DEPARTMENT OF STATISTICS Computer Assignment #1 Load Data Exploration Cleaning Save Manipulations/Transformations 12 DEPARTMENT OF STATISTICS Exploration View(data) 13 DEPARTMENT OF STATISTICS Cleaning complete.cases(data_table) is.na(data_table) help(“is.na”) help(“complete.cases”) 14 DEPARTMENT OF STATISTICS Manipulations/Transformations log(creditData_clean$lamount) 15 DEPARTMENT OF STATISTICS How will computer assignments be graded? 1. Code Cut/Paste (Input TextBox) 2. Output (File Upload) 3. Interpretations (Type up detailed Answers) (Input TextBox) output uploads should be screenshot (Snip Tool) BAD GOOD p-value = 0.5383 P value = 0.5383 16 DEPARTMENT OF STATISTICS Open and Utilize Everything Do not re-type output; it should be snipped! Highlight the output. OPEN EVERYTHING UP copy/paste or snip between them as you are working 17