Open Academic Analytics Initiative Marist College Extraction, Transformation , Load using Kettle We have used the Pentaho Data Integration also called as Kettle to preprocess the data obtained from different institutions. The raw data is cleaned up in the kettle, by running through series of SQL commands in a job or a transformation. Let’s discuss the creation of a job or transformation in detail. We have used transformations to import the raw data from the MS access to MSSQL Server and Jobs are created to preprocess the data imported to the MSSQL server. We have created separate transformation and job files for all the semester data received till now. Please check the attached flows for reference. Steps to create a transformation: 1) Click on File-> New-> Transformation. 2) Click on Input in the Design Tab and drag the “Microsoft Access Input” tool to the transformation layout. Click on the Microsoft Access Input to change its properties. o For the File or directory option, click on browse and select the path of the input file has to be loaded and then click on add. o Click on the Content menu and select the particular table from the inputted MS ACCESS Database that has to be imported to MSSQL Server. o Click on the Fields menu and then on “Get fields” button to retrieve the fields from the loaded database and any naming or the data type of the fields can be changed by just editing the retrieved fields. o You can click on “preview rows” to view the data that would be imported to the MSSQL Server and then click on OK. 3) Now Click on Output in the Design tab and drag the “Table Output” tool to the transformation layout. 4) Create a hop between the “Microsoft Access Input” and “Table Output”. It can be done either by selecting both and right click->create new hop or Hover over the “Microsoft Access Input” and the => would show up, connect it to the table output. 5) Now to change the properties of the “Table Output” Click on the “Table Output” Step Name: Give the Desired name Connection: Click on new to create a connection and in the new connection window select the MSSQL Server and give the Database properties. o Connection Name: Give the Desired name o Hostname Name: 10.128.247.167 o Database Name: Specify the Database name where the table should be created. All our intermediate tables were created in OAAIDBtest and the final results table were created in OAAIDBEitel o Specify the Username, password and test the connection. click on OK. 1 Open Academic Analytics Initiative Marist College Specify the “Table name” for the new table to be created and select the truncate table checkbox. Click on SQL and execute the SQL Script, then click on OK. Save the transformation and run it, the table should be generated in the specified database. Steps to create a Job in Kettle: Creating a job is very similar to that of the Transformation. We have used multiple “SQL” tool and linked them by creating hops between them. Each of the SQL tool has a set of SQL queries that has to be executed and the final preprocessed Database table is saved back to “OAAIDBEitel” Database. 1) Click on File->New->Job. 2) Click on Scripting from the Design Tab and drag it to the job Layout. 3) Several SQL tools as per need are placed in the Layout and connected by hops as mentioned above. o To change the properties of the SQL, Click on the SQL tool and execute the following, o Job Entry Name: Give the Desired name. o Connection: The Database Connection created in the above transformation can be shared to all other flows. It can be accomplished by clicking on view->Database Connections-> Right click on the Database connection that has to be shared and click on Share. 4) Enter the SQL script that has to be executed and click on OK (Note: Each SQL Script should be terminated with a semi colon). 5) Save the Job and Run it. SQL Scripts Used in Preprocessing the Raw Data: Assigning Primary Keys for the Bio File to eliminate duplicates. //Primary fields has to be NOT NULL,so the datatype is being changed ALTER TABLE[OAAIDBtest].[dbo].[G10FBio] ALTER COLUMN ALTERNATIVE_ID varchar(36) NOT NULL ; // Eliminating the Duplicate Records(Multiple Records with same ALTERNATIVE_ID) drop table G10FDistBio; SELECT * into G10FDistBio FROM (SELECT *,Row_number() OVER(PARTITION BY ALTERNATIVE_ID ORDER BY ALTERNATIVE_ID) rn FROM dbo.G10FBio ) t WHERE rn = 1 ; // Primary Key assigned Alter Table dbo.G10FDistBio Drop column rn; ALTER TABLE [dbo].[G10FDistBio] ADD ( CONSTRAINT [PK_G10FDistBio] PRIMARY KEY CLUSTERED 2 Open Academic Analytics Initiative Marist College [ALTERNATIVE_ID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ; Adding Aptitude Score to the Bio Table: The Aptitude score is created to replace the SAT and the ACT Scores given in the Raw File. Students have reported either the SAT or ACT, since 2 different fields are used to report scores there are many NULL values. The Aptitude score is calculated for each of the students based on their reported scores, to reduce the number of NULL values. //Adding a new Field named APTITUDE_SCORE to the Bio table ALTER TABLE dbo.G10FDistBio ADD APTITUDE_SCORE int; // APTITUDE_SCORE assigned based on the Scores reported by the Students UPDATE dbo.G10FDistBio set APTITUDE_SCORE= CASE WHEN ACT_COMPOSITE_SCORE =36 THEN 1600 WHEN ACT_COMPOSITE_SCORE=35 THEN 1560 WHEN ACT_COMPOSITE_SCORE=34 THEN 1510 WHEN ACT_COMPOSITE_SCORE=33 THEN 1460 WHEN ACT_COMPOSITE_SCORE=32 THEN 1420 WHEN ACT_COMPOSITE_SCORE=31 THEN 1380 WHEN ACT_COMPOSITE_SCORE=30 THEN 1340 WHEN ACT_COMPOSITE_SCORE=29 THEN 1300 WHEN ACT_COMPOSITE_SCORE=28 THEN 1260 WHEN ACT_COMPOSITE_SCORE=27 THEN 1220 WHEN ACT_COMPOSITE_SCORE=26 THEN 1190 WHEN ACT_COMPOSITE_SCORE=25 THEN 1150 WHEN ACT_COMPOSITE_SCORE=24 THEN 1110 WHEN ACT_COMPOSITE_SCORE=23 THEN 1070 WHEN ACT_COMPOSITE_SCORE=22 THEN 1030 WHEN ACT_COMPOSITE_SCORE=21 THEN 990 WHEN ACT_COMPOSITE_SCORE=20 THEN 950 WHEN ACT_COMPOSITE_SCORE=19 THEN 910 WHEN ACT_COMPOSITE_SCORE=18 THEN 870 WHEN ACT_COMPOSITE_SCORE=17 THEN 830 WHEN ACT_COMPOSITE_SCORE=16 THEN 790 WHEN ACT_COMPOSITE_SCORE=15 THEN 740 WHEN ACT_COMPOSITE_SCORE=14 THEN 690 WHEN ACT_COMPOSITE_SCORE=13 THEN 640 WHEN ACT_COMPOSITE_SCORE=12 THEN 590 WHEN ACT_COMPOSITE_SCORE=11 THEN 530 WHEN SAT_MATH_SCORE is not null and SAT_VERBAL_SCORE is not null THEN SAT_VERBAL_SCORE + SAT_MATH_SCORE WHEN SAT_MATH_SCORE is not null and SAT_VERBAL_SCORE is not null and ACT_COMPOSITE_SCORE is not null THEN SAT_VERBAL_SCORE + SAT_MATH_SCORE END ; 3 Open Academic Analytics Initiative Marist College Assigning Primary to the Course Data: The Assigned primary keys for the Course data are the ALTERNATIVE_ID and COURSE fields. The COURSE is not given directly in the raw data, so we create the course field. The ONLINE_FLAG field is created just to mention if the course is an online or a ground course. //New Columns Created in the Course Data ALTER TABLE dbo.G10FCourse DROP COLUMN COURSE; ALTER TABLE dbo.G10FCourse DROP COLUMN COURSENUM; ALTER TABLE dbo.G10FCourse DROP COLUMN ONLINE_FLAG; alter table dbo.G10FCourse add COURSE varchar(17), COURSENUM varchar(17), ONLINE_FLAG INT; // Script to assign the COURSE, COURSE NUMBER and ONLINE STATUS UPDATE [OAAIDBtest].[dbo].[G10FCourse] SET COURSE = (SELECT SUBJECT+ '_'+ COURSENO +'_'+SECTION+'_10F') UPDATE [OAAIDBtest].[dbo].[G10FCourse] SET COURSENUM= (select SUBSTRING([COURSE],1,LEN([COURSE])-9)); UPDATE [OAAIDBtest].[dbo].[G10FCourse] SET ONLINE_FLAG= CASE WHEN SUBSTRING(COURSE,LEN(COURSE)-6,1)='7' then 1 WHEN SUBSTRING(COURSE,LEN(COURSE)-6,1)!='7' then 0 END; //Assigning Primary Key drop table dbo.G10FDistCourse; SELECT * into dbo.G10FDistCourse FROM (SELECT *,Row_number() OVER(PARTITION BY ALTERNATIVE_ID,COURSE ORDER BY COURSE) rn FROM dbo.G10FCourse) t WHERE rn = 1 ; Alter Table dbo.G10FDistCourse Drop column rn; ALTER TABLE [OAAIDBtest].[dbo].[G10FDistCourse] ALTER COLUMN [ALTERNATIVE_ID] varchar(36) NOT NULL; ALTER TABLE [OAAIDBtest].[dbo].[G10FDistCourse] ALTER COLUMN [COURSE] varchar(17) NOT NULL; ALTER TABLE [dbo].[G10FDistCourse] ADD CONSTRAINT [PK_G10FDistCourse] PRIMARY KEY CLUSTERED ( [ALTERNATIVE_ID] ASC, [COURSE] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]; 4 Open Academic Analytics Initiative Marist College Altering Grades Table to assign scores for each of the Student Submissions: //Adding SCORE column to the Grades Table alter table dbo.Grades10F drop column SCORE ; alter table dbo.Grades10F add SCORE float(7); // Calculating Scores for each of the submissions in the File based on Earned Points and Max Points. UPDATE dbo.Grades10F SET SCORE= EARNED_POINTS/MAX_POINTS ; Calculating the RMN Score based on the Score and Weight in the Grade Table: // Intermediate steps in calculating the RMN Score SELECT COURSE,ALTERNATIVE_ID,sum(WEIGHT) as TOTAL_WEIGHT,sum(SCORE * WEIGHT) as WEIGHTED_SCORE into Aux1_AggScores_10F FROM dbo.Grades10F GROUP BY COURSE,ALTERNATIVE_ID ORDER BY COURSE,ALTERNATIVE_ID; SELECT COURSE ,ALTERNATIVE_ID ,TOTAL_WEIGHT ,WEIGHTED_SCORE ,CASE WHEN TOTAL_WEIGHT<>0 then WEIGHTED_SCORE /TOTAL_WEIGHT else 0 end as EFFWT_SCORE into Aux2_AggScores_10F FROM Aux1_AggScores_10F; SELECT COURSE ,AVG(EFFWT_SCORE) AVG_EFFWT_SCORE INTO Aux_AVGCourse_effwtscore_10F FROM Aux2_AggScores_10F GROUP BY COURSE ORDER BY COURSE // Calculating the actual RMN Score Drop Table Grades10F_Scores; select A.* ,AVG_EFFWT_SCORE ,EFFWT_SCORE/AVG_EFFWT_SCORE*100 as RMN_SCORE into Grades10F_Scores from Aux2_AggScores_10F A, Aux_AVGCourse_effwtscore_10F M WHERE A.COURSE= M.COURSE AND AVG_EFFWT_SCORE > 0 order by AVG_EFFWT_SCORE; 5 Open Academic Analytics Initiative Marist College Creating The PersonalCourse Table by inner joining the Bio and the Course Table: SELECT P.ALTERNATIVE_ID, C.COURSE, C.COURSENUM, C.SUBJECT, C.ONLINE_FLAG, C.SECTION, C.ENROLLMENT, C.LETTER_GRADE, P.CLASS_RANK, P.PERCENTILE, P.SAT_VERBAL_SCORE, P.SAT_MATH_SCORE, P.ACT_COMPOSITE_SCORE, P.APTITUDE_SCORE, P.AGE, P.GENDER, P.FTPT, P.CLASS, P.CUM_GPA, P.SEM_GPA, P.ACADEMIC_STANDING into dbo.G10FPC FROM dbo.G10FDistBio P INNER JOIN dbo.G10FDistCourse C ON P.ALTERNATIVE_ID =C.ALTERNATIVE_ID ; Cleaning Up the PersonalCourse Table: // Eliminating the NULL values in the PersonalCourse table: SELECT * into G10FNtnullPC from G10FPC WHERE [ALTERNATIVE_ID] is NOT NULL AND [COURSE] is NOT NULL AND [LETTER_GRADE]is NOT NULL AND [GENDER] is NOT NULL AND [FTPT]is NOT NULL AND [CLASS]is NOT NULL AND [CUM_GPA] is NOT NULL AND [SEM_GPA]is NOT NULL AND [ACADEMIC_STANDING]is NOT NULL ; // Deleting Invalid Grades: DELETE FROM dbo.G10FNtnullPC WHERE LETTER_GRADE='W' or LETTER_GRADE='WF' or LETTER_GRADE= 'AU' or LETTER_GRADE= 'P' or LETTER_GRADE= 'NC' or LETTER_GRADE= 'X' or LETTER_GRADE= 'I'; 6 Open Academic Analytics Initiative Marist College // Recoding all the VarChar fields with Int Values. Update P Set P.GENDER = Replace (P.GENDER, 'F', '1') From dbo.G10FNtnullPC P; Update P Set P.GENDER = Replace (P.GENDER, 'M', '2') From dbo.G10FNtnullPC P ; UPDATE dbo.G10FNtnullPC SET GENDER = NULL WHERE GENDER='N'; Update P Set P.FTPT = Replace (P.FTPT, 'F', '1') From dbo.G10FNtnullPC P; Update P Set P.FTPT = Replace (P.FTPT, 'P', '2') From dbo.G10FNtnullPC P ; UPDATE dbo.G10FNtnullPC SET FTPT = NULL WHERE FTPT='o'; Update P Set P.CLASS = Replace (P.CLASS, 'GR', '5') From dbo.G10FNtnullPC P; Update P Set P.ONLINE_FLAG = Replace (P.ONLINE_FLAG, 'Y', '1') From dbo.G10FNtnullPC P; Update P Set P.ONLINE_FLAG= Replace (P.ONLINE_FLAG, 'N', '2') From dbo.G10FNtnullPC P; UPDATE dbo.G10FNtnullPC SET LETTER_GRADE=3.7 WHERE LETTER_GRADE='A-'; UPDATE dbo.G10FNtnullPC SET LETTER_GRADE='2.7' WHERE LETTER_GRADE='B-'; UPDATE dbo.G10FNtnullPC SET LETTER_GRADE='1.7' WHERE LETTER_GRADE='C-'; Update P Set P.LETTER_GRADE = Replace (P.LETTER_GRADE, 'A', '4.0') From dbo.G10FNtnullPC P; Update P Set P.LETTER_GRADE = Replace (P.LETTER_GRADE, 'B+', '3.3') From dbo.G10FNtnullPC P; Update P Set P.LETTER_GRADE = Replace (P.LETTER_GRADE, 'B', '3.0') From dbo.G10FNtnullPC P; 7 Open Academic Analytics Initiative Marist College Update P Set P.LETTER_GRADE = Replace (P.LETTER_GRADE, 'C+', '2.3') From dbo.G10FNtnullPC P; Update P Set P.LETTER_GRADE = Replace (P.LETTER_GRADE, 'C', '2.0') From dbo.G10FNtnullPC P; Update P Set P.LETTER_GRADE = Replace (P.LETTER_GRADE, 'D+', '1.3') From dbo.G10FNtnullPC P; Update P Set P.LETTER_GRADE = Replace (P.LETTER_GRADE, 'D', '1.0') From dbo.G10FNtnullPC P; Update P Set P.LETTER_GRADE = Replace (P.LETTER_GRADE, 'F', '0.0') From dbo.G10FNtnullPC P; //Creating Fields to update the Int Values and eliminating the fields with Varchar data type. Alter Table dbo.G10FNtnullPC add RC_GENDER int, RC_FTPT int, RC_CLASS int, RC_LETTERGRADE float, ACADEMIC_RISK int ; update dbo.G10FNtnullPC set RC_GENDER= cast( GENDER as int); Alter table dbo.G10FNtnullPC drop column GENDER; update dbo.G10FNtnullPC set RC_FTPT= cast( FTPT as int); Alter table dbo.G10FNtnullPC drop column FTPT; update dbo.G10FNtnullPC set RC_CLASS= cast(CLASS as int); Alter table dbo.G10FNtnullPC drop column CLASS; update dbo.G10FNtnullPC set RC_LETTERGRADE= cast( LETTER_GRADE as float); Alter table dbo.G10FNtnullPC drop column LETTER_GRADE; Update dbo.G10FNtnullPC set ACADEMIC_RISK= Case when RC_LETTERGRADE <2 then 1 else 2 END; Inner joining the Scores table with the PersonalCourse table: 8 Open Academic Analytics Initiative SELECT Marist College P.*, C.RMN_SCORE into dbo.G10FPCS FROM dbo.G10FNtnullPC P INNER JOIN dbo.Grades10F_Scores C ON P.ALTERNATIVE_ID =C.ALTERNATIVE_ID AND P.COURSE=C.COURSE; Eliminating Null Records in PCS table: Select * into dbo.G10FntNullPCS from dbo.G10FPCS where RMN_SCORE is not null; Scipts to generate the Metrics Table : // Scripts to calculate the ratio of the number of occurrences of each of the events SELECT S.COURSE, RC_CLASS, P.ALTERNATIVE_ID, EVENT, COUNT(*) AS QTY INTO G10FmvCourseStudentEventQty FROM dbo.SAKAIEVENTS10F S, dbo.G10FntNullPC P WHERE S.ALTERNATIVE_ID=P.ALTERNATIVE_ID GROUP BY S.COURSE,Rc_Class, P.ALTERNATIVE_ID,EVENT ORDER BY S.COURSE,Rc_Class, P.ALTERNATIVE_ID,EVENT; SELECT COURSE, EVENT, SUM(QTY) AS TQTY, AVG(QTY * 1.0) AS MEAN into dbo.G10FmvCourseEventAvg FROM dbo.G10FmvCourseStudentEventQty GROUP BY COURSE, EVENT; SELECT v1.COURSE, v1.ALTERNATIVE_ID, v1.EVENT, v1.QTY, v2.MEAN, v1.QTY / v2.MEAN AS RATIO into dbo.G10FmvCourseStudentEventRatio FROM dbo.G10FmvCourseStudentEventQty AS v1 INNER JOIN dbo.G10FmvCourseEventAvg AS v2 ON v1.COURSE = v2.COURSE AND v1.EVENT = v2.EVENT; // Individual Metrics table for each of the 9 metrics SELECT COURSE, ALTERNATIVE_ID, QTY as Q_ASSMT_TAKE, Ratio as R_ASSMT_TAKE INTO G10FASSMT_TAKE FROM dbo.G10FmvCourseStudentEventRatio WHERE EVENT = 'sam.assessment.take'; SELECT COURSE, ALTERNATIVE_ID, QTY as Q_ASN_READ, Ratio as R_ASN_READ INTO G10FASN_READ FROM dbo.G10FmvCourseStudentEventRatio WHERE EVENT = 'asn.read.assignment'; 9 Open Academic Analytics Initiative Marist College SELECT COURSE,ALTERNATIVE_ID,QTY as Q_ASN_SUB,Ratio as R_ASN_SUB INTO G10FASN_SUB FROM dbo.G10FmvCourseStudentEventRatio WHERE EVENT = 'asn.submit.submission'; SELECT COURSE, ALTERNATIVE_ID, QTY as Q_CONTENT_READ, Ratio as R_CONTENT_READ INTO G10FCONTENT_READ FROM dbo.G10FmvCourseStudentEventRatio WHERE EVENT = 'content.read'; SELECT COURSE, ALTERNATIVE_ID, QTY as Q_LESSONS_VIEW, Ratio as R_LESSONS_VIEW INTO G10FLESSONS_VIEW FROM dbo.G10FmvCourseStudentEventRatio WHERE EVENT = 'melete.section.read'; SELECT COURSE, ALTERNATIVE_ID, sum(QTY) as Q_FORUM_POST, sum(QTY) / sum(Mean) as R_FORUM_POST INTO G10FFORUM_POST FROM dbo.G10FmvCourseStudentEventRatio WHERE EVENT in ('forums.new' , 'forums.newtopic' , 'forums.response') GROUP BY COURSE, ALTERNATIVE_ID; SELECT COURSE, ALTERNATIVE_ID, QTY as Q_FORUM_READ, Ratio as R_FORUM_READ INTO G10FFORUM_READ FROM dbo.G10FmvCourseStudentEventRatio WHERE EVENT ='forums.read'; SELECT COURSE, ALTERNATIVE_ID, QTY as Q_SESSIONS, Ratio as R_SESSIONS INTO G10FSESSIONS FROM dbo.G10FmvCourseStudentEventRatio WHERE EVENT = 'pres.begin'; Left Joining each of the Metrics with the PCS table: SELECT C.ALTERNATIVE_ID, C.COURSE,C.COURSENUM,C.SUBJECT,C.ONLINE_FLAG,C.ENROLLMENT, C.RC_LETTERGRADE,C.CLASS_RANK,C.PERCENTILE,C.SAT_VERBAL_SCORE,C.SAT_MATH_SCORE, C.APTITUDE_SCORE,C.AGE,C.RC_GENDER,C.RC_FTPT,C.RC_CLASS,C.CUM_GPA, C.SEM_GPA,C.ACADEMIC_STANDING,C.RMN_SCORE, CT.R_CONTENT_READ, 10 Open Academic Analytics Initiative Marist College ASN.R_ASN_SUB, FP.R_FORUM_POST, FR.R_FORUM_READ, LS.R_LESSONS_VIEW, ASSMT.R_ASSMT_TAKE, ASMS.R_ASSMT_SUB, ASNR.R_ASN_READ, SES.R_SESSIONS, C.ACADEMIC_RISK into G10FPCSM FROM dbo.G10FNtnullPCS C LEFT JOIN G10FCONTENT_READ CT ON C.COURSE = CT.COURSE AND C.ALTERNATIVE_ID = CT.ALTERNATIVE_ID LEFT JOIN dbo.G10FSESSIONS SES ON C.COURSE = SES.COURSE AND C.ALTERNATIVE_ID = SES.ALTERNATIVE_ID LEFT JOIN dbo.G10FASN_SUB ASN ON C.COURSE = ASN.COURSE AND C.ALTERNATIVE_ID = ASN.ALTERNATIVE_ID LEFT JOIN G10FFORUM_POST FP ON C.COURSE = FP.COURSE AND C.ALTERNATIVE_ID = FP.ALTERNATIVE_ID LEFT JOIN G10FFORUM_READ FR ON C.COURSE = FR.COURSE AND C.ALTERNATIVE_ID = FR.ALTERNATIVE_ID LEFT JOIN G10FLESSONS_VIEW LS ON C.COURSE = LS.COURSE AND C.ALTERNATIVE_ID = LS.ALTERNATIVE_ID LEFT JOIN dbo.G10FASSMT_TAKE ASSMT ON C.COURSE = ASSMT.COURSE AND C.ALTERNATIVE_ID = ASSMT.ALTERNATIVE_ID LEFT JOIN dbo.G10FASSMT_SUB ASMS ON C.COURSE = ASMS.COURSE AND C.ALTERNATIVE_ID = ASMS.ALTERNATIVE_ID LEFT JOIN dbo.G10FASN_READ ASNR ON C.COURSE = ASNR.COURSE AND C.ALTERNATIVE_ID = ASNR.ALTERNATIVE_ID; 11