Center for African Studies The MasterCard Foundation Scholars Program Database Project Final Presentation PRESENTED BY TEAM 3 MIN LIN MINCHAO LIN TIAN LIU YIFAI NG HAOCHEN SHAN YIGANG WANG GUO YU YING YANG YE ZHONG Client Overview • UC Berkeley Center for African Studies (CAS) • The MasterCard Foundation Program • Provide comprehensive support for Education • For economically disadvantaged students from developing countries in Africa • Financial, academic, social, and career counseling • 2012-2020, 25 new students per year • Client Need • A new database help better manage the program • Specialized in tracking financial transactions • Also store data and information related to the program • Financial, academic, social, career Project Review DP I DP II • Familiarized with clients and understood their needs • Created 1st version of Simplified EER diagram • Set up schedule for the project • Expanded and Revised EER diagram • Conceptualized 5 queries • Developed Relational Schema DP III Final Presentation • Finalized EER Diagram • Created database in Access and implemented relations into Access • Altered Queries and developed SQL code • Implemented Queries into Access • Modified relational Schema • Did Normalization Analysis EER Diagram Access Relationships Relational Schema 1. Person (PID, Lname, Fname, MI, Birth_Date, Nationality, Gender, Primary_Phone_No, Address) 1a. Alumni (Alumini_PID1, Occupation, Company, Degree, Graduation_Date, Admission_Date) 1b. Student (Student_PID1, SID) 1bb. Pre_Student (Student_PID1, Financial_Background, SAT_Score, TOFEL_Score, High_School_GPA, No_HighSchool_Honor, No_HighSchool_Awards, HID9) 1bc. Student_current_status (SID, Mentor_Employee_PID1, Class_Level, Expected_Graduation_Date, App_Date, College_GPA) 1c. Employee (Employee_PID1, Position) 2. TransitionAccount (TID, Amount, Time, Year, Month, Date, PC_ID3) 2a. Withdrawal (W_TID2) 2aa. Check (Check_TID2, Description, Responsible_Person, Account_No) 2aaa. Account_Detail (Account_No, Routing_No) 2ab. Bearbuy (Bear_TID2 ) 2ac. Cash (C_TID2, Responsible_Person) 2ad. BlueCard (BC_TID2) 2b. Deposit (D_TID2) 3. Personal_Card (PC_ID, Year, Month, Day, Time, Amount, Reimbursed, Employee_PID1) 4. Expense (EID, Amount, Year, Month, Day, Description, PMID5) 4a. InternalExpense (I_EID4) 4ab. Tuition (T_EID4, Semester, Degree_Class) 4ac. CourseMaterial (C_EID4, Course_Name, ISBN) 4b. ExternalExpense (E_EID4, OID5) 4ba. Summer_Winter_Housing (SWH_EID4, Address, Start_Date, Finish_Date) 4baa. Summer_Winter_Housing_Timeline (Start_Date, Finish_Date, Duration) 4bab. Housing_Information(Address, Landlord, Agent,Contact_Phone_No) 4bb. External_Course_Material (EC_EID4, Course_Name, ISBN) 4bc. SEVIS_Fee (S_EID4) 4bd. Travel_Airline_Ticket (TAT_EID4, Flight_No, Airline_Name) 4bda. Ticket_Detail (Flight_No, Airline_Name, Date, Departure_Location, Destination, Class) 4be. Office_Supplies (O_EID4, Supply_Name, Quantity, Lead_Time, Price, Discount) 5. Order (OID, Total_Amount, Description, Product_Name, Quantity, Price, Date,Req_Student_PID1) 5a. OnlineOrder (OOID5, Tracking_No, Discount, OSID) 5b. NormalOrder (NOID5) 6. Item (NOID5, IID, Iname, Description) 7. OnlineShop (OSID7, Website, Shipping_Method) 7a.Website_Email_Address (Website, Supplier_Email_Address ,Description) Relational Schema (Continue) 8. Course (CID, Semester, Professor, Final_Date, Description, Ctitle) 8a. GE (GE_CID8, Category) 8b. LowerDivision (L_CID8) 8c. UpperDivision (U_CID8) 8d. Sections(CID, Section_Number) 8e. Ctitle (CID, Ctitle) 9.High_School(HID, Name, School_Type, Year_Founded, Relidious_Affiliation, Academic_Calender, Setting, Student_Faculty_Ratio, Total_Enrollment_No, Gender_Ratio, College_Enrollment_Rate) 10.Job (JID, Position, No_Employee, Start_Date, Finish_date, Work_On_Alumini_PID1, Work_on_Student_PID1) 10a.Job_Detail (Position, Job_function, Location, Salary, Major_Req, Skill_Need, Description, Hour_Per_Week, Language_Need, Visa_Req) 10b.Job-Timeline (Start_Date, Finish_date, Duration) 10c. Full_1Time (F_JID10, Degree_Level, Exp_Req, Pre_I_JID10c) 10d. Part_Time (P_JID10, Degree_Level, Exp_Req) 10e. Internship (I_JID10, Class_Level, Referrer) 11.Company (Company_ID, Company_Name, Description, Company_Type, Contact_Phone_No, Size, Industry_Type, Email_Address, Website, City, Street, Zip_Code) 12.Country (Name, No_Student) 13.Event (Event_ID, title, Start_Date, Finish_date, Duration, No_People, Sponsor, City) 13a. CAS_Event (CAS_Event_ID13, Description, Cost) 13b. Other_Event (O_Event_ID13, Description) 13ba. Recruiting_Event. (R_Event_ID13, Industry_Type, Job_Class, Major_Preferred) 14. Grade (Student_PID1, CID8, Grade_Option, Grade, Semester) 15. Survey (Survey_ID, title, date, description) 15a. Alumni_Survey (A_Survey_ID15, Alumini_PID1) 15b. Student_Survey (S_Survey_ID15, Student_PID1) N to N Relationships: 16.Withdrawal_Pay_Expense (W_TID2, EID4) 17.Employee_Check_Inventory (Employee_PID1, OID5) 18.Student_Take_Course (Student_PID1, CID8, Grade, Grade_Option, Semesters) 18a. Ctitle(CID8, Ctitle) 19.Student_Participate_Event (Student_PID1, Event_ID13) 20.Company_Provide_Job (Company_ID11, JID10) 21.Company_Participate_Recuriting_Event (Company_ID11, Recuriting_Event_ID13) 22.Company_Located_Country (Company_ID11, Country_Name12) 23.Event_Located_Country (Event_ID13, Country_Name12) Multivalue: 24.Person_Email (PID1, E-mail_Address) 25.Alumni_Major (Alumni_PID1a, Major) 26.Student_Major (Student_PID1b, Major) 27.Course_Midterm_Date (CID8, Year, Month, Day) NORMALIZATION ANALYSIS Decomposing to 1NF and 2NF Course (CID, Semester, Professor, Final_Date, Section_Number, Description, Ctitle) To 1NF Course (CID, Semester, Professor, Final_Date, Description, Ctitle) Sections(CID, Section_Number) Student_Take_Course Student_PID CID Grade Grade_Option Semester To 2NF Student_Take_Course (Student_PID, CID, Grade, Grade_Option, Semesters) CourseTite(CID, Ctitle) Ctitle Decomposing to 3NF Check Check_TID2 Description Responsible_Person Account_No Routing_No To 3NF Check (Check_TID2, Description, Responsible_Person, Account_No) Account_Detail (Account_No, Routing_No) Decomposing to BCNF Course CID Semester Professor Final_Date Description Ctitle To BCNF Course (CID, Semester, Professor, Final_Date, Section_Number, Description) Ctitle (CID, Ctitle) QUERIES Query 1: Demand Forecasting and EOQ Object Benefits Forecast student’s demands for each type of orders in the future based on historical records, then use this data to determine the best order quantity and time interval to place orders on online shops. 1. Well understand the students’ need for each type of supplies on a timely basis. 2. Predict the order period and amount so that CAS could plan ahead. Query 1: Demand Forecasting - SQL SELECT Order.ProductName AS Product, sum(Order.Quantity) AS SepTotalQuantity FROM [Order] WHERE Order.Date like "8/*/2013” GROUP BY Order.ProductName; Step 1: Extract the Data from Access. Get the order quantity of a specific product over a period. Sample output: A 3*3 matrix include all the product’s order quantity for a Specific period of a specific year Query 1: Demand Forecasting - Process Step 2: Calculate the seasonal factor and monthly demand forecast with a calculator program wrote by Java. Step 3: Put the result back into SQL to get the Economic order quantity (EOQ) model. Query 1: Demand Forecasting - Process Part of Java code Query 1: EOQ - SQL SELECT DISTINCT Order.ProductName, IIF(Item.Quantity=0,“Yes”,“No”) AS StockOut, Round(Sqr(2*(OnlineOrder.ShippingFee)*(OnlineOrder.Monthly Demand)/(0.1*Order.Price))) AS OptimalOrderQ, IIf(OnlineOrder.ShippingTime>0,Round((OnlineOrder.ShippingTi me)*(OnlineOrder.MonthlyDemand)/(0.1*Order.Price)),0) AS ReorderPointQuantity, Round(((Sqr(2*(OnlineOrder.ShippingFee)*(OnlineOrder.Monthl yDemand)/(0.1*Order.Price))/10))*30) AS OrderCycleDays, OnlineShop.Website AS Website, Order.Date AS OrderDate, Order.Date +Round(((Sqr(2*(OnlineOrder.ShippingFee) *(OnlineOrder.MonthlyDemand)/(0.1*Order.Price))/10))*30) AS NextOrderDate FROM Order, OnlineOrder, OnlineShop, Item WHERE Order.OID=OnlineOrder.OID AND OnlineOrder.OSID=OnlineShop.OSID AND Item.OID=Order.OID ORDER BY Order.Date; Check whether the inventory is stocked out or not Calculate optimal order quantity Calculate Reorder point Calculate order cycle and dates Calculate Next order date Query 1: EOQ - Output Query 2: Academic Performance Object What are the factors that affect students’ academic performance (i.e. GPA) and to what extent? Classes, parents’ education level, Traveling Expense, Course material fee, etc. 1. Understand factors that may affect students’ performance Benefits 2. Wisely Allocate the fund according to the analysis 3. Help student get more successful by planning events accordingly Query 2: Academic Performance - SQL Select specific SELECT Student.SID,1.714+0.589*A.Indicatorvariable and 0.00632*IIf(Student.Degree=“Undergraduate”,1,0)+0.0165*IIf(Student.Gender= combine with “Male”,1,0)+0.000644*Student.SATScore0.0147*B.NumberEventAttend+0.0528*IIF(Student.Research=“Yes”,1,0) AS coefficients to obtain result by ExpGPA linear regression From Student, model (SELECT Student.SID,Count(internship.JID) AS Indicator FROM Student, Job,Internship Determine Where Student.PID=Job.PID AND Job.JID=Internship.JID whether a student have Group BY Student.SID ever attend Union any internships Select Student.SID,0 or not (binary From Student variable) Where Student.PID NOT IN(SELECT student.PID Defined as table A From Student,Job,Internship Where Student.PID=Job.PID AND Job.JID=Internship.JID))As A, Query 2: Academic Performance – SQL (Cont.) (SELECT Student.SID,Count(StudentParticipateEvent.EventID) AS NumberEventAttend From Student,StudentParticipateEvent Where Student.PID=StudentParticipateEvent.PID Group BY Student.SID) AS B Where A.SID=Student.SID AND B.SID=Student.SID; Determine how many events a student have attend (numerical value) Defined as table B Step 2: Run linear regression over all the variables, then use Akaike Information Criterion to reduce the model to the most efficient model. Implement with R Query 3: High School Comparison Object Is there significant difference between each aspect of two high schools? 1. Compare the quality of two high schools based on the records of admitted students. Benefits 2. Help with decision making when comparing applicants with similar qualifications from the perspective of their high school strengths Query 3: High School Comparison - SQL SELECT Student.PID, Student.HID, Student.CollegeGPA, Student.SATScore, Student.TOFELScore, Student.[HighSchoolEvents#], Student.[HighSchoolAward#] FROM Student WHERE (((Student.HID)=1)) OR (((Student.HID)=2)); Step 1: Extract the data from Access by SQL Step 2: Calculate the mean of each corresponding category of all admitted students from these two high schools, then use t-test with unequal variances to get the p-value Implement with Excel Query 3: High School Comparison - Output Step3: Use HolmBonferroni method to find out if each difference is significant. Implement with MATLAB. Step4: Sample output from MATLAB [corrected_p, h]=bonf_holm([0.38 0.414 0.0513 0.334 0.257] ,0.5) corrected_p = 1.0020 0.7600 0.2565 1.0280 1.0280 h= 0 0 1 0 0 Query 4: Category Expense Object What is the distribution of expense on four major categories of external expenses (i.e. Course Material, Travel, Office Supply and Housing, ) that is paid by the program? Generate distribution chart and use statistical tools to analyze these distributions. 1. Track expenses related to students. Benefits 2. Decide the expense constraints for students on each category. 3. Generate clear expense report, could be included in the annual report for the MasterCard Foundation. Query 4: Category Expense - SQL SELECT Student.PID, Sum(Expense.Amount) AS AmountOfSum FROM (OfficeSupply INNER JOIN Expense ON OfficeSupply.EID = Expense.EID) INNER JOIN Student ON Expense.PID = Student.PID GROUP BY Student.PID ORDER BY Student.PID; Step 1: Find the total expense of each student SELECT Count([OfficeSupply Query].PID) AS CountOfPID, Partition([AmountOfSum],0,1100,100) AS Expr1 FROM [OfficeSupply Query] GROUP BY Partition([AmountOfSum],0,1100,100); Step 2: Generate the data for histogram Step 3: Use Report function to generate graphs, and use Access toolbox to generate the Mean, and Standard Deviation of the distribution. Step 4: If, in most cases, the distribution is bell-distributed, we could use 68-95,99.7 rule, aka Three-sigma rule, to set up expense constraints for students. Query 4: Category Expense - Output Query Results after the first step & The toolbox that could be used to calculate average and variance quickly Results after the second step Count the number of PID to generate a histogram Query 4: Category Expense - Output Expense Summary Report (Based on Sample Data) Query 5: Alumni & Employment Object What are the relationships among an alumni’s employment status, alumni’s GPA, alumni’s education level, number of events an alumni has attend etc? 1. Based on the analysis result, help current students find jobs. Benefits 2. Find out the most important factor that affect a student’s future employment. Query 5: Alumni & Employment - SQL SELECT Alumni.PID, IIF(Alumni.Degree="Undergraduate",1,0) AS Degree,count(StudentParticipateEvent.EID) AS NumEventAttend, Step 1: Extract the IIF(AlumniMajor.school_of_medicine = "Yes", 1, 0) As Sch_Medicine, data from Access IIF(AlumniMajor.school_of_law = "Yes", 1, 0) As Sch_Law, by SQL IIF(AlumniMajor.college_of_engineering = "Yes", 1, 0) As Sch_Engi, IIF(AlumniMajor.school_of_optometry = "Yes", 1, 0) As Sch_Opt IIF(AlumniMajor.college_of_natural_resource = "Yes", 1, 0) As Sch_Nat,IIF(AlumniMajor.college_of_letter_science = "Yes", 1, 0) As Sch_Science, IIF(AlumniMajor.school_of_information = "Yes", 1, 0) As Sch_Inf, IIF(AlumniMajor.school_of_social_welfare = "Yes", 1, 0) As Sch_welfare, IIF(AlumniMajor.haas_business_school = "Yes", 1, 0) As Sch_has, Company.c_latitude, Company.c_longitude FROM Alumni, StudentParticipateEvent, AlumniMajor, Company, FullTime WHERE FullTime.JID = Job.JID AND Company.CompanyID=Job.CompanyID AND Job.PID=Alumni.PID AND Alumni.PID=StudentParticipateEvent.PID AND AlumniMajor.PID=Alumni.PID; Query 5: Alumni & Employment - Process Step 2: Fitting Logistic Regression Model with R • this model will give the result of the probability for predicted variable to be 1 ( which means this person will get employed or not) • Potentially 20 candidate models Step 3: Model Selection • Cross-Validation Employment = GPA + Event + Degree • AIC ( Akaike Information Criterion) Employment = GPA + Event + Degree + School of Information • BIC ( Bayesian Information Criterion) Employment = GPA • Deviance Selection Employment = GPA + Event + Degree Query 5: Alumni & Employment - Process Step 4: Cut-off Selection • Find a cut of the predicted probability which will let us judge if the predicted value is 1 or 0 • Method: Building Confusion Matrix • Choosing a cut off probability first • Using confusion Matrix to find the best cut off • Base on the result we choose 0.58 Query 5: Alumni & Employment - Process Step 6: Plot Query 5: Alumni & Employment - Process Step 7: Creating XML concatenate with KML plot Alumni’s Company on Google Earth • Creating an KML plot for Alumni who graduated from different college • Plot those coordinates on Google earth • Get intuition employment status geographically for each college in UC Berkeley, which will give us an intuition where has higher employment rate for corresponding college’s current student. • Example for School of Medicine Query 5: Alumni & Employment - Output Step 8: Implement into GoogleEarth Future Work & Improvements • Create forms & reports • Make our database more user-friendly • Create additional queries: • Other useful Queries • Such as monthly balance check Q&A Thank you for listening.