Team3-IEOR115-ProjectPres

advertisement
Center for African Studies
The MasterCard Foundation Scholars Program
Database Project Final Presentation
PRESENTED BY TEAM 3
MIN LIN
MINCHAO LIN
TIAN LIU
YIFAI NG
HAOCHEN SHAN
YIGANG WANG
GUO YU
YING YANG
YE ZHONG
Client Overview
• UC Berkeley Center for African Studies (CAS)
• The MasterCard Foundation Program
• Provide comprehensive support for Education
• For economically disadvantaged students from developing countries in Africa
• Financial, academic, social, and career counseling
• 2012-2020, 25 new students per year
• Client Need
• A new database help better manage the program
• Specialized in tracking financial transactions
• Also store data and information related to the program
• Financial, academic, social, career
Project Review
DP I
DP II
• Familiarized with clients and
understood their needs
• Created 1st version of Simplified
EER diagram
• Set up schedule for the project
• Expanded and Revised EER
diagram
• Conceptualized 5 queries
• Developed Relational Schema
DP III
Final Presentation
• Finalized EER Diagram
• Created database in Access and
implemented relations into
Access
• Altered Queries and developed
SQL code
• Implemented Queries into Access
• Modified relational Schema
• Did Normalization Analysis
EER Diagram
Access Relationships
Relational Schema
1. Person (PID, Lname, Fname, MI, Birth_Date, Nationality,
Gender, Primary_Phone_No, Address)
1a. Alumni (Alumini_PID1, Occupation, Company, Degree,
Graduation_Date, Admission_Date)
1b. Student (Student_PID1, SID)
1bb. Pre_Student (Student_PID1, Financial_Background,
SAT_Score, TOFEL_Score, High_School_GPA,
No_HighSchool_Honor, No_HighSchool_Awards, HID9)
1bc. Student_current_status (SID, Mentor_Employee_PID1,
Class_Level, Expected_Graduation_Date, App_Date, College_GPA)
1c. Employee (Employee_PID1, Position)
2. TransitionAccount (TID, Amount, Time, Year, Month, Date,
PC_ID3)
2a. Withdrawal (W_TID2)
2aa. Check (Check_TID2, Description,
Responsible_Person, Account_No)
2aaa. Account_Detail (Account_No, Routing_No)
2ab. Bearbuy (Bear_TID2 )
2ac. Cash (C_TID2, Responsible_Person)
2ad. BlueCard (BC_TID2)
2b. Deposit (D_TID2)
3. Personal_Card (PC_ID, Year, Month, Day, Time, Amount,
Reimbursed, Employee_PID1)
4. Expense (EID, Amount, Year, Month, Day, Description, PMID5)
4a. InternalExpense (I_EID4)
4ab. Tuition (T_EID4, Semester, Degree_Class)
4ac. CourseMaterial (C_EID4, Course_Name, ISBN)
4b. ExternalExpense (E_EID4, OID5)
4ba. Summer_Winter_Housing (SWH_EID4, Address,
Start_Date, Finish_Date)
4baa. Summer_Winter_Housing_Timeline (Start_Date,
Finish_Date, Duration)
4bab. Housing_Information(Address, Landlord,
Agent,Contact_Phone_No)
4bb. External_Course_Material (EC_EID4, Course_Name, ISBN)
4bc. SEVIS_Fee (S_EID4)
4bd. Travel_Airline_Ticket (TAT_EID4, Flight_No,
Airline_Name)
4bda. Ticket_Detail (Flight_No, Airline_Name, Date,
Departure_Location, Destination, Class)
4be. Office_Supplies (O_EID4, Supply_Name, Quantity,
Lead_Time, Price, Discount)
5. Order (OID, Total_Amount, Description, Product_Name,
Quantity, Price, Date,Req_Student_PID1)
5a. OnlineOrder (OOID5, Tracking_No, Discount, OSID)
5b. NormalOrder (NOID5)
6. Item (NOID5, IID, Iname, Description)
7. OnlineShop (OSID7, Website, Shipping_Method)
7a.Website_Email_Address (Website,
Supplier_Email_Address ,Description)
Relational Schema (Continue)
8. Course (CID, Semester, Professor, Final_Date, Description, Ctitle)
8a. GE (GE_CID8, Category)
8b. LowerDivision (L_CID8)
8c. UpperDivision (U_CID8)
8d. Sections(CID, Section_Number)
8e. Ctitle (CID, Ctitle)
9.High_School(HID, Name, School_Type, Year_Founded,
Relidious_Affiliation, Academic_Calender, Setting,
Student_Faculty_Ratio, Total_Enrollment_No, Gender_Ratio,
College_Enrollment_Rate)
10.Job (JID, Position, No_Employee, Start_Date, Finish_date,
Work_On_Alumini_PID1, Work_on_Student_PID1)
10a.Job_Detail (Position, Job_function, Location, Salary,
Major_Req, Skill_Need, Description, Hour_Per_Week,
Language_Need, Visa_Req)
10b.Job-Timeline (Start_Date, Finish_date, Duration)
10c. Full_1Time (F_JID10, Degree_Level, Exp_Req, Pre_I_JID10c)
10d. Part_Time (P_JID10, Degree_Level, Exp_Req)
10e. Internship (I_JID10, Class_Level, Referrer)
11.Company (Company_ID, Company_Name, Description,
Company_Type, Contact_Phone_No, Size, Industry_Type,
Email_Address, Website, City, Street, Zip_Code)
12.Country (Name, No_Student)
13.Event (Event_ID, title, Start_Date, Finish_date, Duration,
No_People, Sponsor, City)
13a. CAS_Event (CAS_Event_ID13, Description, Cost)
13b. Other_Event (O_Event_ID13, Description)
13ba. Recruiting_Event. (R_Event_ID13, Industry_Type, Job_Class,
Major_Preferred)
14. Grade (Student_PID1, CID8, Grade_Option, Grade,
Semester)
15. Survey (Survey_ID, title, date, description)
15a. Alumni_Survey (A_Survey_ID15, Alumini_PID1)
15b. Student_Survey (S_Survey_ID15, Student_PID1)
N to N Relationships:
16.Withdrawal_Pay_Expense (W_TID2, EID4)
17.Employee_Check_Inventory (Employee_PID1, OID5)
18.Student_Take_Course (Student_PID1, CID8, Grade,
Grade_Option, Semesters)
18a. Ctitle(CID8, Ctitle)
19.Student_Participate_Event (Student_PID1, Event_ID13)
20.Company_Provide_Job (Company_ID11, JID10)
21.Company_Participate_Recuriting_Event (Company_ID11,
Recuriting_Event_ID13)
22.Company_Located_Country (Company_ID11,
Country_Name12)
23.Event_Located_Country (Event_ID13, Country_Name12)
Multivalue:
24.Person_Email (PID1, E-mail_Address)
25.Alumni_Major (Alumni_PID1a, Major)
26.Student_Major (Student_PID1b, Major)
27.Course_Midterm_Date (CID8, Year, Month, Day)
NORMALIZATION
ANALYSIS
Decomposing to 1NF and 2NF
Course (CID, Semester, Professor, Final_Date, Section_Number, Description, Ctitle)
To 1NF
Course (CID, Semester, Professor, Final_Date, Description, Ctitle)
Sections(CID, Section_Number)
Student_Take_Course
Student_PID
CID
Grade
Grade_Option
Semester
To 2NF
Student_Take_Course (Student_PID, CID, Grade, Grade_Option, Semesters)
CourseTite(CID, Ctitle)
Ctitle
Decomposing to 3NF
Check
Check_TID2
Description Responsible_Person
Account_No Routing_No
To 3NF
Check (Check_TID2, Description, Responsible_Person, Account_No)
Account_Detail (Account_No, Routing_No)
Decomposing to BCNF
Course
CID
Semester
Professor
Final_Date
Description
Ctitle
To BCNF
Course (CID, Semester, Professor, Final_Date, Section_Number, Description)
Ctitle (CID, Ctitle)
QUERIES
Query 1: Demand Forecasting and EOQ
Object
Benefits
Forecast student’s demands for each type of orders in the future
based on historical records, then use this data to determine the
best order quantity and time interval to place orders on online
shops.
1. Well understand the students’ need for each type of supplies
on a timely basis.
2. Predict the order period and amount so that CAS could plan
ahead.
Query 1: Demand Forecasting - SQL
SELECT Order.ProductName AS Product,
sum(Order.Quantity) AS SepTotalQuantity
FROM [Order]
WHERE Order.Date like "8/*/2013”
GROUP BY Order.ProductName;
Step 1: Extract the Data
from Access. Get the order
quantity of a specific
product over a period.
Sample output: A 3*3 matrix
include all the product’s order
quantity for a Specific period of
a specific year
Query 1: Demand Forecasting - Process
Step 2: Calculate the seasonal factor and
monthly demand forecast with a calculator
program wrote by Java.
Step 3: Put the result back into SQL to get
the Economic order quantity (EOQ) model.
Query 1: Demand Forecasting - Process
Part of Java code
Query 1: EOQ - SQL
SELECT DISTINCT Order.ProductName,
IIF(Item.Quantity=0,“Yes”,“No”) AS StockOut,
Round(Sqr(2*(OnlineOrder.ShippingFee)*(OnlineOrder.Monthly
Demand)/(0.1*Order.Price))) AS OptimalOrderQ,
IIf(OnlineOrder.ShippingTime>0,Round((OnlineOrder.ShippingTi
me)*(OnlineOrder.MonthlyDemand)/(0.1*Order.Price)),0) AS
ReorderPointQuantity,
Round(((Sqr(2*(OnlineOrder.ShippingFee)*(OnlineOrder.Monthl
yDemand)/(0.1*Order.Price))/10))*30) AS OrderCycleDays,
OnlineShop.Website AS Website, Order.Date AS OrderDate,
Order.Date +Round(((Sqr(2*(OnlineOrder.ShippingFee)
*(OnlineOrder.MonthlyDemand)/(0.1*Order.Price))/10))*30) AS
NextOrderDate
FROM Order, OnlineOrder, OnlineShop, Item
WHERE Order.OID=OnlineOrder.OID AND
OnlineOrder.OSID=OnlineShop.OSID AND Item.OID=Order.OID
ORDER BY Order.Date;
Check whether the inventory
is stocked out or not
Calculate optimal order
quantity
Calculate Reorder point
Calculate order cycle
and dates
Calculate Next order
date
Query 1: EOQ - Output
Query 2: Academic Performance
Object
What are the factors that affect students’ academic performance
(i.e. GPA) and to what extent?
Classes, parents’ education level, Traveling Expense, Course
material fee, etc.
1. Understand factors that may affect students’ performance
Benefits
2. Wisely Allocate the fund according to the analysis
3. Help student get more successful by planning events
accordingly
Query 2: Academic Performance - SQL
Select specific
SELECT Student.SID,1.714+0.589*A.Indicatorvariable and
0.00632*IIf(Student.Degree=“Undergraduate”,1,0)+0.0165*IIf(Student.Gender=
combine with
“Male”,1,0)+0.000644*Student.SATScore0.0147*B.NumberEventAttend+0.0528*IIF(Student.Research=“Yes”,1,0) AS coefficients to
obtain result by
ExpGPA
linear regression
From Student,
model
(SELECT Student.SID,Count(internship.JID) AS Indicator
FROM Student, Job,Internship
Determine
Where Student.PID=Job.PID AND Job.JID=Internship.JID
whether a
student have
Group BY Student.SID
ever attend
Union
any internships
Select Student.SID,0
or not (binary
From Student
variable)
Where Student.PID NOT IN(SELECT student.PID
Defined as
table A
From Student,Job,Internship
Where Student.PID=Job.PID AND Job.JID=Internship.JID))As A,
Query 2: Academic Performance – SQL (Cont.)
(SELECT Student.SID,Count(StudentParticipateEvent.EventID) AS
NumberEventAttend
From Student,StudentParticipateEvent
Where Student.PID=StudentParticipateEvent.PID
Group BY Student.SID) AS B
Where A.SID=Student.SID AND B.SID=Student.SID;
Determine how
many events a
student have
attend (numerical
value)
Defined as table B
Step 2: Run linear regression
over all the variables, then
use Akaike Information
Criterion to reduce the
model to the most efficient
model.
Implement with R
Query 3: High School Comparison
Object
Is there significant difference between each aspect of two high
schools?
1. Compare the quality of two high schools based on the
records of admitted students.
Benefits
2. Help with decision making when comparing applicants with
similar qualifications from the perspective of their high school
strengths
Query 3: High School Comparison - SQL
SELECT Student.PID, Student.HID, Student.CollegeGPA,
Student.SATScore, Student.TOFELScore,
Student.[HighSchoolEvents#], Student.[HighSchoolAward#]
FROM Student
WHERE (((Student.HID)=1)) OR (((Student.HID)=2));
Step 1: Extract
the data from
Access by SQL
Step 2: Calculate the mean of
each corresponding category
of all admitted students from
these two high schools, then
use t-test with unequal
variances to get the p-value
Implement with Excel
Query 3: High School Comparison - Output
Step3: Use HolmBonferroni method to
find out if each
difference is
significant.
Implement with
MATLAB.
Step4:
Sample output from
MATLAB
[corrected_p, h]=bonf_holm([0.38 0.414 0.0513 0.334 0.257] ,0.5)
corrected_p = 1.0020 0.7600 0.2565 1.0280 1.0280
h= 0 0 1 0 0
Query 4: Category Expense
Object
What is the distribution of expense on four major categories of
external expenses (i.e. Course Material, Travel, Office Supply and
Housing, ) that is paid by the program?
Generate distribution chart and use statistical tools to analyze
these distributions.
1. Track expenses related to students.
Benefits
2. Decide the expense constraints for students on each category.
3. Generate clear expense report, could be included in the
annual report for the MasterCard Foundation.
Query 4: Category Expense - SQL
SELECT Student.PID, Sum(Expense.Amount) AS AmountOfSum
FROM (OfficeSupply INNER JOIN Expense ON OfficeSupply.EID =
Expense.EID) INNER JOIN Student ON Expense.PID = Student.PID
GROUP BY Student.PID
ORDER BY Student.PID;
Step 1: Find the
total expense of
each student
SELECT Count([OfficeSupply Query].PID) AS CountOfPID,
Partition([AmountOfSum],0,1100,100) AS Expr1
FROM [OfficeSupply Query]
GROUP BY Partition([AmountOfSum],0,1100,100);
Step 2: Generate
the data for
histogram
Step 3: Use Report function to generate graphs, and use Access toolbox to generate
the Mean, and Standard Deviation of the distribution.
Step 4: If, in most cases, the distribution is bell-distributed, we could use 68-95,99.7
rule, aka Three-sigma rule, to set up expense constraints for students.
Query 4: Category Expense - Output
Query Results after the first step
& The toolbox that could be used to
calculate average and variance quickly
Results after the second step
Count the number of PID to
generate a histogram
Query 4: Category Expense - Output
Expense Summary Report
(Based on Sample Data)
Query 5: Alumni & Employment
Object
What are the relationships among an alumni’s employment
status, alumni’s GPA, alumni’s education level, number of events
an alumni has attend etc?
1. Based on the analysis result, help current students find jobs.
Benefits
2. Find out the most important factor that affect a student’s
future employment.
Query 5: Alumni & Employment - SQL
SELECT Alumni.PID, IIF(Alumni.Degree="Undergraduate",1,0) AS
Degree,count(StudentParticipateEvent.EID) AS NumEventAttend,
Step 1: Extract the
IIF(AlumniMajor.school_of_medicine = "Yes", 1, 0) As Sch_Medicine, data from Access
IIF(AlumniMajor.school_of_law = "Yes", 1, 0) As Sch_Law,
by SQL
IIF(AlumniMajor.college_of_engineering = "Yes", 1, 0) As Sch_Engi,
IIF(AlumniMajor.school_of_optometry = "Yes", 1, 0) As Sch_Opt
IIF(AlumniMajor.college_of_natural_resource = "Yes", 1, 0) As
Sch_Nat,IIF(AlumniMajor.college_of_letter_science = "Yes", 1, 0) As
Sch_Science, IIF(AlumniMajor.school_of_information = "Yes", 1, 0) As
Sch_Inf,
IIF(AlumniMajor.school_of_social_welfare = "Yes", 1, 0) As Sch_welfare,
IIF(AlumniMajor.haas_business_school = "Yes", 1, 0) As Sch_has,
Company.c_latitude, Company.c_longitude
FROM Alumni, StudentParticipateEvent, AlumniMajor, Company, FullTime
WHERE FullTime.JID = Job.JID AND Company.CompanyID=Job.CompanyID
AND Job.PID=Alumni.PID AND Alumni.PID=StudentParticipateEvent.PID
AND AlumniMajor.PID=Alumni.PID;
Query 5: Alumni & Employment - Process
Step 2: Fitting Logistic Regression Model with R
• this model will give the result of the probability for predicted variable to be
1 ( which means this person will get employed or not)
• Potentially 20 candidate models
Step 3: Model Selection
• Cross-Validation
Employment = GPA + Event + Degree
• AIC ( Akaike Information Criterion)
Employment = GPA + Event + Degree + School of Information
• BIC ( Bayesian Information Criterion)
Employment = GPA
• Deviance Selection
Employment = GPA + Event + Degree
Query 5: Alumni & Employment - Process
Step 4: Cut-off Selection
• Find a cut of the predicted probability which will let us judge if the predicted
value is 1 or 0
• Method: Building Confusion Matrix
• Choosing a cut off probability first
• Using confusion Matrix to find the best cut off
• Base on the result we choose 0.58
Query 5: Alumni & Employment - Process
Step 6: Plot
Query 5: Alumni & Employment - Process
Step 7: Creating XML concatenate with KML plot Alumni’s Company on Google Earth
• Creating an KML plot for Alumni who graduated from different college
• Plot those coordinates on Google earth
• Get intuition employment status geographically for each college in UC
Berkeley, which will give us an intuition where has higher employment rate for
corresponding college’s current student.
• Example for School of Medicine
Query 5: Alumni & Employment - Output
Step 8: Implement into GoogleEarth
Future Work & Improvements
• Create forms & reports
• Make our database more user-friendly
• Create additional queries:
• Other useful Queries
• Such as monthly balance check
Q&A
Thank you for listening.
Download