IIMT 2641 Introduction to Business Analytics Class 1: Introduction to Business Analytics 2023 Fall 1 IIMT 2641 Teaching Team Instructor: Feng TIAN – Email: fengtian@hku.hk – Office: KKL 1312 – Office Hours: TBD or By appointment. § TA: Ian Chan – Email: ikwchan@hku.hk – Office: KKL 625 – Office Hours: By appointment § TA: Yuwen Brian Wang – Email: byuwen@hku.hk – Office Hours: By appointment – Office: KKL 625 § 2 Who am I? § Partial Economist – BA in Econ (Nankai) – MA in Econ (Duke) § Failed Mathematician – Love Mathematics (Dreamed of being Mathematician) – Applied Mathematics, Applying mathematics § Operations Researcher – PhD in Technology & Operations (University of Michigan) 3 Data is powerful and everywhere § Data is transforming business, social interactions, and the future of our society. § The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 64.2 zettabytes in 2020. (Statista) 64,200,000,000,000,000,000,000 bytes (1000 Bytes = 0.9766 Kilobytes) – This is equal to the storage required for more than 4 trillion HD movies – It would take a person approximately 300 million years to download them all from the internet – This number is predicted to reach 175 zb by 2025. – § Internet users generate about 2.5 quintillion (10^18) bytes of data each day. 90% of all data has been created in the last two years. § Ability to process data also increases – Decoding the human genome originally took 10 years to process; now it can be achieved in one week 4 Data and analytics are useful § Analytics is increasingly important in the world today. § The global big data analytics market size was valued US$271.83 billion in 2022. The market is projected to grow from US$307.52 billion in 2023 to US$745.15 billion by 2030. § 95% of businesses cite the need to manage unstructured data as a problem for their business. 5 Data and analytics are useful § 97.2% of organizations are investing in big data and AI. – IBM has changed its business focus over the last 100 years from typewriters to mainframes to personal computers to consulting, and now to analytics. q – § IBM has invested over $20 billion since 2005 to grow its analytics business Netflix saves $1 billion per year on customer retention using big data Critical in almost every business and industry including 6 What is Analytics? § The science of using data to build models that lead to better decisions that add value to individuals, to companies, to institutions 7 This Course § Key Messages: – Analytics provide a competitive edge to individuals and companies – Analytics are often critical to the success of a company § Methodology: – Teach analytics techniques through real world examples and real data – Probability and statistical theories § Goal: – Convince you of the Analytics Edge q q q The power and importance of data How analytics methods work How to interpret and understand the results of analytical models – Inspire you to use analytics in your career q q q Excel spreadsheet Software package R Not just hearing about analytics, but creating your own models 8 This lecture § Summary of some of the cases we will cover – Netflix (Movie Recommendation) – Quality of Wine – Twitter (Text analytics) § Other cases we will cover in this course – Summer Job Search – New Product Development – Healthcare Quality Prediction – Court Ruling Prediction – Criminal Justice - Enron – MRI brain image Segmentation – Housing Price Prediction 9 Netflix 10 Netflix Subscription services § Key aspect is being able to offer customers accurate movie recommendations based on a customer’s own preferences and viewing history § 11 The Netflix Prize From 2006–2009 Netflix ran a contest asking the public to submit algorithms to predict user ratings for movies § Offered a grand prize of $1,000,000 USD to the team who could beat Netflix’s own algorithm's accuracy by more than 10% § Training data set of ~100,000,000 ratings and test data set of ~3,000,000 ratings were provided § 12 Predicting the User Ratings § What data could be used to predict user ratings? 13 Using other users’ rankings: Collaborative Filtering 14 Using movie information: Content Filtering § We saw that Amy liked "Men In Black” – It was directed by Barry Sonnenfeld – Classified in the genres of action, adventure, sci-fi and comedy – It stars actor Will Smith § Consider recommending to Amy: – Barry Sonnenfeld’s movie "Get Shorty" – "Jurassic Park", which is in the genres of action, adventure, and sci-fi – Will Smith’s movie "Hitch" 15 Winners are declared! On September 18, 2009, a winning team was announced § BellKor’s Pragmatic Chaos won the competition and the $1,000,000 grand prize § 16 What is the edge? § In today’s digital age, businesses often have hundreds of thousands of items to offer their customers § Excellent recommendation systems can make or break these businesses § Clustering algorithms, which are tailored to find similar customers or similar items, form the backbone of many of these recommendation systems 17 Predicting the Quality of Wine Bordeaux is a region in France popular for producing wine. § Large differences in price and quality between years, although wine is produced in a similar way. § Taster better when they are older, store young wines § – so hard to tell if wine will be good when it is on the market Expert tasters predict which ones will be good § Can analytics be used to come up with a different system for judging wine? § 18 Predicting the Quality of Wine § March 1990 - Orley Ashenfelter, a Princeton economics professor, claims he can predict wine quality without tasting the wine (assessing the aroma, looking at the legs) § Ashenfelter used a method called linear regression – Predicts an outcome variable, or dependent variable – Predicts using a set of independent variables § Dependent variable: typical price in 1990-1991 wine auctions (approximates quality) § Independent variables: 19 The Expert’s Reaction § Robert Parker, the world's most influential wine expert: “Ashenfelter is an absolute total sham” “rather like a movie critic who never goes to see the movie but tells you how good it is based on the actors and the director” 20 The Results § Parker: – 1986 is “very good to sometimes exceptional” § Ashenfelter: – 1986 is mediocre – 1989 will be “the wine of the century” and 1990 will be even better! § In wine auctions, – 1989 sold for more than twice the price of 1986 – 1990 sold for even higher prices! Later, Ashenfelter predicted 2000 and 2003 would be great § Parker has stated that “2000 is the greatest vintage Bordeaux has ever produced” § 21 What is the edge? § A linear regression model with only a few variables can predict wine prices well § In many cases, outperforms wine experts’ opinions § A quantitative approach to a traditionally qualitative problem 22 Twitter/X A social networking and communication website founded in 2006 § What can you do with Twitter/X? § 23 Impact of Twitter/X § Who are using twitter, and what are they used for? 24 Understanding People § Why companies keep official account on Twitter? What can companies do on Twitter? 25 Using Text as Data § Most of the data we are dealing with – Structured – Numerical – Categorical § Tweets are – Loosely structured – Textual – Sometimes poor spelling, non-traditional grammar – Possibly multilingual 26 Text Analytics § Why people care about textual data? § How do we handle it? § Humans can’t keep up with Internet-scale volumes of data – 350000 tweets sent per minute – 500 million tweets sent each day § Computers can help – Need to 'understand’ text – Natural Language Processing – Understand and derive meaning from human language 27 Sentiment Analysis § Use NLP, Text analytics to identify, extract and study affective states and subjective information. 28 What is the edge? § Twitter and other social media generates large textual data – Text analytics (sentiment analysis) deal with massive amount of unstructured data § We’ll see how we can build analytics models using text as our data § In general, text analytics (including sentiment analysis) are applied to marketing, customer service, and healthcare… 29 This lecture § Summary of some of the examples we will cover – Netflix (Movie Recommendation) – Quality of Wine – Twitter (Text analytics) § Other examples we will cover in this course – Summer Job Search – New Product Development – Healthcare Quality Prediction – Court Ruling Prediction – Criminal Justice – Enron – MRI brain image Segmentation – Housing Price Prediction 30 What is Analytics? § The science of using data to build models that lead to better decisions that add value to individuals, to companies, to institutions § Descriptive analytics: identify patterns in the data – Summary statistics – Visualizations – Clustering – Text analytics § Predictive analytics: predict different outcomes – Linear Regression – Logistic Regression – Classification Trees § Prescriptive/Operations analytics: give advice on actions to take – Decision Analysis – Linear/Integer Optimization 31 Course Schedule COURSE CONTENT AND TENTATIVE TEACHING SCHEDULE Week Date Topic Cases 1 4 Sep Overview: Business Analytics, Probability (Part 1) 2 11 Sep Probability (Part 2), Decision Analysis (Part 1) Summer Job Search 3 18 Sep Decision Analysis (Part 2), Statistical Inference (Part 1) New Product Development 4 25 Sep Statistical Inference (Part 2), Linear Regression (Part 1) Wine Quality Prediction 5 2 Oct 6 9 Oct 7 16 Oct NO CLASS - Reading Week 8 23 Oct General Holiday 9 30 Oct Logistic Regression (Part 2), Clustering (Part 1) Movie Recommendation 10 6 Nov Clustering (Part 2), Classification Tree (Part 1) Court Ruling Prediction 11 13 Nov Classification Tree (Part 2), Text Analytics (Part 1) 12 20 Nov Text Analytics (Part 2) Sentiment Analysis on Twitter Crime Investigation - email 13 27 Nov More cases and Course Review General Holiday Linear Regression (Part 2), Logistic Regression (Part 1) Healthcare Quality Assessment 32 Course Introduction: Technology § Microsoft Excel § R – Will have a brief introduction soon. 34 Goals Understand the complexity of data and how to deal with data § Create your own analytical models § Understand, use, and think critically about the results of the models § Know what to do next § Reach these goals through learning R § But ultimately, this is a course about analytics § Don’t get lost in R, think about the models and results § 35 Administrative Arrangements Course Materials § Required Materials – Lecture notes, assignments, practice problems (on Moodle) § Recommended Text Book – The Analytics Edge. Dimitris Bertsimas, Allison K. O'Hair, and William R. Pulleyblank. Dynamic Ideas LLC., 2016. 37 Manage Lecture Slides § Every one topic – One before-class slides § Every one class, one topic – One after-class slides 38 Assessment Participation + Attendance 5% Individual assignments 30% Group Projects 25% Final Exam 40% Total 100% 39 Assessment § § § Participation + Attendance 5% Individual assignments 30% Group Projects Final Exam Total 25% 40% 100% In-Class Participation – Attending and actively contributing to class discussions. – In class practice questions and pop-up surveys (they are not quiz, only participation will be recorded). – If you participate during the class, please fill in the paper in the front during the break or after class (you may also grab a snack). If you forget to do this, please send TA (cc me) an email (including what you asked or answer) at the end of every class (no later than 10 pm of the day). Attendance – Attendance app No disruptions in class – Laptops are needed throughout – NO CROSSTALKING – NO CELL PHONE (including phone calls) 40 – Tablets or laptops for taking notes are allowed Assessment Participation + Attendance Individual assignments 5% 30% Group Projects Final Exam Total 25% 40% 100% Individual assignment policy – Strict due date/time (posted online) enforced, no excuses. § Homework questions are meant to be extensions of what we do in class. § Highly encourage you to do your homework with your classmates. Do not copy! § At most 5 graded assignments. § 41 Assessment Participation + Attendance Individual assignments 5% 30% Group Project Final Exam Total 25% 40% 100% 4–5 people Apply analytics tools to a wide variety of real-world settings – Project proposal (middle of the course) – Discussion with the instructor – Project report (end of the course) § More details to be announced during the next class. § § 42 Assessment Participation + Attendance Individual assignments 5% 30% Group Projects Final Exam Total 25% 40% 100% During the assessment period. § Homework assignments, practice problems § TBD § 43 ChatGPT Policy § Learning: – Could be helpful. § Assignments: – Directly copy answers from ChatGPT is prohibited. If caught, the grade of that assignment is zero. – Rely on GPT is not wise, since you cannot use them in the exams. § Group project: – If you used GenAI, you need to declare how you used them. – Helpful in polish your writing, learn fancy tools faster. § More discussion in later courses – It is also new to me. 44 Academic Conduct § Academic dishonesty is ABSOLUTELY NOT TORELATED. – No second chance. § Highly encourage you to do your homework with your classmates. Do not copy! 45 Academic Support 1. Tutorial Sessions (TA) Highly recommended, not mandatory. 2. Each other Leverage your classmates and your friends to ask questions and figure out things 3. Office Hours or By Appointment 46 What is R? A software environment for data analysis, statistical computing, and graphics § Natural to use, complete data analyses in just a few lines § Can create almost any analytics model imaginable § Significantly more powerful than Excel § § A programming language – There is a lot more that can be done in R § Don’t worry – We won't be doing much programming in class, and almost everything we ask you to do can be completed in a few lines. 47 History of R § Originated from S – A statistical programming language developed by John Chambers at Bell Labs in the 1970s § The first version of R was developed by Robert Gentleman and Ross Ihaka at the University of Auckland in the mid-1990s – Wanted a better statistical software in their Macintosh teaching laboratory – An open-source alternative: encourage others to download and help develop the software – R packages 48 Why use R? § There are many choices for data analysis software – SAS, Stata, SPSS, Excel (with add-ons), MATLAB, Minitab. . . – So why are we using R? § Free (open-source project) § Widely used – More than 2 million users around the world – New features are being developed all the time – A lot of community resources Easy to re-run previous work and make adjustments § Nice graphics and visualizations § 49 R resources Official page: http://www.r-project.org § Download page: http://www.cran.r-project.org § Some helpful websites: • http://www.statmethods.net • http://www.rseek.org § Looking for a command or function? Google it § Best way to learn R is through trial and error § 50 RStudio Official page: https://www.rstudio.com § Interactive integrated development environment (IDE) § Include a code editor with many R specific features, a console to execute your code, and other useful panes, including one to show figures § 51 Course objectives § This course should make you comfortable using analytics in your career and your life § You will know how to work with real data, and will learn many different methodologies § We want to convince you of the edge of business analytics 52 Take a Quick Survey § https://forms.gle/h1si1U2nwLkX3kzs8 § Deadline of the survey: Sep 7, Thursday 6 pm. 53