BUSF-SHU 101: Statistics for Business and Economics Introduction Jiding Zhang Jan 31, 2023 1 / 26 Plans for Today I Syllabus I Course Project I Basics of Statistics 2 / 26 Meet The Team I Instructor: Jiding Zhang, Assistant Professor of Operations Management I Email: jiding@nyu.edu I Office: New Bund Campus S806 I Office Hour: Thu 4:30 – 6:00 pm I TA and Recitation Instructor: Chenjie Huang I Email: ch1419@nyu.edu I Office Hour: TBD I LA: Lemeng Li I Email: ll4755@nyu.edu I Office Hour: TBD I LA: Khaliun Enkhbold I Email: ke2110@nyu.edu I Office Hour: TBD 3 / 26 Syllabus I Textbook: McClave, J.T., Benson, P.G. and Sincich, T., Statistics for Business and Economics, Pearson Education, 13th edition 4 / 26 Syllabus I Textbook: McClave, J.T., Benson, P.G. and Sincich, T., Statistics for Business and Economics, Pearson Education, 13th edition I Software: pick up on your own, a brief R bootcamp provided 4 / 26 Syllabus I Textbook: McClave, J.T., Benson, P.G. and Sincich, T., Statistics for Business and Economics, Pearson Education, 13th edition I Software: pick up on your own, a brief R bootcamp provided I Grading: curved 4 / 26 Syllabus I Textbook: McClave, J.T., Benson, P.G. and Sincich, T., Statistics for Business and Economics, Pearson Education, 13th edition I Software: pick up on your own, a brief R bootcamp provided I Grading: curved I Recitation: mandatory 4 / 26 Syllabus I Textbook: McClave, J.T., Benson, P.G. and Sincich, T., Statistics for Business and Economics, Pearson Education, 13th edition I Software: pick up on your own, a brief R bootcamp provided I Grading: curved I Recitation: mandatory I Homework [30%]: approx. 1 problem set for each chapter, completed independently, no late submission will be accepted, lowest grade dropped 4 / 26 Syllabus I Textbook: McClave, J.T., Benson, P.G. and Sincich, T., Statistics for Business and Economics, Pearson Education, 13th edition I Software: pick up on your own, a brief R bootcamp provided I Grading: curved I Recitation: mandatory I Homework [30%]: approx. 1 problem set for each chapter, completed independently, no late submission will be accepted, lowest grade dropped I Final exam [30%]: cumulative, closed book, one cheat sheet allowed 4 / 26 Syllabus I Textbook: McClave, J.T., Benson, P.G. and Sincich, T., Statistics for Business and Economics, Pearson Education, 13th edition I Software: pick up on your own, a brief R bootcamp provided I Grading: curved I Recitation: mandatory I Homework [30%]: approx. 1 problem set for each chapter, completed independently, no late submission will be accepted, lowest grade dropped I Final exam [30%]: cumulative, closed book, one cheat sheet allowed I Group course project [40%] 4 / 26 Email Policies I For questions regarding the material, email LA/TA first, or come to the office hour I Use the subject “[BUSF-SHU 101 Questions]” I Limit your e-mail inquiries to issues that (1) cannot be answered by reading the syllabus and (2) that have not been clarified in class or on Brightspace. 5 / 26 The Course Project: Sign Up I Choose from 4 datasets provided by the instructor. I Max 6 people per group. I Feel free to pick your teammates. I No more than X groups can work on a same dataset: first-come-first-served! (X announced after week 2) I Sign up through Google Sheet. I Sign-ups are final: you cannot switch between datasets or teammates afterwards. I Deadline for signing up: February 17 (Friday) 6 / 26 The Course Project: Proposal I 6-minute presentation in week 7. I Expectations: I Descriptive analysis of the dataset. Any interesting patterns? I Tentative research questions: What do you plan to study? I Evaluation: by the instructor, 10% of the final grade. Each team member within a group shares a same grade. 7 / 26 The Course Project: Final Presentation I 15-minute presentation in week 14. I Expectations: I A brief introduction of the background/setting of the problem (2 min) I A clear statement of the research question (1 min) I Motivating the question (3 min): Why should we care? What do we gain from your work? I Statistical analysis to answer the question you raise (8 min): Explain your analysis carefully. I Conclusion/Takeaway (1 min): What should we learn and remember from your work? 8 / 26 The Course Project: Final Presentation I Slides of presentation, including a detailed appendix summarizing research methods and findings should also be submitted by Week 14. I Evaluation: 20% by the instructor, 5% by within-group peer evaluation, and 5% by cross-group peer evaluation I Within-group: rate each of your teammate I Cross-group: rate the work of other groups according to their presentations I Please be honest and fair. Act with academic integrity. I Criterion: (1) interestingness/significance of research question; (2) rigor of analysis; (3) quality of delivery 9 / 26 The Course Project: Final Presentation I Scope: There is no scope. I Tools for analysis are not limited to what we have learned in class. However, it is your responsibility to explain them to us (assume we know nothing), if you use techniques not discussed in class. I Also, feel free to supplement your analysis with data from other sources. However, the main analysis should still be focused on the dataset provided by the instructor. It is allowed and encouraged to supplement, but not to substitute. I Reproducibility of results: All results should be reproducible. Hence, all code used for analysis should be submitted along with the paper. If Microsoft Excel is used for analysis, all formulas should be kept in the worksheet for reproducing results. 10 / 26 What is Statistics I Statistics: the science of data: collecting, classifying, summarizing, organizing, analyzing, interpreting numerical and categorial information 11 / 26 What is Statistics I Statistics: the science of data: collecting, classifying, summarizing, organizing, analyzing, interpreting numerical and categorial information I Descriptive statistics: explore (look for patterns, summarize information) vs. Inferential statistics: estimate, predict, generalize 11 / 26 Birth of Modern Statistics I Francis Galton and Karl Pearson: transformed statistics into a rigorous mathematical discipline used for analysis 12 / 26 Birth of Modern Statistics I Francis Galton and Karl Pearson: transformed statistics into a rigorous mathematical discipline used for analysis success = talent + luck big success = a bit more talent + a lot more luck — Daniel Kahneman, Thinking, Fast and Slow 12 / 26 Birth of Modern Statistics I Francis Galton and Karl Pearson: transformed statistics into a rigorous mathematical discipline used for analysis success = talent + luck big success = a bit more talent + a lot more luck — Daniel Kahneman, Thinking, Fast and Slow I William Sealy Gosset, Ronald Fisher: development of better design of experiments models, hypothesis testing and techniques for use with small data samples 12 / 26 Statistics in the Time of Cholera I Cholera: transmitted through air or (contaminated) water? I John Snow (father of epidemiology): Compare changes in death rate in two areas serviced by two water companies I originally: both from Thames I In 1852, one company moved its water works upriver ⇒ death rate fell 13 / 26 Basics Terms of Statistics I Experimental/observational unit: an object upon which we collect data 14 / 26 Basics Terms of Statistics I Experimental/observational unit: an object upon which we collect data I Population: A set of units that we are interested in studying (all, everyone) 14 / 26 Basics Terms of Statistics I Experimental/observational unit: an object upon which we collect data I Population: A set of units that we are interested in studying (all, everyone) I Variable: a characteristic or property of an individual unit 14 / 26 Basics Terms of Statistics I Experimental/observational unit: an object upon which we collect data I Population: A set of units that we are interested in studying (all, everyone) I Variable: a characteristic or property of an individual unit I Sample: a subset of the units of a population 14 / 26 Basics Terms of Statistics I Experimental/observational unit: an object upon which we collect data I Population: A set of units that we are interested in studying (all, everyone) I Variable: a characteristic or property of an individual unit I Sample: a subset of the units of a population I Statistical inference: an estimate or prediction or generalization about a population based on information from a sample 14 / 26 Basics Terms of Statistics I Experimental/observational unit: an object upon which we collect data I Population: A set of units that we are interested in studying (all, everyone) I Variable: a characteristic or property of an individual unit I Sample: a subset of the units of a population I Statistical inference: an estimate or prediction or generalization about a population based on information from a sample I measure of reliability: degree of uncertainty associated with a statistical inference 14 / 26 Conducting Statistical Analysis I Designed experiment: full controls over the characteristics of the experimental units sampled 15 / 26 Conducting Statistical Analysis I Designed experiment: full controls over the characteristics of the experimental units sampled I Observational study: experimental units sampled are observed in their natural setting (i.e., no attempt of controlling) 15 / 26 Conducting Statistical Analysis I Designed experiment: full controls over the characteristics of the experimental units sampled I Observational study: experimental units sampled are observed in their natural setting (i.e., no attempt of controlling) I Representative sample: exhibits characteristics typical of those possessed by the population of interest 15 / 26 Conducting Statistical Analysis I Designed experiment: full controls over the characteristics of the experimental units sampled I Observational study: experimental units sampled are observed in their natural setting (i.e., no attempt of controlling) I Representative sample: exhibits characteristics typical of those possessed by the population of interest I Simple random sample (of size n): every different sample of size n has an equal chance of selection 15 / 26 How Numbers Lie I Selection bias: a subset of units in population has little or no chance of being selected for the sample 16 / 26 How Numbers Lie I Selection bias: a subset of units in population has little or no chance of being selected for the sample I Measurement error: inaccuracies in the values of the data collected 16 / 26 How Numbers Lie I Selection bias: a subset of units in population has little or no chance of being selected for the sample I Measurement error: inaccuracies in the values of the data collected I spurious correlation 16 / 26 Statistics: Why you should learn it I Ask for money I Higher success rate for women? Greensberg and Mollick (2016) 17 / 26 Statistics: Why you should learn it I Ask for money I Higher success rate for women? Greensberg and Mollick (2016) I Make money (or lose) I Predict computing power and carbon footprint? Prat and Walter (2019) 17 / 26 Statistics: Why you should learn it I Ask for money I Higher success rate for women? Greensberg and Mollick (2016) I Make money (or lose) I Predict computing power and carbon footprint? Prat and Walter (2019) I Be entrepreneurial I Randomized markdown? Moon, Bimpikis, Mendelson (2017) 17 / 26 Statistics: Why you should learn it I Ask for money I Higher success rate for women? Greensberg and Mollick (2016) I Make money (or lose) I Predict computing power and carbon footprint? Prat and Walter (2019) I Be entrepreneurial I Randomized markdown? Moon, Bimpikis, Mendelson (2017) I Be influential I Spread of true and false news? Vosoughi, Roy, Aral (2018) 17 / 26 Statistics: Why you should learn it I Ask for money I Higher success rate for women? Greensberg and Mollick (2016) I Make money (or lose) I Predict computing power and carbon footprint? Prat and Walter (2019) I Be entrepreneurial I Randomized markdown? Moon, Bimpikis, Mendelson (2017) I Be influential I Spread of true and false news? Vosoughi, Roy, Aral (2018) I Many more reasons... 17 / 26 Reward-Based Crowdfunding 18 / 26 Reward-Based Crowdfunding I data of more than 100,000 crowdfunding projects from 2009 to 2014 on Kickstarter I data source: crowdBerkeley I for each project, we have: title, category, goal, amount pleged, state, start date, end date, location, founder, backers count, description, url I one single tsv file, around 40.5 MB 19 / 26 Crypto Mining 20 / 26 Crypto Mining I data of blocks mined for Bitcoin, from 2009 to 2021 I data source: BTC.com I for each successful attempt, we have: Height (index), Host, Volume, Stripped Size, Size, Weight, Average Trans Fee Per transaction, Block Reward, Block Reward Tip, Time I one single csv file, around 58 MB 21 / 26 Online Marketplaces: Successes and Failures 22 / 26 Online Marketplaces: Successes and Failures I data of listings on Beepi and Carvana in July 2016 I Beepi: for each listing, we have: date (of scraping), SaleID, VIN, Mileage, BodyType, Year, ModelID, ModelName, MakeID, MakeName, Trim, Cylinders, EngineSize, NumDoors, Transmission, Price, DealerPrice, EdmundsPrice, SaleBoolean I Carvana: for each listing, we have: date (of scraping), SaleID, BodyType, Year, SalePending, DaysUntilAvailable, ModelName, StockNumber, Mileage, Trim, Price, KBBPrice I 2 csv files, around 20 MB in total 23 / 26 Social Media: Information and User Activity 24 / 26 Social Media: Information and User Activity I data of user activities on Facebook pages in Feburary 2019: New York Times, 538, infowars (renown “Fake News” site) I for each Facebook page, at the time of scraping, we have: information of all (?) posts on the focal page, including: url, Likes, Reposts, Comments, (Title) Text for each post. Also, the time of scraping can be accessed by looking at the file information - Date Modified I around 3000 xlsx files (hourly snapshots of each Facebook page), 2.5 GB in total 25 / 26 Next Time... I R Bootcamp (optional). If you plan to come, please bring your laptop with R and RStudio installed. 26 / 26