Uploaded by yu kangrui

Lecture 1

advertisement
BUSF-SHU 101: Statistics for Business and Economics
Introduction
Jiding Zhang
Jan 31, 2023
1 / 26
Plans for Today
I Syllabus
I Course Project
I Basics of Statistics
2 / 26
Meet The Team
I Instructor: Jiding Zhang, Assistant Professor of Operations
Management
I Email: jiding@nyu.edu
I Office: New Bund Campus S806
I Office Hour: Thu 4:30 – 6:00 pm
I TA and Recitation Instructor: Chenjie Huang
I Email: ch1419@nyu.edu
I Office Hour: TBD
I LA: Lemeng Li
I Email: ll4755@nyu.edu
I Office Hour: TBD
I LA: Khaliun Enkhbold
I Email: ke2110@nyu.edu
I Office Hour: TBD
3 / 26
Syllabus
I Textbook: McClave, J.T., Benson, P.G. and Sincich, T.,
Statistics for Business and Economics, Pearson Education,
13th edition
4 / 26
Syllabus
I Textbook: McClave, J.T., Benson, P.G. and Sincich, T.,
Statistics for Business and Economics, Pearson Education,
13th edition
I Software: pick up on your own, a brief R bootcamp provided
4 / 26
Syllabus
I Textbook: McClave, J.T., Benson, P.G. and Sincich, T.,
Statistics for Business and Economics, Pearson Education,
13th edition
I Software: pick up on your own, a brief R bootcamp provided
I Grading: curved
4 / 26
Syllabus
I Textbook: McClave, J.T., Benson, P.G. and Sincich, T.,
Statistics for Business and Economics, Pearson Education,
13th edition
I Software: pick up on your own, a brief R bootcamp provided
I Grading: curved
I Recitation: mandatory
4 / 26
Syllabus
I Textbook: McClave, J.T., Benson, P.G. and Sincich, T.,
Statistics for Business and Economics, Pearson Education,
13th edition
I Software: pick up on your own, a brief R bootcamp provided
I Grading: curved
I Recitation: mandatory
I Homework [30%]: approx. 1 problem set for each chapter,
completed independently, no late submission will be accepted,
lowest grade dropped
4 / 26
Syllabus
I Textbook: McClave, J.T., Benson, P.G. and Sincich, T.,
Statistics for Business and Economics, Pearson Education,
13th edition
I Software: pick up on your own, a brief R bootcamp provided
I Grading: curved
I Recitation: mandatory
I Homework [30%]: approx. 1 problem set for each chapter,
completed independently, no late submission will be accepted,
lowest grade dropped
I Final exam [30%]: cumulative, closed book, one cheat sheet
allowed
4 / 26
Syllabus
I Textbook: McClave, J.T., Benson, P.G. and Sincich, T.,
Statistics for Business and Economics, Pearson Education,
13th edition
I Software: pick up on your own, a brief R bootcamp provided
I Grading: curved
I Recitation: mandatory
I Homework [30%]: approx. 1 problem set for each chapter,
completed independently, no late submission will be accepted,
lowest grade dropped
I Final exam [30%]: cumulative, closed book, one cheat sheet
allowed
I Group course project [40%]
4 / 26
Email Policies
I For questions regarding the material, email LA/TA first, or
come to the office hour
I Use the subject “[BUSF-SHU 101 Questions]”
I Limit your e-mail inquiries to issues that (1) cannot be
answered by reading the syllabus and (2) that have not been
clarified in class or on Brightspace.
5 / 26
The Course Project: Sign Up
I Choose from 4 datasets provided by the instructor.
I Max 6 people per group.
I Feel free to pick your teammates.
I No more than X groups can work on a same dataset:
first-come-first-served! (X announced after week 2)
I Sign up through Google Sheet.
I Sign-ups are final: you cannot switch between datasets or
teammates afterwards.
I Deadline for signing up: February 17 (Friday)
6 / 26
The Course Project: Proposal
I 6-minute presentation in week 7.
I Expectations:
I Descriptive analysis of the dataset. Any interesting patterns?
I Tentative research questions: What do you plan to study?
I Evaluation: by the instructor, 10% of the final grade. Each
team member within a group shares a same grade.
7 / 26
The Course Project: Final Presentation
I 15-minute presentation in week 14.
I Expectations:
I A brief introduction of the background/setting of the problem
(2 min)
I A clear statement of the research question (1 min)
I Motivating the question (3 min): Why should we care? What
do we gain from your work?
I Statistical analysis to answer the question you raise (8 min):
Explain your analysis carefully.
I Conclusion/Takeaway (1 min): What should we learn and
remember from your work?
8 / 26
The Course Project: Final Presentation
I Slides of presentation, including a detailed appendix
summarizing research methods and findings should also be
submitted by Week 14.
I Evaluation: 20% by the instructor, 5% by within-group peer
evaluation, and 5% by cross-group peer evaluation
I Within-group: rate each of your teammate
I Cross-group: rate the work of other groups according to their
presentations
I Please be honest and fair. Act with academic integrity.
I Criterion: (1) interestingness/significance of research
question; (2) rigor of analysis; (3) quality of delivery
9 / 26
The Course Project: Final Presentation
I Scope: There is no scope.
I Tools for analysis are not limited to what we have learned in
class. However, it is your responsibility to explain them to us
(assume we know nothing), if you use techniques not discussed
in class.
I Also, feel free to supplement your analysis with data from
other sources. However, the main analysis should still be
focused on the dataset provided by the instructor. It is allowed
and encouraged to supplement, but not to substitute.
I Reproducibility of results: All results should be reproducible.
Hence, all code used for analysis should be submitted along
with the paper. If Microsoft Excel is used for analysis, all
formulas should be kept in the worksheet for reproducing
results.
10 / 26
What is Statistics
I Statistics: the science of data: collecting, classifying,
summarizing, organizing, analyzing, interpreting numerical and
categorial information
11 / 26
What is Statistics
I Statistics: the science of data: collecting, classifying,
summarizing, organizing, analyzing, interpreting numerical and
categorial information
I Descriptive statistics: explore (look for patterns, summarize
information) vs. Inferential statistics: estimate, predict,
generalize
11 / 26
Birth of Modern Statistics
I Francis Galton and Karl Pearson: transformed statistics into a
rigorous mathematical discipline used for analysis
12 / 26
Birth of Modern Statistics
I Francis Galton and Karl Pearson: transformed statistics into a
rigorous mathematical discipline used for analysis
success = talent + luck
big success = a bit more talent + a lot more luck
— Daniel Kahneman, Thinking, Fast and Slow
12 / 26
Birth of Modern Statistics
I Francis Galton and Karl Pearson: transformed statistics into a
rigorous mathematical discipline used for analysis
success = talent + luck
big success = a bit more talent + a lot more luck
— Daniel Kahneman, Thinking, Fast and Slow
I William Sealy Gosset, Ronald Fisher: development of better
design of experiments models, hypothesis testing and
techniques for use with small data samples
12 / 26
Statistics in the Time of Cholera
I Cholera: transmitted through air or (contaminated) water?
I John Snow (father of epidemiology): Compare changes in
death rate in two areas serviced by two water companies
I originally: both from Thames
I In 1852, one company moved its water works upriver ⇒ death
rate fell
13 / 26
Basics Terms of Statistics
I Experimental/observational unit: an object upon which we
collect data
14 / 26
Basics Terms of Statistics
I Experimental/observational unit: an object upon which we
collect data
I Population: A set of units that we are interested in studying
(all, everyone)
14 / 26
Basics Terms of Statistics
I Experimental/observational unit: an object upon which we
collect data
I Population: A set of units that we are interested in studying
(all, everyone)
I Variable: a characteristic or property of an individual unit
14 / 26
Basics Terms of Statistics
I Experimental/observational unit: an object upon which we
collect data
I Population: A set of units that we are interested in studying
(all, everyone)
I Variable: a characteristic or property of an individual unit
I Sample: a subset of the units of a population
14 / 26
Basics Terms of Statistics
I Experimental/observational unit: an object upon which we
collect data
I Population: A set of units that we are interested in studying
(all, everyone)
I Variable: a characteristic or property of an individual unit
I Sample: a subset of the units of a population
I Statistical inference: an estimate or prediction or
generalization about a population based on information from
a sample
14 / 26
Basics Terms of Statistics
I Experimental/observational unit: an object upon which we
collect data
I Population: A set of units that we are interested in studying
(all, everyone)
I Variable: a characteristic or property of an individual unit
I Sample: a subset of the units of a population
I Statistical inference: an estimate or prediction or
generalization about a population based on information from
a sample
I measure of reliability: degree of uncertainty associated with a
statistical inference
14 / 26
Conducting Statistical Analysis
I Designed experiment: full controls over the characteristics of
the experimental units sampled
15 / 26
Conducting Statistical Analysis
I Designed experiment: full controls over the characteristics of
the experimental units sampled
I Observational study: experimental units sampled are observed
in their natural setting (i.e., no attempt of controlling)
15 / 26
Conducting Statistical Analysis
I Designed experiment: full controls over the characteristics of
the experimental units sampled
I Observational study: experimental units sampled are observed
in their natural setting (i.e., no attempt of controlling)
I Representative sample: exhibits characteristics typical of those
possessed by the population of interest
15 / 26
Conducting Statistical Analysis
I Designed experiment: full controls over the characteristics of
the experimental units sampled
I Observational study: experimental units sampled are observed
in their natural setting (i.e., no attempt of controlling)
I Representative sample: exhibits characteristics typical of those
possessed by the population of interest
I Simple random sample (of size n): every different sample of
size n has an equal chance of selection
15 / 26
How Numbers Lie
I Selection bias: a subset of units in population has little or no
chance of being selected for the sample
16 / 26
How Numbers Lie
I Selection bias: a subset of units in population has little or no
chance of being selected for the sample
I Measurement error: inaccuracies in the values of the data
collected
16 / 26
How Numbers Lie
I Selection bias: a subset of units in population has little or no
chance of being selected for the sample
I Measurement error: inaccuracies in the values of the data
collected
I spurious correlation
16 / 26
Statistics: Why you should learn it
I Ask for money
I Higher success rate for women? Greensberg and Mollick (2016)
17 / 26
Statistics: Why you should learn it
I Ask for money
I Higher success rate for women? Greensberg and Mollick (2016)
I Make money (or lose)
I Predict computing power and carbon footprint? Prat and
Walter (2019)
17 / 26
Statistics: Why you should learn it
I Ask for money
I Higher success rate for women? Greensberg and Mollick (2016)
I Make money (or lose)
I Predict computing power and carbon footprint? Prat and
Walter (2019)
I Be entrepreneurial
I Randomized markdown? Moon, Bimpikis, Mendelson (2017)
17 / 26
Statistics: Why you should learn it
I Ask for money
I Higher success rate for women? Greensberg and Mollick (2016)
I Make money (or lose)
I Predict computing power and carbon footprint? Prat and
Walter (2019)
I Be entrepreneurial
I Randomized markdown? Moon, Bimpikis, Mendelson (2017)
I Be influential
I Spread of true and false news? Vosoughi, Roy, Aral (2018)
17 / 26
Statistics: Why you should learn it
I Ask for money
I Higher success rate for women? Greensberg and Mollick (2016)
I Make money (or lose)
I Predict computing power and carbon footprint? Prat and
Walter (2019)
I Be entrepreneurial
I Randomized markdown? Moon, Bimpikis, Mendelson (2017)
I Be influential
I Spread of true and false news? Vosoughi, Roy, Aral (2018)
I Many more reasons...
17 / 26
Reward-Based Crowdfunding
18 / 26
Reward-Based Crowdfunding
I data of more than 100,000 crowdfunding projects from 2009
to 2014 on Kickstarter
I data source: crowdBerkeley
I for each project, we have: title, category, goal,
amount pleged, state, start date, end date, location, founder,
backers count, description, url
I one single tsv file, around 40.5 MB
19 / 26
Crypto Mining
20 / 26
Crypto Mining
I data of blocks mined for Bitcoin, from 2009 to 2021
I data source: BTC.com
I for each successful attempt, we have: Height (index), Host,
Volume, Stripped Size, Size, Weight, Average Trans
Fee Per transaction, Block Reward, Block Reward Tip, Time
I one single csv file, around 58 MB
21 / 26
Online Marketplaces: Successes and Failures
22 / 26
Online Marketplaces: Successes and Failures
I data of listings on Beepi and Carvana in July 2016
I Beepi: for each listing, we have: date (of scraping), SaleID,
VIN, Mileage, BodyType, Year, ModelID, ModelName,
MakeID, MakeName, Trim, Cylinders, EngineSize, NumDoors,
Transmission, Price, DealerPrice, EdmundsPrice, SaleBoolean
I Carvana: for each listing, we have: date (of scraping), SaleID,
BodyType, Year, SalePending, DaysUntilAvailable,
ModelName, StockNumber, Mileage, Trim, Price, KBBPrice
I 2 csv files, around 20 MB in total
23 / 26
Social Media: Information and User Activity
24 / 26
Social Media: Information and User Activity
I data of user activities on Facebook pages in Feburary 2019:
New York Times, 538, infowars (renown “Fake News” site)
I for each Facebook page, at the time of scraping, we have:
information of all (?) posts on the focal page, including: url,
Likes, Reposts, Comments, (Title) Text for each post. Also,
the time of scraping can be accessed by looking at the file
information - Date Modified
I around 3000 xlsx files (hourly snapshots of each Facebook
page), 2.5 GB in total
25 / 26
Next Time...
I R Bootcamp (optional). If you plan to come, please bring
your laptop with R and RStudio installed.
26 / 26
Download