Bentley160 Lecture -1-5 Dr. Hema Seshadri Faculty Profile https://dataworksai.com/?page_id=9 Upcoming Book Analytics for Business Success A Guide to Analytics Fitness™ by Hema Seshadri, Ph.D. Book Synopsis: https://dataworksai.com/?page_id=529 Book Blurb: https://dataworksai.com/?page_id=27 Bentley Honor Code 1. Academic Integrity System Structure 2. Faculty and Student Responsibilities and Rights in the Academic Integrity System 3. Violation Levels Defined and Recommended Sanctions 4. Academic Integrity Incident Reports and Consequences 5. Academic Integrity Hearing https://catalog.bentley.edu/undergraduate/academic-policiesprocedures/academic-integrity/ Academic Support 1. 2. 3. 4. Academic Learning Centers and Labs Academic Skills Assistance First Year Seminar Peer Tutoring Assistance https://catalog.bentley.edu/undergraduate/academic-programsresources/academic-support-services/ Disability Services ● ● ● ● ● ● Learning disabilities Attention Deficit/Hyperactivity Disorders Mobility, visual and hearing impairments Medical conditions Psychiatric/psychological disabilities. Services include: ● Academic accommodations ● Assistance with accessibility issues ● Community education ● Individual coaching and support https://catalog.bentley.edu/undergraduate/academicprograms-resources/disability-services/ Tableau Accounts Setup You will need at least 2 tableau accounts – the first will be the downloaded tableau desktop Sign up here https://www.tableau.com/academic/teaching The second is tableau public https://public.tableau.com/en-us/s/ CIS Sandbox ● This semester we have four tutors in the CIS Sandbox who can help with CS 160 ● Grace Hesterberg, John Giaquinto, Tomas Hahn, and Chris Hagedorn. Together they are working about 20 hours per week for drop in time ● (the schedule has not been finalized yet.) ● tutor and student responsibilities CIS Sandbox ● Sandbox tutors for CS150, 350, 605 will be able to help with Tableau, Lucid Charts, ERD and SQL ● When you come to see a tutor either for drop-in or appointment help, be prepared for the meeting. ● Check the CIS Sandbox home page(http://cissandbox.com) to see when Grace, John, Tomas, or Chris are working – the schedule should be finalized by end of next week ● Contact the tutors to set up an appointment with one of the tutors during their regular hours or outside of them. ● Tutors have limited availability, so be prepared for your session by providing as much background information as possible in the email that you send them. Tableau https://www.tableau.com/why-tableau/what-is-tableau https://www.tableau.com/learn/training/20222 Getting Started Tableau Prep Connecting to Data Lucid Charts https://lucid.app/documents#/dashboard https://www.lucidchart.com/blog/getting-started-in-lucidchart https://www.youtube.com/watch?v=kM1B-jQUeVI Welcome to CS160 The primary objective of this course is to expose the student to the breadth, depth, versatility and usefulness of data and databases in problem solving. This course will develop the students’ foundational competencies related to data management that allow them to critically analyze complex problems using a variety of data sources and tools and to effectively present their ideas to others. Welcome to CS160 – The key learning objectives of this course are: 1. Understanding how data can support effective problem solving and decision making in specific problem contexts, 2. Understanding how data are stored, organized, managed, and how data can support effective problem solving and decision making in specific problem contexts, 3. Acquiring, cleaning, and structuring data for analysis and decision support, 4. Analyzing the data with relevant tools, and 5. Presenting the results of the analysis effectively to various stakeholder groups Data analyst salary and job outlook ● The average base salary for a data analyst in the US is $69,517 in December 2021, according to Glassdoor. ● This can vary depending on your seniority, where in the US you’re located, and other factors. ● Data analysts are in high demand. The World Economic Forum listed it as number two in growing jobs in the US [1]. ● The Bureau of Labor Statistics also reports related occupations as having extremely high growth rates. https://www.coursera.org/articles/data-analytics-projects-for-beginners Data analyst salary and job outlook ● From 2020 to 2030, operations research analyst positions are expected to grow by 25 percent, market research analysts by 22 percent, and mathematicians and statisticians by 33 percent. ● That’s a lot higher than the total employment growth rate of 7.7 percent. https://www.coursera.org/articles/data-analytics-projects-for-beginners Types of Data analyst People who perform data analysis might have other titles such as: ● Medical and health care analyst ● Market research analyst ● Business analyst ● Business intelligence analyst ● Operations research analyst ● Intelligence analyst https://www.coursera.org/articles/data-analytics-projects-for-beginners What do Data Analyst do? ● Analysts are data storytellers. ● Their mandate is to summarize interesting facts and to use data for inspiration. ● In some organizations those facts and that inspiration become input for human decision-makers. ● But in more sophisticated data operations, data-driven inspiration gets flagged for proper statistical follow-up. ● Good analysts have unwavering respect for the one golden rule of their profession: do not come to conclusions beyond the data (and prevent your audience from doing it, too). https://hbr.org/2018/12/what-great-data-analysts-do-and-why-every-organization-needsthem#:~:text=Analysts%20are%20data%20storytellers.,input%20for%20human%20decision%2Dmakers. Analytics for decision-making ● Analysts should lay out the story they’re tempted to tell and poke it from several angles with follow-up investigations to see if it holds water before bringing it to decision-makers. ● The decision-maker should then function as a filter between exploratory data analytics and statistical rigor. ● If someone with decision responsibility finds the analyst’s exploration promising for a decision they have to make, they then can sign off on a statistician spending the time to do a more rigorous analysis. (This process indicates why just telling analysts to get better at statistics misses the point in an important way. https://hbr.org/2018/12/what-great-data-analysts-do-and-why-every-organization-needsthem#:~:text=Analysts%20are%20data%20storytellers.,input%20for%20human%20decision%2Dmakers . What do Data Analyst do? ● Good analyst use softened, hedging language. ● For example, not “we conclude” but “we are inspired to wonder”. They also discourage leaders’ overconfidence by emphasizing a multitude of possible interpretations for every insight. ● As long as analysts stick to the facts — saying only “This is what is here.” — and don’t take themselves too seriously, the worst crime they could commit is wasting someone’s time when they run it by them. https://hbr.org/2018/12/what-great-data-analysts-do-and-why-every-organization-needsthem#:~:text=Analysts%20are%20data%20storytellers.,input%20for%20human%20decision%2Dmakers. Analytics for decision-making ● If someone with decision responsibility finds the analyst’s exploration promising for a decision they have to make, they then can sign off on a statistician spending the time to do a more rigorous analysis. ● This process indicates why just telling analysts to get better at statistics misses the point in an important way. ● Not only are the two activities separate, but another person sits in between them, meaning it’s not necessarily any more efficient for one person to do both things.) https://hbr.org/2018/12/what-great-data-analysts-do-and-why-every-organization-needsthem#:~:text=Analysts%20are%20data%20storytellers.,input%20for%20human%20decision%2Dmakers. Excellence in analytics: speed ● The best analysts are lightning-fast coders who can surf vast datasets quickly, encountering and surfacing potential insights faster than those other specialists can say “whiteboard.” ● Speed is their highest virtue, closely followed by the ability to identify potentially useful gems. ● A mastery of visual presentation of information helps, too: beautiful and effective plots allow the mind to extract information faster, which pays off in time-to-potential-insights. https://hbr.org/2018/12/what-great-data-analysts-do-and-why-every-organization-needsthem#:~:text=Analysts%20are%20data%20storytellers.,input%20for%20human%20decision%2Dmakers. Project Portfolio ● If you’re getting ready to launch a new career as a data analyst, Job listings ask for experience, but how do you get experience if you’re looking for your first data analyst job? ● This is where your portfolio comes in. ● The projects you include in your portfolio demonstrate your skills and experience—even if it’s not from a previous data analytics job—to hiring managers and interviewers. https://www.coursera.org/articles/what-does-a-data-analyst-do-a-career-guide Project Portfolio ● What do I put in my portfolio if I don’t have work experience? If you’re just starting out and don’t yet have work experience as a data analyst, include projects you’ve completed on your own or as part of your coursework. ● Start with small projects, and add them as you go. Once you learn how to scrape a website, for example, you can add a screenshot of your code, as well as a short paragraph explaining what you did. https://www.coursera.org/articles/what-does-a-data-analyst-do-a-career-guide How to build a data analytics portfolio What to include in your portfolio A simple portfolio should include at least two sections, an ● “About me” section ● Data analytics projects. https://www.coursera.org/articles/what-does-a-data-analyst-do-a-career-guide About me How to build a data analytics portfolio The “About me” page gives you an opportunity to introduce prospective employers to who you are, what you do, and why it’s important to you. You can use this section to explain: ● How you got started in data analysis ● What about data interests you most ● Where your passions lie in relation to data analytics ● This is also a great place to include your contact details (if you don’t have them on a separate page) and links to your social media https://www.coursera.org/articles/what-does-a-data-analyst-do-a-career-guide accounts. How to build a data analytics portfolio Projects ● Visualize data to tell a story: Create a chart, map, graph, or other visualization to make your data easier to understand. ● Communicate complex ideas: Consider writing a blog post that outlines your process or explains a difficult data concept to highlight your communication skills. ● Collaborate with others: If you’ve worked on a group project, be sure to include it. ● Use data analysis tools: Share projects that show off your ability to use SQL, Python, R, Tableau, etc. https://www.coursera.org/articles/what-does-a-data-analyst-do-a-career-guide Projects How to build a data analytics portfolio The bulk of your portfolio will likely comprise a series of projects and case studies that demonstrate your key skills. In general, your portfolio should showcase your best or latest work. Try to include projects that highlight your ability to: ● Scrape data from websites: Show your code, and use hashed comments to explain your thinking. ● Clean data: Take a dataset with missing, duplicate, or other problematic data, and walk through your data cleaning process. ● Perform different types of analysis: Use data to perform diagnostic, descriptive, predictive, and prescriptive analysis. https://www.coursera.org/articles/what-does-a-data-analyst-do-a-career-guide Excellence in analytics: speed ● As analysts mature, they’ll begin to get the hang of judging what’s important in addition to what’s interesting, allowing decision-makers to step away from the middleman role. ● The result is that the business gets a finger on its pulse and eyes on previously-unknown unknowns. ● This generates the inspiration that helps decision-makers select valuable quests to send statisticians and ML engineers on, saving them from mathematically-impressive excavations of useless rabbit holes. https://hbr.org/2018/12/what-great-data-analysts-do-and-why-every-organization-needsthem#:~:text=Analysts%20are%20data%20storytellers.,input%20for%20human%20decisio n%2Dmakers. Business Objectives ● ● ● ● ● ● What are the business Objectives(s)? What are the business metrics to measure? What are the business output & outcomes? Why are they important? How should you answer the question(s)? How do you know when the question(s) are answered?Wh https://hbr.org/2018/12/what-great-data-analysts-do-and-why-every-organization-needsthem#:~:text=Analysts%20are%20data%20storytellers.,input%20for%20human%20decisio n%2Dmakers. https://www.dailymail.co.uk/news/article-3168408/Where-s-sun-s-lolly-Rising-temperatures-lead-boomsales-icecreams-beer-cider.html Ice Cream Sales (Monthly ) http://www.mas.ncl.ac.uk/~nag48/teaching/MAS1403/notes7slr.pdf Ice Cream Sales (Monthly http://www.mas.ncl.ac.uk/~nag48/teaching/MAS1403/notes7slr.pdf Sales Distribution 1. The Sales distribution of various categories relative to each other 2. Their respective Profit margins. 3. Each Category’s Sub – Category Product Sales 4. Sales growth of Categories over the years https://www.analyticsvidhya.com/blog/2017/07/data-visualisation-made-easy/ PROJECT As part of this course, students will undertake a real-world data project. The project will consist of addressing several questions and requirements using data and analytic approaches and tools. The project will be carried out in multiple phases, each requiring a mandatory in class presentation. The following questions will help you to consider your project: ● ● ● ● ● ● What are the business Objectives(s)? What are the business metrics to measure? What are the business output & outcomes? Why are they important? How should you answer the question(s)? How do you know when the question(s) are answered?Wh Class Project and Presentations Throughout the course, we will be using the same dataset as this gives you the opportunity to become well versed in the data. With this dataset you will complete the following: Phase I ● Initial analysis in Tableau that will provide the foundation for your subsequent analyses ● Tools: Group Tableau Dashboard, slides and presentation Phase II ● Incorporating Business objectives, consulting for context, storytelling and visualization concepts ● Tools: Group Tableau Dashboard, Lucid Charts, slides and presentation Class Project and Presentations Phase III ● Final Project Presentation ● Final Group Dashboard (Tableau, Lucid Charts, slides and mandatory in class presentation ● ● ● ● ● Dataset criteria At least two data sets with a common key to be joined* A data set that needs some level of cleaning, At least one field of one of the data sets would need some prep A data set that lands itself to some basic stats The data sets should have more than seven data points (variables/attributes) and at least one thousand rows/records ● Data sets that provide a rich story to be told, this is, rich in patterns and/or some statistical correlation https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/c07.xhtml#usec0003 ● P Assignment ● Tools: Powerpoint, Tableau 1 [Phase I] Purpose of Assignment (WHY) ● Communicating complex data information and insights through storytelling with data visualizations using dashboards, scorecards, spatial data representations and use of annotations is an important aspect of data analytics. ● In various roles you must be able to evaluate, propose and implement appropriate visualizations for a specified audience using key informational design concepts. ● In this assignment you will conduct an initial analysis of the data set of your choosing. ● Your group will be using the same dataset as this gives you the opportunity to become well versed in the data. Assignment 1- Group Activity Tools: Tableau, Powerpoint Assignment Description (WHAT) ● Once you understand the data, it’s much easier to design a dashboard knowing the key variables. ● First, review the data set and start looking for valuable information to determine key drivers of the analysis. ● Analyze the variables/attributes/columns, information and seek correlations that complete a basic analysis. ● In your analysis, answer the following questions. Collaborate, Collaborate, Collaborate ● We are using groups for assignments and projects ● Our colleagues are sometimes our best resource when learning a new theory or application. ● Therefore, it is important that you get to know your group members as soon as possible. ● Use your group as a resource throughout the course when you encounter software, technical, or any other questions about the project. https://www.coursera.org/articles/data-analytics-projects-for-beginners In Class Activity 1. 2. 3. 4. Class members introduction Find your group for projects Dataset review Lucid Chart set up From: Degree: Class: ADD PHOTO HERE Most Interesting or Unusual Job: Hobbies: Recent Achievement or News: Fun Facts: Name On the Bucket List: Favorite Meal: Major Goals: Popular Analytics Data Sets https://careerfoundry.com/en/blog/data-analytics/data-analytics-portfolio-examples/ https://dataworksai.com/?p=389 Part I: Essentials Concepts and Terminology Datasets ● Collections or groups of related data are generally referred to as datasets. ● Each group or dataset member (datum) shares the same set of attributes or properties as others in the same dataset. ● Some examples of datasets are: • tweets stored in a flat file • a collection of image files in a directory • an extract of rows from a database table stored in a CSV formatted file • historical weather observations that are stored as XML files https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch01.xhtml#ch01lev2sec2 Concepts and Terminology https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch01.xhtml#ch01lev2sec2 Analytics Terminology ● Data are observations or measurements (unprocessed or processed) represented as text, numbers, or multimedia. ● A dataset is a structured collection of data generally associated with a unique body of work. ● A database is an organized collection of data stored as multiple datasets. Those datasets are generally stored and accessed electronically from a computer system that allows the data to be easily accessed, manipulated, and updated https://www.usgs.gov/faqs/what-are-differences-between-data-dataset-and-database Analytics Terminology Data Science vs. Data Analytics Data science is the process of building, cleaning, and structuring datasets to analyze and extract meaning. Data analytics, on the other hand, refers to the process and practice of analyzing data to answer questions, extract insights, and identify trends. You can think of data science as a precursor to data analysis. If your dataset isn’t structured, cleaned, and wrangled, how will you be able to draw accurate, insightful conclusions? Below is a deeper dive into each field’s role in business https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 Data Analytics in Business The main goal of business analytics is to extract meaningful insights from data that an organization can use to inform its strategy and, ultimately, reach its objectives. Business analytics can be used for: Budgeting and forecasting: ● By assessing a company’s historical revenue, sales, and costs data alongside its goals for future growth, an analyst can identify the budget and investments required to make those goals a reality. • Risk management: ● By understanding the likelihood of certain business risks occurring—and their associated expenses—an analyst can make cost-effective recommendations to help mitigate them. • https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 Data Analytics in Business Business analytics can be used for: Marketing and sales: ● By understanding key metrics, such as leadto-customer conversion rate, a marketing analyst can identify the number of leads their efforts must generate to fill the sales pipeline. • Product development (or research and development): ● By understanding how customers reacted to product features in the past, an analyst can help guide product development, design, and user experience in the future. https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 Analytics Terminology ● Data preparation i(Data Wrangling) s the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data and the combining of data sets to enrich data. ● Data preparation is often a lengthy undertaking for data professionals or business users, but it is essential as a prerequisite to put data in context in order to turn it into insights and eliminate bias resulting from poor data quality. ● For example, the data preparation process usually includes standardizing data formats, enriching source data, and/or removing outliers. https://www.talend.com/resources/what-is-data-preparation/ Reporting vs Analysis https://blog.adobe.com/en/publish/2010/10/19/reporting-vs-analysis-whats-the-difference#gs.ndyrvp Reporting vs Analysis https://blog.adobe.com/en/publish/2010/10/19/reporting-vs-analysis-whats-the-difference#gs.ndyrvp Reporting vs Analysis ● Reporting translates raw data into information. ● Reporting helps companies to monitor their online business and be alerted to when data falls outside of expected ranges. ● Reporting raises questions about the business from its end users. ● Reporting provides no or limited context about what’s happening in the data. ● Analysis transforms data and information into insights. ● The goal of analysis is to answer questions by interpreting the data at a deeper level and providing actionable recommendations. ● Through the process of performing analysis you may raise additional questions, but the goal is to identify answers, or at least potential answers that can be tested. ● Context is critical to good analysis. In summary, Reporting shows you what is happening https://blog.adobe.com/en/publish/2010/10/19/reporting-vs-analysis-whats-the-difference#gs.ndyrvp Analysis focuses on explaining why it is happening and what you can do about it. Data Driven Organization https://www.oreilly.com/radar/being-data-driven-its-all-about-the-culture/ Analytics Value Chain https://www.oreilly.com/radar/being-data-driven-its-all-about-the-culture/ Driving Value ● Data-driven stages (data > reporting > analysis > decision > action > value) as a series of dominoes. ● If you remove a domino, it can be more difficult or impossible to achieve the desired value. ● Provides specific guidance (Recommendations) on what actions to take based on the key insights found in the data. ● Once a recommendation has been made, follow-up is another potent outcome because ● Recommendations demand decisions to be made (go/no go/explore further). ● Decisions precede action. https://www.oreilly.com/radar/being-data-driven-its-all-about-the-culture/ ● Action precedes value. Analytics Value Chain Types of Reporting: Canned reports, dashboards, and alerts. Canned reports ● These are the out-of-the-box and custom reports that you can access within the analytics tool or which can also be delivered on a recurring basis to a group of end users. ● Canned reports are fairly static with fixed metrics and dimensions. ● In general, some canned reports are more valuable than others, and a report’s value may depend on how relevant it is to an individual’s role (e.g., SEO specialist vs. web producer). https://blog.adobe.com/en/publish/2010/10/19/reporting-vs-analysis-whats-the-difference#gs.njnk2d Analytics Value Chain Types of Reporting: Canned reports, dashboards, and alerts. Dashboards ● These custom-made reports combine different KPIs and reports to provide a comprehensive, high-level view of business performance for specific audiences. ● Dashboards may include data from various data sources ● Dashboards can be static or dynamic https://blog.adobe.com/en/publish/2010/10/19/reporting-vs-analysis-whats-thedifference#gs.njnk2d Analytics Value Chain Types of Reporting: Canned reports, dashboards, and alerts. Dashboards ● These custom-made reports combine different KPIs and reports to provide a comprehensive, high-level view of business performance for specific audiences. ● Dashboards may include data from various data sources ● Dashboards can be static or dynamic https://blog.adobe.com/en/publish/2010/10/19/reporting-vs-analysis-whats-thedifference#gs.njnk2d Analytics Value Chain Types of Reporting: Canned reports, dashboards, and alerts. Alerts ● These conditional reports are triggered when data falls outside of expected ranges or some other pre-defined criteria is met. ● Once people are notified of what happened, they can take appropriate action as necessary. https://blog.adobe.com/en/publish/2010/10/19/reporting-vs-analysis-whats-thedifference#gs.njnk2d Analytics Value Chain Ad hoc responses: ● Analysts receive requests to answer a variety of business questions, which may be spurred by questions raised by the reporting. ● Typically, these urgent requests are time sensitive and demand a quick turnaround. ● The analytics team may have to juggle multiple requests at the same time. ● As a result, the analyses cannot go as deep or wide as the analysts may like, and the deliverable is a short and concise report, which may or may not include any specific recommendations. https://blog.adobe.com/en/publish/2010/10/19/reporting-vs-analysis-whats-thedifference#gs.njnk2d Analytics Value Chain Analysis presentations ● Some business questions are more complex in nature and require more time to perform a comprehensive, deep-dive analysis. ● These analysis projects result in a more formal deliverable, which includes two key sections: key findings and recommendations. ● The key findings section highlights the most meaningful and actionable insights gleaned from the analyses performed. ● The recommendations section provides guidance on what actions to take based on the analysis findings. https://blog.adobe.com/en/publish/2010/10/19/reporting-vs-analysis-whats-thedifference#gs.njnk2d Analytics Categories Descriptive Analytics ● Descriptive analysis is the simplest type of analysis. ● It describes and summarizes a dataset quantitatively. ● It characterizes the sample of data at hand and does not attempt to describe anything about the population from which it comes. ● It can often form the data that is displayed in dashboards, such as number of new members this week or booking year to date https://learning.oreilly.com/library/view/creating-a-datadriven/9781491916902/ch05.html#chap_analysis Descriptive Analytics ● The goal of descriptive analysis is to describe the key features of the sample numerically. ● It should shed light on the key numbers that summarize distributions within the data, ● It may describe or show the relationships among variables with metrics that describe association, or by tables that cross-tabulate counts. https://learning.oreily.com/library/view/creating-a-data-driven/9781491916902/ch05.html#chap_analysis Descriptive Analytics Metrics The simplest but one of the most important measures is: ● Sample size: The number of data points or records in the sample. ● Mean (average): The arithmetic mean of the data: sum of values divided by number of values. ● Median: The 50th percentile. ● Mode: The most frequently occurring value. ● Minimum: The smallest value in the sample (0th percentile). ● Q1: The 25th percentile. The value such that one quarter of the sample has values below this value. Also known as the lower hinge. ● Q3: The 75th percentile. Also known as the upper hinge. ● Maximum: The largest value in the sample (100th percentile). ● Interquartile range: The central 50% of data; that is, Q3 – Q1. https://learning.oreily.com/library/view/creating-a-data-driven/9781491916902/ch05.html#chap_analysis Descriptive Analytics The simplest but one of the most important measures is: ● Range: Maximum minus minimum. ● Standard deviation: Measure of the dispersion from the arithmetic mean of a sample. It is the square root of variance, and its units are the same as the sample data. ● Variance: Another measure of dispersion and is the average squared difference from the arithmetic mean and is the square of standard deviation. Its units are the square of those of the data. https://learning.oreily.com/library/view/creating-a-data-driven/9781491916902/ch05.html#chap_analysis Descriptive Analytics This seeks answers about: ● "What is happening now?" ● What has happened in the past?" Descriptive analytics, also called business intelligence (BI) or operational analytics, is the gateway to advanced analytics. In Class Activity Data Exploration Data: US Bureau of Labor Statistics https://www.bls.gov/oes/current/oes_nat.htm Goals: Data Exploration, Descriptive Analytics, Presentation, Collaboration, Communication, Knowledge sharing, ● Group activity or via zoom) ● Combine your processes and data ● Add content to Lecture 2 activity file lname1_lname2_cs160_Assignment2 ● Save file as lname1_lname2_cs160_Assignment3 ● Present results in class room on designated date (PPT slides) Locate Descriptive Analytics Metrics ● Download Data: downloadable XLS file. ● Sample size: The number of data points or records in the sample. ● Mean (average): The arithmetic mean of the data: sum of values divided by number of values. ● Median: The 50th percentile. ● Mode: The most frequently occurring value. ● Minimum: The smallest value in the sample (0th percentile). ● Q1: The 25th percentile. The value such that one quarter of the sample has values below this value. Also known as the lower hinge. ● Q3: The 75th percentile. Also known as the upper hinge. ● Maximum: The largest value in the sample (100th percentile). ● Interquartile range: The central 50% of data; that is, Q3 – Q1. https://learning.oreily.com/library/view/creating-a-data-driven/9781491916902/ch05.html#chap_analysis Locate Descriptive Analytics ● Download Data: downloadable XLS file. ● Range: Maximum minus minimum. ● Standard deviation: Measure of the dispersion from the arithmetic mean of a sample. It is the square root of variance, and its units are the same as the sample data. ● Variance: Another measure of dispersion and is the average squared difference from the arithmetic mean and is the square of standard deviation. Its units are the square of those of the data. https://learning.oreily.com/library/view/creating-a-data-driven/9781491916902/ch05.html#chap_analysis Descriptive Analytics Activity ● Download Data: downloadable XLS file. Partner with 1 other person (in person or via zoom) ● Introduce yourselves ● Review/Identify (5-10) Descriptive metrics covered in excel sheet ● Document values ● Find 1-3 metrics you find interesting not covered in the lecture ● Google the definitions ● Combine your processes and data https://learning.oreily.com/library/view/creating-a-data-driven/9781491916902/ch05.html#chap_analysis Descriptive Analytics- Activity ● Open a document on your Google Drive or local drive ● Add content to activity from Lecture 2 ○ lname1_lname2_cs160_Assignment3 ○ Add Title, your name, activity title ○ Summarize metrics in the excel sheet you identified in the lectures ○ Summarize 1-3 metrics in the excel sheet that were not covered ○ Add definitions for 1-3 metrics above Predictive Analytics ● The goal of Predictive analytics is to learn about relationships among variables from an existing training dataset and develop a statistical model that can predict values of attributes for new, incomplete, or future data points. ● Predictive analysis can then be used to generate forecasts, that is, future predictions in a time series, which in turn can be used to generate plans of when to manufacture or buy goods, how many to make or buy, when to have them shipped to stores, and so on. https://learning.oreilly.com/library/view/creating-a-datadriven/9781491916902/ch05.html Predictive Analytics ● Predictive analysis can also make predictions about which class an object might fall into. ● For instance, given a person’s salary information, credit card purchase history, and history of paying (or not paying) bills, we can predict their credit risk. ● Given a set of tweets that contain a short movie review, each of which has been labeled by a human as being positive (“I loved this movie.”) or negative (“This movie sucked.”), we can develop a model that will predict the sentiment—positive or negative—for new tweets, such as “The movie’s special effects were awesome,” that the model was not trained upon. https://learning.oreilly.com/library/view/creating-a-datadriven/9781491916902/ch05.html Predictive Analytics https://learning.oreilly.com/library/view/creating-a-datadriven/9781491916902/ch05.html Predictive Analytics ● Good recommendations of who to date, Stock prediction software (caveat emptor!) ● By tracking movements in stock prices and identifying patterns, algorithms can attempt to buy low, sell high, and maximize returns. ● Content apps: Good recommendations of what to watch (Netflix) leads to higher retention and lower churn. ● Social networking: LinkedIn’s “People You May Know” increases the user’s network effect and provides both the user greater value and the service more valuable data. https://learning.oreilly.com/library/view/creating-a-datadriven/9781491916902/ch05.html Predictive Analytics Predictions that can drive higher conversion and basket sizes: ● Cross sell and upsell: Even simple association-based recommendations, such as “Customers Who Bought the Frozen DVD Also Bought The Little Mermaid" (Amazon) leads to higher sales and, for some, makes holiday shopping quicker and easier. ● Ads and coupons: Learning and individual’s history and predicting an individual’s state, interest, or intent can drive more relevant display ads or effective supermarket coupons https://learning.oreilly.com/library/view/creating-a-datadriven/9781491916902/ch05.html Prescriptive Analytics ● Prescriptive analytics is the upper echelon in the advanced analytics taxonomy. Prescriptive analytics seeks to answer "What should I do now?" or "How should I do it?" or "What caused this to happen (root cause analysis)?" ● Predictive and prescriptive analytics are collectively called advanced analytics and AI or strategic analytics. Analytics Strategy ● Analytics strategy is simply a blueprint for using analytics and detailing analytics capabilities, organizational capabilities, management systems, resources, communication, alignment, execution, and other efforts to enable an organization to be successful in reaching its goals Part II: Applying the Essentials Activity -Finding your data ● Google Advanced Search Operators https://www.spyfu.com/blog/google-search-operators/ Tableau Data sets for everyone https://www.tableau.com/learn/articles/free-public-data-sets Kaggle Data Sets https://www.kaggle.com/datasets ● Google Advanced Search Operators https://www.spyfu.com/blog/google-search-operators/ Tableau Data sets for everyone https://www.tableau.com/learn/articles/free-public-data-sets Activity -Finding data sets Find data sets for the following data topics (Pick any three) ● Trends in Digital transformation ● Trends in AI and automation ● Job growth in your area/field of study ● Job openings for Data position ● Trends in Cloud Computing ● Investments in Data and Analytics ● Types of Analytics used in field of study ● Others that interest you Add content to Data Exploration and Data Blending activity ○ lname1_lname2_cs160_Assignment3 Concepts and Terminology Data Analysis ● Data analysis is the process of examining data to find facts, relationships, patterns, insights and/or trends. ● The overall goal of data analysis is to support better decision-making. ● Carrying out data analysis helps establish patterns and relationships among the data being analyzed. ● https://www.analyticsvidhya.com/blog/2017/07/data-visualisation-made-easy/ https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml#ch01lev2sec3 https://www.dailymail.co.uk/news/article-3168408/Where-s-sun-s-lolly-Rising-temperatures-lead-boomsales-icecreams-beer-cider.html Ice Cream Sales (Monthly ) http://www.mas.ncl.ac.uk/~nag48/teaching/MAS1403/notes7slr.pdf Ice Cream Sales (Monthly http://www.mas.ncl.ac.uk/~nag48/teaching/MAS1403/notes7slr.pdf Sales Distribution 1. The Sales distribution of various categories relative to each other 2. Their respective Profit margins. 3. Each Category’s Sub – Category Product Sales 4. Sales growth of Categories over the years https://www.analyticsvidhya.com/blog/2017/07/data-visualisation-made-easy/ Data & Analytics Concepts and Terminology Part I-Essentials Concepts and Terminology Data Analytics ● Different kinds of organizations use data analytics tools and techniques in different ways. ● In business-oriented environments, data analytics results can lower operational costs and facilitate strategic decision-making. ● In the scientific domain, data analytics can help identify the cause of a phenomenon to improve the accuracy of predictions. ● In service-based environments like public sector organizations, data analytics can help strengthen the focus on delivering high-quality services by driving down costs. https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml#ch01lev2sec3 Concepts and Terminology Data Analytics and Data Science ● Enterprises are collecting, procuring, storing, curating and processing increasing quantities of data. ● Find new insights that can drive more efficient and effective operations ● Provide management the ability to steer the business proactively ● Allow the C-suite to better formulate and assess their strategic initiatives. ● Looking for new ways to gain a competitive edge https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml#ch01lev2sec3 Innovation ● Companies focus outward, looking to find new customers and keep existing customers from defecting to marketplace competitors ● Offering new products and services and delivering increased value propositions to customers. ● Organization can gain a competitive edge via Digitization and Innovation ● Need for techniques and technologies that can extract meaningful information and insights has increased. ● Computational approaches, statistical techniques and data warehousing have advanced significantly https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml#ch01lev2sec3 Digitization ● Digital mediums have replaced physical mediums as the de facto communications and delivery mechanism. ● The use of digital artifacts saves both time and cost as distribution is supported by the vast pre-existing infrastructure of the Internet. ● As consumers connect to a business through their interaction with these digital substitutes, it leads to an opportunity to collect further “secondary” data; ● Collection Secondary Data for Data Mining. (Eg.)Requesting a customer to provide feedback, https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch01.xhtml#ch01lev2sec3 Digitization https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch01.xhtml#ch01lev2sec3 Affordable Technology ● Technology capable of storing and processing large quantities of diverse data has become increasingly affordable ● Data solutions often leverage open-source software that executes on commodity hardware, further reducing costs. ● The combination of commodity hardware and open source software has virtually eliminated the advantage that large enterprises used to hold ● Technology becomes the platform upon which the business executes. https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch01.xhtml#ch01lev2sec3 Affordable Technology https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch01.xhtml#ch01lev2sec3 Affordable Technology ● Significant price decline associated with data storage prices over the past 20 years ● From a business standpoint, utilization of affordable technology and commodity hardware generates analytic results that can further optimize the execution of its business processes is the path to competitive advantage. ● The use of commodity hardware makes the adoption of Data solutions accessible to businesses without large capital investments. https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch01.xhtml#ch01lev2sec3 Part II: Applying the Essentials In Class Activity- I Tableau Set up You will need at least 2 tableau accounts – the first will be the downloaded tableau desktop Sign up here https://www.tableau.com/academic/teaching The second is tableau public https://public.tableau.com/en-us/s/ In Class Group Activity- 2 Data Exploration Data: US Bureau of Labor Statistics https://www.bls.gov/oes/current/oes_nat.htm Goals: Data Exploration, Data Blending, Presentation, Collaboration, Communication, Knowledge sharing, ● Work with your assigned group members ● Combine your processes and data ● Save file as lname1_lname2_cs160_Assignment1 ● Present results in class room on designated date (PPT slides) Data Exploration Data: US Bureau of Labor Statistics ● May 2020 National Occupational Employment and Wage Estimates ● These estimates are calculated with data collected from employers in all industry sectors in metropolitan and nonmetropolitan areas in every state and the District of Columbia. ● Additional information, including the hourly and annual 10th, 25th, 75th, and 90th percentile wages, is available ● Download Data: downloadable XLS file. Data Exploration Data: US Bureau of Labor Statistics https://blackboard.bentley.edu/webapps/blackboard/execute/content/file?cmd=view&mod e=designer&content_id=_1798627_1&course_id=_24334_1 Data Exploration Data US Bureau of Labor Statistics ● Find your major : Major Operational Groups Data Exploration Data: US Bureau of Labor Statistics ● Find your occupation : use text search option Data Exploration & Blending Data: Nerd Wallet ● https://www.nerdwallet.com/cost-of-living-calculator/compare/boston-mavs-colorado-springs-co ● Review background (About the data) ● Use of the cost of living calculator functionality ○ Where you live ○ Where you want to live ○ Income after graduation Data Exploration & Blending Data: Nerd Wallet ● Enter Salary information ● Enter current area of residence (Boston, MA) ● Enter desired area of residence (Colorado Springs, CO) Data Exploration & Blending Data: Nerd Wallet ● Review Results Data Exploration & Blending Data: Nerd Wallet & US Bureau of Labor Statistics ● https://www.bls.gov/oes/current/oes_nat.htm ● https://www.nerdwallet.com/cost-of-living-calculator/compare/boston-mavs-colorado-springs-co ● Review background (About the data) ● Use of the cost of Living calculator functionality ● Find your major and associated salary for your profession Provide at least 3 cities with reasonable cost of living ● Combine your processes and data Results ● Open a document on your Google Drive ● Call it lname1_lname2_cs160_group#_assignment1 ● Combine your processes and data ○ Occupation, Salary Career growth, cost-of-living results Answer the following questions ● What does the salary data tell you? ● What does the cost of living calculator tell you? ● After graduation what steps can you take? Assignments 1&2&3 Goals: Data Exploration, Data Blending, Descriptive Statistics, Process Mapping, Collaboration, Storytelling, and Presentation Assignments 1&2 1. 2. 3. 4. 5. 6. 7. 8. 9. Explore information about the data, professions, and entry-level and manager position salaries from the United States Bureau of Labor Statistics (USBLS) https://www.bls.gov/oes/current/oes_nat.htm Explore a total of six descriptive statistics. Learn about three new descriptive not covered in class in the USBLS data. Explore information about the data and Identify three cities with a reasonable cost of living expenses from Nerd wallet https://www.nerdwallet.com/cost-of-living-calculator/compare/boston-ma-vs-colorado-springs-co Story tell to your team members the reasoning for picking the location, benefits, and affordability Create a process map using Lucid Charts to outline steps 1 through 4. Identify websites with rich datasets available for projects Include links to 3-5 analytics rich datasets you can use for future assignments Prepare a slide deck that includes an “about you” slide and a storytelling narrative for steps 1-5 Present your team results in an in-class presentation. (15 min presentation with 5 mins for questions) Homework Mandatory: Visit CIS Sandbox either in person/ attend a review session via Zoom / Get Started on the following 1. Get Tableau Set up 2. Get Lucid Account 3. Get SQL server set up Optional: Visit Data.World website Lucid Charts In Class Activity Map dataset 1 and 2 activity process In Lucid Charts; Data Discovery to Presentation Process Mapping Guidelines Data Discovery to Presentation https://www.riversideca.gov/audit/pdf/Process%20Mapping%20Guidelines.pdf In Class Activity Map Lecture 2 activity process In Lucid Charts; Data Discovery to Presentation In Class Activity Map Lecture 2 activity process In Lucid Charts; Data Discovery to Presentation In Class Activity Map activity process In Lucid Charts; Data Discovery to Presentation Data Exploration - US Bureau of Labor Statistics ● ● ● ● ● ● ● ● ● ● ● Partner with your group members Introduce yourselves https://www.bls.gov/oes/current/oes_nat.htm Review background (About the data) Download data Review data in excel Navigate to the USBLS web page Find your occupation : use text search option Find your major and associated salary for your profession Find the salary associated with the manager position Combine your processes and data Nerd Wallet data -Data Blending ● https://www.nerdwallet.com/cost-of-living-calculator/compare/boston-mavs-colorado-springs-co ● Review background (About the data) ● Use of the cost of Living calculator functionality ● Find your major and associated salary for your profession from activity 1a ● Provide at least 3 cities with reasonable cost of living ● Combine your processes and data ● Summarize Presenting Results ● ● ● ● ● Open a document on your Google Drive Call it lname1_lname2_cs160A1 Combine your processes and data from activity 1a and 1b Occupation, Salary Career growth, cost-of-living results What does the salary data tell you? ○ Hint: Information from Activity 1a ● What does the cost of living calculator tell you? ○ Hint: Information from Activity 1a ● After graduation what steps can you take? ○ Hint: Blend information from Activity 1a and 1b In Class Activity Map your process In Lucid Charts; Data Discovery to Presentation Submit lname_fname_lecture4a_CS160 Data Types ● The data processed by Big Data solutions can be humangenerated or machine-generated, ● It is ultimately the responsibility of machines to generate the analytic results. ● Human-generated data is the result of human interaction with systems, ○ such as online services and digital devices. https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml Data Types ● The data processed by Big Data solutions can be humangenerated or machine-generated, ● It is ultimately the responsibility of machines to generate the analytic results. ● Human-generated data is the result of human interaction with systems, ○ such as online services and digital devices. https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml Data Types https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml Data Types https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch01.xhtml Data Types ● Machine-generated data is generated by software programs and hardware devices in response to real-world events. ● For example, ○ a log file captures an authorization decision made by a security service ○ a point-of-sale system generates a transaction against inventory to reflect items purchased by a customer. ● From a hardware perspective: ○ an example of machine-generated data would be information conveyed from the numerous sensors in a cellphone that may be reporting information, including position and cell tower signal strength. https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml Data Types ● Human-generated and machine-generated data can come from a variety of sources and be represented in various formats or types. ● The primary types of data that are processed by Big Data solutions. are: ○ structured data ○ unstructured data ○ semi-structured data ● These data types refer to the internal organization of data and are sometimes called data formats. Apart from these three fundamental data types, another important type of data in Big Data environments is metadata. https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml Data Types -Structured Data ● Structured data conforms to a data model or schema and is often stored in tabular form. ● It is used to capture relationships between different entities and is therefore most often stored in a relational database. ● Structured data is frequently generated by enterprise applications and information systems like ERP and CRM systems. ● Due to the abundance of tools and databases that natively support structured data, it rarely requires special consideration in regards to processing or storage. ● Examples of this type of data include banking transactions, invoices, and customer records. https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml Data Types -UnStructured Data ● Data that does not conform to a data model or data schema is known as unstructured data. ● It is estimated that unstructured data makes up 80% of the data within any given enterprise. ● Unstructured data has a faster growth rate than structured data. ● Common types of unstructured data include. ○ Textual or binary and often conveyed via files that are selfcontained and non-relational. ○ A text file may contain the contents of various tweets or blog postings. ○ Binary files are often media files that contain image, audio or video data. Technically, both text and binary files have a structure defined by the file format itself, but this aspect is https://learning.oreilly.com/library/view/big-datadisregarded, and the notion of being unstructured is in relation to fundamentals/9780134291185/ch01.xhtml the format of the data contained in the file itself. Data Types -UnStructured Data ● Common types of unstructured data include. ○ Textual or binary and often conveyed via files that are selfcontained and non-relational. ○ A text file may contain the contents of various tweets or blog postings. ○ Binary files are often media files that contain image, audio or video data. ○ Technically, both text and binary files have a structure defined by the file format itself, but this aspect is disregarded. ○ The notion of being unstructured is in relation to the format of the data contained in the file itself. https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml Data Types -UnStructured Data https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml Data Types -UnStructured Data ● Special purpose logic is usually required to process and store unstructured data. ● For example, to play a video file, it is essential that the correct codec (coder-decoder) is available. ● Unstructured data cannot be directly processed or queried using SQL. ● If it is required to be stored within a relational database, it is stored in a table as a Binary Large Object (BLOB). ● Alternatively, a Not-only SQL (NoSQL) database is a nonrelational database that can be used to store unstructured data alongside structured data. https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml Data Types -SemiStructured Data Semi-structured data has a defined level of structure and consistency, but is not relational in nature. Semi-structured data is hierarchical or graph-based. This kind of data is commonly stored in files that contain text. XML and JSON files are common forms of semi-structured data. Due to the textual nature of this data and its conformance to some level of structure, it is more easily processed than unstructured data. https://learning.oreilly.com/library/view/big-datafundamentals/9780134291185/ch01.xhtml Data Types -SemiStructured Data Data Types -SemiStructured Data https://www.javatpoint.com/xml-example Data Types -SemiStructured Data https://www.javatpoint.com/json-example Data Types -Meta Data ● Metadata provides information about a dataset’s characteristics and structure. ● This type of data is mostly machine-generated and can be appended to data. ● The tracking of metadata is crucial to Big Data processing, storage and analysis because it provides information about the pedigree of the data and its provenance during processing. ● Examples of metadata include: ○ XML tags providing the author and creation date of a document ○ Attributes providing the file size and resolution of a digital photograph ● Big Data solutions rely on metadata, particularly when processing semi-structured and unstructured data. https://www.javatpoint.com/json-example Data Types -Meta Data https://www.javatpoint.com/json-example Database Terminology ● Datasets ○ Collections or groups of related data are generally referred to as datasets. Each group or dataset member (datum) shares the same set of attributes or properties as others in the same dataset. Some examples of datasets are: ○ • tweets stored in a flat file ○ • a collection of image files in a directory ○ • an extract of rows from a database table stored in a CSV formatted file ○ • historical weather observations that are stored as XML files https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch01.xhtml#ch01lev2sec1 Database Terminology ● ● Tables: Tables, on the other hand, contain all the necessary information in the database. They have a format similar to a spreadsheet, with rows (records) and columns (fields) Relational databases such as MySql, Postgres, and Oracle databases are often used to store structured data. ● RDBMS: The systems that manage relational databases, which store transactional records or data structured or arranged in a predetermined format, are called relational database management systems (RDBMS). Why Databases- Video https://www.youtube.com/watch?v=wR0jg0eQsZA Online Transaction Processing (OLTP) ● OLTP is a software system that processes transaction-oriented data. ● The term “online transaction” refers to the completion of an activity in realtime and is not batch-processed. ● OLTP systems store operational data that is normalized. ● This data is a common source of structured data and serves as input to many analytic processes. ● OLTP systems, for example a point of sale system, execute business processes in support of corporate operations. ● OLTP perform transactions against a relational database. ● Big Data analysis results can be used to augment OLTP data stored in the underlying relational databases. https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch04.xhtml#ch04lev1sec1 Online Transaction Processing (OLTP) ● The queries supported by OLTP systems are comprised of simple insert, delete and update operations with sub-second response times. ● Examples include ticket reservation systems, banking and point of sale systems. https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch04.xhtml#ch04lev1sec1 Online Analytical Processing (OLTP) ● Online analytical processing (OLAP) systems are used for processing data analysis queries. ● OLAPs form an integral part of business intelligence, data mining and machine learning processes. ● They are relevant to Big Data in that they can serve as both a data source as well as a data sink that is capable of receiving data. ● They are used in diagnostic, predictive and prescriptive analytics. https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch04.xhtml#ch04lev1sec1 Online Analytical Processing (OLAP) ● Online analytical processing (OLAP) systems are used for processing data analysis queries. ● OLAPs form an integral part of business intelligence, data mining and machine learning processes. ● They are relevant to Big Data in that they can serve as both a data source as well as a data sink that is capable of receiving data. ● They are used in diagnostic, predictive and prescriptive analytics. ● OLAP systems perform long-running, complex queries against a multidimensional database whose structure is optimized for performing advanced analytics. https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch04.xhtml#ch04lev1sec1 Online Analytical Processing (OLAP) https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch04.xhtml#ch04lev1sec1 Online Analytical Processing (OLAP) ● OLAP systems store historical data that is aggregated and denormalized to support fast reporting capability. ● They further use databases that store historical data in multidimensional structures and can answer complex queries based on the relationships between multiple aspects of the data. https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch04.xhtml#ch04lev1sec1 Extract Transform Load ● Extract Transform Load (ETL) is a process of loading data from a source system into a target system. ● The source system can be a database, a flat file, or an application. ● Similarly, the target system can be a database or some other storage system. ● ETL represents the main operation through which data warehouses are fed data. https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch04.xhtml#ch04lev1sec1 Extract Transform Load ● Required data is first obtained or extracted from the sources, after which the extracts are modified or transformed by the application of rules. ● Finally, the data is inserted or loaded into the target system. Data Warehouses ● A data warehouse is a central, enterprise-wide repository consisting of historical and current data. ● Data warehouses are heavily used by BI to run various analytical queries, and they usually interface with an OLAP system to support multidimensional analytical queries, ● Data pertaining to multiple business entities from different operational systems is periodically extracted, validated, transformed and consolidated into a single denormalized database. ● With periodic data imports from across the enterprise, the amount of data contained in a given data warehouse will continue to increase. ● Over time this leads to slower query response times for data analysis tasks. To resolve this shortcoming, data warehouses usually contain optimized databases, called analytical databases, to handle reporting and data analysis tasks. An analytical database can exist as a separate DBMS, as in the case of an OLAP database. Data Warehouses https://learning.oreilly.com/library/vi ew/big-datafundamentals/9780134291185/ch0 Customer Database https://www.w3schools.com/sql/sql_syntax.asp NorthWind Database https://www.w3schools.com/sql/sql_syntax.asp NorthWind Database SQL Data Types https://www.w3schools.com/sql/sql_datatypes.asp SQL Data Types https://www.w3schools.com/sql/sql_datatypes.asp SQL Data Types https://www.w3schools.com/sql/sql_datatypes.asp W3 School ● Access NorthWind Data base ● Go to https://www.w3schools.com/sql/sql_ref_database.asp W3 School W3 School Entity Relationship Diagram https://www.youtube.com/watch?v=QpdhBUYk7Kk Business Metrics ● So what metrics do companies care about? ● The sole purpose of businesses, according to the Nobel-winning economist Milton Friedman, is to maximize profits for shareholders. ● The ultimate goal of any project within a business is, therefore, to increase profits, either directly or indirectly: ○ Directly such as increasing sales (conversion rates) and cutting costs; ○ Indirectly such as higher customer satisfaction and increasing time spent on a website. https://learning.oreilly.com/library/view/designing-machine-learning/9781098107956/ch02.html Business Objectives ● We first need to consider the objectives of the proposed projects. ● When working on an project, data analyst/data scientists tend to care about the the metrics they can measure. ○ Performance of their ML models such as accuracy, F1 score, inference latency, etc. ● They get excited about improving their model’s accuracy from 94% to 94.2% and might spend a ton of resources—data, compute, and engineering time— to achieve that. ● But the truth is: most companies don’t care about the fancy ML metrics or complex visualization. https://learning.oreilly.com/library/view/designing-machine-learning/9781098107956/ch02.html Business Objectives ● They don’t care about increasing a model’s accuracy from 94% to 94.2% unless it moves some business metrics. ● A pattern in many short-lived ML projects and visualization projects is that the data analyst/ scientists become too focused on hacking ML metrics or complex visualization without paying attention to business metrics. ● Their managers, however, only care about business metrics and, after failing to see how an ML project can help push their business metrics, kill the projects prematurely (and possibly let go of the data science team involved) https://learning.oreilly.com/library/view/designing-machine-learning/9781098107956/ch02.html Business Metrics- ML metrics example ● Imagine that you work for an ecommerce site that cares about purchasethrough rate. ● You want to move your recommender system from batch prediction to online prediction. ● You might reason that online prediction will enable recommendations more relevant to users right now. ● Which can lead to a higher purchase-through rate. ● You can even do an experiment to show that online prediction can improve your recommender system’s predictive accuracy by X% ● Historically on your site, each percent increase in the recommender system’s predictive accuracy led to a certain increase in purchase-through rate. Business Metrics- ML metrics example ● One of the reasons why predicting ad click-through rates and fraud detection are among the most popular use cases for ML today ○ Is that it’s easy to map ML models’ performance to business metrics: ○ Every increase in click-through rate results in actual ad revenue, and every fraudulent transaction stopped results in actual money saved. https://learning.oreilly.com/library/view/designing-machine-learning/9781098107956/ch02.html Mapping Business Metrics to ML metrics ● Many companies create their own metrics to map business metrics to ML metrics. ● For example, Netflix measures the performance of their recommender system using take-rate:: the number of quality plays divided by the number of recommendations a user sees. ● The higher the take-rate, the better the recommender system. ● Netflix also put a recommender system’s take-rate in the context of their other business metrics like ○ total streaming hours and subscription cancellation rate. ● They found that a higher take-rate also results in higher total streaming hours and lower subscription cancellation rates PROJECT PLANNING Even the most complex project can be tackled successfully if broken down in a series of steps. This sequence will help you to consider your project in stages: 1. 2. 3. 4. 5. 6. 7. 8. Understand Overall Project Detail Objectives Brainstorm Resources to Complete Deliverable Establish Project Timeline in Phases Research--Identify What Resources Are Available Analyze Research/Findings Outline Finished Product Write/Compile Document Presentation Exploratory Data Analysis (EDA) ● Classical statistics focused almost exclusively on inference, a sometimes complex set of procedures for drawing conclusions about large populations based on small samples. ● In 1962, John W. Tukey called for a reformation of statistics in his seminal paper “The Future of Data Analysis” [Tukey-1962]. ● He proposed a new scientific discipline called data analysis that included statistical inference as just one component. ● Tukey forged links to the engineering and computer science communities (he coined the terms bit, short for binary digit, and software), and his original tenets are surprisingly durable and form part of the foundation for data science. https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html Additional Data Camp resources Follow this link to learn about Data Camp Links to an external site. These practice assignments Links to an external site. can help you learn, practice and build your datasheet skills in Tableau. These Data Camp Training Resources Links to an external site. can help you build other Tableau skills. You will have to create a user in Data Camp before beginning. Popular Analytics Data Sets ● Google Dataset Search; Sample dataset: Global price of coffee, 1990-present ● Kaggle: Sample dataset: Daily temperature of major cities ● Data.Gov: Sample dataset: Lobster Report for Transshipment and Sales ● Datahub.io: Sample dataset: Average mass of glaciers since 1945 ● UCI Machine Learning Repository: Sample dataset: Behavior of urban traffic in Sao Paulo, Brazil ● Earth Data: Sample dataset: Environmental conditions during fall moose hunting season in Alaska, 2000-2016 ● FBI Crime Data Explorer: FBI Crime Data Explorer https://careerfoundry.com/en/blog/data-analytics/where-to-find-free-datasets/ ● ● ● ● ● ● ● ● ● ● Popular Analytics Data Sets Pro Football Reference: Contains all-time statistics from NFL players and teaPro Football Stats, History, Scores, Standings, Playoffs, Schedule & Records | Pro-Football-Reference.com Kaggle: Contains a library of various user submitted data sets Find Open Datasets and Machine Learning Projects | Kaggle Data.gov: Contains datasets from various US agencies Data.gov Academic Torrents: Provides datasets from academic papers Search - Academic Torrents Nasdaq Data: Provides financial and economic datasets: Search | Nasdaq Data Link World Bank Open Data: economic datasets from around the world and ranging in several topics https://data.worldbank.org/ NASA: datasets covering the association’s scientific/astronomic discoveries https://data.nasa.gov/ WHO Global Health Observatory: provides datasets from medical research https://www.who.int/data/gho Spotify: datasets on preferences in music genres/podcasts https://research.atspotify.com/datasets/ OECD: datasets on different countries’ state of living and financial states https://stats.oecd.org/ Shared by Ampie S (CS160 Spring2022) PROJECT PHASE I Shared by Ampie S (CS160 Spring2022) Module 2 Learning objectives ● Describe what is Tableau ● Explain the basic functions in Tableau https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/c07.xhtml#usec0003 Tableau https://www.tableau.com/learn/get-started Tableau https://www.tableau.com/learn/get-started 1. Click on Free training 2. Enter login credentials or Create one to access Getting Started https://www.tableau.com/learn/tutorials/on-demand/getting-started?playlist=391099 Connect to Data https://www.tableau.com/learn/tutorials/on-demand/getting-started-part3?playlist=391099 https://learning.oreilly.com/library/view/practicaltableau/9781491977309/ch08.html#an_introduction_to_aggregation_in_tablea https://learning.oreilly.com/library/view/practicaltableau/9781491977309/ch08.html#an_introduction_to_aggregation_in_tablea https://learning.oreilly.com/library/view/practicaltableau/9781491977309/ch08.html#an_introduction_to_aggregation_in_tablea https://learning.oreilly.com/library/view/practicaltableau/9781491977309/ch08.html#an_introduction_to_aggregation_in_tablea https://learning.oreilly.com/library/view/practicaltableau/9781491977309/ch08.html#an_introduction_to_aggregation_in_tablea Data Preparation Data Preparation – The Data Interpreter (4:29) See Tableau Public’s ideal data structure, and learn how to use the Data Interpreter to clean data Covers: ● How your data should (ideally) be structured ● How to clean your data using the Data Interpreter Data: World Bank CO2 (.xlsx) https://public.tableau.com/s/resources Data Preparation Data Preparation – Pivoting your Data (4:54) Learn how to pivot your data structure in Tableau Covers: ● Why you might need to pivot your data structure ● How to use Tableau Public’s pivot function Data: World Bank CO2 (.xlsx) https://public.tableau.com/s/resources Data Preparation Data Preparation – Splitting your Data (2:26) Learn how to split a field into multiple fields in Tableau Covers: ● Why you might need to split a field in Tableau Public ● How to use Tableau Public’s split function Data: World Bank CO2 (.xlsx) https://public.tableau.com/s/resources Data Preparation Data Preparation – Splitting your Data (2:26) Learn how to split a field into multiple fields in Tableau Covers: ● Why you might need to split a field in Tableau Public ● How to use Tableau Public’s split function Data: World Bank CO2 (.xlsx) https://public.tableau.com/s/resources Data Preparation Data Preparation – Joins and Unions (6:28) Learn how to join multiple data sets together in Tableau Covers: ● What are joins and unions ● How to join two data sets together ● How to union multiple data sets Data: World Bank CO2 (.xlsx) https://public.tableau.com/s/resources Charts Learn about the logic of how Tableau Public creates charts Covers: ● Overview of Dimensions and Measures ● Overview of Columns and Rows shelf ● Overview of the Marks card Data: World Bank CO2 (.xlsx) https://public.tableau.com/s/resources World Map Chart Data Preparation – Splitting your Data (2:26) Learn how to split a field into multiple fields in Tableau Covers: ● Why you might need to split a field in Tableau Public ● How to use Tableau Public’s split function Data: World Bank CO2 (.xlsx) https://public.tableau.com/s/resources Dashboards Combining Sheets on a Dashboard (5:27) See how to combine your visualizations together on a dashboard Covers: ● How to combine sheets on a dashboard ● How to re-arrange and add items to a dashboard Data: World Bank CO2 (.xlsx) Viz: Combine Sheets on a Dashboard https://public.tableau.com/s/resources Dashboards Adding Interactivity to Dashboards (4:30) Learn how to add interactivity to your dashboards Covers: ● See how to add filter actions ● See how to add highlight actions Data: World Bank CO2 (.xlsx) Viz: Combine Sheets on a Dashboard https://public.tableau.com/s/resources DESCRIPTIVE STATISTICS https://www.pluralsight.com/guides/tableau-worksheetsummary-card:-quick-descriptive-statistics https://public.tableau.com/s/resources STORIES Creating Stories (5:55) Learn how to turn your data into a cohesive narrative using Story Points Covers: ● See examples of data stories ● Learn how to create story points Data: World Bank CO2 (.xlsx) https://public.tableau.com/s/resources STORIES Creating Stories (5:55) Learn how to turn your data into a cohesive narrative using Story Points Covers: ● See examples of data stories ● Learn how to create story points Data: World Bank CO2 (.xlsx) https://public.tableau.com/s/resources STORIES Formatting Story Points (6:54) Make your stories come to life with these formatting tips Covers: ● Learn how to fit your dashboards to the story points ● See how to format the story points ● See how to add annotations to your story Data: World Bank CO2 (.xlsx) https://public.tableau.com/s/resources TABLEAU TUTORIAL-1 https://mdl.library.utoronto.ca/technology/tutorials/creatingdata-visualizations-using-tableau-desktop-beginner#7 https://public.tableau.com/s/resources TABLEAU TUTORIAL-2 https://maps.library.utoronto.ca/workshops/Tableau1hrOnline /Demo.pdf https://maps.library.utoronto.ca/workshops/Tableau1hrOnline /SetupInstructions.pdf https://public.tableau.com/s/resources CALCULATED FIELDS https://learning.oreilly.com/library/view/practical-tableau/9781491977309/ch08.html#an_introduction_to_aggregation_in_tablea Connect to Data Watch Video: https://www.tableau.com/learn/tutorials/on-demand/getting-started-part3?playlist=391099 Tableau Workspace https://help.tableau.com/current/pro/desktop/en-us/environment_workspace.htm Watch: Tour the Tableau Interface Dive deeper: The Tableau Workspace Getting Started with Visual Analytics VIDEO PLACEHOLDER QUESTION TO ANSWER: 1. Sales over timer 2. Profit over time 3. Relationship between shipping cost and profit https://www.tableau.com/learn/tutorials/on-demand/getting-started-visual-analytics Download Video from BlackBoard or Webpage ORDERS, PEOPLE, RETURNS https://www.tableau.com/learn/tutorials/on-demand/getting-started-part3?playlist=391099 JOIN ORDERS and RETURNS https://www.tableau.com/learn/tutorials/on-demand/getting-started-part3?playlist=391099 Tableau Workspace Watch: Tour the Tableau Interface Dive deeper: The Tableau Workspace Video Placeholder Tableau Calculations Watch Dive deeper: The Tableau Workspace Video Placeholder COMMON CHARTS https://help.tableau.com/current/pro/desktop/en-us/dataview_examples.htm ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Build an Area Chart Build a Bar Chart Build a Box Plot Build a Bullet Graph Build with Density Marks (Heatmap) Build a Gantt Chart Build a Highlight Table Build a Histogram Build a Line Chart Build a Packed Bubble Chart Build a Pie Chart Build a Scatter Plot Build a Text Table Build a Treemap Build Combination Charts AREA CHARTS https://help.tableau.com/current/pro/desktop/en-us/qs_area_charts.htm BAR CHARTS https://help.tableau.com/current/pro/desktop/en-us/buildexamples_bar.htm Build Visualization Watch: Build a Visualization Additional reading: Viz Building Basics Learn how the 'Show Me' menu can help you. BOX WHISKERS PLOT https://help.tableau.com/current/pro/desktop/en-us/buildexamples_boxplot.htm PIE PLOT https://help.tableau.com/current/pro/desktop/en-us/buildexamples_pie.htm CREATE DASHBOARDS https://help.tableau.com/current/pro/desktop/en-us/dashboards.htm ● ● ● ● ● ● ● ● Best Practices for Effective Dashboards Create a Dashboard Accelerators for Cloud-based Data Size and Lay Out Your Dashboard Create Dashboard Layouts for Different Devices Build Accessible Dashboards Manage Sheets in Dashboards and Stories Use Dashboard Extensions STORIES https://help.tableau.com/current/pro/desktop/en-us/stories.htm In Tableau, a story is a sequence of visualizations that work together to convey information. You can create stories to tell a data narrative, provide context, demonstrate how decisions relate to outcomes, or to simply make a compelling case. Additional resources ● This article The Ultimate Cheat Sheet on Tableau Charts Links to an external site. by Kate Strachnyi explains why Tableau Desktop is a superior data analysis and visualization tool. ● When you click on this link you will see a Tableau Example. Interviewing data: exploratory graphical analysis ● Please review these free training videos from Tableau PROJECT PHASE I Shared by Ampie S (CS160 Spring2022) PROJECT As part of this course, students will undertake a real-world data project. The project will consist of addressing several questions and requirements using data and analytic approaches and tools. The project will be carried out in multiple phases, each requiring a mandatory in class presentation. The following questions will help you to consider your project: ● ● ● ● ● ● ● What are the business question(s)? What are the business Objectives(s)? What are the business metrics to measure? What are the business output & outcomes? Why are they important? How should you answer the question(s)? How do you know when the question(s) are answered?Wh Class Project and Presentations Throughout the course, we will be using the same dataset as this gives you the opportunity to become well versed in the data. With this dataset you will complete the following: Phase I ● Initial analysis in Tableau that will provide the foundation for your subsequent analyses (Assignment I) a. Group Tableau Dashboard, slides and presentation Phase II ● Incorporating Business objectives, consulting for context, story telling and vizualization concepts a. Group Tableau Dashboard, slides and presentation PHASE I PROJECT - Guidelines Phase I ● ● ● ● ● Initial analysis in Tableau that will provide the foundation for your subsequent analyses (Assignment I) Group Tableau Dashboard, slides and presentation In class mandatory group presentation Each member participates in the presentation (under 3 mins per student) Slide deck should include the following areas ○ Include “About you” section in the Slide deck ○ Understand Overall Project ○ Project Objectives ○ Data set to use for the project Phase I- Guidelines Mandatory in-class presentation (under 3 mins per student). Students will walk the class through the Tableau dashboard. We will skip student intro “About you” during in-class presentation. ○ ○ ○ ● ● Resources to Complete Deliverable Establish Project Timeline in Phases Create a chart, map, graph, or other visualization to make your data easier to understand. ○ Use Descriptive Analytics ○ Snapshot of Tableau graphs summary n Slide deck ○ Each team member presents Submit slides individually (fn_ln_Grp#_CS160_month_day_year) Submit Tableau workbook individually (fn_ln_Grp#_CS160_month_day_year) Project Phase II Phase II- Guidelines ○ ○ ○ ○ ○ Resources to Complete Deliverable Establish Project Timeline in Phases Visualize data to tell a story Incorporate Consulting for Context principles Iterate on Phase I Tableau Dashboard ■ Create a chart, map, graph, or other visualization to make your data easier to understand. ■ Use Descriptive Analytics ■ Snapshot of Tableau graphs ● Submit slides individually (fn_ln_Grp#_ALY6070_1_30_2023) ● Submit Tableau workbook individually (fn_ln_Grp#_ALY6070_1_30_2023) Project Phase II- Guidelines ● ● ● ● ● Incorporating Business objectives, consulting for context, story telling and vizualization concepts Group Tableau Dashboard, slides and presentation In class mandatory group presentation Each member participates (under 3 mins per student) in the presentation Slide deck should include the following areas a. Include “About you” section in the Slide deck b. Understand Overall Project c. Project Objectives d. Data set to use for the project Module 2 : Learning Objectives By the end of this module, you should be able to: ● Explain the importance of data storytelling ● Explain the importance of context ● Differentiate between exploratory and explanatory data visualization approaches ● Tailor data presentation to identified target audience ● Incorporate learnings in Phase II project Exploratory vs. explanatory analysis ● Exploratory analysis is what you do to understand the data and figure out what might be noteworthy or interesting to highlight to others. ● When we’re at the point of communicating our analysis to our audience, we really want to be in the explanatory space. ● You have a specific thing you want to explain, a specific story you want to tell—probably about those two pieces Explanatory analysisWho? What? and How? ● When it comes to explanatory analysis, there are a few things to think about and be extremely clear on before visualizing any data or creating content. ● First, To whom are you communicating? ○ It is important to have a good understanding of who your audience is and how they perceive you. ○ This can help you to identify common ground that will help you ensure they hear your message. Explanatory analysis- Who, what, and how ● Second, What do you want your audience to know or do? ○ You should be clear how you want your audience to act. ○ Take into account how you will communicate to them. ○ Overall tone that you want to set for your communication. ● Third: How can you use data to help make your point? Consulting for context: questions to ask ● What background information is relevant or essential? ● Who is the audience or decision maker? ○ What do we know about them? ● What biases does our audience have? ○ that might make them supportive of or resistant to our message? ● What data is available that would strengthen our case? ● Is our audience familiar with this data, or is it new? Consulting for context: questions to ask ● Where are the risks: what factors could weaken our case and do we need to proactively address them? ● What would a successful outcome look like? ● If you only had a limited amount of time or a single sentence to tell your audience what they need to know, what would you say? Storytelling ● Find a subject you care about. ○ It is this genuine caring, and not games with language, which will be the most compelling and seductive element in your style. ● Do not ramble, though. ● Keep it simple. ○ “To be or not to be?” asks Shakespeare’s Hamlet. The longest word is three letters. ● Have the guts to cut. If a sentence, no matter how excellent, does not illuminate your subject in some new and useful way, scratch it out. ● Sound like yourself. ● Say what you meant to say. ● Pity the readers. Our audience requires us to be sympathetic and patient teachers, ever willing to simplify and clarify. https://learning.oreilly.com/library/view/storytelling-with-data/9781119002253/c07.xhtml#c7_2 Constructing the story- The beginning ● The setting: When and where does the story take place? ● The main character: Who is driving the action? (This should be framed in terms of your audience!) ● The imbalance: Why is it necessary, what has changed? ● The balance: What do you want to see happen? ● The solution: How will you bring about the changes? Constructing the story- The middle ● Further develop the situation or problem by covering relevant background. ● Incorporate external context or comparison points. ● Give examples that illustrate the issue. ● Include data that demonstrates the problem. ● Articulate what will happen if no action is taken or no change is made. ● Discuss potential options for addressing the problem. ● Illustrate the benefits of your recommended solution. ● Make it clear to your audience why they are in a unique position to make a decision or drive action. Constructing the story- The end ● End with a call to action: make it totally clear to your audience what you want them to do with the new understanding or knowledge that you’ve imparted to them. ● One classic way to end a story is to tie it back to the beginning. ● At the beginning of our story, we set up the plot and introduced the dramatic tension. ● To wrap up, you can think about recapping this problem and the resulting need for action, reiterating any sense of urgency and sending your audience off ready to act. 3-minute story ● The 3-minute story is exactly that: if you had only three minutes to tell your audience what they need to know, what would you say? ● This is a great way to ensure you are clear on and can articulate the story you want to tell. ● Being able to do this removes you from dependence on your slides or visuals for a presentation. ● This is useful in the situation where your boss asks you what you’re working on or if you find yourself in an elevator with one of your stakeholders and want to give her the quick rundown. ● Or if your half-hour on the agenda gets shortened to ten minutes, or to five. ● If you know exactly what it is you want to communicate, you can make it fit the time slot you’re given, even if it isn’t the one for which you are prepared. Big● Idea The Big Idea boils the so-what down even further: to a single sentence. ● Big Idea has three components: ○ It must articulate your unique point of view; ○ It must convey what’s at stake; and ○ It must be a complete sentence. Storyboarding ● The storyboard establishes a structure for your communication. ● It is a visual outline of the content you plan to create. ● It can be subject to change as you work through the details. ● Establishing a structure early on will set you up for success. ● When you can (and as makes sense), get acceptance from your client or stakeholder at this step. ● It will help ensure that what you’re planning is in line with the need. ● Use a whiteboard, Post-it notes, or plain paper. ● It’s much easier to put a line through an idea on a piece of paper or recycle a Post-it note without feeling the same sense of loss as when you cut something you’ve spent time creating with your computer. ● With Post-it notes its easy to rearrange and easily to explore different narrative flows. Storyboarding ● ● ● ● Who are the users of this screen? What is the page showing? What questions will the page answer? What actions will that enable? HOW TO STORYBOARD 1. 2. 3. 4. Get large sticky notes or sheets of paper. Brainstorm the elements in your story. Play around with sequences that seem right. Illustrate ideas that show what the step is about. STORYBOARD EXAMPLE We will create a storyboard for the feature in mapping software, such as Google Maps, where a user can share their real-time route information with others: ● Character: Susan is a sales representative. She uses mapping software on her phone all the time to find her way to clients. She is a busy, single mom to Simon. ● Context: She is in her car, stuck in traffic. Her phone is in its holder on the dashboard. Simon, the second character, is waiting at school. https://docs.google.com/presentation/d/14LXsWyo4fjnS28K9eHc0dnvy3ijyN307Y_USbAodWA/edit#slide=id.g1996323a360_0_356 STORYBOARD EXAMPLE ● Plot: Susan is rushing back from a client to fetch Simon from school (Characters). ● She sees that she is late. Because of the unexpected traffic, she will be very late fetching Simon. ● Simon is waiting at school, anxious and upset (Struggle). ● Susan remembers that she can send her real-time location to Simon with the click of a button on her phone. ● Then he will know that she is coming. ● She is relieved (Trigger). Simon gets the notification and is immediately relieved because he knows when Susan will be arriving and that she is thinking of him (Climax + Resolution). https://docs.google.com/presentation/d/14LXsWyo4fjnS28K9eHc0dnvy3ijyN307Y_USbAodWA/edit#slide=id.g1996323a360_0_356 STORYBOARD EXAMPLE The next step is to separate the story into panels. 1. Susan driving from an appointment to fetch her son, Simon, from school. She is late and the traffic is bad. She is upset and worried. 2. Closeup of Susan realizing that she can share her location with Simon, so he can see where she is and know when she's going to arrive. 3. Simon waiting at school. He is sad and anxious because he doesn't know when Susan will arrive and she is late. His phone buzzes with a message. 4. Simon sees Susan's progress and ETA on his phone. He is happy because he sees that Susan is on her way and knows when she will arrive. He feels cared for because he knows she thought of him. SETTING THE SCENE FOR YOUR DATA STORY https://learning.oreilly.com/library/view/effective-datastorytelling/9781119615712/c07.xhtml#usec0003 SETTING THE SCENE FOR YOUR DATA STORY https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/c07.xhtml#usec0003 PHASE II PROJECT https://learning.oreilly.com/library/view/effective-data-storytelling/9781119615712/c07.xhtml#usec0003 Module 3 As we move into this module, we will cover the following topics: ● The importance of choosing an effective visual display as well as decluttering your visuals. ● We will also examine what consists of graphical integrity. Module 3 By the end of this module, you should be able to: ● Recognize appropriate data visualizations that help to communicate the meaning of data ● Explain how to improve a visual to better communicate the meaning of data ● Describe Graphical Integrity DIRECTORY OF VISUALIZATIONS Amounts ● The most common approach to visualizing amounts (i.e., numerical values shown for some set of categories) is using bars, either vertically or horizontally arranged ● However, instead of using bars, we can also place dots at the location where the corresponding bar would end https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS Amounts https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS- Distributions ● Histograms and density plots provide the most intuitive visualizations of a distribution, but both require arbitrary parameter choices and can be misleading. ● Cumulative densities and quantile-quantile (q-q) plots always represent the data faithfully but can be more difficult to interpret. https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS- Distributions ● Boxplots, violin plots, strip charts, and sina plots are useful when we want to visualize many distributions at once and/or ● If we are primarily interested in overall shifts among the distributions. ● Stacked histograms and overlapping densities allow a more indepth comparison of a smaller number of distributions, though stacked histograms can be difficult to interpret and are best avoided ● Ridgeline plots can be a useful alternative to violin plots and are often useful when visualizing very large numbers of distributions or changes in distributions over time https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS- Distributions https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS- Proportions ● Proportions can be visualized as pie charts, side-by-side bars, or stacked bars. ● As for amounts, when we visualize proportions with bars, the bars can be arranged either vertically or horizontally. https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS- Proportions ● Pie charts emphasize that the individual parts add up to a whole and highlight simple fractions. ● However, the individual pieces are more easily compared in side-by-side bars. ● Stacked bars look awkward for a single set of proportions, but can be useful when comparing multiple sets of proportions. https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS- Proportions ● ● ● When visualizing multiple sets of proportions or changes in proportions across conditions, pie charts tend to be space-inefficient and often obscure relationships. Grouped bars work well as long as the number of conditions compared is moderate, and stacked bars can work for large numbers of conditions. Stacked densities are appropriate when the proportions change along a continuous variable. https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS- Proportions ● When proportions are specified according to multiple grouping variables, mosaic plots, treemaps, or parallel sets are useful visualization approaches. ● Mosaic plots assume that every level of one grouping variable can be combined with every level of another grouping variable, https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS- Proportions ● Treemaps do not make such an assumption. ● Treemaps work well even if the subdivisions of one group are entirely distinct from the subdivisions of another. ● Parallel sets work better than either mosaic plots or treemaps when there are more than two grouping variables. https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS :x-y relationships ● Scatterplots represent the archetypical visualization when we want to show one quantitative variable relative to another. ● If we have three quantitative variables, we can map one onto the dot size, creating a variant of the scatterplot called a bubble chart. https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS :x-y relationships ● For paired data, where the variables along the x and y axes are measured in the same units, it is generally helpful to add a line indicating x = y (see “Paired Data”). ● Paired data can also be shown as a slopegraph of paired points connected by straight lines. https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS :x-y relationships ● For large numbers of points, regular scatterplots can become uninformative due to overplotting. ● In this case, contour lines, 2D bins, or hex bins may provide an alternative https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS :x-y relationships ● When we want to visualize more than two quantities, on the other hand, we may choose to plot correlation coefficients in the form of a correlogram instead of the underlying raw data (see “Correlograms”). https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS :x-y relationships ● When the x axis represents time or a strictly increasing quantity such as a treatment dose, we commonly draw line graphs. ● If we have a temporal sequence of two response variables we can draw a connected scatterplot. https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS :x-y relationships ● Connected scatterplot, where we first plot the two response variables in a scatterplot and then connect dots corresponding to adjacent time points (see “Time Series of Two or More Response Variables”). ● We can use smooth lines to represent trends in a larger dataset https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS :GeoSpatial data ● The primary mode of showing geospatial data is in the form of a map (Chapter 15). A map takes coordinates on the globe and projects them onto a flat surface, such that shapes and distances on the globe are approximately represented by shapes and distances in the 2D representation. In addition, we can show data values in different regions by coloring those regions in the map according to the data. Such a map is called a choropleth (see “Choropleth Mapping”). In some cases, it may be helpful to distort the different regions according to some other quantity (e.g., population number) or simplify each region into a square. Such visualizations are called cartograms (see “Cartograms”). https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS :GeoSpatial data ● The primary mode of showing geospatial data is in the form of a map (Chapter 15). A map takes coordinates on the globe and projects them onto a flat surface, such that shapes and distances on the globe are approximately represented by shapes and distances in the 2D representation. In addition, we can show data values in different regions by coloring those regions in the map according to the data. Such a map is called a choropleth (see “Choropleth Mapping”). In some cases, it may be helpful to distort the different regions according to some other quantity (e.g., population number) or simplify each region into a square. Such visualizations are called cartograms (see “Cartograms”). https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS :Uncertainty Error bars are meant to indicate the range of likely values for some estimate or measurement. They extend horizontally and/or vertically from some reference point representing the estimate or measurement (Chapter 16). Reference points can be shown in various ways, such as by dots or by bars. Graded error bars show multiple ranges at the same time, where each range corresponds to a different degree of confidence. They are in effect multiple error bars with different line thicknesses plotted on top of each other. https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS :Uncertainty Error bars are meant to indicate the range of likely values for some estimate or measurement. They extend horizontally and/or vertically from some reference point representing the estimate or measurement (Chapter 16). Reference points can be shown in various ways, such as by dots or by bars. Graded error bars show multiple ranges at the same time, where each range corresponds to a different degree of confidence. They are in effect multiple error bars with different line thicknesses plotted on top of each other. https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts To achieve a more detailed visualization than is possible with error bars or graded error bars, we can visualize the actual confidence or posterior distributions (Chapter 16). Confidence strips provide a visual sense of uncertainty but are difficult to read accurately. Eyes and half-eyes combine error bars with approaches to visualize distributions (violins and ridgelines, respectively), and thus show both precise ranges for some confidence levels and the overall uncertainty distribution. A quantile dot plot can serve as an alternative visualization of an uncertainty distribution (see “Framing Probabilities as Frequencies”). Because it shows the distribution in discrete units, the quantile dot plot is not as precise but can be easier to read than the continuous distribution shown by a violin or ridgeline plot. DIRECTORY OF VISUALIZATIONS :Uncertainty https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts To achieve a more detailed visualization than is possible with error bars or graded error bars, we can visualize the actual confidence or posterior distributions (Chapter 16). Confidence strips provide a visual sense of uncertainty but are difficult to read accurately. Eyes and half-eyes combine error bars with approaches to visualize distributions (violins and ridgelines, respectively), and thus show both precise ranges for some confidence levels and the overall uncertainty distribution. A quantile dot plot can serve as an alternative visualization of an uncertainty distribution (see “Framing Probabilities as Frequencies”). Because it shows the distribution in discrete units, the quantile dot plot is not as precise but can be easier to read than the continuous distribution shown by a violin or ridgeline plot. DIRECTORY OF VISUALIZATIONS :Uncertainty https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS :Uncertainty For smooth line graphs, the equivalent of an error bar is a confidence band (see “Visualizing the Uncertainty of Curve Fits”). It shows a range of values the line might pass through at a given confidence level. Like with error bars, we can draw graded confidence bands that show multiple confidence levels at once. We can also show individual fitted draws in lieu of or in addition to the confidence bands. https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts DIRECTORY OF VISUALIZATIONS :Uncertainty For smooth line graphs, the equivalent of an error bar is a confidence band (see “Visualizing the Uncertainty of Curve Fits”). It shows a range of values the line might pass through at a given confidence level. Like with error bars, we can draw graded confidence bands that show multiple confidence levels at once. We can also show individual fitted draws in lieu of or in addition to the confidence bands. https://learning.oreilly.com/library/view/fundamentals-of-data/9781492031079/ch05.html#amounts PROJECT-Phase II Iterate on Assignment 1 (Power Point and Tableau Workbook) and incorporate the consulting, storytelling, visualization concepts you learnt. ● ● ● ● ● ● ● What are the business question(s)? What are the business Objectives(s)? What are the business metrics to measure? What are the business output & outcomes? Why are they important? How should you answer the question(s)? How do you know when the question(s) are answered?Wh PROJECT-Phase II-Submission guideline ● In class mandatory presentation Jan 30, 2023 ● Submit slides individually (fn_ln_Grp#_ALY6070_1_30_2023_prj_2) ● Submit Tableau workbook individually (fn_ln_Grp#_ALY6070_1_23_2023_prj_2) Chart Design principles There are so many different types of charts. However, just because data can be made into a chart doesn’t necessarily mean that it should be turned into one. Before creating a chart, stop and ask: Does a visualized data pattern really matter to your story? Sometimes a simple table, or even text alone, can communicate the idea more effectively to your audience. Creating a well-designed chart requires time and effort, so make sure it enhances your data story. Although not a science, data visualization comes with a set of principles and best practices that serve as a foundation for creating truthful and eloquent charts. In this section, we’ll identify some important rules about chart design. You may be surprised to learn that some rules are less rigid than others and can be “broken” https://learning.oreilly.com/library/view/hands-on-datavisualization/9781492085997/ch06.html#idm45750379312520 Chart Design principles ● There are so many different types of charts. ● However, just because data can be made into a chart doesn’t necessarily mean that it should be turned into one. ● Before creating a chart, stop and ask: ● Does a visualized data pattern really matter to your story? ● Sometimes a simple table, or even text alone, can communicate the idea more effectively to your audience. ● Creating a well-designed chart requires time and effort, so make sure it enhances your data story. https://learning.oreilly.com/library/view/hands-on-data-visualization/9781492085997/ch06.html#idm45750379312520 Chart Design principles ● Although not a science, data visualization comes with a set of principles and best practices that serve as a foundation for creating truthful and eloquent charts. ● In this section, we’ll identify some important rules about chart design. ● You may be surprised to learn that some rules are less rigid than others and can be “broken” https://learning.oreilly.com/library/view/hands-on-data-visualization/9781492085997/ch06.html#idm45750379312520 Chart Design principles https://learning.oreilly.com/library/view/hands-on-data-visualization/9781492085997/ch06.html#idm45750379312520 Chart Design principles https://learning.oreilly.com/library/view/hands-on-data-visualization/9781492085997/ch06.html#idm45750379312520 Chart Design principles https://learning.oreilly.com/library/view/hands-on-data-visualization/9781492085997/ch06.html#idm45750379312520 Chart Design principles https://learning.oreilly.com/library/view/hands-on-data-visualization/9781492085997/ch06.html#idm45750379312520 Chart Design principles https://learning.oreilly.com/library/view/hands-on-data-visualization/9781492085997/ch06.html#idm45750379312520 Chart Design principles https://learning.oreilly.com/library/view/hands-on-data-visualization/9781492085997/ch06.html#idm45750379312520 Chart Design principles https://learning.oreilly.com/library/view/hands-on-data-visualization/9781492085997/ch06.html#idm45750379312520 Chart Design principles https://learning.oreilly.com/library/view/hands-on-data-visualization/9781492085997/ch06.html#idm45750379312520 Exploratory Data Analysis (EDA) ● ● ● ● ● Classical statistics focused almost exclusively on inference, a sometimes complex set of procedures for drawing conclusions about large populations based on small samples. In 1962, John W. Tukey called for a reformation of statistics in his seminal paper “The Future of Data Analysis” [Tukey-1962]. He proposed a new scientific discipline called data analysis that included statistical inference as just one component. Tukey forged links to the engineering and computer science communities (he coined the terms bit, short for binary digit, and software), His original tenets are surprisingly durable and form part of the foundation for data science. https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html Exploratory Data Analysis (EDA) ● The field of exploratory data analysis was established with Tukey’s 1977 now-classic book Exploratory Data Analysis [Tukey-1977]. ● Tukey presented simple plots (e.g., boxplots, scatterplots). ● That, along with summary statistics (mean, median, quantiles, etc.), help paint a picture of a data set. https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html Types of Data There are two basic types of structured data: numeric and categorical. Numeric: Data that are expressed on a numeric scale. ● Continuous: Data that can take on any value in an interval. (Synonyms: interval, float, numeric) ○ such as wind speed or time duration ● Discrete: Data that can take on only integer values, such as counts. (Synonyms: integer, count) ○ such as the count of the occurrence of an event. https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html Types of Data- Categorical Categorical: Data that can take on only a specific set of values representing a set of possible categories. (Synonyms: enums, enumerated, factors, nominal) ● Binary: A special case of categorical data with just two categories of values, e.g., 0/1, true/false. (Synonyms: dichotomous, logical, indicator, boolean) ● Ordinal: Categorical data that has an explicit ordering. (Synonym: ordered factor) https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html Types of Data- Categorical ● Categorical data takes only a fixed set of values, such as a type of TV screen (plasma, LCD, LED, etc.) or a state name (Alabama, Alaska, etc.). ● Knowing that data is categorical can act as a signal telling software how statistical procedures, ● Such as producing a chart or fitting a model, should behave. ● In R , ordinal data can be represented as an ordered.factor in R, preserving a user-specified ordering in charts, tables, and models. ● In Python, scikit-learn supports ordinal data with the sklearn.preprocessing.OrdinalEncoder. ○ Storage and indexing can be optimized (as in a relational database). ○ The possible values a given categorical variable can take are enforced in the software (like an enum). https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html Types of Data- Categorical ● The third “benefit” can lead to unintended or unexpected behavior: ● In R: the default behavior of data import functions in R (e.g., read.csv) is to automatically convert a text column into a factor. ○ Subsequent operations on that column will assume that the only allowable values for that column are the ones originally imported, and assigning a new text value will introduce a warning and produce an NA (missing value). ● In Python: The pandas package in Python will not make such a conversion automatically. ○ However, you can specify a column as categorical explicitly in the read_csv function. https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html Types of Data https://learning.oreilly.com/library/view/fundamentals-ofdata/9781492031079/ch02.html Types of Data https://learning.oreilly.com/library/view/fundamentals-ofdata/9781492031079/ch02.html Types of Data https://learning.oreilly.com/library/view/fundamentals-ofdata/9781492031079/ch02.html Types of Data https://learning.oreilly.com/library/view/fundamentals-ofdata/9781492031079/ch02.html EDA terminology https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html#Percentiles Descriptive Statistics https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html#Percentiles Descriptive Statistics https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html#Percentiles Descriptive Statistics https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html#Percentiles Descriptive Statistics https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html#Percentiles EDA ● ● Documentation on data frames in R Documentation on data frames in Python https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html#Percentiles EDA ● Variables with measured or count data might have thousands of distinct values. ● A basic step in exploring your data is getting a “typical value” for each feature (variable): an estimate of where most of the data is located (i.e., its central tendency). ● At first glance, summarizing data might seem fairly trivial: just take the mean of the data. ● While the mean is easy to compute and expedient to use, it may not always be the best measure for a central value. ● For this reason, statisticians have developed and promoted several alternative estimates to the mean. https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html#Percentiles EDA https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html#Percentiles Descriptive Statistics https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html#Percentiles EDA https://learning.oreilly.com/library/view/practical-statistics-for/9781492072935/ch01.html#Percentiles https://www.ibm.com/blogs/internet-of-things/what-is-the-iot/ Internet of Things (IoT) ● The Internet of Things, or IoT, refers to the billions of physical devices around the world that are now connected to the internet, all collecting and sharing data. ● Pretty much any physical object can be transformed into an IoT device. ● If it can be connected to the internet to be controlled or communicate information. ● A lightbulb that can be switched on using a smartphone app is an IoT device, as is a motion sensor or a smart thermostat in your office or a connected streetlight. https://www.zdnet.com/article/what-is-the-internet-of-things-everything-you-need-to-know-about-the-iot-rightnow/ Data and Analytics Concepts and Terminology -Advanced https://www.zdnet.com/article/what-is-the-internet-of-things-everything-you-need-to-know-about-the-iot-rightnow/ Internet of Things (IoT) https://learning.oreilly.com/library/view/big-data-fundamentals/9780134291185/ch01.xhtml#ch01lev2sec3 Internet of Things (IoT) ● The broadening coverage of the Internet and the proliferation of cellular and Wi-Fi networks has enabled more people and their devices to be continuously active in virtual communities. ● Coupled with the proliferation of Internet connected sensors, the underpinnings of the Internet of Things (IoT), a vast collection of smart Internet-connected devices ● This in turn has resulted in a massive increase in the number of available data streams. https://www.zdnet.com/article/what-is-the-internet-of-things-everything-you-need-to-know-about-the-iot-rightnow/ https://aws.amazon.com/what-is-cloud-computing/ Cloud Computing https://aws.amazon.com/what-is-cloud-computing/ https://aws.amazon.com/what-is-cloud-computing/ Cloud Computing ● Cloud computing advancements have led to the creation of environments that are capable of providing highly scalable, on-demand IT resources ● Cloud computing environments can be leased via pay-asyou-go models. ● Businesses can leverage the infrastructure, storage and processing capabilities ● Build-out scalable data analytics solutions for large-scale automated anlaysis https://www.zdnet.com/article/what-is-the-internet-of-things-everything-you-need-to-know-about-the-iot-rightnow/ xaaS- Cloud product offerings https://learning.oreilly.com/library/view/kubernetes-applicationdeveloper/9781484280324/html/511560_1_En_1_Chapter.xhtml Data Ecosystem (Unified Analytics Platform) ● The term data ecosystem refers to the programming languages, packages, algorithms, cloud-computing services, and general infrastructure an organization uses to collect, store, analyze, and leverage data. ● No two organizations leverage the same data in the same way. ● As such, each organization has a unique data ecosystem. ● Also referred to an UAP https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 Data Life Cycle ● While the data ecosystem encompasses everything that handles, organizes, and processes data, the data life cycle describes the path data takes from when it’s first generated to when it’s interpreted into actionable insights. ● This life cycle can be split into eight steps: generation, collection, processing, storage, management, analysis, visualization, and interpretation. ● A data project’s steps are often described as a cycle because the lessons learned and insights gleaned from one project typically inform the next. ● In this way, the final step of the process feeds back into the first, enabling you to start again with new goals and learnings. https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 Data Life Cycle https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 7 Data & Analytics Skills You Need Critical Thinking ● If you’re interested in using data to solve business problems, you need to be adept at thinking critically about challenges and solutions. ● While data can provide many answers, it’s nothing without a human’s discerning eye. “From the first steps of determining the quality of a data source to determining the success of an algorithm, critical thinking is at the heart of every decision data scientists—and those who work with them—make,” Tingley (HBS) https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 7 Data & Analytics Skills You Need Hypothesis Formation and Testing ● At the heart of data and analytics is the desire to answer questions. ● The proposed explanations for these leading questions are called hypotheses, which must be formed before analysis takes place. ● An example of a hypothesis is, “I predict that a person’s likelihood of recommending our product is directly proportional to their reported satisfaction with the product.” ● You predict the data will show this trend and must prove or disprove the hypothesis through analysis. ● Without a hypothesis, your analysis has no clear direction. https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 7 Data & Analytics Skills You Need Data Wrangling (Data Preparation) ● The process of cleaning raw data in preparation for analysis. It involves identifying and resolving mistakes, filling in missing data, and organizing and transferring it into an easily understandable format. ● This is an important skill for anyone dealing with data to acquire because it leads to a more efficient and organized data analysis process. ● You can extract valuable insights from data more quickly when it’s cleaned and in its optimal viewing format. https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 7 Data & Analytics Skills You Need Mathematical Ability ● You don’t have to be a mathematician to become data literate, but strong math skills become increasingly important as you deal with more complex analyses. ● A seasoned data professional needs a solid understanding of statistics, probability, linear algebra, and multivariable calculus. ● Data scientists often call on statistical methods to find structure in data and make predictions, and linear algebra and calculus can make machine-learning algorithms easier to comprehend. ● If you’re not a data scientist or analyst, your work may not require you to understand the more complex mathematical concepts, but having a basic understanding of statistics can go a long way. https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 7 Data & Analytics Skills You Need Data Visualization ● It’s crucial to know how to transform raw data into compelling visuals that tell a story. ● Rather than simply presenting a list of values to your stakeholders, it’s more effective to visually communicate data in a way that’s easily digestible. ● Data visualization tool, a form of software designed to present data.a llow you to input a dataset and visually manipulate it. ● Most, but not all, come with built-in templates you can use to generate basic visualizations (pie charts, bar charts, and histograms) Microsoft Excel and Power BI, Google Charts, Tableau, Zoho Analytics, Data Wrapper, and Infogram. https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 7 Data & Analytics Skills You Need Machine Learning (ML) ● As artificial intelligence (AI) grows in popularity, machine learning is a highly valuable skill for professionals working with big data. ● Machine learning refers to the use of computer algorithms that automatically learn from and adapt in response to data. ● Some business applications of machine learning include risk management, performance analysis, trading, and automation. ● Even if you’re not responsible for writing code, knowing the basics of machine learning can help you gain a deeper understanding of your organization and boost efficiency through automation. https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 Data Science in Business ● In business, data science is used to ○ collect, ○ organize, ○ maintain data ○ often to write algorithms that make large-scale analysis possible. ● When designed correctly and tested thoroughly, algorithms can catch information or trends that humans miss. ● They can also significantly speed up the processes of gathering and analyzing data. https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 Data Science in Business ● You can use data science to: • ○ Gain customer insights: ■ Data about your customers can reveal details about their habits, demographics, preferences, and aspirations. ■ A foundational understanding of data science can help you make sense of and leverage it to improve user experiences and inform retargeting efforts. https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 Data Science in Business You can use data science to: • Increase security: ● To increase your business’s security and protect sensitive information. ● For example, ML algorithms can detect bank fraud faster and with greater accuracy than humans, simply because of the sheer volume of data generated every day. Inform internal finances: ● Your organization’s financial team can utilize data science to create reports, generate forecasts, and analyze financial trends. ● Data on a company’s cash flows, assets, and debts is constantly gathered, which financial analysts use to manually or algorithmically detect trends in financial growth or decline. • https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 Data Science in Business You can use data science to: • Streamline manufacturing: ● Manufacturing machines gather data from production processes at high volumes. ● In cases where the volume of data collected is too high for a human to manually analyze it, an algorithm can be written to clean, sort, and interpret it quickly and accurately to gather insights that drive cost-saving improvements. • Predict future market trends: ● Collecting and analyzing data on a larger scale can enable you to identify emerging trends in your market. ● By staying up to date on the behaviors of your target market, you can make business decisions that allow you to get ahead of the curve. https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4 Data Science in Business You can use data science to: • Streamline manufacturing: ● Manufacturing machines gather data from production processes at high volumes. ● In cases where the volume of data collected is too high for a human to manually analyze it, an algorithm can be written to clean, sort, and interpret it quickly and accurately to gather insights that drive cost-saving improvements. • Predict future market trends: ● Collecting and analyzing data on a larger scale can enable you to identify emerging trends in your market. ● By staying up to date on the behaviors of your target market, you can make business decisions that allow you to get ahead of the curve. https://online.hbs.edu/Documents/a-beginners-guide-to-data-and-analytics.pdf?hsCtaTracking=2bb079d4-1f8a-4052-95482430ccb52d48%7C4d888017-3b60-48fb-abd2-754f4106abb4