Course Outline Foundations Ask Prepare Process Analyze Share Act Capstone Skillsets Using data in everyday life Thinking analytically Applying tools from the data analytics toolkit Showing trends and patterns with data visualizations Ensuring your data analysis is fair Asking SMART and effective questions Structuring how you think Summarizing data Putting things into context Managing team and stakeholder expectations Problem-solving and conflict-resolution Ensuring ethical data analysis practices Addressing issues of bias and credibility Accessing databases and importing data Writing simple queries Organizing and protecting data Connecting with the data community (optional) Connecting business objectives to data analysis Identifying clean and dirty data Cleaning small datasets using spreadsheet tools Cleaning large datasets by writing SQL queries Documenting data-cleaning processes Sorting data in spreadsheets and by writing SQL queries Filtering data in spreadsheets and by writing SQL queries Converting data Formatting data Substantiating data analysis processes Seeking feedback and support from others during data analysis Creating visualizations and dashboards in Tableau Addressing accessibility issues when communicating about data Understanding the purpose of different business communication tools Telling a data-driven story Presenting to others about data Answering questions about data Coding in R Writing functions in R Accessing data in R Cleaning data in R Generating data visualizations in R Reporting on data analysis to stakeholders Building a portfolio Increasing your employability Showcasing your data analytics knowledge, skill, and technical expertise Sharing your work during an interview Communicating your unique value proposition to a potential employer Course 1– Foundations: Data, Data, Everywhere Course 1, Week 1 Transforming data into insights Data – a collection of facts that can be used to draw conclusions, make predictions, and assist in decisionmaking. Data needs to be controlled by businesses so they can use it to improve processes, identify opportunities and trends, launch new products, serve customers, and make thoughtful decisions. Analysis – turning data into insights. Data Analysis – the collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making. Data Analysis Process: o o o o o o Ask Prepare Process Analyze Share Act Data Science branches into: a) b) c) Machine Learning / AI used for automation / making many decisions under uncertainty excellence requires high performance (ie. success rate, accuracy) Statistics used for making a few important decisions under uncertainty excellence requires care and rigor, with the intent to protect decision makers from coming to the wrong conclusion Analytics used for uncovering the unknown and figuring out inspiration for decisions (ie. unsure how many decisions are needed to be made) excellence requires speed amidst ambiguity Business Analytics – the use of math and statistics to collect, analyze, and interpret data to make better business decisions. o o o o Descriptive Analytics – the interpretation of historical data to identify trends and patterns (“what happened?”) Diagnostic Analytics – used to identify root causes of problems and correlations between variables (“why did this happen?”) Predictive Analytics – taking interpreted data and using it to forecast future outcomes to inform business strategies (“what might happen in the future?”) Prescriptive Analytics – used to determine which outcome will yield the best result given a scenario (“what should we do next?”) Business Analytics vs. Data Science Business Analytics – main goal is to extract meaningful insights from data to guide organizational decisions (tasks such as budgeting, forecasting, product development) Data Science – focused on turning raw data into meaningful conclusions through using algorithms and statistical models (tasks such as data wrangling*, programming, statistical modeling) (https://online.hbs.edu/blog/post/importance-of-business-analytics) (https://online.hbs.edu/blog/post/business-analytics-examples) * Data Wrangling – also called data cleaning, data remediation, or data munging – refers to a variety of processes designed to transform data into more readily used formats. Can be manual or automated. (ie. Merging multiple data sources into a single dataset for analysis, identifying gaps in data, deleting data that’s either unnecessary or irrelevant to the project being worked on, identifying extreme outliers in data and either explaining the discrepancies or removing them) o o o o o o Discovery – familiarizing with data to conceptualize how to employ it Structuring – transforming data to readily use it Cleaning – removing inherent errors in data that might distort the analysis Enriching – determining whether to enrich or augment existing data Verifying – confirming if data is consistent and of high quality Publishing – making data available for analysis (https://online.hbs.edu/blog/post/data-wrangling) 5 Business Analytics Skills for Professionals 1. 2. 3. 4. 5. Data Literacy* Familiarity with the language of data, including different types, sources, tools, and techniques Data Collection Samples include existing datasets, customer surveys, interviews, questionnaires, and focus groups Statistical Analysis Methods include: Hypothesis Testing (statistical means of testing an assumption), Linear Regression Analysis (used to evaluate the relationship between two variables) Multiple Regression Analysis (used to evaluate the relationship between three or more variables) Communication Includes oral communication and presentation skills, and written communication in the form of reports Data Visualization Allows to present findings in easily digestible formats for those who may not be as data literate More effective to distill findings to key takeaways and present in a manner that’s easy to understand (https://online.hbs.edu/blog/post/business-analytics-skills) *Data Literacy Skills & Concepts Data Analysis Descriptive Analysis – seeks to explain or describe what has happened Diagnostic Analysis – seeks to explain or diagnose why something has happened Predictive Analysis – seeks to forecast what might happen Prescriptive Analysis – seeks to prescribe a course of action that might lead to a desired outcome Data Wrangling Act of transforming data from raw state into something that can be readily used Also known as data munging or data cleaning Data Visualization Process of creating a graphical or visual representation of data and often crucial piece of effectively communication insights both inside and outside the organization Data Ecosystem Refers to all of the components an organization leverages to collect, store, and analyze data Includes physical infrastructure like server space and cloud storage solutions, and non-physical components like data sources, programming languages, code packages, algorithms, and software Data Governance The process and practices an organization uses to formally manage its data assets Typically broken down into: Quality: how to ensure data remains accurate, trustworthy, and complete Security: how to secure data from unauthorized access Privacy: how to protect sensitive information collected and stored Stewardship: how to ensure data processes are followed appropriately Data Team Data Scientists – leverages advanced mathematics, programming, and tools to conduct and manage large-scale analyses Data Engineers – responsible for building and maintaining datasets that are leveraged in data projects Data Analysts – conducts majority of the analyses an organization requires (https://online.hbs.edu/blog/post/data-literacy) Top Data Science Skills: 1. 2. 3. 4. 5. 6. 7. 8. 9. Critical Thinking – ability to recognize business problems, conduct testing, and swiftly identify trends in data Mathematical Ability – including statistics, probability, linear algebra, multivariable calculus Data Visualization – transforming data into compelling visuals that tell a story Programming Skills – including Python, R, SQL Data Wrangling – cleaning up data in preparation for analysis Business Fluency – necessary to understand what information drives business decisions Communication – with the help of data visualization Machine Learning – the use of computer algorithms that automatically learn and adapt from data. Uses include risk management, performance analysis, trading, and automation Ethical Skills (https://online.hbs.edu/blog/post/data-science-skills) 4 Ways to Improve Analytical Skills (https://online.hbs.edu/blog/post/how-to-improve-analytical-skills) 6 Steps to Analyzing Datasets 1. 2. 3. 4. 5. 6. Clean up data Data wrangling – the process of uncovering and correcting, or eliminating inaccurate or repeat records from dataset, transforming raw data into useful format for analysis. Identify the right questions Questions should easily be measurable and closely related to specific business problems. Know what should be learned, what is expected to be learned, and how information will be used. Break down data into segments Break datasets into smaller, defined groups Visualize the data Data visualization* – the process of creating graphical representations of data to help easily identify any trends or patterns and obvious outliers. Engaging visuals also help in effectively communicating findings to key stakeholders. Use data to answer questions If results are inconclusive, revisit previous steps in the process Supplement with qualitative data Pair quantitative findings with qualitative information (which may be captured via questionnaires, interviews, or testimonials). Datasets tend to be useful in understand “what”, while qualitative information tends to give insights on “why.” *Data Visualization – the process of creating a visual representation of the information within a dataset. The idea is to make data more accessible across an organization and to better communicate with external parties. Most common techniques include: Pie Charts Bar Charts Histograms Gantt Charts Heat Maps Box-and-Whisker Plots (https://online.hbs.edu/blog/post/how-to-analyze-datasets) Waterfall Charts Area Charts Scatter Plots Infographics Maps Data Visualization Techniques (https://online.hbs.edu/blog/post/data-visualization-techniques) Data Visualization Tools: 1. 2. 3. 4. 5. 6. 7. 8. Microsoft Excel (and Power BI) Google Charts Tableau Zoho Analytics HubSpot Databox Datawrapper Infogram (https://online.hbs.edu/blog/post/data-visualization-tools) Understanding the data ecosystem Ecosystem – a group of elements that interact with one another Data Ecosystem – a combination of hardware tools, software tools, and human resource that interact with one another in order to produce, manage, store, organize, analyze, and share data* Cloud – a place to keep data online *Job of data analysts is to harness the power of the data ecosystem to find the right information, and provide analysis that helps make smart decisions Data Scientist vs. Data Analyst Data Science – creating new ways of modeling and understanding the unknown by using raw data Data Scientists – creates new questions using data Data Analysts – finds answers to existing questions by creating insights from data sources Data Analysis vs. Data Analytics Data Analysis – collection, transformation, and organization of data in order to draw conclusions that help drive informed decision-making Data Analytics – the science of data, a broad concept that encompasses everything from the job of managing and using data to the tools and methods the data workers use everyday Data-driven decision-making - Using facts to guide business strategy Involves figuring out business needs, finding relevant data, analyzing it, then using it to uncover trends, patterns, and relationships Most powerful when data is combined with human experience, observation, and intuition (in certain cases) Insights from subject matter experts must be included, as they can identify inconsistencies, make sense of gray areas, and eventually validate choices being made Daya Analysis Process: o Ask questions and define the problem o Prepare data by collecting and storing the information o Process data by cleaning and checking the information o Analyze data to find patterns, relationships, and trends o Share data with your audience o Act on the data and use the analysis results * Blending data with business knowledge, plus sometimes a touch of gut instinct, is a common part of the process. * Good questions to ask when figuring out how much business knowledge and gut instinct should be involved in each project: (a) What kind of results are needed? (b) Who will be informed? (c) Am I answering the question being asked? (d) How quickly does a decision need to be made? Data Analysis Life Cycle *No single defined structure of phases, but fundamentals are always shared. (https://online.hbs.edu/blog/post/data-life-cycle) (https://www.informit.com/articles/article.aspx?p=2473128&seqNum=11&ranMID=24808) Google’s process: 1. Ask: Business Challenge/Objective/Question 2. Prepare: Data generation, collection, storage, and data management 3. Process: Data cleaning/data integrity 4. Analyze: Data exploration, visualization, and analysis 5. Share: Communicating and interpreting results 6. Act: Putting your insights to work to solve the problem Dell’s process: 1. Discovery 2. Pre-processing data 3. Model planning 4. Model building 5. Communicate results 6. Operationalize Project-based data analytics life cycle: 1. Identifying the problem 2. Designing data requirements 3. Pre-processing data 4. Performing data analysis 5. Visualizing data (http://pingax.com/understanding-data-analytics-project-life-cycle/#google_vignette) Big Data analytics life cycle: 1. Business case evaluation 2. Data identification 3. Data acquisition and filtering 4. Data extraction 5. Data validation and cleaning 6. Data aggregation and representation 7. Data analysis 8. Data visualization 9. Utilization of analysis results (https://www.informit.com/articles/article.aspx?p=2473128&seqNum=11&ranMID=24808) Course 1, Week 2 Embracing Data Analyst Skills Analytical Skills – qualities and characteristics associated with solving problems using facts. 1. 2. 3. 4. 5. Curiosity – wanting to learn something Understanding Context – the condition to which something exists or happens Having Technical Mindset – the ability to break things down into smaller steps or pieces, and work with them in an orderly and logical way Data Design – the skill of organizing information Data Strategy – the management of people, processes, and tools used in data analysis a. People needs to know how to use the right data to find solutions to the problem b. Processes ensure the path to the solution is clear and accessible c. The right technological tools need to be used for the job Thinking About Analytical Thinking Analytical Thinking – identifying and defining a problem and then solving it by using data in an organized, stepby-step manner Five Key Aspects to Analytical Thinking: 1. 2. 3. 4. 5. Visualization The graphical representation of information that allows information to be understood and explained more effectively Examples include graphs, maps, or other design elements Strategy Necessary to stay focused and on track amidst large quantity of data Involves understanding what needs to be achieved with data and identifying how to get there Improves quality and usefulness of data collected Problem-Orientation Using a problem-oriented approach in identifying, describing, and solving problems Keeping the problem top of mind throughout the entire project Correlation Identifying relationships between two or more pieces of data REMINDER: CORRELATION DOES NOT EQUAL CAUSATION, just because pieces of data are both trending in the same direction does not necessarily mean they’re all related Big-Picture / Detail-Oriented Thinking a. Big-Picture Thinking Looking at the whole picture without getting stuck on every tiny piece of information Important to zoom out and see possibilities and opportunities b. Detail-Oriented Thinking Figuring out all of the aspects that will help execute plans Thinking Methods: 1. 2. 3. Analytical Thinking Critical Thinking Creative Thinking Usual questions by Data Analysts: 1. 2. 3. “What is the root cause of the problem?” - Can be addressed through the Five Whys Ask “why?” five times to get to the root cause of the problem “Where are the gaps in our process?” - Can be addressed through Gap Analysis Examination and evaluation of how a process works currently order to achieve ideal future improvement General approach is to understand where the process is now in comparison to where it should ideally be, then identify the gap and how to bridge the current and future state “What did we not consider before?” - A great way to think about information or procedure that might be missing from a process in order to improve future decision and strategy making Thinking About Outcomes Data-driven decision-making allows businesses to: Gain valuable insights Verify theories or assumptions Better understand opportunities and challenges Support objectives Help make plans Allows for greater confidence in choices and ability to address business challenges Allows for proactivity when opportunities present themselves Allows saving of time and effort when working towards a goal Practicing necessary skills: Curiosity and Context – curiosity in patterns and relationships in everyday life, then using context to make predictions, research answers, and draw conclusions Having a Technical Mindset – building on gut feelings and using technical approach to explore them (seek out facts, analyze, then use infights to make informed decisions) Data Design – actively designing day-to-day data so that they are organized in a logical way that makes them easy to access, understand, and make the most of Data Strategy – making sure others are on board on the procedures in place and technology being used in gathering and using data Course 1, Week 3 Data Life Cycle Six Stages of Data Life Cycle: 1. 2. 3. 4. 5. 6. Plan When it’s decided what kind of data is needed, how it will be managed, and who will be responsible for it Capture When data is collected from variety of sources Data can be collected from outside resources (ie. publicly available datasets) or from internal database* *When maintaining an internal database, ensuring data integrity, credibility, and privacy are important Manage When data is cared for and maintained. Includes determining how and where it is stored and the tools used to do so. Integral to the process of data cleansing Analyze When data is used to solve problems, make informed decisions, and support business goals Archive When data is stored for long-term and for future reference Destroy When data is removed from storage and any shared copies deleted to protect a company’s private information and private data about its customers To destroy data on hard drives, secure data erasure software will be used. To destroy paper files, they will be shredded (https://www.sfmagazine.com/articles/2018/july/the-data-life-cycle/?psso=true) (https://online.hbs.edu/blog/post/data-life-cycle) Data Analysis Process Outline Six Phases of Data Analysis Life Cycle: 1. Ask: Define the problem and confirm stakeholder expectations 2. Prepare: Collect and store data for analysis 3. Begins with cleaning data, and understanding its structure, quirks, nuances, and its potential to answer business questions Involves quality assurance checks (ie. checking if all data anticipated is available, minimizing missing data or gaps in data collection effort, identifying outliers) to ensure data can be analyzed appropriately and responsibly Analyze: Use data analysis tools to draw conclusions 5. Think about what kind of data we need based on what is learned from Ask phase (ie. qualitative vs. quantitative, cross-sectional / points in time vs. longitudinal over a long period of time) Think about how to collect data (ie. existing data? Brand new data?) Process: Clean and transform data to ensure integrity 4. “What is the problem that we’re trying solve?” “What is the purpose of this analysis?” “What are we hoping to learn?” Must be objective and unbiased Involves a series of analyses that are planned as early as Ask phase Involves looking for patterns (while being mindful of personal intuition) Share: Interpret and communicate results to others to make data-driven decisions May begin by sharing high-level findings to leadership, followed through by gradual digging and sharing of information to the rest of the organization 6. Act: Put your insights to work in order to solve the original problem Based on results of analysis, decide on interventions not only on an organizational level but also on the team level Data Analysis Toolbox Most common tools: 1. 2. 3. 4. Spreadsheets A digital worksheet that allows to: o Collect, store, organize, and sort information o Identify patterns and piece the data together in a way that works for each specific data project o Create excellent data visualizations, like graphs and charts Two most popular ones are Microsoft Excel and Google Sheets Useful features include Formulas and Functions o Formulas – a set of instructions that specific calculation using data in a spreadsheet (ie. MDAS, average, sum of values that meet particular rule, etc.) o Function – a preset command that automatically performs a specific process or task using the data in a spreadsheet, allowing for efficiency Query Languages A computer programming language that allows to: o Isolate specific information from a database(s)* o Make it easier to learn and understand requests made to databases o Select, create, add, or download data from a database for analysis *A database is a collection of structured data stored in a computer system Most popular is SQL (Structured Query Language) or Sequel Akin to requesting the database to act on a command (ie. insert, delete, select, or update data) Visualization Tools Using graphical representation of information (ie. graphs, maps, tables) to better communicate insights to others Most popular ones include Tableau and Looker o Tableau – simple drag and drop feature lets users create interactive graphs in dashboards and worksheets o Looker – communicates directly with a database, allowing connection with data right to the visual tool chosen Allows to: o Turn complex numbers into a story that people can understand o Help stakeholders come up with conclusions that lead to informed decisions and effective business strategies Programming Languages Most common languages used by Data Analysts include R and Python, both used for statistical analysis, visualization, and other data analysis Choosing the right tools During the Share phase of Data Analysis, Data Visualization tools are mostly used to create complex and eye-catching visualizations During Prepare, Process, and Analyze phase of Data Analysis, Spreadsheets and Query Languages are most useful. The differences of both outlined below: Spreadsheets Databases Software applications Data stores - accessed using a query language (e.g. SQL) Structure data in a row and column format Structure data using rules and relationships Organize information in cells Organize information in complex collections Provide access to a limited amount of data Provide access to huge amounts of data Manual data entry Strict and consistent data entry Generally one user at a time Multiple users Controlled by the user Controlled by a database management system Course 1, Week 4 Mastering Spreadsheet Basics Main features of a spreadsheet: 1. 2. 3. 4. 5. Cell Column Column labels are called Attributes (also referred to as column names, column labels, headers, or header row) Row Also called as Observation Formulas A set of instructions that perform specific actions using the data in the spreadsheet Uses Cell references for values calculated Always begin with “=” sign Functions Training references: Google Sheets Training and Help https://support.google.com/a/users/answer/9282959?visit_id=637361702049227170-1815413770&rd=1 Google Sheets Cheat Sheet https://support.google.com/a/users/answer/9300022 Microsoft Excel Video Training https://support.microsoft.com/en-us/office/excel-video-training-9bc05390-e94c-46af-a5b3-d7c22f6990bb Structured Query Languages (SQL) Useful for storing, organizing, and analyzing of data just like Spreadsheets, but allows for larger scale (like a supersized Spreadsheet) Needs a database that understands its language, and its queries* are universal Query – a request for data or information from a database Syntax o A unique set of guidelines followed by programming languages (like SQL) o The predetermined structure of a language that includes all required words, symbols, and punctuation, as well as their proper placement SQL Syntax: o SELECT – use to choose the columns you want to return o FROM – use to choose the tables where the columns you want are located o WHERE – use to filter for certain information Sample structure: #2 Select [choose the column(s) you want] #1 From [from the appropriate table the data lives on] #3 Where [a certain condition is met] Fill information in sequence #1-3, the suggested order is to start big (data table) and go small (specific conditions) New line and indent are necessary when filling in information Example (pulling data on customers with the first name Tony): SELECT first_name FROM customer_data.customer_name WHERE first_name = “Tony” Multiple columns in a query structure: SELECT columnA, columnB, columnC, FROM Table where the data lives WHERE Certain condition is met Example (pulling data on customers with the first name Tony): SELECT customer_id, first_name, last_name FROM customer_data.customer_name WHERE first_name = ‘Tony’ In general, it is more efficient to only select columns that are needed (such as those that will actually use the additional fields in the WHERE clause) Multiple columns and WHERE clause in a query structure: SELECT columnA, columnB, columnC, FROM Table where the data lives WHERE Condition 1 AND Condition 2 AND Condition 3 SELECT command uses a comma to separate fields/variable parameters WHERE command uses the ‘AND’ statement to connect conditions There are other connectors/operators for the WHERE command such as OR and NOT Example (pulling data on customers with multiple conditions): SELECT customer_id, first_name, last_name FROM customer_data.customer_name WHERE customer_id > 0 AND first_name = ‘Tony’ AND last_name = ‘Magnolia’ SQL Guide A. B. C. D. Capitalization, indentation, and semicolons o SQL queries can be written in all lower case and with extra spaces between words o Capitalization and indentation can help read information more easily (better formatting) o Semicolon may be required as a statement terminator Part of the American National Standards Institute (ANSI) SQL-92 standard, semicolon to be used as a common syntax If a statement works without a semicolon, it’s fine WHERE conditions o SELECT clause -- identifies the column you want to pull data from o FROM clause – identifies the table where the column is located o WHERE clause – narrows query so that the database returns only the data with an exact value match or the data that matches the input condition o LIKE query – can be used to tell the database to look for certain patterns o % (percent sign) or * (asterisk) – used as a wildcard to match one or more characters o <> query – used to create conditions which “does not equal” o Example: Specific query: WHERE field1 = ‘Chavez’ Pattern query: WHERE field1 LIKE ‘Ch%’ WHERE field1 LIKE ‘Ch*’ SELECT all columns o SELECT * -- selecting all columns in the table o Although a correct SQL statement from a syntax point of view, it should be used sparingly and with caution as it can cause a query to run slowly Comments o Used when tables aren’t designed with descriptive enough naming conventions o Good practice to save time and energy understand previously written queries o Comments are text placed between characters /* and */, or after two dashes (--) o Can be placed outside of a statement or within a statement o Example: SELECT field1 /* this is the last name column */ FROM table -- this is the customer data table WHERE field1 LIKE ‘Ch%’; o E. Aliases o o o o o o Example: -- This is an important query used later to join with the accounts table SELECT rowkey, -- key used to join with account_id info.date, -- date is in string format YYYY-MM-DD HH:MM:SS info.code – e.g., ‘pub-###’ FROM Publishers Assigns a new name or alias to the column or table names to make them easier to work with Uses SQL “AS” clause Can be used to avoid the need for comments Only good for the duration of the query only Doesn’t change the actual name of a column or table in the database Example: Field1 AS last_name – Alias to make my work easier Table AS customer – Alias to make my work easier SELECT last_name FROM Customer WHERE last_name LIKE “Ch%’ F. SQL Tutorials: (https://www.w3schools.com/sql/default.asp) (https://www.sqltutorial.org/sql-cheat-sheet/) Data Visualization Graphical representation of information Allows data to be easily understood and interesting to look at Steps to plan a data visualization: 1. 2. 3. Explore the data for patterns Reviewing basic information, behaviors, numerical data (ie. sales, basket size), qualitative data (ie. gender, mobile/desktop), geographical information, etc. Plan your visuals Identify which data, findings, and patterns should be included in the visualization Ex. Show sales over time, connect sales to location, show relationship between sales and website use, show which customers fuel growth Create your visuals Creating the right visualization for a presentation is a process which involves trying different visualization formats and making adjustments as necessary The idea is to create the most compelling story for stakeholders Line charts -- can track sales over time Maps -- can connect sales to locations Donut charts -- can show customer segments Bar charts – can compare total visitors and visitors that make purchase Data Visualization Toolkit: Can use built-in visualization tools in spreadsheets Can use more advanced tools such as Tableau that allow to integrate data into dashboard style visualizations With the programming language R, can use visualization tools in RStudio (an independent integrated developer environment / IDE for visualization needs) Choice will be driven by a variety of drivers including size of data, process used for analyzing data (ie. spreadsheet, database/queries, or programming languages) Course 1, Week 5 Issue – A topic or subject to investigate Question – Designed to discover information Problem – An obstacle or complication that needs to be worked out Business Task – – The question or problem data analysis answers for a business ie. “Analyze weather data from the last decade to identify predictable patterns” Data-driven decision-making – Using facts observed from data analysis to guide business strategy Fairness – Ensuring that analysis doesn’t create or reinforce bias Course 2– Ask Questions to Make Data-Driven Decisions Course 2, Week 1 Problem-solving and effective questioning Structured thinking – – The process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying the options Used to address a vague, complex problem by breaking it down into smaller steps that will lead to logical solutions Take action with data Structured Thinking: - Phase Ask Breaking the data analysis process into smaller, manageable parts (four basic activities): Recognizing the current problem or situation Organizing available information Revealing gaps and opportunities Identifying options Primary Goal Figure out what problem is being solved Prepare Process Decide what data are needed to be collected and how to organize it in order to answer questions / resolve problems Clean data and get rid of possible errors, inaccuracies, or inconsistencies Analyze Think analytically about data (make sure it’s sorted and formatted for easier use) Objectives/Considerations Define the problem Understand stakeholder’s expectations Focus on actual problem and avoid distractions Collaborate with stakeholders and keep open line of communication Take a step back and see the whole situation in context What metrics to measure Locate data in database Create security measures to protect data Use spreadsheet functions to find incorrectly entered data Use SQL functions to check for extra spaces Removing repeated entries Checking as much as possible for bias in the data Perform calculations Combine data from multiple sources Create tables with your results 1. 2. 1. 2. 1. 2. 1. 2. 3. Share Act Summarize results with clear and enticing visuals to show stakeholders how to solve problems and how answers are reached upon Take everything learned from data analysis and put them to use (ie. provide stakeholders with recommendations based on findings in order to make data-driven decisions) Solve problems with data Make better decisions Make more informed decisions Lead to stronger outcomes Successfully communicate findings 1. 2. 1. Relevant Questions What are my stakeholders saying their problems are? How can I help the stakeholders resolve their questions? What do I need to figure out how to solve this problem? What research do I need to do? What data errors or inaccuracies might get in my way of getting the best possible answer to the problem I am trying to solve? How can I clean my data so the information I have is more consistent? What story is my data telling me? How will my data help me solve this problem? Who needs my company’s products or services? What type of person is most likely to use it? How can I make what I present to the stakeholders engaging and easy to understand? What would help me understand this if I were the listener? How can I use the feedback I received during the share phase (step 5) to actually meet the stakeholder’s needs and expectations? Common Problem Types: 1. 2. 3. 4. 5. 6. Making predictions Using data to make an informed business decision about how things may be in the future (ie. A company that wants to know the best advertising method to bring in new customers) Categorizing things Assigning information to different groups or clusters based on common features (ie. To improve customer satisfaction, classify customer service calls based on keywords or scores to identify top performers and help correlate certain actions taken with higher customer satisfaction scores) Spotting something unusual Identifying data that is different from the norm (ie. Smart watches analyzing aggregated health data can help product developers determine the right algorithms to spot or set off alarms when certain data doesn’t trend normally) Identifying themes Grouping categorized information into broader concepts (ie. UX designers needing help to identify themes to help prioritize right product features for improvement. Examples of themes in a user study include: beliefs, practices, and needs) Discovering connections Finding similar challenges faced by different entities and combining data and insights to address them (ie. 3PL working with another company to get shipments delivered to customers on time by analyzing wait times at shipping hubs) Finding patterns Using historical data to understand what happened in the past and is therefore likely to happen again (ie. Minimizing downtime caused by machine failure by analyzing maintenance data to discover when / why most failures happen) Craft effective questions Avoid asking leading questions - Questions which lead respondents to answer in a certain way Avoid asking close-ended questions - Questions that can be answered with a yes or no Avoid asking vague questions - Questions that are too vague and lacks context SMART Questions 1. Specific Specific questions are simple, significant, and focused on a single topic or a few closely related ideas Bad Question “Are kids getting enough exercise these days?” 2. Measurable Measurable questions can be quantified and assessed Bad Question “Why did our recent video go viral?” 3. Good Question “What percentage of kids achieve the recommended 60 minutes of physical activity at least give days a week?” Action-oriented Good Question “How many times was our video shared on social channels the first week it was posted?” - Action-oriented questions encourage change Bad Question “How can we get customers to recycle our product packaging?” 4. Relevant Relevant questions matter, are important, and have significance to the problem you’re trying to solve Bad Question “Why does it matter that Pine Barrens tree frogs started disappearing?” 5. Good Question “What design features will make our packaging easier to recycle?” Good Question “What environmental factors changed in Durham, North Carolina, between 1983 and 2004 that could cause Pine Barrens tree frogs to disappear from the Sandhills Regions?” Time-bound Time-bound questions specific the time to be studied Bad Question “Why does it matter that Pine Barrens tree frogs started disappearing?” Good Question “What environmental factors changed in Durham, North Carolina, between 1983 and 2004 that could cause Pine Barrens tree frogs to disappear from the Sandhills Regions?” Common topics for SMART questions: Objectives (ie. “What are the goals of the deep dive? What, if any, questions are expected to be answered by this deep dive?”) Audience (ie. “Who are the stakeholders? Who is interested or concerned about the results of this deep dive? Who is the audience for the presentation?”) Time (“What is the time frame for completion? By what date does this need to be done?”) Resources (“What resources are available to accomplish the deep dive's goals?”) Security (“Who should have access to the information?”) Questions should always take into account fairness - Ensuring that questions don’t create or reinforce bias ie. “These are the best sandwiches ever, aren’t they?” ie. “What do you love most about our exhibits?” Sample questions: Take good notes - - Ideal process is to ask questions, clarify understanding of responses, and then briefly record them in notes If a question is worth asking, then the answer is worth recording Important aspects of the conversation to note include: Facts – Write down any concrete piece of information, such as dates, times, names, and other specifics Context – Facts without context are useless. Note any relevant details that are needed in order to understand the information being gathered Unknowns – Sometimes there are important questions missed during a conversation. Make a note of when they happen so answers can be figured out later Sample notes: A good guideline to think about: Stakeholder’s business goals; in this case, the person you had a conversation with Identifying the data needed to answer the SMART questions Exploring what data the stakeholder already has Determining the data that you don’t have, but need in order to answer the questions Course 2, Week 2 Understand the Power of Data Data-Driven Decision Making Finding patterns and important insights from a collection of facts to make informed business decisions - Data-Inspired Decision Making Exploring different data sources to find out what they have in common - Algorithm – A process or set of rules to be followed for a specific task Quantitative Data - Specific and objective measures of numerical facts What? How many? How often? / Things that can be measured Examples of measurable questions: “How many negative reviews are there?” “What’s the average rating?” “How many of these reviews use the same keywords?” Qualitative Data - Subjective or explanatory measures of qualities and characteristics Things that can’t be measured by numerical data Great for helping answer “why” questions Examples of immeasurable questions: “Why are customers unsatisfied?” “How can we improve their experience?” Qualitative Data Tools Focus Groups Social Media Text Analysis In-Person Interviews Quantitative Data Tools Structured Interviews Surveys Polls Follow the Evidence Two common data presentation tools include: 1. Reports - A static collection of data given to stakeholders periodically 2. Dashboards - Live reflection of incoming data. Organizes information from multiple datasets into one central location Data Presentation Tool Reports Description Pros A static collection of data given to stakeholders periodically High-level historical data Easy to design Static, pre-cleaned and sorted data Continual maintenance Less visually appealing Dashboards Monitors live, incoming data Dynamic, automatic, and interactive Shows more data More stakeholder accesses Low maintenance Visually appealing Labor-intensive design Can be confusing Not suitable if data report will not be used very often Potentially uncleaned data Interface is susceptible to breaking Cons Pivot Table - A data summarization tool that is used in data processing. Pivot tables are used to summarize, sort, reorganize, group, count, total, or average data stored in a database. - A single quantifiable type of data that can be used as measurement Usually involves simple math and can be combined into formulas that we can plug our numerical data into ie. Revenue by individual salesperson = (number of individual sales) x (sales price) ROI (Return on Investment) = (net profit over a period of time) / cost of investment Customer Retention Rate (ability to keep customers over time) = Customers at the beginning of period / customers at the end of the period Metrics - Metric Goal - A measurable goal set by a company and evaluated using metrics Benefits of using dashboards for analysts and stakeholders: Creating a dashboard Here is a process you can follow to create a dashboard: 1. Identify the stakeholders who need to see the data and how they will use it To get started with this, you need to ask effective questions. Check out this Requirements Gathering Worksheet to explore a wide range of good questions you can use to identify relevant stakeholders and their data needs. This is a great resource to help guide you through this process again and again. 2. Design the dashboard (what should be displayed) Use these tips to help make your dashboard design clear, easy to follow, and simple: Use a clear header to label the information Add short text descriptions to each visualization Show the most important information at the top 3. Create mock-ups if desired This is optional, but a lot of data analysts like to sketch out their dashboards before creating them. 4. Select the visualizations you will use on the dashboard You have a lot of options here and it all depends on what data story you are telling. If you need to show a change of values over time, line charts or bar graphs might be the best choice. If your goal is to show how each part contributes to the whole amount being reported, a pie or donut chart is probably a better choice. To learn more about choosing the right visualizations, check out Tableau’s galleries: For more samples of area charts, column charts, and other visualizations, visit Tableau’s Viz Gallery. This gallery is full of great examples that were created using real data; explore this resource on your own to get some inspiration. Explore Tableau’s Viz of the Day to see visualizations curated by the community. These are visualizations created by Tableau users and are a great way to learn more about how other data analysts are using data visualization tools. 5. Create filters as needed Filters show certain data while hiding the rest of the data in a dashboard. This can be a big help to identify patterns while keeping the original data intact. It is common for data analysts to use and share the same dashboard, but manage their part of it with a filter. To dig deeper into filters and find an example of filters in action, you can visit Tableau’s page on Filter Actions. This is a useful resource to save and come back to when you start practicing using filters in Tableau on your own. Types of Dashboards Strategic – focuses on long term goals and strategies at the highest level of metrics Operational – short-term performance tracking and intermediate goals Analytical – consists of the datasets and the mathematics used in these sets Strategic Dashboards - Used in evaluating and aligning strategic goals, providing information over the longest time frame (from a single financial quarter to years) Typically contains information used for enterprise-wide decision-making Operational Dashboards - Arguably the most common type of dashboard containing information on a time scale of days, weeks, or months, allowing for almost real-time performance on insights Allows business to track and maintain immediate operational process in light of strategic goals Analytical Dashboards - Contains vast amounts of data used by data analysts. They also contain details involved in the usage, analysis, and predictions made by data scientists. Created and maintained by data science teams and rarely shared with upper management as they are difficult to understand Connecting the Data Dots Mathematical Thinking - Looking at a problem and logically breaking it down step-by-step to see relationship of patterns in data in order to better analyze problems Helps figure out the best tools to use for analysis (ie. different sizes of datasets will require different tools) Small Data - Specific Short time-period Day-to-day decisions - Large and less-specific Long time-period Big decisions Big Data Course 2, Week 3 Working with Spreadsheets Common Spreadsheet Math Functions: Sum Average Count Min Max Spreadsheet Tasks: Organize your data o Pivot table Sort and filter Calculate your data o Formulas o Functions Spreadsheets and the Data Life Cycle Sample Open Data Sources: World Bank (https://data.worldbank.org/) World Health Organization Google Public Data Explorer U.S. Census Bureau Philippine Statistics Authority (https://www.psa.gov.ph/) Formulas in Spreadsheets Formulas - A set of instructions that performs a specific calculation - Symbols that name the type of operator or calculation to be performed Operator Cell Reference - A single cell or a range of cells in a worksheet that can be used in a formula - Collection of two or more cells Range Functions in Spreadsheets Functions - A preset command that automatically performs a specific process or task using the data ie. SUM, AVERAGE, COUNT, MIN, MAX Save Time with Structured Thinking Problem Domain - The specific area of analysis that encompasses every activity affecting or affected by the problem Structured Thinking - The process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying the options Scope of Work (SOW) - An agreed-upon outline of the work you’re going to perform on a project Usually includes work details, schedules, reports For data analysis, normally includes data preparation, validation, analysis of quantitative and qualitative datasets, initial results, and reporting visuals To keep data collection objective, ask the ff about the data collected: Who What When Where How Why Course 2, Week 4 Balance Team and Stakeholder Needs Stakeholders - People that have invested time, interest, and resources into the projects to be worked on Project Managers - Responsible for planning and executing projects Common Project Stakeholders: Working Effectively w/ Stakeholders: Focus on what matters: 1. 2. 3. Who are the primary and secondary stakeholders? Who is managing the data? Where can we go for help? Communication is Key - Different teams have different expectations on communication It’s normal to feel lost at first, but it’s important to keep learning as you go and not be afraid to ask clarifications Set realistic timelines and prepare for roadblocks Flag stakeholders about potential delays as early as possible Good writing habits for emails: - Complete sentences with proper spelling and punctuation Write clearly enough that anyone could understood Read emails out loud Don’t write too long, be clear and concise Short and to the point; polite and well-written Answer timely If discussion is too long, set up a meeting instead Communication skills needed by data analysts: - Listening Speaking Presenting Writing Communication tip: Know your audience (can be used as a guideline for email flow) 1. Who is your audience? 2. What do they already know? 3. What do they need to know? 4. How can you best communicate what they need to know? When approached with a request that is hasty: 1. 2. 3. 4. 5. Reframe question Problems Challenges Solutions Timelines ie. “I can certain check out the rates of completion, but I sense there may be more to the story here. Could you give me two days to run some reports and learn what’s really going on?” Limitations of Data: When thinking about communicating findings, consider the ff: 1. 2. 3. 4. 5. Does the analysis answer the original question? Are there angles that haven’t been considered? Can we answer questions that may get asked about the data and analysis? How detailed should we be when sharing the results? Would high level analysis be okay? Does the analysis help the team make better, more informed decisions? Amazing Teamwork Meeting Best Practices: Do’s: Come prepared o Bring what you need o Read the meeting agenda o Prepare notes and presentations o Be ready to answer questions Be on time Pay attention Ask questions Don’ts: Show up unprepared Arrive late Be distracted Dominate the conversation, give others chance to talk and let them finish speaking Talk over others Distract people with unfocused discussion Other tips: Every meeting should focus on making a clear decision and should include the person needed to make the decision Schedule meetings immediately if decisions are needed to be made Try to keep meeting participants under 10 When leading a meeting: o Make sure to build and send an agenda beforehand o Try to keep everyone involved o Let everyone know the floor is open for questions after the meeting o Take notes o Afterwards, follow up on questions and send updates o Try to have everyone put their phones or computers on silent when not speaking Leading great meetings: Conflict resolution: - Most common reasons for conflict are mismatched expectations and miscommunications When conflicts arise, instead of focusing on who’s at fault, it’s best to reframe the problem ie; “how can I best help you reach your goal?” Find opportunities for the team to work together instead of feeling frustrated by the problem Discussion is key to conflict resolution Start a conversation Understand the context Course 3, Week 1 Collecting Data Differentiate Between Data Formats & Structures Explore Data Types, Fields, and Values