Course 2 Data in Action From issue to action: The six data analysis phases There are six data analysis phases that will help you make seamless decisions: ask, prepare, process, analyze, share, and act. Keep in mind, these are different from the data life cycle, which describes the changes data goes through over its lifetime. Let’s walk through the steps to see how they can help you solve problems you might face on the job. Icon of a question mark with the word ask Step 1: Ask It’s impossible to solve a problem if you don’t know what it is. These are some things to consider: Define the problem you’re trying to solve Make sure you fully understand the stakeholder’s expectations Focus on the actual problem and avoid any distractions Collaborate with stakeholders and keep an open line of communication Take a step back and see the whole situation in context Questions to ask yourself in this step: What are my stakeholders saying their problems are? Now that I’ve identified the issues, how can I help the stakeholders resolve their questions? Icon of a clipboard with the word prepare Step 2: Prepare You will decide what data you need to collect in order to answer your questions and how to organize it so that it is useful. You might use your business task to decide: What metrics to measure Locate data in your database Create security measures to protect that data Questions to ask yourself in this step: What do I need to figure out how to solve this problem? What research do I need to do? Icon of numbers with the word process Step 3: Process Clean data is the best data and you will need to clean up your data to get rid of any possible errors, inaccuracies, or inconsistencies. This might mean: Using spreadsheet functions to find incorrectly entered data Using SQL functions to check for extra spaces Removing repeated entries Checking as much as possible for bias in the data Questions to ask yourself in this step: What data errors or inaccuracies might get in my way of getting the best possible answer to the problem I am trying to solve? How can I clean my data so the information I have is more consistent? Icon of magnifying glass with the word analyze Step 4: Analyze You will want to think analytically about your data. At this stage, you might sort and format your data to make it easier to: Perform calculations Combine data from multiple sources Create tables with your results Questions to ask yourself in this step: What story is my data telling me? How will my data help me solve this problem? Who needs my company’s product or service? What type of person is most likely to use it? Icon of an arrow with the word share Step 5: Share Everyone shares their results differently so be sure to summarize your results with clear and enticing visuals of your analysis using data via tools like graphs or dashboards. This is your chance to show the stakeholders you have solved their problem and how you got there. Sharing will certainly help your team: Make better decisions Make more informed decisions Lead to stronger outcomes Successfully communicate your findings Questions to ask yourself in this step: How can I make what I present to the stakeholders engaging and easy to understand? What would help me understand this if I were the listener? Icon of finger pressing on a button with the word act Step 6: Act Now it’s time to act on your data. You will take everything you have learned from your data analysis and put it to use. This could mean providing your stakeholders with recommendations based on your findings so they can make data-driven decisions. Questions to ask yourself in this step: How can I use the feedback I received during the share phase (step 5) to actually meet the stakeholder’s needs and expectations? These six steps can help you to break the data analysis process into smaller, manageable parts, which is called structured thinking. This process involves four basic activities: Recognizing the current problem or situation Organizing available information Revealing gaps and opportunities Identifying your options When you are starting out in your career as a data analyst, it is normal to feel pulled in a few different directions with your role and expectations. Following processes like the ones outlined here and using structured thinking skills can help get you back on track, fill in any gaps and let you know exactly what you need. In a previous video, I shared how data analysis helped a company figure out where to advertise its services. An important part of this process was strong problem-solving skills. As a data analyst, you'll find that problems are at the center of what you do every single day, but that's a good thing. Think of problems as opportunities to put your skills to work and find creative and insightful solutions. Problems can be small or large, simple or complex, no problem is like another and they all require a slightly different approach but the first step is always the same: Understanding what kind of problem you're trying to solve and that's what we're going to talk about now. Data analysts work with a variety of problems. In this video, we're going to focus on six common types. These include: making predictions, categorizing things, spotting something unusual, identifying themes, discovering connections, and finding patterns. Let's define each of these now. First, making predictions. This problem type involves using data to make an informed decision about how things may be in the future. For example, a hospital system might use a remote patient monitoring to predict health events for chronically ill patients. The patients would take their health vitals at home every day, and that information combined with data about their age, risk factors, and other important details could enable the hospital's algorithm to predict future health problems and even reduce future hospitalizations. The next problem type is categorizing things. This means assigning information to different groups or clusters based on common features. An example of this problem type is a manufacturer that reviews data on shop floor employee performance. An analyst may create a group for employees who are most and least effective at engineering. A group for employees who are most and least effective at repair and maintenance, most and least effective at assembly, and many more groups or clusters. Next, we have spotting something unusual. In this problem type, data analysts identify data that is different from the norm. An instance of spotting something unusual in the real world is a school system that has a sudden increase in the number of students registered, maybe as big as a 30 percent jump in the number of students. A data analyst might look into this upswing and discover that several new apartment complexes had been built in the school district earlier that year. They could use this analysis to make sure the school has enough resources to handle the additional students. Identifying themes is the next problem type. Identifying themes takes categorization as a step further by grouping information into broader concepts. Going back to our manufacturer that has just reviewed data on the shop floor employees. First, these people are grouped by types and tasks. But now a data analyst could take those categories and group them into the broader concept of low productivity and high productivity. This would make it possible for the business to see who is most and least productive, in order to reward top performers and provide additional support to those workers who need more training. Now, the problem type of discovering connections enables data analysts to find similar challenges faced by different entities, and then combine data and insights to address them. Here's what I mean; say a scooter company is experiencing an issue with the wheels it gets from its wheel supplier. That company would have to stop production until it could get safe, quality wheels back in stock. But meanwhile, the wheel companies encountering the problem with the rubber it uses to make wheels, turns out its rubber supplier could not find the right materials either. If all of these entities could talk about the problems they're facing and share data openly, they would find a lot of similar challenges and better yet, be able to collaborate to find a solution. The final problem type is finding patterns. Data analysts use data to find patterns by using historical data to understand what happened in the past and is therefore likely to happen again. Ecommerce companies use data to find patterns all the time. Data analysts look at transaction data to understand customer buying habits at certain points in time throughout the year. They may find that customers buy more canned goods right before a hurricane, or they purchase fewer cold-weather accessories like hats and gloves during warmer months. The ecommerce companies can use these insights to make sure they stock the right amount of products at these key times. Alright, you've now learned six basic problem types that data analysts typically face. As a future data analyst, this is going to be valuable knowledge for your career. Coming up, we'll talk a bit more about these problem types and I'll provide even more examples of them being solved by data analysts. Personally, I love real-world examples. They really help me better understand new concepts. I can't wait to share even more actual cases with you. See you there. Six problem types Data analytics is so much more than just plugging information into a platform to find insights. It is about solving problems. To get to the root of these problems and find practical solutions, there are lots of opportunities for creative thinking. No matter the problem, the first and most important step is understanding it. From there, it is good to take a problemsolver approach to your analysis to help you decide what information needs to be included, how you can transform the data, and how the data will be used. Data analysts typically work with six problem types 1. Making predictions 2. Categorizing things 3. Spotting something unusual 4. Identifying themes 5. Discovering connections 6. Finding patterns A video, Common problem types, introduced the six problem types with an example for each. The examples are summarized below for review. Making predictions A company that wants to know the best advertising method to bring in new customers is an example of a problem requiring analysts to make predictions. Analysts with data on location, type of media, and number of new customers acquired as a result of past ads can't guarantee future results, but they can help predict the best placement of advertising to reach the target audience. Categorizing things An example of a problem requiring analysts to categorize things is a company's goal to improve customer satisfaction. Analysts might classify customer service calls based on certain keywords or scores. This could help identify top-performing customer service representatives or help correlate certain actions taken with higher customer satisfaction scores. Spotting something unusual A company that sells smart watches that help people monitor their health would be interested in designing their software to spot something unusual. Analysts who have analyzed aggregated health data can help product developers determine the right algorithms to spot and set off alarms when certain data doesn't trend normally. Identifying themes User experience (UX) designers might rely on analysts to analyze user interaction data. Similar to problems that require analysts to categorize things, usability improvement projects might require analysts to identify themes to help prioritize the right product features for improvement. Themes are most often used to help researchers explore certain aspects of data. In a user study, user beliefs, practices, and needs are examples of themes. By now you might be wondering if there is a difference between categorizing things and identifying themes. The best way to think about it is: categorizing things involves assigning items to categories; identifying themes takes those categories a step further by grouping them into broader themes. Discovering connections A third-party logistics company working with another company to get shipments delivered to customers on time is a problem requiring analysts to discover connections. By analyzing the wait times at shipping hubs, analysts can determine the appropriate schedule changes to increase the number of on-time deliveries. Finding patterns Minimizing downtime caused by machine failure is an example of a problem requiring analysts to find patterns in data. For example, by analyzing maintenance data, they might discover that most failures happen if regular maintenance is delayed by more than a 15-day window. Key takeaway As you move through this program, you will develop a sharper eye for problems and you will practice thinking through the problem types when you begin your analysis. This method of problem solving will help you figure out solutions that meet the needs of all . You've been learning about six common problem types of data analysts encounter, making predictions, categorizing things, spotting something unusual, identifying themes, discovering connections, and finding patterns. Let's think back to our real world example from a previous video. In that example, anywhere gaming repair wanted to figure out how to bring in new customers. So the problem was, how to determine the best advertising method for anywhere gaming repair's target audience. To help solve this problem, the company used data to envision what would happen if it advertised in different places. Now nobody can see the future but the data helped them make an informed decision about how things would likely work out. So, their problem type was making predictions. Now let's think about the second problem type, categorizing things. Here's an example of a problem that involves categorization. Let's say a business wants to improve its customer satisfaction levels. Data analysts could review recorded calls to the company's customer service department and evaluate the satisfaction levels of each caller. They could identify certain key words or phrases that come up during the phone calls and then assign them to categories such as politeness, satisfaction, dissatisfaction, empathy, and more. Categorizing these key words gives us data that lets the company identify top performing customer service representatives, and those who might need more coaching. This leads to happier customers and higher customer service scores. Okay, now let's talk about a problem that involves spotting something unusual. Some of you may have a smart watch, my favorite app is for health tracking. These apps can help people stay healthy by collecting data such as their heart rate, sleep patterns, exercise routine, and much more. There are many stories out there about health apps actually saving people's lives. One is about a woman who was young, athletic, and had no previous medical problems. One night she heard a beep on her smartwatch, a notification said her heart rate had spiked. Now in this example think of the watch as a data analyst. The watch was collecting and analyzing health data. So when her resting heart rate was suddenly 120 beats per minute, the watch spotted something unusual because according to its data, the rate was normally around 70. Thanks to the data her smart watch gave her, the woman went to the hospital and discovered she had a condition which could have led to life threatening complications if she hadn't gotten medical help. Now let's move on to the next type of problem: identifying themes. We see a lot of examples of this in the user experience field. User experience designers study and work to improve the interactions people have with products they use every day. Let's say a user experience designer wants to see what customers think about the coffee maker his company manufactures. This business collects anonymous survey data from users, which can be used to answer this question. But first to make sense of it all, he will need to find themes that represent the most valuable data, especially information he can use to make the user experience even better. So the problem the user experience designer's company faces, is how to improve the user experience for its coffee makers. The process here is kind of like finding categories for keywords and phrases in customer service conversations. But identifying themes goes even further by grouping each insight into a broader theme. Then the designer can pinpoint the themes that are most common. In this case he learned users often couldn't tell if the coffee maker was on or off. He ended up optimizing the design with improved placement and lighting for the on/off button, leading to the product improvement and happier users. Now we come to the problem of discovering connections. This example is from the transportation industry and uses something called third party logistics. Third party logistics partners help businesses ship products when they don't have their own trucks, planes or ships. A common problem these partners face is figuring out how to reduce wait time. Wait time happens when a truck driver from the third party logistics provider arrives to pick up a shipment but it's not ready. So she has to wait. That costs both companies time and money and it stops trucks from getting back on the road to make more deliveries. So how can they solve this? Well, by sharing data the partner companies can view each other's timelines and see what's causing shipments to run late. Then they can figure out how to avoid those problems in the future. So a problem for one business doesn't cause a negative impact for the other. For example, if shipments are running late because one company only delivers Mondays, Wednesdays and Fridays, and the other company only delivers Tuesdays and Thursdays, then the companies can choose to deliver on the same day to reduce wait time for customers. All right, we've come to our final problem type, finding patterns. Oil and gas companies are constantly working to keep their machines running properly. So the problem is, how to stop machines from breaking down. One way data analysts can do this is by looking at patterns in the company's historical data. For example, they could investigate how and when a particular machine broke down in the past and then generate insights into what led to the breakage. In this case, the company saw pattern indicating that machines began breaking down at faster rates when maintenance wasn't kept up in 15 day cycles. They can then keep track of current conditions and intervene if any of these issues happen again. Pretty cool, right? I'm always amazed to hear about how data helps real people and businesses make meaningful change. I hope you are too. See you soon. Now that we've talked about six basic problem types, it's time to start solving them. To do that, data analysts start by asking the right questions. In this video, we're going to learn how to ask effective questions that lead to key insights you can use to solve all kinds of problems. As a data analyst, I ask questions constantly. It's a huge part of the job. If someone requests that I work on a project, I ask questions to make sure we're on the same page about the plan and the goals. And when I do get a result, I question it. Is the data showing me something superficially? Is there a conflict somewhere that needs to be resolved? The more questions you ask, the more you'll learn about your data and the more powerful your insights will be at the end of the day. Some questions are more effective than others. Let's say you're having lunch with a friend and they say, "These are the best sandwiches ever, aren't they?" Well, that question doesn't really give you the opportunity to share your own opinion, especially if you happen to disagree and didn't enjoy the sandwich very much. This is called a leading question because it's leading you to answer in a certain way. Or maybe you're working on a project and you decide to interview a family member. Say you ask your uncle, did you enjoy growing up in Malaysia? He may reply, "Yes." But you haven't learned much about his experiences there. Your question was closed-ended. That means it can be answered with a yes or no. These kinds of questions rarely lead to valuable insights. Now what if someone asks you, do you prefer chocolate or vanilla? Well, what are they specifically talking about? Ice cream, pudding, coffee flavoring or something else? What if you like chocolate ice cream but vanilla in your coffee? What if you don't like either flavor? That's the problem with this question. It's too vague and lacks context. Knowing the difference between effective and ineffective questions is essential for your future career as a data analyst. After all, the data analyst process starts with the ask phase. So it's important that we ask the right questions. Effective questions follow the SMART methodology. That means they're specific, measurable, action-oriented, relevant and time-bound. Let's break that down. Specific questions are simple, significant and focused on a single topic or a few closely related ideas. This helps us collect information that's relevant to what we're investigating. If a question is too general, try to narrow it down by focusing on just one element. For example, instead of asking a closed-ended question, like, are kids getting enough physical activities these days? Ask what percentage of kids achieve the recommended 60 minutes of physical activity at least five days a week? That question is much more specific and can give you more useful information. Now, let's talk about measurable questions. Measurable questions can be quantified and assessed. An example of an unmeasurable question would be, why did a recent video go viral? Instead, you could ask how many times was our video shared on social channels the first week it was posted? That question is measurable because it lets us count the shares and arrive at a concrete number. Okay, now we've come to action-oriented questions. Action-oriented questions encourage change. You might remember that problem solving is about seeing the current state and figuring out how to transform it into the ideal future state. Well, action-oriented questions help you get there. So rather than asking, how can we get customers to recycle our product packaging? You could ask, what design features will make our packaging easier to recycle? This brings you answers you can act on. All right, let's move on to relevant questions. Relevant questions matter, are important and have significance to the problem you're trying to solve. Let's say you're working on a problem related to a threatened species of frog. And you asked, why does it matter that Pine Barrens tree frogs started disappearing? This is an irrelevant question because the answer won't help us find a way to prevent these frogs from going extinct. A more relevant question would be, what environmental factors changed in Durham, North Carolina between 1983 and 2004 that could cause Pine Barrens tree frogs to disappear from the Sandhills Regions? This question would give us answers we can use to help solve our problem. That's also a great example for our final point, time-bound questions. Time-bound questions specify the time to be studied. The time period we want to study is 1983 to 2004. This limits the range of possibilities and enables the data analyst to focus on relevant data. Okay, now that you have a general understanding of SMART questions, there's something else that's very important to keep in mind when crafting questions, fairness. We've touched on fairness before, but as a quick reminder, fairness means ensuring that your questions don't create or reinforce bias. To talk about this, let's go back to our sandwich example. There we had an unfair question because it was phrased to lead you toward a certain answer. This made it difficult to answer honestly if you disagreed about the sandwich quality. Another common example of an unfair question is one that makes assumptions. For instance, let's say a satisfaction survey is given to people who visit a science museum. If the survey asks, what do you love most about our exhibits? This assumes that the customer loves the exhibits which may or may not be true. Fairness also means crafting questions that make sense to everyone. It's important for questions to be clear and have a straightforward wording that anyone can easily understand. Unfair questions also can make your job as a data analyst more difficult. They lead to unreliable feedback and missed opportunities to gain some truly valuable insights. You've learned a lot about how to craft effective questions, like how to use the SMART framework while creating your questions and how to ensure that your questions are fair and objective. Moving forward, you'll explore different types of data and learn how each is used to guide business decisions. You'll also learn more about visualizations and how metrics or measures can help create success. It's going to be great! More about SMART questions Companies in lots of industries today are dealing with rapid change and rising uncertainty. Even well-established businesses are under pressure to keep up with what is new and figure out what is next. To do that, they need to ask questions. Asking the right questions can help spark the innovative ideas that so many businesses are hungry for these days. The same goes for data analytics. No matter how much information you have or how advanced your tools are, your data won’t tell you much if you don’t start with the right questions. Think of it like a detective with tons of evidence who doesn’t ask a key suspect about it. Coming up, you will learn more about how to ask highly effective questions, along with certain practices you want to avoid. Highly effective questions are SMART questions: Examples of SMART questions Here's an example that breaks down the thought process of turning a problem question into one or more SMART questions using the SMART method: What features do people look for when buying a new car? Specific: Does the question focus on a particular car feature? Measurable: Does the question include a feature rating system? Action-oriented: Does the question influence creation of different or new feature packages? Relevant: Does the question identify which features make or break a potential car purchase? Time-bound: Does the question validate data on the most popular features from the last three years? Questions should be open-ended. This is the best way to get responses that will help you accurately qualify or disqualify potential solutions to your specific problem. So, based on the thought process, possible SMART questions might be: On a scale of 1-10 (with 10 being the most important) how important is your car having four-wheel drive? What are the top five features you would like to see in a car package? What features, if included with four-wheel drive, would make you more inclined to buy the car? How much more would you pay for a car with four-wheel drive? Has four-wheel drive become more or less popular in the last three years? Things to avoid when asking questions Leading questions: questions that only have a particular response Example: This product is too expensive, isn’t it? This is a leading question because it suggests an answer as part of the question. A better question might be, “What is your s of this product?” There are tons of answers to that question, and they could include information about usability, features, accessories, color, reliability, and popularity, on top of price. Now, if your problem is actually focused on pricing, you could ask a question like “What price (or price range) would make you consider purchasing this product?” This question would provide a lot of different measurable responses. Closed-ended questions: questions that ask for a one-word or brief response only Example: Were you satisfied with the customer trial? This is a closed-ended question because it doesn’t encourage people to expand on their answer. It is really easy for them to give one-word responses that aren’t very informative. A better question might be, “What did you learn about customer experience from the trial.” This encourages people to provide more detail besides “It went well.” Vague questions: questions that aren’t specific or don’t provide context Example: Does the tool work for you? This question is too vague because there is no context. Is it about comparing the new tool to the one it replaces? You just don’t know. A better inquiry might be, “When it comes to data entry, is the new tool faster, slower, or about the same as the old tool? If faster, how much time is saved? If slower, how much time is lost?” These questions give context (data entry) and help frame responses that are measurable (time). Hi, I'm Evan. I'm a learning portfolio manager here at Google, and I have one of the coolest jobs in the world where I get to look at all the different technologies that affect big data and then work them into training courses like this one for students to take. I wish I had a course like this when I was first coming out of college or high school. It was honestly a data analyst course that's geared in the way like this one is if you've already taken some of the videos really prepares you to do anything you want. It will open all of those doors that you want for any of those roles inside of the data curriculum. Well, what are some of those roles? There are so many different career paths for someone who's interested in data. Generally, if you're like me, you'll come in through the door as a data analyst maybe working with spreadsheets, maybe working with small, medium, and large databases, but all you have to remember is 3 different core roles. Now there's many in special, whether specialties, within each of these different careers, but these three are the data analysts, which is generally someone who works with SQL, spreadsheets, databases, might work as a business intelligence team creating those dashboards. Now where does all that data come from? Generally, a data analyst will work with a data engineer to turn that raw data into actionable pipelines. So you have data analysts, data engineers, and then lastly, you might have data scientists who basically say the data engineers have built these beautiful pipelines. Sometimes the analyst do that too. The analysts have provided us with clean and actionable data. Then the data scientists then worked actually to turn it into really cool machine learning models or statistical inferences that are just well beyond anything you could have ever imagined. We'll share a lot of resources in links for ways that you can get excited for each of these different roles. And the best part is, if you're like me when I went into school, I didn't know what I wanted to do and you don't have to know at the outset which path you want to go down. Try 'em all. See what you really, really like. It's very personal. Becoming a data analyst is so exciting. Why? Because it's not just like a means to an end. It's just taking a career path where so many bright people have gone before and have made the tools and technologies that much easier for you and me today. For example, when I was starting to learn SQL or the structured query language that you're going to be learning as part of this course, I was doing it on my local laptop and each of the queries would take like 20, 30 minutes to run and it was very hard for me to keep track of different SQL statements that I was writing or share them with somebody else. That was about 10 or 15 years ago. Now, through all the different companies and all the different tools that are making data analysis tools and technologies easier for you, you're going to have a blast creating these insights with a lot less of the overhead that I had when I first started out. So I'm really excited to hear what you think and what your experience is going to be. We've talked a lot about what data is and how it plays into decision-making. What do we know already? Well, we know that data is a collection of facts. We also know that data analysis reveals important patterns and insights about that data. Finally, we know that data analysis can help us make more informed decisions. Now, we'll look at how data plays into the decision-making process and take a quick look at the differences between data-driven and data-inspired decisions. Let's look at a real-life example. Think about the last time you searched "restaurants near me" and sorted the results by rating to help you decide which one looks best. That was a decision you made using data. Businesses and other organizations use data to make better decisions all the time. There's two ways they can do this, with data-driven or data-inspired decision-making. We'll talk more about data-inspired decision-making later on, but here's a quick definition for now. Data-inspired decision-making explores different data sources to find out what they have in common. Here at Google, we use data every single day, in very surprising ways too. For example, we use data to help cut back on the amount of energy spent cooling your data centers. After analyzing years of data collected with artificial intelligence, we were able to make decisions that help reduce the energy we use to cool our data centers by over 40 percent. Google's People Operations team also uses data to improve how we hire new Googlers and how we get them started on the right foot. We wanted to make sure we weren't passing over any talented applicants and that we made their transition into their new roles as smooth as possible. After analyzing data on applications, interviews, and new hire orientation processes, we started using an algorithm. An algorithm is a process or set of rules to be followed for a specific task. With this algorithm, we reviewed applicants that didn't pass the initial screening process to find great candidates. Data also helped us determine the ideal number of interviews that lead to the best possible hiring decisions. We've created new onboarding agendas to help new employees get started at their new jobs. Data is everywhere. Today, we create so much data that scientists estimate 90 percent of the world's data has been created in just the last few years. Think of the potential here. The more data we have, the bigger the problems we can solve and the more powerful our solutions can be. But responsibly gathering data is only part of the process. We also have to turn data into knowledge that helps us make better solutions. I'm going to let fellow Googler, Ed, talk more about that. Just having tons of data isn't enough. We have to do something meaningful with it. Data in itself provides little value. To quote Jack Dorsey, the founder of Twitter and Square, "Every single action that we do in this world is triggering off some amount of data, and most of that data is meaningless until someone adds some interpretation of it or someone adds a narrative around it." Data is straightforward, facts collected together, values that describe something. Individual data points become more useful when they're collected and structured, but they're still somewhat meaningless by themselves. We need to interpret data to turn it into information. Look at Michael Phelps' time in a 200-meter individual medal swimming race, one minute, 54 seconds. Doesn't tell us much. When we compare it to his competitor's times in the race, however, we can see that Michael came in the first place and won the gold medal. Our analysis took data, in this case, a list of Michael's races and times and turned it into information by comparing it with other data. Context is important. We needed to know that this race was an Olympic final and not some other random race to determine that this was a gold medal finish. But this still isn't knowledge. When we consume information, understand it, and apply it, that's when data is most useful. In other words, Michael Phelps is a fast swimmer. It's pretty cool how we can turn data into knowledge that helps us in all kinds of ways, whether it's finding the perfect restaurant or making environmentally friendly changes. But keep in mind, there are limitations to data analytics. Sometimes we don't have access to all of the data we need, or data is measured differently across programs, which can make it difficult to find concrete examples. We'll cover these more in detail later on, but it's important that you start thinking about them now. Now that you know how data drives decision-making, you know how key your role as a data analyst is to the business. Data is a powerful tool for decision-making, and you can help provide businesses with the information they need to solve problems and make new decisions, but before that, you will need to learn a little more about the kinds of data you'll be working with and how to deal with it. Data trials and triumphs This reading focuses on why accurate interpretation of data is key to data-driven decisions. You have been learning why data is such a powerful business tool and how data analysts help their companies make data-driven decisions for great results. As a quick reminder, the goal of all data analysts is to use data to draw accurate conclusions and make good recommendations. That all starts with having complete, correct, and relevant data. But keep in mind, it is possible to have solid data and still make the wrong choices. It is up to data analysts to interpret the data accurately. When data is interpreted incorrectly, it can lead to huge losses. Consider the examples below. Coke launch failure In 1985, New Coke was launched, replacing the classic Coke formula. The company had done taste tests with 200,000 people and found that test subjects preferred the taste of New Coke over Pepsi, which had become a tough competitor. Based on this data alone, classic Coke was taken off the market and replaced with New Coke. This was seen as the solution to take back the market share that had been lost to Pepsi. But as it turns out, New Coke was a massive flop and the company ended up losing tens of millions of dollars. How could this have happened with data that seemed correct? It is because the data wasn’t complete, which made it inaccurate. The data didn't consider how customers would feel about New Coke replacing classic Coke. The company’s decision to retire classic Coke was a data-driven decision based on incomplete data. Mars orbiter loss In 1999, NASA lost the $125 million Mars Climate Orbiter, even though it had good data. The spacecraft burned to pieces because of poor collaboration and communication. The Orbiter’s navigation team was using the SI or metric system (newtons) for their force calculations, but the engineers who built the spacecraft used the English Engineering Units system (pounds) for force calculations. No one realized a problem even existed until the Orbiter burst into flames in the Martian atmosphere. Later, a NASA review board investigating the root cause of the problem figured out that the issue was isolated to the software that controlled the thrusters. One program calculated the thrusters’ force in pounds; another program looking at the data assumed it was in newtons. The software controllers were making data-driven decisions to adjust the thrust based on 100% accurate data, but these decisions were wrong because of inaccurate assumptions when interpreting it. A conversion of the data from one system of measurement to the other could have prevented the loss. When data is used strategically, businesses can transform and grow their revenue. Consider the examples below. Crate and Barrel At Crate and Barrel, online sales jumped more than 40% during stay-at-home orders to combat the global pandemic. Currently, online sales make up more than 65% of their overall business. They are using data insights to accelerate their digital transformation and bring the best of online and offline experiences together for customers. BigQuery enables Crate and Barrel to "draw on ten times [as many] information sources (compared to a few years ago) which are then analyzed and transformed into actionable insights that can be used to influence the customer’s next interaction. And this, in turn, drives revenue." Read more about Crate and Barrel's data strategy in How one retailer’s data strategy powers seamless customer experiences. PepsiCo Since the days of the New Coke launch, things have changed dramatically for beverage and other consumer packaged goods (CPG) companies. PepsiCo "hired analytical talent and established cross-functional workflows around an infrastructure designed to put consumers’ needs first. Then [they] set up the right processes to make critical decisions based on data and technology use cases. Finally, [they] invested in the right technology stack and platforms so that data could flow into a central cloud-based hub. This is critical. When data comes together, [they] develop a holistic understanding of the consumer and their journeys." Read about how PepsiCo is delivering a more personal and valuable experience to customers using data in How one of the world’s biggest marketers ripped up its playbook and learned to anticipate intent. Key skills for triumphant results As a data analyst, your own skills and knowledge will be the most important part of any analysis project. It is important for you to keep a data-driven mindset, ask lots of questions, experiment with many different possibilities, and use both logic and creativity along the way. You will then be prepared to interpret your data with the highest levels of care and accuracy. Note that there is a difference between making a decision with incomplete data and making a decision with a small amount of data. You learned that making a decision with incomplete data is dangerous. But sometimes accurate data from a small test can help you make a good decision. Stay tuned. You will learn about how much data to collect later in the program. Hi again. When it comes to decision-making, data is key. But we've also learned that there are a lot of different kinds of questions that data might help us answer, and these different questions make different kinds of data. There are two kinds of data that we'll talk about in this video, quantitative and qualitative. Quantitative data is all about the specific and objective measures of numerical facts. This can often be the what, how many, and how often about a problem. In other words, things you can measure, like how many commuters take the train to work every week. As a financial analyst, I work with a lot of quantitative data. I love the certainty and accuracy of numbers. On the other hand, qualitative data describes subjective or explanatory measures of qualities and characteristics or things that can't be measured with numerical data, like your hair color. Qualitative data is great for helping us answer why questions. For example, why people might like a certain celebrity or snack food more than others. With quantitative data, we can see numbers visualized as charts or graphs. Qualitative data can then give us a more high-level understanding of why the numbers are the way they are. This is important because it helps us add context to a problem. As a data analyst, you'll be using both quantitative and qualitative analysis, depending on your business task. Reviews are a great example of this. Think about a time you used reviews to decide whether you wanted to buy something or go somewhere. These reviews might have told you how many people dislike that thing and why. Businesses read these reviews too, but they use the data in different ways. Let's look at an example of a business using data from customer reviews to see qualitative and quantitative data in action. Now, say a local ice cream shop has started using their online reviews to engage with their customers and build their brand. These reviews give the ice cream shop insights into their customers' experiences, which they can use to inform their decision-making. The owner notices that their rating has been going down. He sees that lately his shop has been receiving more negative reviews. He wants to know why, so he starts asking questions. First are measurable questions. How many negative reviews are there? What's the average rating? How many of these reviews use the same keywords? These questions generate quantitative data, numerical results that help confirm their customers aren't satisfied. This data might lead them to ask different questions. Why are customers unsatisfied? How can we improve their experience? These are questions that lead to qualitative data. After looking through the reviews, the ice cream shop owner sees a pattern, 17 of negative reviews use the word "frustrated." That's quantitative data. Now we can start collecting qualitative data by asking why this word is being repeated? He finds that customers are frustrated because the shop is running out of popular flavors before the end of the day. Knowing this, the ice cream shop can change its weekly order to make sure it has enough of what the customers want. With both quantitative and qualitative data, the ice cream shop owner was able to figure out his customers were unhappy and understand why. Having both types of data made it possible for him to make the right changes and improve his business. Now that you know the difference between quantitative and qualitative data, you know how to get different types of data by asking different questions. It's your job as a data detective to know which questions to ask to find the right solution. Then you can start thinking about cool and creative ways to help stakeholders better understand the data. For example, interactive dashboards, which we'll learn about soon. . Your analysis of the historical data shows that the 7:30 PM showtime was the most popular and had the greatest attendance, followed by the 7:15 PM and 9:00 PM showtimes. You may suggest replacing the current 8:00 PM showtime that has lower attendance with an 8:30 PM showtime. But you need more data to back up your hunch that people would be more likely to attend the later show. Evening movie-goers are the largest source of revenue for the theater. Therefore, you also decide to include a question in your online survey to gain more insight. Qualitative data for all three trends plus ticket pricing Since you know that the theater is planning to raise ticket prices for evening showtimes in a few months, you will also include a question in the survey to get an idea of customers’ price sensitivity. Your final online survey might include these questions for qualitative data: 1. What went into your decision to see a movie in our theater today? (movie attendance) 2. What do you think about the quality and value of your purchases at the concession stand? (concession stand profitability) 3. Which showtime do you prefer, 8:00 PM or 8:30 PM, and why do you prefer that time? (evening movie-goer preferences) 4. Under what circumstances would you choose a matinee over a nighttime showing? (ticket price increase) Summing it up Data analysts will generally use both types of data in their work. Usually, qualitative data can help analysts better understand their quantitative data by providing a reason or more thorough explanation. In other words, quantitative data generally gives you the what, and qualitative data generally gives you the why. By using both quantitative and qualitative data, you can learn when people like to go to the movies and why they chose the theater. Maybe they really like the reclining chairs, so your manager can purchase more recliners. Maybe the theater is the only one that serves root beer. Maybe a later show time gives them more time to drive to the theater from where popular restaurants are located. Maybe they go to matinees because they have kids and want to save money. You wouldn’t have discovered this information by analyzing only the quantitative data for attendance, profit, and showtimes. In the last video, we learned how you can visualize your data using reports and dashboards to show off your findings in interesting ways. In one of our examples, the company wanted to see the sales revenue of each salesperson. That specific measurement of data is done using metrics. Now, I want to tell you a little bit more about the difference between data and metrics. And how metrics can be used to turn data into useful information. Play video starting at ::30 and follow transcript0:30 A metric is a single, quantifiable type of data that can be used for measurement. Think of it this way. Data starts as a collection of raw facts, until we organize them into individual metrics that represent a single type of data. Play video starting at ::48 and follow transcript0:48 Metrics can also be combined into formulas that you can plug your numerical data into. In our earlier sales revenue example all that data doesn't mean much unless we use a specific metric to organize it. So let's use revenue by individual salesperson as our metric. Now we can see whose sales brought in the highest revenue. Metrics usually involve simple math. Revenue, for example, is the number of sales multiplied by the sales price. Choosing the right metric is key. Play video starting at :1:25 and follow transcript1:25 Data contains a lot of raw details about the problem we're exploring. But we need the right metrics to get the answers we're looking for. Different industries will use all kinds of metrics to measure things in a data set. Let's look at some more ways businesses in different industries use metrics. So you can see how you might apply metrics to your collected data. Play video starting at :1:50 and follow transcript1:50 Ever heard of ROI? Play video starting at :1:53 and follow transcript1:53 Companies use this metric all the time. ROI, or Return on Investment is essentially a formula designed using metrics that let a business know how well an investment is doing. The ROI is made up of two metrics, the net profit over a period of time and the cost of investment. By comparing these two metrics, profit and cost of investment, the company can analyze the data they have to see how well their investment is doing. This can then help them decide how to invest in the future and which investments to prioritize. We see metrics used in marketing too. For example, metrics can be used to help calculate customer retention rates, or a company's ability to keep its customers over time. Customer retention rates can help the company compare the number of customers at the beginning and the end of a period to see their retention rates. This way the company knows how successful their marketing strategies are and if they need to research new approaches to bring back more repeat customers. Play video starting at :3:3 and follow transcript3:03 Different industries use all kinds of different metrics. But there's one thing they all have in common: they're all trying to meet a specific goal by measuring data. Play video starting at :3:14 and follow transcript3:14 This metric goal is a measurable goal set by a company and evaluated using metrics. And just like there are a lot of possible metrics, there are lots of possible goals too. Play video starting at :3:26 and follow transcript3:26 Maybe an organization wants to meet a certain number of monthly sales, or maybe a certain percentage of repeat customers. Play video starting at :3:34 and follow transcript3:34 By using metrics to focus on individual aspects of your data, you can start to see the story your data is telling. Metric goals and formulas are great ways to measure and understand data. But they're not the only ways. We'll talk more about how to interpret and understand data throughout this course. So far, you've learned a lot about how to think like a data analyst. We've explored a few different ways of thinking. And now, I want to take that one step further by using a mathematical approach to problem-solving. Mathematical thinking is a powerful skill you can use to help you solve problems and see new solutions. So, let's take some time to talk about what mathematical thinking is, and how you can start using it. Using a mathematical approach doesn't mean you have to suddenly become a math whiz. It means looking at a problem and logically breaking it down step-by-step, so you can see the relationship of patterns in your data, and use that to analyze your problem. This kind of thinking can also help you figure out the best tools for analysis because it lets us see the different aspects of a problem and choose the best logical approach. There are a lot of factors to consider when choosing the most helpful tool for your analysis. One way you could decide which tool to use is by the size of your dataset. When working with data, you'll find that there's big and small data. Small data can be really small. These kinds of data tend to be made up of datasets concerned with specific metrics over a short, well defined period of time. Like how much water you drink in a day. Small data can be useful for making day-to-day decisions, like deciding to drink more water. But it doesn't have a huge impact on bigger frameworks like business operations. You might use spreadsheets to organize and analyze smaller datasets when you first start out. Big data on the other hand has larger, less specific datasets covering a longer period of time. They usually have to be broken down to be analyzed. Big data is useful for looking at large- scale questions and problems, and they help companies make big decisions. When you're working with data on this larger scale, you might switch to SQL. Let's look at an example of how a data analyst working in a hospital might use mathematical thinking to solve a problem with the right tools. The hospital might find that they're having a problem with over or under use of their beds. Based on that, the hospital could make bed optimization a goal. They want to make sure that beds are available to patients who need them, but not waste hospital resources like space or money on maintaining empty beds. Using mathematical thinking, you can break this problem down into a step-by-step process to help you find patterns in their data. There's a lot of variables in this scenario. But for now, let's keep it simple and focus on just a few key ones. There are metrics that are related to this problem that might show us patterns in the data: for example, maybe the number of beds open and the number of beds used over a period of time. There's actually already a formula for this. It's called the bed occupancy rate, and it's calculated using the total number of inpatient days, and the total number of available beds over a given period of time. What we want to do now is take our key variables and see how their relationship to each other might show us patterns that can help the hospital make a decision. To do that, we have to choose the tool that makes sense for this task. Hospitals generate a lot of patient data over a long period of time. So logically, a tool that's capable of handling big datasets is a must. SQL is a great choice. In this case, you discover that the hospital always has unused beds. Knowing that, they can choose to get rid of some beds, which saves them space and money that they can use to buy and store protective equipment. By considering all of the individual parts of this problem logically, mathematical thinking helped us see new perspectives that led us to a solution. Well, that's it for now. Great job. You've covered a lot of material already. You've learned about how empowering data can be in decision-making, the difference between quantitative and qualitative analysis, using reports and dashboards for data visualization, metrics, and using a mathematical approach to problem-solving. Coming up next, we'll be tackling spreadsheet basics. You'll get to put what you've learned into action and learn a new tool to help you along the data analysis process. See you soon. Big and small data As a data analyst, you will work with data both big and small. Both kinds of data are valuable, but they play very different roles. Whether you work with big or small data, you can use it to help stakeholders improve business processes, answer questions, create new products, and much more. But there are certain challenges and benefits that come with big data and the following table explores the differences between big and small data. Small data Big data Describes a data set made up of specific metrics over a short, well-defined time period Describes large, less-specific data sets that Usually organized and analyzed in spreadsheets Usually kept in a database and queried Likely to be used by small and midsize businesses Likely to be used by large organizations Simple to collect, store, manage, sort, and visually represent Takes a lot of effort to collect, store, manag Usually already a manageable size for analysis Usually needs to be broken into smaller pie and analyzed effectively for decision-makin Challenges and benefits Here are some challenges you might face when working with big data: A lot of organizations deal with data overload and way too much unimportant or irrelevant information. Important data can be hidden deep down with all of the non-important data, which makes it harder to find and use. This can lead to slower and more inefficient decision-making time frames. The data you need isn’t always easily accessible. Current technology tools and solutions still struggle to provide measurable and reportable data. This can lead to unfair algorithmic bias. There are gaps in many big data business solutions. Now for the good news! Here are some benefits that come with big data: When large amounts of data can be stored and analyzed, it can help companies identify more efficient ways of doing business and save a lot of time and money. Big data helps organizations spot the trends of customer buying patterns and satisfaction levels, which can help them create new products and solutions that will make customers happy. By analyzing big data, businesses get a much better understanding of current market conditions, which can help them stay ahead of the competition. As in our earlier social media example, big data helps companies keep track of their online presence—especially feedback, both good and bad, from customers. This gives them the information they need to improve and protect their brand. The three (or four) V words for big data When thinking about the benefits and challenges of big data, it helps to think about the three Vs: volume, variety, and velocity. Volume describes the amount of data. Variety describes the different kinds of data. Velocity describes how fast the data can be processed. Some data analysts also consider a fourth V: veracity. Veracity refers to the quality and reliability of the data. These are all important considerations related to processing huge, complex data sets. Volume Variety Velocity Veracit The amount of data The different kinds of data How fast the data can be processed The qu Hi, again. I'm glad you're back. In this part of the program, we'll revisit the spreadsheet. Spreadsheets are a powerful and versatile tool, which is why they're a big part of pretty much everything we do as data analysts. There's a good chance a spreadsheet will be the first tool you reach for when trying to answer data-driven questions. After you've defined what you need to do with the data, you'll turn to spreadsheets to help build evidence that you can then visualize, and use to support your findings. Spreadsheets are often the unsung heroes of the data world. They don't always get the appreciation they deserve, but as a data detective, you'll definitely want them in your evidence collection kit. I know spreadsheets have saved the day for me more than once. I've added data for purchase orders into a sheet, setup formulas in one tab, and had the same formulas do the work for me in other tabs. This frees up time for me to work on other things during the day. I couldn't imagine not using spreadsheets. Math is a core part of every data analyst's job, but not every analyst enjoys it. Luckily, spreadsheets can make calculations more enjoyable, and by that, I mean easier. Let's see how. Spreadsheets can do both basic and complex calculations automatically. Not only does this help you work more efficiently, but it also lets you see the results and understand how you got them. Here's a quick look at some of the functions that you'll use when performing calculations. Many functions can be used as part of a math formula as well. Functions and formulas also have other uses, and we'll take a look at those too. We'll take things one step further with exercises that use real data from databases. This is your chance to reorganize a spreadsheet, do some actual data analysis, and have some fun with data. You have been learning a lot about spreadsheets and all kinds of time-saving calculations and organizational features they offer. One of the most valuable spreadsheet features is a formula. As a quick reminder, a formula is a set of instructions that does a specific calculation using the data in a spreadsheet. Formulas make it easy for data analysts to do powerful calculations automatically, which helps them analyze data more effectively. Below is a quick-reference guide to help you get the most out of formulas. Formulas The basics When you write a formula in math, it generally ends with an equal sign (2 + 3 = ?). But with formulas, they always start with one instead (=A2+A3). The equal sign tells the spreadsheet that what follows is part of a formula, not just a word or number in a cell. After you type the equal sign, most spreadsheet applications will display an autocomplete menu that lists valid formulas, names, and text strings. This is a great way to create and edit formulas while avoiding typing and syntax errors. A fun way to learn new formulas is just by typing an equal sign and a single letter of the alphabet. Choose one of the options that pops up and you will learn what that formula does. Mathematical operators The mathematical operators used in spreadsheet formulas include: Subtraction – minus sign ( - ) Addition – plus sign ( + ) Division – forward-slash ( / ) Multiplication – asterisk ( * ) Auto-filling The lower-right corner of each cell has a fill handle. It is a small green square in Microsoft Excel and a small blue square in Google Sheets. Click the fill handle for a cell and drag it down a column to auto-fill other cells in the column with the same value or formula in that cell. Click the fill handle for a cell and drag it across a row to auto-fill other cells in the row with the same value or formula in that cell. If you want to create a numbered sequence in a column or row, do the following: 1) Fill in the first two numbers of the sequence in two adjacent cells, 2) Select to highlight the cells, and 3) Drag the fill handle to the last cell to complete the sequence of numbers. For example, to insert 1 through 100 in each row of column A, enter 1 in cell A1 and 2 in cell A2. Then, select to highlight both cells, click the fill handle in cell A2, and drag it down to cell A100. This auto-fills the numbers sequentially so you don't have to type them in each cell. Absolute referencing Absolute referencing is marked by a dollar sign ($). For example, =$A$10 has absolute referencing for both the column and the row value Relative references (which is what you normally do e.g. “=A10”) will change anytime the formula is copied and pasted. They are in relation to where the referenced cell is located. For example if you copied “=A10” to the cell to the right it would become “=B10”. With absolute referencing “=$A$10” copied to the cell to the right would remain “=$A$10”. But if you copied $A10 to the cell below, it would change to $A11 because the row value isn't an absolute reference. Absolute references will not change when you copy and paste the formula in a different cell. The cell being referenced is always the same. To easily switch between absolute and relative referencing in the formula bar, highlight the reference you want to change and press the F4 key; for example, if you want to change the absolute reference, $A$10, in your formula to a relative reference, A10, highlight $A$10 in the formula bar and then press the F4 key to make the change. Data range When you click into your formula, the colored ranges let you see which cells are being used in your spreadsheet. There are different colors for each unique range in your formula. In a lot of spreadsheet applications, you can press the F2 (or Enter) key to highlight the range of data in the spreadsheet that is referenced in a formula. Click the cell with the formula, and then press the F2 (or Enter) key to highlight the data in your spreadsheet. Combining with functions COUNTIF() is a formula and a function. This means the function runs based on criteria set by the formula. In this case, COUNT is the formula; it will be executed IF the conditions you create are true. For example, you could use =COUNTIF(A1:A16, “7”) to count only the cells that contained the number 7. Combining formulas and functions allows you to do more work with a single command. Quick reference: Functions in spreadsheets As a quick refresher, a function is a preset command that automatically performs a specific process or task using the data in a spreadsheet. Functions give data analysts the ability to do calculations, which can be anything from simple arithmetic to complex equations. Use this reading to help you keep track of some of the most useful options. Functions The basics Just like formulas, start all of your functions with an equal sign; for example =SUM. The equal sign tells the spreadsheet that what follows is part of a function, not just a word or number in a cell. After you type the equal sign, most spreadsheet applications will display an autocomplete menu that lists valid functions, names, and text strings. This is a great way to create and edit functions while avoiding typing and syntax errors. A fun way to learn new functions is by simply typing an equal sign and a single letter of the alphabet. Choose one of the options that pops up and learn what that function does. Difference between formulas and functions A formula is a set of instructions used to perform a calculation using the data in a spreadsheet. A function is a preset command that automatically performs a specific process or task using the data in a spreadsheet. Popular functions A lot of people don’t realize that keyboard shortcuts like cut, save, and find are actually functions. These functions are built into an application and are amazing time-savers. Using shortcuts lets you do more with less effort. They can make you more efficient and productive because you are not constantly reaching for the mouse and navigating menus. The following table shows some of the most popular shortcuts, for Chromebook, PC, and Mac: Command Chromebook PC Mac Create new workbook Control+N Control+N Command+ Open workbook Control+O Control+O Command+ Save workbook Control+S Control+S Command+ Close workbook Control+W Control+W Command+ Undo Control+Z Control+Z Command+ Redo Control+Y Control+Y Command+ Copy Control+C Control+C Command+ Cut Control+X Control+X Command+ Paste Control+V Control+V Command+ Paste values only Control+Shift+V Control+Shift+V Command+ Find Control+Shift+F Control+F Command+ Find and replace Control+H Control+H Command+ Insert link Control+K Control+K Command+ Bold Control+B Control+B Command+ Italicize Control+I Control+I Command+ Underline Control+U Control+U Command+ Zoom in Control+Plus (+) Control+Plus (+) Option+Com Zoom out Control+Minus (-) Control+Minus (-) Option+Com Select column Control+Spacebar Control+Spacebar Command+ Select row Shift+Spacebar Shift+Spacebar Up Arrow+ Select all cells Control+A Control+A Command+ Edit the current cell Enter F2 F2 Comment on a cell Ctrl + Alt + M Alt+I+M Option+Com Insert column to the left Ctrl + Alt + = (with existing column selected) Alt+Shift+I, then C ⌘ + Option selected) Command Chromebook PC Mac Insert column to the right Alt + I, then O Alt+Shift+I, then O Ctrl + Optio Insert row above Ctrl + Alt + = (with existing row selected) Alt+Shift+I, then R ⌘ + Option selected) Insert row below Alt + I, then R, then B Alt+Shift+I, then B Ctrl + Optio Auto-filling The lower-right corner of each cell has a fill handle. It is a small green square in Microsoft Excel and a small blue square in Google Sheets. Click the fill handle for a cell and drag it down a column to auto-fill other cells in the column with the same formula or function used in that cell. Click the fill handle for a cell and drag it across a row to auto-fill other cells in the row with the same formula or function used in that cell. Relative, absolute, and mixed references Relative references (cells referenced without a dollar sign, like A2) will change when you copy and paste the function into a different cell. With relative references, the location of the cell that contains the function determines the cells used by the function. Absolute references (cells fully referenced with a dollar sign, like $A$2) will not change when you copy and paste the function into a different cell. With absolute references, the cells referenced always remain the same. Mixed references (cells partially referenced with a dollar sign, like $A2 or A$2) will change when you copy and paste the function into a different cell. With mixed references, the location of the cell that contains the function determines the cells used by the function, but only the row or column is relative (not both). In spreadsheets, you can press the F4 key to toggle between relative, absolute, and mixed references in a function. Click the cell containing the function, highlight the referenced cells in the formula bar, and then press F4 to toggle between and select relative, absolute, or mixed referencing. Data ranges When you click a cell that contains a function, colored data ranges in the formula bar indicate which cells are being used in the spreadsheet. There are different colors for each unique range in a function. Colored data ranges help prevent you from getting lost in complex functions. In spreadsheets, you can press the F2 key to highlight the range of data used by a function. Click the cell containing the function, highlight the range of data used by the function in the formula bar, and then press F2. The spreadsheet will go to and highlight the cells specified by the range. Data ranges evaluated for a condition COUNTIF is an example of a function that returns a value based on a condition that the data range is evaluated for. The function counts the number of cells that meet the criteria. For example, in an expense spreadsheet, use COUNTIF to count the number of cells that contain a reimbursement for "airfare." For more information, refer to: Microsoft Support's page for COUNTIF Google Help Center's documentation for COUNTIF where you can copy a sheet with COUNTIF examples (click "Use Template" if you click the COUNTIF link provided on this page) Conclusion There are a lot more functions that can help you make the most of your data. This is just the start. You can keep learning how to use functions to help you solve complex problems efficiently and accurately throughout your entire career. Activity overview You have been learning about the role of a data analyst and how to manage, analyze, and visualize data. Now, you will consider a valuable tool to help you practice structured thinking and avoid mistakes: a scope-of-work (SOW). In this activity, you’ll get practical experience developing an SOW document with the help of a handy template. You will then complete an example SOW for an imaginary project of your choosing and learn how analysts outline the work they are going to perform. By the time you complete this activity, you will be familiar with an essential, industry-standard tool, and gain comfort asking the right questions to develop an SOW. Before you get started, take a minute to think about the main ideas, goals, and target audiences of SOW documents. Scope of work: What you need to know As a data analyst, it’s hard to overstate the importance of an SOW document. A well-defined SOW keeps you, your team, and everyone involved with a project on the same page. It ensures that all contributors, sponsors, and stakeholders share the same understanding of the relevant details. Why do you need an SOW? The point of data analysis projects is to complete business tasks that are useful to the stakeholders. Creating an SOW helps to make sure that everyone involved, from analysts and engineers to managers and stakeholders, shares the understanding of what those business goals are, and the plan for accomplishing them. Clarifying requirements and setting expectations are two of the most important parts of a project. Recall the first phase of the Data Analysis Process—asking questions. As you ask more and more questions to clarify requirements, goals, data sources, stakeholders, and any other relevant info, an SOW helps you formalize it all by recording all the answers and details. In this context, the word “ask” means two things. Preparing to write an SOW is about asking questions to learn the necessary information about the project, but it’s also about clarifying and defining what you’re being asked to accomplish, and what the limits or boundaries of the “ask” are. After all, if you can’t make a distinction between the business questions you are and aren’t responsible for answering, then it’s hard to know what success means! What is a good SOW? There’s no standard format for an SOW. They may differ significantly from one organization to another, or from project to project. However, they all have a few foundational pieces of content in common. Deliverables: What work is being done, and what things are being created as a result of this project? When the project is complete, what are you expected to deliver to the stakeholders? Be specific here. Will you collect data for this project? How much, or for how long? Avoid vague statements. For example, “fixing traffic problems” doesn’t specify the scope. This could mean anything from filling in a few potholes to building a new overpass. Be specific! Use numbers and aim for hard, measurable goals and objectives. For example: “Identify top 10 issues with traffic patterns within the city limits, and identify the top 3 solutions that are most cost-effective for reducing traffic congestion.” Milestones: This is closely related to your timeline. What are the major milestones for progress in your project? How do you know when a given part of the project is considered complete? Milestones can be identified by you, by stakeholders, or by other team members such as the Project Manager. Smaller examples might include incremental steps in a larger project like “Collect and process 50% of required data (100 survey responses)”, but may also be larger examples like ”complete initial data analysis report” or “deliver completed dashboard visualizations and analysis reports to stakeholders”. Timeline: Your timeline will be closely tied to the milestones you create for your project. The timeline is a way of mapping expectations for how long each step of the process should take. The timeline should be specific enough to help all involved decide if a project is on schedule. When will the deliverables be completed? How long do you expect the project will take to complete? If all goes as planned, how long do you expect each component of the project will take? When can we expect to reach each milestone? Reports: Good SOWs also set boundaries for how and when you’ll give status updates to stakeholders. How will you communicate progress with stakeholders and sponsors, and how often? Will progress be reported weekly? Monthly? When milestones are completed? What information will status reports contain? At a minimum, any SOW should answer all the relevant questions in the above areas. Note that these areas may differ depending on the project. But at their core, the SOW document should always serve the same purpose by containing information that is specific, relevant, and accurate. If something changes in the project, your SOW should reflect those changes. What is in and out of scope? SOWs should also contain information specific to what is and isn’t considered part of the project. The scope of your project is everything that you are expected to complete or accomplish, defined to a level of detail that doesn’t leave any ambiguity or confusion about whether a given task or item is part of the project or not. Notice how the previous example about studying traffic congestion defined its scope as the area within the city limits. This doesn’t leave any room for confusion — stakeholders need only to refer to a map to tell if a stretch of road or intersection is part of the project or not. Defining requirements can be trickier than it sounds, so it’s important to be as specific as possible in these documents, and to use quantitative statements whenever possible. For example, assume that you’re assigned to a project that involves studying the environmental effects of climate change on the coastline of a city: How do you define what parts of the coastline you are responsible for studying, and which parts you are not? In this case, it would be important to define the area you’re expected to study using GPS locations, or landmarks. Using specific, quantifiable statements will help ensure that everyone has a clear understanding of what’s expected. Completing your own SOW Now that you know the basics, you can practice creating your own mock SOW for a project of your choice. To get started, first access the scope-of-work template. What you will need To use the template for this course item, click the link below and select “Use Template.” Link to template: Data Analysis Project Scope-Of-Work (SOW) Template OR If you don’t have a Google account, you can download the template directly from the attachment below. Scope-Of-Work Template DOCX File Download file Fill the template in for an imaginary project Spend a few minutes thinking about a plausible data analysis project. Come up with a problem domain, and then make up the relevant details to help you fill out the template. Take some time to fill out the template. Treat this exercise as if you were writing your first SOW in your new career as a data analyst. Try to be thorough, specific, and concise! The specifics here aren’t important. The goal is to get comfortable identifying and formalizing requirements and using those requirements in a professional manner by creating SOWs. Compare your work to a strong example Once you’ve filled out your template, consider the strong example below and compare it to yours. Link to the strong example: Data Analysis Project Scope-of-Work (SOW) Strong Example OR You can download the template directly from the attachment below. Scope-Of-Work Exemplar.pdf PDF File Open file Confirmation and reflection When you created a complete and thorough mock SOW, which foundational pieces of content did you include? Select all that apply. The importance of context Context is the condition in which something exists or happens. Context is important in data analytics because it helps you sift through huge amounts of disorganized data and turn it into something meaningful. The fact is, data has little value if it is not paired with context. Image of a hand putting the final puzzle piece in a 4-piece puzzle Understanding the context behind the data can help us make it more meaningful at every stage of the data analysis process. For example, you might be able to make a few guesses about what you're looking at in the following table, but you couldn't be certain without more context. 2010 28000 2005 18000 2000 23000 1995 10000 On the other hand, if the first column was labeled to represent the years when a survey was conducted, and the second column showed the number of people who responded to that survey, then the table would start to make a lot more sense. Take this a step further, and you might notice that the survey is conducted every 5 years. This added context helps you understand why there are five-year gaps in the table. Years (Collected every 5 years) Respondents 2010 28000 2005 18000 2000 23000 1995 10000 Context can turn raw data into meaningful information. It is very important for data analysts to contextualize their data. This means giving the data perspective by defining it. To do this, you need to identify: Who: The person or organization that created, collected, and/or funded the data collection What: The things in the world that data could have an impact on Where: The origin of the data When: The time when the data was created or collected Why: The motivation behind the creation or collection How: The method used to create or collect it This is an image of an unlabeled graph with 3 dashed lines (red, blue, and yellow) with a star on the yellow line Understanding and including the context is important during each step of your analysis process, so it is a good idea to get comfortable with it early in your career. For example, when you collect data, you’ll also want to ask questions about the context to make sure that you understand the business and business process. During organization, the context is important for your naming conventions, how you choose to show relationships between variables, and what you choose to keep or leave out. And finally, when you present, it is important to include contextual information so that your stakeholders understand your analysis. It's normal for conflict to come up in your work life. A lot of what you've learned so far, like managing expectations and communicating effectively can help you avoid conflict, but sometimes you'll run into conflict anyways. If that happens, there are ways to resolve it and move forward. In this video, we will talk about how conflict could happen and the best ways you can practice conflict resolution. A conflict can pop up for a variety of reasons. Maybe a stakeholder misunderstood the possible outcomes for your project; maybe you and your team member have very different work styles; or maybe an important deadline is approaching and people are on edge. Mismatched expectations and miscommunications are some of the most common reasons conflicts happen. Maybe you weren't clear on who was supposed to clean a dataset and nobody cleaned it, delaying a project. Or maybe a teammate sent out an email with all of your insights included, but didn't mention it was your work. While it can be easy to take conflict personally, it's important to try and be objective and stay focused on the team's goals. Believe it or not, tense moments can actually be opportunities to re-evaluate a project and maybe even improve things. So when a problem comes up, there are a few ways you can flip the situation to be more productive and collaborative. One of the best ways you can shift a situation from problematic to productive is to just re-frame the problem. Instead of focusing on what went wrong or who to blame, change the question you're starting with. Try asking, how can I help you reach your goal? This creates an opportunity for you and your team members to work together to find a solution instead of feeling frustrated by the problem. Discussion is key to conflict resolution. If you find yourself in the middle of a conflict, try to communicate, start a conversation or ask things like, are there other important things I should be considering? This gives your team members or stakeholders a chance to fully lay out your concerns. But if you find yourself feeling emotional, give yourself some time to cool off so you can go into the conversation with a clearer head. If I need to write an email during a tense moment, I'll actually save it to drafts and come back to it the next day to reread it before sending to make sure that I'm being level-headed. If you find you don't understand what your team member or stakeholder is asking you to do, try to understand the context of their request. Ask them what their end goal is, what story they're trying to tell with the data or what the big picture is. By turning moments of potential conflict into opportunities to collaborate and move forward, you can resolve tension and get your project back on track. Instead of saying, "There's no way I can do that in this time frame," try to re-frame it by saying, "I would be happy to do that, but I'll just take this amount of time, let's take a step back so I can better understand what you'd like to do with the data and we can work together to find the best path forward." With that, we've reached the end of this section. Great job. Learning how to work with new team members can be a big challenge in starting a new role or a new project but with the skills you've picked up in these videos, you'll be able to start on the right foot with any new team you join. So far, you've learned about balancing the needs and expectations of your team members and stakeholders. You've also covered how to make sense of your team's roles and focus on the project objective, the importance of clear communication and communication expectations in a workplace, and how to balance the limitations of data with stakeholder asks. Finally, we covered how to have effective team meetings and how to resolve conflicts by thinking collaboratively with your team members. Hopefully now you understand how important communication is to the success of a data analyst. These communication skills might feel a little different from some of the other skills you've been learning in this program, but they're also an important part of your data analyst toolkit and your success as a professional data analyst. Just like all of the other skills you're learning right now, your communication skills will grow with practice and experience. Limitations of data Data is powerful, but it has its limitations. Has someone’s personal opinion found its way into the numbers? Is your data telling the whole story? Part of being a great data analyst is knowing the limits of data and planning for them. This reading explores how you can do that. If you have incomplete or nonexistent data, you might realize during an analysis that you don't have enough data to reach a conclusion. Or, you might even be solving a different problem altogether! For example, suppose you are looking for employees who earned a particular certificate but discover that certification records go back only two years at your company. You can still use the data, but you will need to make the limits of your analysis clear. You might be able to find an alternate source of the data by contacting the company that led the training. But to be safe, you should be up front about the incomplete dataset until that data becomes available. If you're collecting data from other teams and using existing spreadsheets, it is good to keep in mind that people use different business rules. So one team might define and measure things in a completely different way than another. For example, if a metric is the total number of trainees in a certificate program, you could have one team that counts every person who registered for the training, and another team that counts only the people who completed the program. In cases like these, establishing how to measure things early on standardizes the data across the board for greater reliability and accuracy. This will make sure comparisons between teams are meaningful and insightful. Dirty data refers to data that contains errors. Dirty data can lead to productivity loss, unnecessary spending, and unwise decision-making. A good data cleaning effort can help you avoid this. As a quick reminder, data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When you find and fix the errors - while tracking the changes you made - you can avoid a data disaster. You will learn how to clean data later in the training. Avinash Kaushik, a Digital Marketing Evangelist for Google, has lots of great tips for data analysts in his blog: Occam's Razor. Below are some of the best practices he recommends for good data storytelling: Compare the same types of data: Data can get mixed up when you chart it for visualization. Be sure to compare the same types of data and double check that any segments in your chart definitely display different metrics. Visualize with care: A 0.01% drop in a score can look huge if you zoom in close enough. To make sure your audience sees the full story clearly, it is a good idea to set your Y-axis to 0. Leave out needless graphs: If a table can show your story at a glance, stick with the table instead of a pie chart or a graph. Your busy audience will appreciate the clarity. Test for statistical significance: Sometimes two datasets will look different, but you will need a way to test whether the difference is real and important. So remember to run statistical tests to see how much confidence you can place in that difference. Pay attention to sample size: Gather lots of data. If a sample size is small, a few unusual responses can skew the results. If you find that you have too little data, be careful about using it to form judgments. Look for opportunities to collect more data, then chart those trends over longer periods. In any organization, a big part of a data analyst’s role is making sound judgments. When you know the limitations of your data, you can make judgment calls that help people make better decisions supported by the data. Data is an extremely powerful tool for decisionmaking, but if it is incomplete, misaligned, or hasn’t been cleaned, then it can be misleading. Take the unstructurenecessary steps to make sure that your data is complete and consistent. Clean the data before you begin your analysis to save yourself and possibly others a great amount of time and effort. Data modeling levels and techniques This reading introduces you to data modeling and different types of data models. Data models help keep data consistent and enable people to map out how data is organized. A basic understanding makes it easier for analysts and other stakeholders to make sense of their data and use it in the right ways. Important note: As a junior data analyst, you won't be asked to design a data model. But you might come across existing data models your organization already has in place. What is data modeling? Data modeling is the process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models. You can think of data modeling as a blueprint of a house. At any point, there might be electricians, carpenters, and plumbers using that blueprint. Each one of these builders has a different relationship to the blueprint, but they all need it to understand the overall structure of the house. Data models are similar; different users might have different data needs, but the data model gives them an understanding of the structure as a whole. Levels of data modeling Each level of data modeling has a different level of detail. 1. Conceptual data modeling gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn't contain technical details. 2. Logical data modeling focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn't spell out actual names of database tables. That's the job of a physical data model. 3. Physical data modeling depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database. More information can be found in this comparison of data models. Data-modeling techniques There are a lot of approaches when it comes to developing data models, but two common methods are the Entity Relationship Diagram (ERD) and the Unified Modeling Language (UML) diagram. ERDs are a visual way to understand the relationship between entities in the data model. UML diagrams are very detailed diagrams that describe the structure of a system by showing the system's entities, attributes, operations, and their relationships. As a junior data analyst, you will need to understand that there are different data modeling techniques, but in practice, you will probably be using your organization’s existing technique. You can read more about ERD, UML, and data dictionaries in this data modeling techniques article. Data analysis and data modeling Data modeling can help you explore the high-level details of your data and how it is related across the organization’s information systems. Data modeling sometimes requires data analysis to understand how the data is put together; that way, you know how to map the data. And finally, data models make it easier for everyone in your organization to understand and collaborate with you on your data. This is important for you and everyone on your team! By now you've learned a lot about data. From generated data, to collected data, to data formats, it's good to know as much as you can about the data you'll use for analysis. In this video, we'll talk about another way you can describe data: the data type. A data type is a specific kind of data attribute that tells what kind of value the data is. In other words, a data type tells you what kind of data you're working with. Data types can be different depending on the query language you're using. For example, SQL allows for different data types depending on which database you're using. For now though, let's focus on the data types that you'll use in spreadsheets. To help us out, we'll use a spreadsheet that's already filled with data. We'll call it "Worldwide Interests in Sweets through Google Searches." Now a data type in a spreadsheet can be one of three things: a number, a text or string, or a Boolean. You might find spreadsheet programs that classify them a bit differently or include other types, but these value types cover just about any data you'll find in spreadsheets. We'll look at all of these in just a bit. Looking at columns B, D, and F, we find number data types. Each number represents the search interest for the terms "cupcakes," "ice cream," and "candy" for a specific week. The closer a number is to 100, the more popular that search term was during that week. One hundred represents peak popularity. Keep in mind that in this case, 100 is a relative value, not the actual number of searches. It represents the maximum number of searches during a certain time. Think of it like a percentage on a test. All other searches are then also valued out of 100. You might notice this in other data sets as well. Gold star for 100! If you needed to, you could change the numbers into percents or other formats, like currency. These are all examples of number data types. In column H, the data shows the most popular treat for each week, based on the search data. So as we'll find in cell H4 for the week beginning July 28th, 2019, the most popular treat was ice cream. This is an example of a text data type, or a string data type, which is a sequence of characters and punctuation that contains textual information. In this example, that information would be the treats and people's names. These can also include numbers, like phone numbers or numbers in street addresses. But these numbers wouldn't be used for calculations. In this case they're treated like text, not numbers. In columns C, E, and G, it seems like we've got some text. But the text here isn't a text or string data type. Instead, it's a Boolean data type. A Boolean data type is a data type with only two possible values: true or false. Columns C, E, and G show Boolean data for whether the search interest for each week, is at least 50 out of 100. Here's how it works. To get this data, we've created a formula that calculates whether the search interest data in columns B, D, and F is 50 or greater. In cell B4, the search interest is 14. In cell C4, we find the word false because, for this week of data, the search interest is less than 50. For each cell in columns C, E, and G, the only two possible values are true or false. We could change the formula so other words appear in these cells instead, but it's still Boolean data. You'll get a chance to read more about the Boolean data type soon. Let's talk about a common issue that people encounter in spreadsheets: mistaking data types with cell values. For example, in cell B57, we can create a formula to calculate data in other cells. This will give us the average of the search interests in cupcakes across all weeks in the dataset, which is about 15. The formula works because we calculated using a number data type. But if we tried it with a text or string data type, like the data in column C, we'd get an error. Error values usually happen if a mistake is made in entering the values in the cells. The more you know your data types and which ones to use, the less errors you'll run into. There you have it, a data type for everyone. We're not done yet. Coming up, we'll go deeper into the relationship between data types, fields, and values. See you soon. Understanding Boolean logic In this reading, you will explore the basics of Boolean logic and learn how to use multiple conditions in a Boolean statement. These conditions are created with Boolean operators, including AND, OR, and NOT. These operators are similar to mathematical operators and can be used to create logical statements that filter your results. Data analysts use Boolean statements to do a wide range of data analysis tasks, such as creating queries for searches and checking for conditions when writing programming code. Boolean logic example Imagine you are shopping for shoes, and are considering certain preferences: You will buy the shoes only if they are pink and grey You will buy the shoes if they are entirely pink or entirely grey, or if they are pink and grey You will buy the shoes if they are grey, but not if they have any pink Below are Venn diagrams that illustrate these preferences. AND is the center of the Venn diagram, where two conditions overlap. OR includes either condition. NOT includes only the part of the Venn diagram that doesn't contain the exception. The AND operator Your condition is “If the color of the shoe has any combination of grey and pink, you will buy them.” The Boolean statement would break down the logic of that statement to filter your results by both colors. It would say “IF (Color=”Grey”) AND (Color=”Pink”) then buy them.” The AND operator lets you stack multiple conditions. Below is a simple truth table that outlines the Boolean logic at work in this statement. In the Color is Grey column, there are two pairs of shoes that meet the color condition. And in the Color is Pink column, there are two pairs that meet that condition. But in the If Grey AND Pink column, there is only one pair of shoes that meets both conditions. So, according to the Boolean logic of the statement, there is only one pair marked true. In other words, there is one pair of shoes that you can buy. Color is Grey Color is Pink If Grey AND Pink, then Buy Boole Grey/True Pink/True True/Buy True A Grey/True Black/False False/Don't buy True A Red/False Pink/True False/Don't buy False A Red/False Green/False False/Don't buy False A The OR operator The OR operator lets you move forward if either one of your two conditions is met. Your condition is “If the shoes are grey or pink, you will buy them.” The Boolean statement would be “IF (Color=”Grey”) OR (Color=”Pink”) then buy them.” Notice that any shoe that meets either the Color is Grey or the Color is Pink condition is marked as true by the Boolean logic. According to the truth table below, there are three pairs of shoes that you can buy. Color is Grey Color is Pink If Grey OR Pink, then Buy Boo Red/False Black/False False/Don't buy Fals Black/False Pink/True True/Buy Fals Grey/True Green/False True/Buy True Grey/True Pink/True True/Buy True The NOT operator Finally, the NOT operator lets you filter by subtracting specific conditions from the results. Your condition is "You will buy any grey shoe except for those with any traces of pink in them." Your Boolean statement would be “IF (Color="Grey") AND (Color=NOT “Pink”) then buy them.” Now, all of the grey shoes that aren't pink are marked true by the Boolean logic for the NOT Pink condition. The pink shoes are marked false by the Boolean logic for the NOT Pink condition. Only one pair of shoes is excluded in the truth table below. Color is Grey Color is Pink Boolean Logic for NOT Pink If Grey AND (NOT Pink), then Bu Grey/True Red/False Not False = True True/Buy Grey/True Black/False Not False = True True/Buy Grey/True Green/False Not False = True True/Buy Grey/True Pink/True Not True = False False/Don't buy The power of multiple conditions For data analysts, the real power of Boolean logic comes from being able to combine multiple conditions in a single statement. For example, if you wanted to filter for shoes that were grey or pink, and waterproof, you could construct a Boolean statement such as: “IF ((Color = “Grey”) OR (Color = “Pink”)) AND (Waterproof=“True”).” Notice that you can use parentheses to group your conditions together. Whether you are doing a search for new shoes or applying this logic to your database queries, Boolean logic lets you create multiple conditions to filter your results. And now that you know a little more about how Boolean logic is used, you can start using it! Additional Reading/Resources Learn about who pioneered Boolean logic in this historical article: Origins of Boolean Algebra in the Logic of Classes. Find more information about using AND, OR, and NOT from these tips for searching with Boolean operators. In this reading, you will explore how data is transformed and the differences between wide and long data. Data transformation is the process of changing the data’s format, structure, or values. As a data analyst, there is a good chance you will need to transform data at some point to make it easier for you to analyze it. Data transformation usually involves: Adding, copying, or replicating data Deleting fields or records Standardizing the names of variables Renaming, moving, or combining columns in a database Joining one set of data with another Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (CSV) file. Why transform data? Goals for data transformation might be: Data organization: better organized data is easier to use Data compatibility: different applications or systems can then use the same data Data migration: data with matching formats can be moved from one system to another Data merging: data with the same organization can be merged together Data enhancement: data can be displayed with more detailed fields Data comparison: apples-to-apples comparisons of the data can then be made Data transformation example: data merging Mario is a plumber who owns a plumbing company. After years in the business, he buys another plumbing company. Mario wants to merge the customer information from his newly acquired company with his own, but the other company uses a different database. So, Mario needs to make the data compatible. To do this, he has to transform the format of the acquired company’s data. Then, he must remove duplicate rows for customers they had in common. When the data is compatible and together, Mario’s plumbing company will have a complete and merged customer database. Data transformation example: data organization (long to wide) To make it easier to create charts, you may also need to transform long data to wide data. Consider the following example of transforming stock prices (collected as long data) to wide data. Long data is data where each row contains a single data point for a particular item. In the long data example below, individual stock prices (data points) have been collected for Apple (AAPL), Amazon (AMZN), and Google (GOOGL) (particular items) on the given dates. Long data example: Stock prices Wide data is data where each row contains multiple data points for the particular items identified in the columns. Wide data example: Stock prices With data transformed to wide data, you can create a chart comparing how each company's stock changed over the same period of time. You might notice that all the data included in the long format is also in the wide format. But wide data is easier to read and understand. That is why data analysts typically transform long data to wide data more often than they transform wide data to long data. The following table summarizes when each format is preferred: Wide data is preferred when Creating tables and charts with a few variables about each subject Comparing straightforward line graphs Long data is preferred when Storing a lot of variables about each subject.For examp rates for each bank Performing advanced statistical analysis or graphing Question 1 Activity overview By now, you’ve learned a lot about different data types and data structures. In this activity, you will work with datasets from Kaggle, an online community of people passionate about data. To start this activity, you’ll create a Kaggle account, set up a profile, and explore Kaggle notebooks. Every data analyst has a data community that they rely on for help, support, and inspiration. Kaggle can help you build your own data community. Kaggle has millions of users in all stages of their data career, from beginners to data scientists with decades of experience. The Kaggle community brings people together to develop their data analysis skills, share datasets and interactive notebooks, and collaborate on solving real-life data problems. Check out this brief introductory video to learn more about Kaggle. By the time you complete this activity, you will be able to use many of Kaggle’s key features. This will enable you to create notebooks and browse data, which is important for completing and sharing data projects in your career as a data analyst. Create a Kaggle account To get started, follow these steps to create a Kaggle account. Note: Kaggle frequently updates its user interface. The latest changes may not be reflected in the screenshots, but the principles in this activity remain the same. Adapting to changes in software updates is an essential skill for data analysts, and we encourage you to practice troubleshooting. You can also reach out to your community of learners on the discussion forum for help. 1. Go to kaggle.com 2. Click on the Register button at the top-right of the Kaggle homepage. You can register with your Google credentials or your trans email address. screenshot of kaggle homepage. The register button is highlighted 3. Once you’re registered and logged in to Kaggle, click on the Account icon at the top-right of your screen. In the menu that opens, click the Your Profile button. 4. On your profile page, click on the Edit Profile button. Enter any information you’d like to share with the Kaggle community. Your profile will be public, so only enter the information you’re comfortable sharing. 5. If you want some inspiration, check out the profile of Kaggle’s Community Advocate, Jesse Mostipak! Explore Kaggle notebooks Now that you’ve created an account and set up your profile, you can check out some notebooks on Kaggle. Kagglers use notebooks to share datasets and data analyses. Step 1: Go to the Code home page First, go to the Navigation bar on the left side of your screen. Then, click on the Code icon. This takes you to the Code home page. Step 2: Review Kaggler contributions On the Code home page, you’ll notice links to notebooks created by other Kagglers. To begin, feel free to scroll through the list and click on notebooks that interest you. As you explore, you may come across unfamiliar terms and new information: That’s fine! Kagglers come from diverse backgrounds and focus on different areas of data analysis, data science, machine learning, and deep learning. Step 3: Narrow your search Once you’re familiar with the Code home page, you can narrow your search results by typing a word in the search bar or by using the filter feature. For example, type Beginner in the search bar to show notebooks tagged as beginnerfriendly. Or, click on the Filter icon, the triangle shape on the right side of the search bar. You can filter results by tags, programming language, output, and other options. Filter to Datasets to show notebooks that use one of the tens of thousands of public datasets available on Kaggle. Step 4: Review suggested notebooks If you’re looking for specific suggestions, check out the following notebooks: gganimate by Meg Risdal Getting staRted in R by Rachael Tatman Writing Hamilton Lyrics with TensorFlow/R by Ana Sofia Uzsoy Dive into dplyr (tutorial #1) by Jesse Mostipak Spend some time checking out a couple of notebooks to get an idea of the work that Kagglers share online—and that you’ll be able to create by the time you’ve finished this course! Edit a notebook Now, take a look at a specific notebook: Dive into dplyr (tutorial #1) by Jesse Mostipak. Follow these steps to learn how to edit notebooks: 1. Click on the link to open up the notebook. It contains the dataset you’ll work with later on. 2. Click on the Copy and Edit button at the top-right to make a copy of the notebook in your account. Now, the notebook appears in Edit mode. Edit mode lets you make changes to the notebook if you want. Screenshot of a notebook viewer page. Introductory text has been copied and pasted This notebook is private. If you want to share your work, you can choose to make it public. When you copy and edit another Kaggler’s work, always make meaningful changes to the notebook before publishing it. That way, you’re not misrepresenting someone else’s work as your own. 3. Take a moment to explore the Edit mode of the notebook. Some of this may seem unfamiliar—and that’s just fine. By the end of this course, you’ll know how to create a notebook like this from scratch! What is data anonymization? You have been learning about the importance of privacy in data analytics. Now, it is time to talk about data anonymization and what types of data should be anonymized. Personally identifiable information, or PII, is information that can be used by itself or with other data to track down a person's identity. Data anonymization is the process of protecting people's private or sensitive data by eliminating that kind of information. Typically, data anonymization involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values. Your role in data anonymization Organizations have a responsibility to protect their data and the personal information that data might contain. As a data analyst, you might be expected to understand what data needs to be anonymized, but you generally wouldn't be responsible for the data anonymization itself. A rare exception might be if you work with a copy of the data for testing or development purposes. In this case, you could be required to anonymize the data before you work with it. What types of data should be anonymized? Healthcare and financial data are two of the most sensitive types of data. These industries rely a lot on data anonymization techniques. After all, the stakes are very high. That’s why data in these two industries usually goes through de-identification, which is a process used to wipe data clean of all personally identifying information. Data anonymization is used in just about every industry. That is why it is so important for data analysts to understand the basics. Here is a list of data that is often anonymized: Telephone numbers Names License plates and license numbers Social security numbers IP addresses Medical records Email addresses Photographs Account numbers For some people, it just makes sense that this type of data should be anonymized. For others, we have to be very specific about what needs to be anonymized. Imagine a world where we all had access to each other’s addresses, account numbers, and other identifiable information. That would invade a lot of people’s privacy and make the world less safe. Data anonymization is one of the ways we can keep data private and secure! The open-data debate Just like data privacy, open data is a widely debated topic in today’s world. Data analysts think a lot about open data, and as a future data analyst, you need to understand the basics to be successful in your new role. What is open data? In data analytics, open data is part of data ethics, which has to do with using data ethically. Openness refers to free access, usage, and sharing of data. But for data to be considered open, it has to: Be available and accessible to the public as a complete dataset Be provided under terms that allow it to be reused and redistributed Allow universal participation so that anyone can use, reuse, and redistribute the data Data can only be considered open when it meets all three of these standards. The open data debate: What data should be publicly available? One of the biggest benefits of open data is that credible databases can be used more widely. Basically, this means that all of that good data can be leveraged, shared, and combined with other data. This could have a huge impact on scientific collaboration, research advances, analytical capacity, and decision-making. But it is important to think about the individuals being represented by the public, open data, too. Third-party data is collected by an entity that doesn’t have a direct relationship with the data. You might remember learning about this type of data earlier. For example, third parties might collect information about visitors to a certain website. Doing this lets these third parties create audience profiles, which helps them better understand user behavior and target them with more effective advertising. Personal identifiable information (PII) is data that is reasonably likely to identify a person and make information known about them. It is important to keep this data safe. PII can include a person’s address, credit card information, social security number, medical records, and more. Everyone wants to keep personal information about themselves private. Because thirdparty data is readily available, it is important to balance the openness of data with the privacy of individuals. Balancing security and analytics The battle between security and data analytics Data security means protecting data from unauthorized access or corruption by putting safety measures in place. Usually the purpose of data security is to keep unauthorized users from accessing or viewing sensitive data. Data analysts have to find a way to balance data security with their actual analysis needs. This can be tricky-- we want to keep our data safe and secure, but we also want to use it as soon as possible so that we can make meaningful and timely observations. In order to do this, companies need to find ways to balance their data security measures with their data access needs. Luckily, there are a few security measures that can help companies do just that. The two we will talk about here are encryption and tokenization. Encryption uses a unique algorithm to alter data and make it unusable by users and applications that don’t know the algorithm. This algorithm is saved as a “key” which can be used to reverse the encryption; so if you have the key, you can still use the data in its original form. Tokenization replaces the data elements you want to protect with randomly generated data referred to as a “token.” The original data is stored in a separate location and mapped to the tokens. To access the complete original data, the user or application needs to have permission to use the tokenized data and the token mapping. This means that even if the tokenized data is hacked, the original data is still safe and secure in a separate location. Encryption and tokenization are just some of the data security options out there. There are a lot of others, like using authentication devices for AI technology. As a junior data analyst, you probably won’t be responsible for building out these systems. A lot of companies have entire teams dedicated to data security or hire third party companies that specialize in data security to create these systems. But it is important to know that all companies have a responsibility to keep their data secure, and to understand some of the potential systems your future employer might use. Signing up Signing up with LinkedIn is simple. Just follow these simple steps: 1. Browse to linkedin.com 2. Click Join now or Join with resume. If you clicked Join now: 1. Enter your email address and a password and click Agree & Join (or click Join with Google to link to a Google account). 2. Enter your first and last name and click Continue. 3. Enter your country/region, your postal code, and location with the area (this helps LinkedIn find job opportunities near you). 4. Enter your most recent job title, or select I’m a student. 5. If you entered your most recent job title, select your employment type and enter the name of your most recent company. 6. If you selected self-employed or freelance, LinkedIn will ask for your industry. 7. Click confirm your email address. You will receive an email from LinkedIn. 8. To confirm your email address, click Agree & Confirm in your email. 9. LinkedIn will then ask if you are looking for a job. Click the answer that applies. If you select Yes, LinkedIn will help you start looking for job opportunities. If you clicked Join with resume: 1. Click Upload your resume and select the file to upload. 2. Follow any of the steps under Join Now that are relevant. The Join with resume option saves you some time because it auto-fills most of the information from your resume. And just like that, your initial profile is now ready! Including basic information in your profile It is a good idea to take your time filling out every section of your profile. This helps recruiters find your profile and helps people you connect with get to know you better. Start with your photo. Here are some tips to help you choose a great picture for your new profile: Choose an image that looks like you: You want to make sure that your profile is the best representation of you and that includes your photo. You want a potential connection or potential employer to be able to recognize you from your profile picture if you were to meet. Use your industry as an example: If you are having trouble deciding what is appropriate for your profile image, look at other profiles in the same industry or from companies you are interested in to get a better sense of what you should be doing. Choose a high-resolution image: The better the resolution, the better impression it makes, so make sure the image you choose isn’t blurry. The ideal image size for a LinkedIn profile picture is 400 x 400 pixels. Use a photo where your face takes up at least 60% of the space in the frame. Remember to smile: Your profile picture is a snapshot of who you are as a person so it is okay to be serious in your photo. But smiling helps put potential connections and potential employers at ease. Adding connections Connections are a great way to keep up to date with your previous coworkers, colleagues, classmates, or even companies you want to work with. The world is a big place with a lot of people. So here are some tips to help get you started. 1. Connect to people you know personally. 2. Add a personal touch to your invitation message. Instead of just letting them know you would like to connect, let them know why. 3. Make sure your profile picture is current so people can recognize you. 4. Add value. Provide them with a resource, a website link, or even some content they might find interesting in your invitation to connect. Finding leaders and influencers LinkedIn is a great place to find great people and great ideas. From technology to marketing, and everything in between, there are all kinds of influencers and thought leaders active on LinkedIn. If you have ever wanted to know the thoughts of some of the most influential and respected minds in a certain field, LinkedIn is a great place to start. Following your favorite people takes only a few minutes. You can search for people or companies individually, or you can use these lists as starting points. Top influencers on LinkedIn LinkedIn Top Voices 2020: Data Science & AI Looking for a new position On LinkedIn, letting recruiters and potential employers know that you are in the market for a new job is simple. Just follow these steps: 1. Click the Me icon at the top of your LinkedIn homepage. 2. Click View profile. 3. Click the Add profile section drop-down and under Intro, select Looking for a new job. Make sure to select the appropriate filters for the new positions you might be looking for and update your profile to better fit the role that you are applying for. Keeping your profile up to date Add to your profile to keep it complete, current, and interesting. For example, remember to add the Google Data Analytics Certificate to your profile after you complete the program! Building connections on LinkedIn Using LinkedIn to connect A connection is someone you know and trust on a personal or professional basis. Your connections are who make up your network. And when it comes to your network, it is important to remember quality over quantity. So don’t focus on how many connections you have. Instead, make sure that everyone you connect with adds value to your network, and vice versa. Inviting those you know versus making cold requests Adding connections on LinkedIn is easy. You invite people to join your network, and they accept your invitation. When you send an invitation, you can attach a personal note. Personal notes are highly recommended. A great way to increase the number of your connections is to invite classmates, friends, teachers, or even members of a club or organization you are in. LinkedIn also gives suggestions for connections based on your profile information. Here's an example (template) that you can use to connect with a former co-worker: The message: Hi <fill in name here>, Please accept my invitation to connect. It has been a while since we were at <fill in company name here> and I look forward to catching up with you. I’m looking for job opportunities and would love to hear about what you’re doing and who is hiring in your organization. Best regards, <fill in your name here> Cold requests on LinkedIn are invitations to connect with people you don’t know personally or professionally. When you start to build your network, it is best to connect with people you already know. But cold requests might be the only way to connect with people who work at companies you are interested in. You can learn a lot about a company’s culture and job openings from current employees. As a best practice, send cold requests rarely and only when there is no other way to connect. Asking for recommendations (references) Recommendations on LinkedIn are a great way to have others vouch for you. Ask people to comment on your past performance, how you handled a challenging project, or your strengths as a data analyst. You can choose to accept, reject, show, or hide recommendations in your profile. Here are some tips for asking for a recommendation: Reach out to a variety of people for a 360-degree view: supervisors, coworkers, direct reports, partners, and clients Personalize the recommendation request with a custom message Suggest strengths and capabilities they can highlight as part of your request Be willing to write a recommendation in return Read the recommendation carefully before you accept it into your profile Sometimes the hardest part of getting a recommendation is creating the right request message. Here's an example (template) that you can use to ask for a recommendation: Hi <fill in name here>, How are you? I hope you are well. I’m preparing for a new job search and would appreciate it if you could write a recommendation that highlights my <insert your specific skill here>. Our experience working on <insert project here> is a great example and I would be happy to provide other examples if you need them. Please let me know if I can write a recommendation for you. I would be very glad to return the favor. Thanks in advance for your support! <fill in your name here> Ask a few connections to recommend you and highlight why you should be hired. Recommendations help prospective employers get a better idea of who you are and the quality of your work. Summing it up When you write thoughtful posts and respond to others genuinely, people in and even outside your network will be open and ready to help you during your job search. More about data integrity and compliance This reading illustrates the importance of data integrity using an example of a global company’s data. Definitions of terms that are relevant to data integrity will be provided at the end. Scenario: calendar dates for a global company Calendar dates are represented in a lot of different short forms. Depending on where you live, a different format might be used. In some countries,12/10/20 (DD/MM/YY) stands for October 12, 2020. In other countries, the national standard is YYYY-MM-DD so October 12, 2020 becomes 2020-10-12. In the United States, (MM/DD/YY) is the accepted format so October 12, 2020 is going to be 10/12/20. Now, think about what would happen if you were working as a data analyst for a global company and didn’t check date formats. Well, your data integrity would probably be questionable. Any analysis of the data would be inaccurate. Imagine ordering extra inventory for December when it was actually needed in October! A good analysis depends on the integrity of the data, and data integrity usually depends on using a common format. So it is important to double-check how dates are formatted to make sure what you think is December 10, 2020 isn’t really October 12, 2020, and vice versa. Here are some other things to watch out for: Data replication compromising data integrity: Continuing with the example, imagine you ask your international counterparts to verify dates and stick to one format. One analyst copies a large dataset to check the dates. But because of memory issues, only part of the dataset is actually copied. The analyst would be verifying and standardizing incomplete data. That partial dataset would be certified as compliant but the full dataset would still contain dates that weren't verified. Two versions of a dataset can introduce inconsistent results. A final audit of results would be essential to reveal what happened and correct all dates. Data transfer compromising data integrity: Another analyst checks the dates in a spreadsheet and chooses to import the validated and standardized data back to the database. But suppose the date field from the spreadsheet was incorrectly classified as a text field during the data import (transfer) process. Now some of the dates in the database are stored as text strings. At this point, the data needs to be cleaned to restore its integrity. Data manipulation compromising data integrity: When checking dates, another analyst notices what appears to be a duplicate record in the database and removes it. But it turns out that the analyst removed a unique record for a company’s subsidiary and not a duplicate record for the company. Your dataset is now missing data and the data must be restored for completeness. Conclusion Fortunately, with a standard date format and compliance by all people and systems that work with the data, data integrity can be maintained. But no matter where your data comes from, always be sure to check that it is valid, complete, and clean before you begin any analysis. Reference: Data constraints and examples As you progress in your data journey, you'll come across many types of data constraints (or criteria that determine validity). The table below offers definitions and examples of data constraint terms you might come across. Data constraint Definition Examples Data type Values must be of a certain type: date, number, percentage, Boolean, etc. If the data type is a date, a single numbe constraint and be invalid Data range Values must fall between predefined maximum and minimum values If the data range is 10-20, a value of 30 w be invalid Data constraint Definition Examples Mandatory Values can’t be left blank or empty If age is mandatory, that value must be f Unique Values can’t have a duplicate Two people can’t have the same mobile same service area Regular expression (regex) patterns Values must match a prescribed pattern A phone number must match ###-###allowed) Cross-field validation Certain conditions for multiple fields must be satisfied Values are percentages and values from 100% Primary-key (Databases only) value must be unique per column A database table can’t have two rows wi value. A primary key is an identifier in a column in which each value is unique. M primary and foreign keys is provided lat Set-membership (Databases only) values for a column must come from a set of discrete values Value for a column must be set to Yes, N Foreign-key (Databases only) values for a column must be unique values coming from a column in another table In a U.S. taxpayer database, the State col territory with the set of acceptable value table Accuracy The degree to which the data conforms to the actual entity being measured or described If values for zip codes are validated by s the data goes up. Completeness The degree to which the data contains all desired components or measures If data for personal profiles required hai collected, the data is complete. Consistency The degree to which the data is repeatable from different points of entry or collection If a customer has the same address in th the data is consistent. Hey there, it's good to remember to check for data integrity. It's also important to check that the data you use aligns with the business objective. This adds another layer to the maintenance of data integrity because the data you're using might have limitations that you'll need to deal with. The process of matching data to business objectives can actually be pretty straightforward. Here's a quick example. Let's say you're an analyst for a business that produces and sells auto parts. Play video starting at ::29 and follow transcript0:29 If you need to address a question about the revenue generated by the sale of a certain part, then you'd pull up the revenue table from the data set. Play video starting at ::37 and follow transcript0:37 If the question is about customer reviews, then you'd pull up the reviews table to analyze the average ratings. But before digging into any analysis, you need to consider a few limitations that might affect it. If the data hasn't been cleaned properly, then you won't be able to use it yet. You would need to wait until a thorough cleaning has been done. Now, let's say you're trying to find how much an average customer spends. You notice the same customer's data showing up in more than one row. This is called duplicate data. To fix this, you might need to change the format of the data, or you might need to change the way you calculate the average. Otherwise, it will seem like the data is for two different people, and you'll be stuck with misleading calculations. You might also realize there's not enough data to complete an accurate analysis. Maybe you only have a couple of months' worth of sales data. There's slim chance you could wait for more data, but it's more likely that you'll have to change your process or find alternate sources of data while still meeting your objective. I like to think of a data set like a picture. Take this picture. What are we looking at? Play video starting at :1:48 and follow transcript1:48 Unless you're an expert traveler or know the area, it may be hard to pick out from just these two images. Play video starting at :1:55 and follow transcript1:55 Visually, it's very clear when we aren't seeing the whole picture. When you get the complete picture, you realize... you're in London! Play video starting at :2:4 and follow transcript2:04 With incomplete data, it's hard to see the whole picture to get a real sense of what is going on. We sometimes trust data because if it comes to us in rows and columns, it seems like everything we need is there if we just query it. But that's just not true. I remember a time when I found out I didn't have enough data and had to find a solution. Play video starting at :2:26 and follow transcript2:26 I was working for an online retail company and was asked to figure out how to shorten customer purchase to delivery time. Faster delivery times usually lead to happier customers. When I checked the data set, I found very limited tracking information. We were missing some pretty key details. So the data engineers and I created new processes to track additional information, like the number of stops in a journey. Using this data, we reduced the time it took from purchase to delivery and saw an improvement in customer satisfaction. That felt pretty great! Learning how to deal with data issues while staying focused on your objective will help set you up for success in your career as a data analyst. And your path to success continues. Next step, you'll learn more about aligning data to objectives. Keep it up! Well-aligned objectives and data You can gain powerful insights and make accurate conclusions when data is well-aligned to business objectives. As a data analyst, alignment is something you will need to judge. Good alignment means that the data is relevant and can help you solve a business problem or determine a course of action to achieve a given business objective. In this reading, you will review the business objectives associated with three scenarios. You will explore how clean data and well-aligned business objectives can help you come up with accurate conclusions. On top of that, you will learn how new variables discovered during data analysis can cause you to set up data constraints so you can keep the data aligned to a business objective. Clean data + alignment to business objective = accurate conclusions Business objective Account managers at Impress Me, an online content subscription service, want to know how soon users view content after their subscriptions are activated. To start off, the data analyst verifies that the data exported to spreadsheets is clean and confirms that the data needed (when users access content) is available. Knowing this, the analyst decides there is good alignment of the data to the business objective. All that is missing is figuring out exactly how long it takes each user to view content after their subscription has been activated. Here are the data processing steps the analyst takes for a user from an account called V&L Consulting. (These steps would be repeated for each subscribing account, and for each user associated with that account.) Step 1 Data-processing step Look up the activation date for V&L Consulting Relevant data in spreadsheet: Source of d Account sp Result: October 21, 2019 Step 2 Data-processing step Look up the name of a user belonging to the V&L Consulting account Relevant data in spreadsheet: Source of da Account spr Result: Maria Ballantyne Step 3 Data-processing step Find the first content access date for Maria B. Relevant data in spreadsheet: Source of data Content usage spre Result: October 31, 2019 Step 4 Data-processing step Calculate the time between activation and first content usage for Maria B. Relevant data in spreadsheet: Result: 10 days Pro tip 1 In the above process, the analyst could use VLOOKUP to look up the data in Steps 1, 2, and 3 to populate the values in the spreadsheet in Step 4. VLOOKUP is a spreadsheet function that searches for a certain value in a column to return a related piece of information. Using VLOOKUP can save a lot of time; without it, you have to look up dates and names manually. Refer to the VLOOKUP page in the Google Help Center for how to use the function in Google Sheets. Pro tip 2 In Step 4 of the above process, the analyst could use the DATEDIF function to automatically calculate the difference between the dates in column C and column D. The function can calculate the number of days between two dates. Refer to the Microsoft Support DATEDIF page for how to use the function in Excel. The DAYS360 function does the same thing in accounting spreadsheets that use a 360-day year (twelve 30-day months). Refer to the DATEDIF page in the Google Help Center for how to use the function in Google Sheets. Source New sp Alignment to business objective + additional data cleaning = accurate conclusions Business objective Cloud Gate, a software company, recently hosted a series of public webinars as free product introductions. The data analyst and webinar program manager want to identify companies that had five or more people attend these sessions. They want to give this list of companies to sales managers who can follow up for potential sales. The webinar attendance data includes the fields and data shown below. Name <First name> <Last name> This was required information attendee Email Address xxxxx@company.com This was required information attende Company <Company name> This was optional information attende Data cleaning The webinar attendance data seems to align with the business objective. But the data analyst and program manager decide that some data cleaning is needed before the analysis. They think data cleaning is required because: The company name wasn’t a mandatory field. If the company name is blank, it might be found from the email address. For example, if the email address is username@google.com, the company field could be filled in with Google for the data analysis. This data cleaning step assumes that people with companyassigned email addresses attended a webinar for business purposes. Attendees could enter any name. Since attendance across a series of webinars is being looked at, they need to validate names against unique email addresses. For example, if Joe Cox attended two webinars but signed in as Joe Cox for one and Joseph Cox for the other, he would be counted as two different people. To prevent this, they need to check his unique email address to determine that he was the same person. After the validation, Joseph Cox could be changed to Joe Cox to match the other instance. Alignment to business objective + newly discovered variables + constraints = accurate conclusions Business objective An after-school tutoring company, A+ Education, wants to know if there is a minimum number of tutoring hours needed before students have at least a 10% improvement in their assessment scores. The data analyst thinks there is good alignment between the data available and the business objective because: Students log in and out of a system for each tutoring session, and the number of hours is tracked Assessment scores are regularly recorded Data constraints for new variables After looking at the data, the data analyst discovers that there are other variables to consider. Some students had consistent weekly sessions while other students had scheduled sessions more randomly even though their total number of tutoring hours was the same. The data doesn’t align as well with the original business objective as first thought, so the analyst adds a data constraint to focus only on the students with consistent weekly sessions. This modification helps to get a more accurate picture about the enrollment time needed to achieve a 10% improvement in assessment scores. Key takeaways Hopefully these examples give you a sense of what to look for to know if your data aligns with your business objective. When there is clean data and good alignment, you can get accurate insights and make conclusions the data supports. If there is good alignment but the data needs to be cleaned, clean the data before you perform your analysis. If the data only partially aligns with an objective, think about how you could modify the objective, or use data constraints to make sure that the subset of data better aligns with the business objective. What to do when you find an issue with your data When you are getting ready for data analysis, you might realize you don’t have the data you need or you don’t have enough of it. In some cases, you can use what is known as proxy data in place of the real data. Think of it like substituting oil for butter in a recipe when you don’t have butter. In other cases, there is no reasonable substitute and your only option is to collect more data. Consider the following data issues and suggestions on how to work around them. Data issue 1: no data Possible Solutions Examples of solutions in real life Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after you have collected more data. If you are surveying employees about wh performance and bonus plan, use a samp Then, ask for another 3 weeks to collect t If there isn’t time to collect data, perform the analysis using proxy data from other datasets. This is the most If you are analyzing peak travel times for the data for a particular city, use the data similar size and demographic. common workaround. Data issue 2: too little data Possible Solutions Examples of solutions in real life Do the analysis using proxy data along with actual data. If you are analyzing trends for owners of golden retrievers, ma including the data from owners of labradors. Adjust your analysis to align with the data you already have. If you are missing data for 18- to 24-year-olds, do the analysis b limitation in your report: this conclusion applies to adults 25 ye Data issue 3: wrong data, including data with errors* Possible Solutions Examples of solutions in real life If you have the wrong data because requirements were misunderstood, communicate the requirements again. If you need the data for female vo male voters, restate your needs. Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors. If your data is in a spreadsheet a statement or boolean causing cal the conditional statement instea values. If you can’t correct data errors yourself, you can ignore the wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won’t cause systematic bias. If your dataset was translated fro some of the translations don’t m with bad translation and go ahea other data. * Important note: sometimes data with errors can be a warning sign that the data isn’t reliable. Use your best judgment. Use the following decision tree as a reminder of how to deal with data errors or not enough data: Calculating sample size Before you dig deeper into sample size, familiarize yourself with these terms and definitions: Terminology Definitions Population The entire group that you are interested in for your study. For example, if you are company, the population would be all the employees in your company. Sample A subset of your population. Just like a food sample, it is called a sample because i company is too large to survey every individual, you can survey a representative s Margin of error Since a sample is used to represent a population, the sample’s results are expected result would have been if you had surveyed the entire population. This difference The smaller the margin of error, the closer the results of the sample are to what th you had surveyed the entire population. Confidence level How confident you are in the survey results. For example, a 95% confidence level the same survey 100 times, you would get similar results 95 of those 100 times. C before you start your study because it will affect how big your margin of error is a Confidence interval The range of possible values that the population’s result would be at the confidenc range is the sample result +/- the margin of error. Statistical significance The determination of whether your result could be due to random chance or not. T the less due to chance. Things to remember when determining the size of your sample When figuring out a sample size, here are things to keep in mind: Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population. The confidence level most commonly used is 95%, but 90% can work in some cases. Increase the sample size to meet specific needs of your project: For a higher confidence level, use a larger sample size To decrease the margin of error, use a larger sample size For greater statistical significance, use a larger sample size Note: Sample size calculators use statistical formulas to determine a sample size. More about these are coming up in the course! Stay tuned. Why a minimum sample of 30? This recommendation is based on the Central Limit Theorem (CLT) in the field of probability and statistics. As sample size increases, the results more closely resemble the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid. Researchers who rely on regression analysis – statistical methods to determine the relationships between controlled and dependent variables – also prefer a minimum sample of 30. Still curious? Without getting too much into the math, check out these articles: Central Limit Theorem (CLT): This article by Investopedia explains the Central Limit Theorem and briefly describes how it can apply to an analysis of a stock index. Sample Size Formula: This article by Statistics Solutions provides a little more detail about why some researchers use 30 as a minimum sample size. Sample sizes vary by business problem Sample size will vary based on the type of business problem you are trying to solve. For example, if you live in a city with a population of 200,000 and get 180,000 people to respond to a survey, that is a large sample size. But without actually doing that, what would an acceptable, smaller sample size look like? Would 200 be alright if the people surveyed represented every district in the city? Answer: It depends on the stakes. A sample size of 200 might be large enough if your business problem is to find out how residents felt about the new library A sample size of 200 might not be large enough if your business problem is to determine how residents would vote to fund the library You could probably accept a larger margin of error surveying how residents feel about the new library versus surveying residents about how they would vote to fund it. For that reason, you would most likely use a larger sample size for the voter survey. Larger sample sizes have a higher cost You also have to weigh the cost against the benefits of more accurate results with a larger sample size. Someone who is trying to understand consumer preferences for a new line of products wouldn’t need as large a sample size as someone who is trying to understand the effects of a new drug. For drug safety, the benefits outweigh the cost of using a larger sample size. But for consumer preferences, a smaller sample size at a lower cost could provide good enough results. Knowing the basics is helpful Knowing the basics will help you make the right choices when it comes to sample size. You can always raise concerns if you come across a sample size that is too small. A sample size calculator is also a great tool for this. Sample size calculators let you enter a desired confidence level and margin of error for a given population size. They then calculate the sample size needed to statistically achieve those results. Refer to the Determine the Best Sample Size video for a demonstration of a sample size calculator, or refer to the Sample Size Calculator reading for additional information. What to do when there is no data Earlier, you learned how you can still do an analysis using proxy data if you have no data. You might have some questions about proxy data, so this reading will give you a few more examples of the types of datasets that can serve as alternate data sources. Proxy data examples Sometimes the data to support a business objective isn’t readily available. This is when proxy data is useful. Take a look at the following scenarios and where proxy data comes in for each example: Business scenario How proxy data can be used A new car model was just launched a few days ago and the auto dealership can’t wait until the end of the month for sales data to come in. They want sales projections now. The analyst proxies the numb specifications on the dealersh potential sales at the dealersh A brand new plant-based meat product was only recently stocked in grocery stores and the supplier needs to estimate the demand over the next four years. The analyst proxies the sales d made out of tofu that has been years. The Chamber of Commerce wants to know how a tourism campaign is going to impact travel to their city, but the results from the campaign aren’t publicly available yet. The analyst proxies the histor to the city one to three month run six months earlier. Open (public) datasets If you are part of a large organization, you might have access to lots of sources of data. But if you are looking for something specific or a little outside your line of business, you can also make use of open or public datasets. (You can refer to this Towards Data Science article for a brief explanation of the difference between open and public data.) Here's an example. A nasal version of a vaccine was recently made available. A clinic wants to know what to expect for contraindications, but just started collecting first-party data from its patients. A contraindication is a condition that may cause a patient not to take a vaccine due to the harm it would cause them if taken. To estimate the number of possible contraindications, a data analyst proxies an open dataset from a trial of the injection version of the vaccine. The analyst selects a subset of the data with patient profiles most closely matching the makeup of the patients at the clinic. There are plenty of ways to share and collaborate on data within a community. Kaggle (kaggle.com) which we previously introduced, has datasets in a variety of formats including the most basic type, Comma Separated Values (CSV) files. CSV, JSON, SQLite, and BigQuery datasets CSV: Check out this Credit card customers dataset, which has information from 10,000 customers including age, salary, marital status, credit card limit, credit card category, etc. (CC0: Public Domain, Sakshi Goyal). JSON: Check out this JSON dataset for trending YouTube videos (CC0: Public Domain, Mitchell J). SQLite: Check out this SQLite dataset for 24 years worth of U.S. wildfire data (CC0: Public Domain, Rachael Tatman). BigQuery: Check out this Google Analytics 360 sample dataset from the Google Merchandise Store (CC0 Public Domain, Google BigQuery). Refer to the Kaggle documentation for datasets for more information and search for and explore datasets on your own at kaggle.com/datasets. As with all other kinds of datasets, be on the lookout for duplicate data and ‘Null’ in open datasets. Null most often means that a data field was unassigned (left empty), but sometimes Null can be interpreted as the value, 0. It is important to understand how Null was used before you start analyzing a dataset with Null data. Sample size calculator In this reading, you will learn the basics of sample size calculators, how to use them, and how to understand the results. A sample size calculator tells you how many people you need to interview (or things you need to test) to get results that represent the target population. Let’s review some terms you will come across when using a sample size calculator: Confidence level: The probability that your sample size accurately reflects the greater population. Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population. Population: This is the total number you hope to pull your sample from. Sample: A part of a population that is representative of the population. Estimated response rate: If you are running a survey of individuals, this is the percentage of people you expect will complete your survey out of those who received the survey. How to use a sample size calculator In order to use a sample size calculator, you need to have the population size, confidence level, and the acceptable margin of error already decided so you can input them into the tool. If this information is ready to go, check out these sample size calculators below: Sample size calculator by surveymonkey.com Sample size calculator by raosoft.com What to do with the results After you have plugged your information into one of these calculators, it will give you a recommended sample size. Keep in mind, the calculated sample size is the minimum number to achieve what you input for confidence level and margin of error. If you are working with a survey, you will also need to think about the estimated response rate to figure out how many surveys you will need to send out. For example, if you need a sample size of 100 individuals and your estimated response rate is 10%, you will need to send your survey to 1,000 individuals to get the 100 responses you need for your analysis. Now that you have the basics, try some calculations using the sample size calculators and refer back to this reading if you need a refresher on the definitions. What is dirty data? Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant to the problem you are trying to solve. This reading summarizes: Types of dirty data you may encounter What may have caused the data to become dirty How dirty data is harmful to businesses Types of dirty data Icons of the 6 types of dirty data: duplicate, outdated, incomplete, incorrect and inconsistent data Duplicate data Description Possible causes Potential harm to businesses Any data record that shows up more than once Manual data entry, batch data imports, or data migration Skewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval Outdated data Description Possible causes Potential harm to businesses Any data that is old which should be replaced with newer and more accurate information People changing roles or companies, or software and systems becoming obsolete Inaccurate insights, decision-making, and analytics Incomplete data Description Possible causes Potential harm to businesses Any data that is missing important fields Improper data collection or incorrect data entry Decreased productivity, inaccurate insights, or inability to complete essential services Incorrect/inaccurate data Description Possible causes Potential harm to businesses Any data that is complete but inaccurate Human error inserted during data input, fake information, or mock data Inaccurate insights or decision-making based on bad information resulting in revenue loss Inconsistent data Description Possible causes Potential harm to businesses Any data that uses different formats to represent the same thing Data stored incorrectly or errors inserted during data transfer Contradictory data points leading to confusion or inability to classify or segment customers Business impact of dirty data For further reading on the business impact of dirty data, enter the term “dirty data” into your preferred browser’s search bar to bring up numerous articles on the topic. Here are a few impacts cited for certain industries from a previous search: Banking: Inaccuracies cost companies between 15% and 25% of revenue (source). Digital commerce: Up to 25% of B2B database contacts contain inaccuracies (source). Marketing and sales: 8 out of 10 companies have said that dirty data hinders sales campaigns (source). Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s electronic health records (source). Common data-cleaning pitfalls In this reading, you will learn the importance of data cleaning and how to identify common mistakes. Some of the errors you might come across while cleaning your data could include: Common mistakes to avoid Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most of the time the wrong spelling or common grammatical errors can be detected, but it gets harder with things like names or addresses. For example, if you are working with a spreadsheet table of customer data, you might come across a customer named “John” whose name has been input incorrectly as “Jon” in some places. The spreadsheet’s spellcheck probably won’t flag this, so if you don’t double-check for spelling errors and catch this, your analysis will have mistakes in it. Forgetting to document errors: Documenting your errors can be a big time saver, as it helps you avoid those errors in the future by showing you how you resolved them. For example, you might find an error in a formula in your spreadsheet. You discover that some of the dates in one of your columns haven’t been formatted correctly. If you make a note of this fix, you can reference it the next time your formula is broken, and get a head start on troubleshooting. Documenting your errors also helps you keep track of changes in your work, so that you can backtrack if a fix didn’t work. Not checking for misfielded values: A misfielded value happens when the values are entered into the wrong field. These values might still be formatted correctly, which makes them harder to catch if you aren’t careful. For example, you might have a dataset with columns for cities and countries. These are the same type of data, so they are easy to mix up. But if you were trying to find all of the instances of Spain in the country column, and Spain had mistakenly been entered into the city column, you would miss key data points. Making sure your data has been entered correctly is key to accurate, complete analysis. Overlooking missing values: Missing values in your dataset can create errors and give you inaccurate conclusions. For example, if you were trying to get the total number of sales from the last three months, but a week of transactions were missing, your calculations would be inaccurate. As a best practice, try to keep your data as clean as possible by maintaining completeness and consistency. Only looking at a subset of the data: It is important to think about all of the relevant data when you are cleaning. This helps make sure you understand the whole story the data is telling, and that you are paying attention to all possible errors. For example, if you are working with data about bird migration patterns from different sources, but you only clean one source, you might not realize that some of the data is being repeated. This will cause problems in your analysis later on. If you want to avoid common errors like duplicates, each field of your data requires equal attention. Losing track of business objectives: When you are cleaning data, you might make new and interesting discoveries about your dataset-- but you don’t want those discoveries to distract you from the task at hand. For example, if you were working with weather data to find the average number of rainy days in your city, you might notice some interesting patterns about snowfall, too. That is really interesting, but it isn’t related to the question you are trying to answer right now. Being curious is great! But try not to let it distract you from the task at hand. Not fixing the source of the error: Fixing the error itself is important. But if that error is actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will have to keep fixing that same error over and over again. For example, imagine you have a team spreadsheet that tracks everyone’s progress. The table keeps breaking because different people are entering different values. You can keep fixing all of these problems one by one, or you can set up your table to streamline data entry so everyone is on the same page. Addressing the source of the errors in your data will save you a lot of time in the long run. Not analyzing the system prior to data cleaning: If we want to clean our data and avoid future errors, we need to understand the root cause of your dirty data. Imagine you are an auto mechanic. You would find the cause of the problem before you started fixing the car, right? The same goes for data. First, you figure out where the errors come from. Maybe it is from a data entry error, not setting up a spell check, lack of formats, or from duplicates. Then, once you understand where bad data comes from, you can control it and keep your data clean. Not backing up your data prior to data cleaning: It is always good to be proactive and create your data backup before you start your data clean-up. If your program crashes, or if your changes cause a problem in your dataset, you can always go back to the saved version and restore it. The simple procedure of backing up your data can save you hours of work-- and most importantly, a headache. Not accounting for data cleaning in your deadlines/process: All good things take time, and that includes data cleaning. It is important to keep that in mind when going through your process and looking at your deadlines. When you set aside time for data cleaning, it helps you get a more accurate estimate for ETAs for stakeholders, and can help you know when to request an adjusted ETA. 1. Question 1 Activity overview You’ve learned about cleaning data and its importance in meeting good data science standards. In this activity, you’ll do some data cleaning with spreadsheets, then transpose the data. By the time you complete this activity, you will be able to perform some basic cleaning methods in spreadsheets. This will enable you to clean and transpose data, which is important for making data more specific and accurate in your career as a data analyst. What you will need To get started, first access the data spreadsheet. To use the spreadsheet for this course item, click the link below and select “Use Template.” Link to data spreadsheet: Cleaning with spreadsheets OR If you don’t have a Google account, you can download the template directly from the attachment below. Data Spreadsheet for Cleaning with SpreadsheetsXLSX File Download file Select and remove blank cells The first technique we’ll use is to select and eliminate rows containing blank cells by using filters. To eliminate rows with blank cells: 1. Highlight all cells in the spreadsheet. You can highlight Columns A-H by clicking on the header of Column A, holding Shift, and clicking on the header of Column H. 2. Click on the Data tab and pick the Create a filter option. In Microsoft Excel, this is called Filter. Excel: 3. Every column now shows a green triangle in the first row next to the column title. Click the green triangle in Column B to access a new menu. 4. On that new menu, click Filter by condition and open the dropdown menu to select Is empty. Click OK. In Excel, click the dropdown, then Filter... then make sure only (Blanks) is checked. Click OK. Excel: You can then review a list of all the rows with blank cells in that column. 5. Select all these cells and delete the rows except the row of column headers. 6. Return to the Filter by condition and return it to None. In Excel, click Clear Filter from ‘Column’. Note: You will now notice that any row that had an empty cell in Column A will be removed (including the extra empty rows after the data). 7. Repeat this for Columns B-H. All the rows that had blank cells are now removed from the spreadsheet. Transpose the data The second technique you will practice will help you convert the data from the current long format (more rows than columns) to the wide format (more columns than rows). This action is called transposing. To transpose your data: 1. Highlight and copy the data that you want to transpose including the column labels. You can do this by highlighting Columns A-H. In Excel, highlight only the relevant cells (A1-H45) instead of the headers. 2. Right-click on cell I1. This is where you want the transposed data to start. 3. Hover over Paste Special from the right-click menu. Select the Transposed option. In Excel, select the Transpose icon under the paste options. Excel: You should now find the data transformed into the new wide format. At this point, you should remove the original long data from the spreadsheet. 4. Delete the previous long data. The easiest way to do this is to click on Column A, so the entire column is highlighted. Then, hold down the Shift key and click on Column H. You should find these columns highlighted. Right-click on the highlighted area and select Delete Columns A - H. Your screen should now appear like this: Get rid of extra spaces in cells with string data Now that you have transposed the data, eliminate the extra spaces in the values of the cells. 1. Highlight the data in the spreadsheet. 2. Click on the Data tab, then hover over Data cleanup and select Trim whitespace. In Excel, you can use the v command to get rid of white spaces. In any space beneath your data (such as cell A10), type =TRIM(A1). Then, drag the bottom right corner of the cell to the bottom right to call the data without the white spaces. Now all the extra spaces in the cells have been removed. Change Text Lower/Uppercase/Proper Case Next, you’ll process string data. The easiest way to clean up string data will depend on the spreadsheet program you are using. If you are using Excel, you’ll use a simple formula. If you are using Google Sheets, you can use an Add-On to do this with a few clicks. Follow the steps in the relevant section below. Microsoft Excel If you are using Microsoft Excel, this documentation explains how to use a formula to change the case of a text string. Follow these instructions to clean the string text and then move on to the confirmation and reflection section of this activity. Google sheets If you’re completing this exercise using Google Sheets, you’ll need to install an add-in that will give you the functionality needed to easily clean string data and change cases. Google Sheets Add-on Instructions: 1. Click on the Add-Ons option at the top of Google Sheets. 2. Click on Get add-ons. 3. Search for ChangeCase. It should appear like this: 4. Click on Install to install the add-on. It may ask you to login or verify the installation permissions. Once you have installed the add-on successfully, you can access it by clicking on the Add-ons menu again. Now, you can change the case of text data that shows up. To change the text in Column C to all uppercase: 1. Click on Column C. Be sure to deselect the column header, unless you want to change the case of that as well (which you don't). 2. Click on the Add-Ons tab and select ChangeCase. Select the option All uppercase. Notice the other options that you could have chosen if needed. Delete all formatting If you want to clear the formatting for any or all cells, you can find the command in the Format tab. To clear formatting: 1. Select the data for which you want to delete the formatting. In this case, highlight all the data in the spreadsheet by clicking and dragging over Rows 1-8. 2. Click the Format tab and select the Clear Formatting option. In Excel, go to the Home tab, then hover over Clear and select Clear Formats. You will notice that all the cells have had their formatting removed. Workflow automation In this reading, you will learn about workflow automation and how it can help you work faster and more efficiently. Basically, workflow automation is the process of automating parts of your work. That could mean creating an event trigger that sends a notification when a system is updated. Or it could mean automating parts of the data cleaning process. As you can probably imagine, automating different parts of your work can save you tons of time, increase productivity, and give you more bandwidth to focus on other important aspects of the job. What can be automated? Automation sounds amazing, doesn’t it? But as convenient as it is, there are still some parts of the job that can’t be automated. Let's take a look at some things we can automate and some things that we can’t. Can it be automated? Why? No Communication is key to understanding the needs of yo you complete the tasks you are working on. There is no person communications. Presenting your findings No Presenting your data is a big part of your job as a data a accessible and understandable to stakeholders and crea be automated for the same reasons that communication Preparing and cleaning data Partially Some tasks in data preparation and cleaning can be aut processes, like using a programming script to automati Data exploration Partially Sometimes the best way to understand data is to see it. tools available that can help automate the process of vi speed up the process of visualizing and understanding itself still needs to be done by a data analyst. Modeling the data Yes Data modeling is a difficult process that involves lots of there are tools that can completely automate the differe Task Communicating with your team and stakeholders More about automating data cleaning One of the most important ways you can streamline your data cleaning is to clean data where it lives. This will benefit your whole team, and it also means you don’t have to repeat the process over and over. For example, you could create a programming script that counted the number of words in each spreadsheet file stored in a specific folder. Using tools that can be used where your data is stored means that you don’t have to repeat your cleaning steps, saving you and your team time and energy. More resources There are a lot of tools out there that can help automate your processes, and those tools are improving all the time. Here are a few articles or blogs you can check out if you want to learn more about workflow automation and the different tools out there for you to use: Towards Data Science’s Automating Scientific Data Analysis MIT News’ Automating Big-Data Analysis TechnologyAdvice’s 10 of the Best Options for Workflow Automation Software As a data analyst, automation can save you a lot of time and energy, and free you up to focus more on other parts of your project. The more analysis you do, the more ways you will find to make your processes simpler and more streamlined. Learning Log: Develop your approach to cleaning data Overview By this point, you have started working with real data. And you may have noticed that data is often messy-- you can expect raw, primary data to be imperfect. In this learning log, you will develop an approach to cleaning data by creating a cleaning checklist, considering your preferred methods for data cleaning, and deciding on a data cleaning motto. By the time you complete this entry, you will have a stronger understanding of how to approach the data cleaning process methodically. This will help you save time cleaning data in the future and ensure that your data is clean and usable. Fill out the Data Cleaning Approach Table The problem with data cleaning is that it usually requires a lot of time, energy, and attention from a junior data analyst. One of the best ways to lessen the negative impacts of data cleaning is to have a plan of action or a specific approach to cleaning the data. In order to help you develop your own approach, you’ll use the instructions from this learning log to fill out a Data Cleaning Approach Table in your learning log template. The table will appear like this in the template: Once you have completed your Data Cleaning Approach Table, you will spend some time reflecting on the data cleaning process and your own approach. Access your learning log To use the learning log for this course item, click the link below and select “Use Template.” Link to learning log template: Develop your approach to data cleaning OR If you don’t have a Google account, you can download the template directly from the attachment below. Learning Log Template_ Develop your approach to cleaning dataDOCX File Download file Step 1: Create your checklist You can start developing your personal approach to cleaning data by creating a standard checklist to use before your data cleaning process. Think of this checklist as your default "what to search for" list. With a good checklist, you can efficiently and, hopefully, swiftly identify all the problem spots without getting sidetracked. You can also use the checklist to identify the scale and scope of the dataset itself. Some things you might include in your checklist: Size of the data set Number of categories or labels Missing data Unformatted data The different data types You can use your own experiences so far to help you decide what else you want to include in your checklist! Step 2: List your preferred cleaning methods After you have compiled your personal checklist, you can create a list of activities you like to perform when cleaning data. This list is a collection of procedures that you will implement when you encounter specific issues present in the data related to your checklist or every time you clean a new dataset. For example, suppose that you have a dataset with missing data, how would you handle it? Moreover, if the data set is very large, what would you do to check for missing data? Outlining some of your preferred methods for cleaning data can help save you time and energy. Step 3: Choose a data cleaning motto Now that you have a personal checklist and your preferred data cleaning methods, you can create a data cleaning motto to help guide and explain your process. The motto is a short one or two sentence summary of your philosophy towards cleaning data. For example, here are a few data cleaning mottos from other data analysts: 1. "Not all data is the same, so don't treat it all the same." 2. "Be prepared for things to not go as planned. Have a backup plan.” 3. "Avoid applying complicated solutions to simple problems." The data you encounter as an analyst won’t always conform to your checklist or activities list regardless of how comprehensive they are. Data cleaning can be an involved and complicated process, but surprisingly most data has similar problems. A solid personal motto and explanation can make the more common data cleaning tasks easy to understand and complete. Reflection Now that you have completed your Data Cleaning Approach Table, take a moment to reflect on the decisions you made about your data cleaning approach. Write 1-2 sentences (20-40 words) answering each of the following questions: What items did you add to your data cleaning checklist? Why did you decide these were important to check for? How have your own experiences with data cleaning affected your preferred cleaning methods? Can you think of an example where you needed to perform one of these cleaning tasks? How did you decide on your data cleaning motto? Using SQL as a junior data analyst In this reading, you will learn more about how to decide when to use SQL, or Structured Query Language. As a data analyst, you will be tasked with handling a lot of data, and SQL is one of the tools that can help make your work a lot easier. SQL is the primary way data analysts extract data from databases. As a data analyst, you will work with databases all the time, which is why SQL is such a key skill. Let’s follow along as a junior data analyst uses SQL to solve a business task. The business task and context The junior data analyst in this example works for a social media company. A new business model was implemented on February 15, 2020 and the company wants to understand how their user-growth compares to the previous year. Specifically, the data analyst was asked to find out how many users have joined since February 15, 2020. An image of a person holding a laptop containing different data and an image of a multicolored outline of 3 people Spreadsheets functions and formulas or SQL queries? Before they can address this question, this data analyst needs to choose what tool to use. First, they have to think about where the data lives. If it is stored in a database, then SQL is the best tool for the job. But if it is stored in a spreadsheet, then they will have to perform their analysis in that spreadsheet. In that scenario, they could create a pivot table of the data and then apply specific formulas and filters to their data until they were given the number of users that joined after February 15th. It isn’t a really complicated process, but it would involve a lot of steps. In this case, the data is stored in a database, so they will have to work with SQL. And this data analyst knows they could get the same results with a single SQL query: Screenshot of a single SQL query SELECT COUNT(DISTINCT user_id) AS count_of_unique_users FROM table WHERE join_date >= ‘2020-02-15’ Spreadsheets and SQL both have their advantages and disadvantages: Features of Spreadsheets Features of SQL Databases Smaller data sets Larger datasets Enter data manually Access tables across a database Create graphs and visualizations in the same program Prepare data for further analysis in another software Built-in spell check and other useful functions Fast and powerful functionality Best when working solo on a project Great for collaborative work and tracking queries run by all users When it comes down to it, where the data lives will decide which tool you use. If you are working with data that is already in a spreadsheet, that is most likely where you will perform your analysis. And if you are working with data stored in a database, SQL will be the best tool for you to use for your analysis. You will learn more about SQL coming up, so that you will be ready to tackle any business problem with the best tool possible. Optional: Upload the store transactions dataset to BigQuery In the next video, the instructor uses a specific dataset. The instructions in this reading are provided for you to upload the same dataset in your BigQuery console so you can follow along. You must have a BigQuery account to follow along. If you have hopped around courses, Using BigQuery in the Prepare Data for Exploration course covers how to set up a BigQuery account. Prepare for the next video First, download the CSV file from the attachment below. Lauren's Furniture Store Transaction TableCSV File Download file Next, complete the steps below in your BigQuery console to upload the Store Transaction dataset. Note: These steps will be different from what you performed before. In previous instances, you selected the Auto detect check box to allow BigQuery to auto-detect the schema. This time, you will choose to create the schema by editing it as text. This method can be used when BigQuery doesn't automatically set the desired type for a particular field. In this case, you will specify STRING instead of FLOAT as the type for the purchase_price field. Step 1: Open your BigQuery console and click on the project you want to upload the data to. If you already created a customer_data dataset for your project, jump to step 5; otherwise, continue with step 2. Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next to your project name and select Create dataset. Step 3: Enter customer_data for the Dataset ID. Step 4: Click CREATE DATASET (blue button) to add the dataset to your project. Step 5: In the Explorer, click to expand your project, and then click the customer_data dataset. Step 6: Click the Actions icon (three vertical dots) next to customer_data and select Open. Step 7: Click the blue + icon at the top right to open the Create table window. Step 8: Under Source, for the Create table from selection, choose where the data will be coming from. Select Upload. Click Browse to select the Store Transaction Table CSV file you downloaded. Choose CSV from the file format drop-down. Step 9: For Table name, enter customer_purchase if you plan to follow along with the video. Step 10: For Schema, click the toggle switch for Edit as text. This opens up a box for the text. Step 11: Copy and paste the following text into the box. Be sure to include the opening and closing brackets. They are required. [ { "description": "date", "mode": "NULLABLE", "name": "date", "type": "DATETIME" }, { "description": "transaction id", "mode": "NULLABLE", "name": "transaction_id", "type": "INTEGER" }, { "description": "customer id", "mode": "NULLABLE", "name": "customer_id", "type": "INTEGER" }, { "description": "product name", "mode": "NULLABLE", "name": "product", "type": "STRING" }, { "description": "product_code", "mode": "NULLABLE", "name": "product_code", "type": "STRING" }, { "description": "product color", "mode": "NULLABLE", "name": "product_color", "type": "STRING" }, { "description": "product price", "mode": "NULLABLE", "name": "product_price", "type": "FLOAT" }, { "description": "quantity purchased", "mode": "NULLABLE", "name": "purchase_size", "type": "INTEGER" }, { "description": "purchase price", "mode": "NULLABLE", "name": "purchase_price", "type": "STRING" }, { "description": "revenue", "mode": "NULLABLE", "name": "revenue", "type": "FLOAT" } ] Step 12: Scroll down and expand the Advanced options section. Step 13: For the Header rows to skip field, enter 1. Step 14: Click Create table (blue button). You will now see the customer_purchase table under your customer_data dataset in your project. Step 15: Click the customer_purchase table and in the Schema tab, confirm that the schema matches the schema shown below. Step 16: Click the Preview tab and confirm that your data matches the data shown below. Congratulations, you are now ready to follow along with the video! How to get to the BigQuery console In your browser, go to console.cloud.google.com/bigquery. Note: Going to console.cloud.google.com in your browser takes you to the main dashboard for the Google Cloud Platform. To navigate to BigQuery from the dashboard, do the following: Click the Navigation menu icon (Hamburger icon) in the banner. Scroll down to the BIG DATA section. Click BigQuery and select SQL workspace. Watch the How to use BigQuery video for an introduction to each part of the BigQuery SQL workspace. (Optional) Explore a BigQuery public dataset You will be exploring a public dataset in an upcoming activity, so you can perform these steps later if you prefer. Refer to these step-by-step instructions. (Optional) Upload a CSV file to BigQuery These steps are provided so you can work with a dataset on your own at this time. You will upload CSV files to BigQuery later in the program. Refer to these step-by-step instructions. Getting started with other databases (if not using BigQuery) It is easier to follow along with the course activities if you use BigQuery, but if you are connecting to and practicing SQL queries on other database platforms instead of BigQuery, here are similar getting started resources: Getting started with MySQL: This is a guide to setting up and using MySQL. Getting started with Microsoft SQL Server: This is a tutorial to get started using SQL Server. Getting started with PostgreSQL: This is a tutorial to get started using PostgreSQL. Getting started with SQLite: This is a quick start guide for using SQLite. It's so great to have you back. Now that we know some basic SQL queries and spent some time working in a database, let's apply that knowledge to something else we've been talking about: preparing and cleaning data. You already know that cleaning and completing your data before you analyze it is an important step. So in this video, I'll show you some ways SQL can help you do just that, including how to remove duplicates, as well as four functions to help you clean string variables. Earlier, we covered how to remove duplicates in spreadsheets using the Remove duplicates tool. In SQL, we can do the same thing by including DISTINCT in our SELECT statement. For example, let's say the company we work for has a special promotion for customers in Ohio. We want to get the customer IDs of customers who live in Ohio. But some customer information has been entered multiple times. We can get these customer IDs by writing SELECT customer_id FROM customer_data.customer_address. This query will give us duplicates if they exist in the table. If customer ID 9080 shows up three times in our table, our results will have three of that customer ID. But we don't want that. We want a list of all unique customer IDs. To do that, we add DISTINCT to our SELECT statement by writing, SELECT DISTINCT customer_id FROM customer_data.customer_address. In those cases, you'll need to clean them before you can analyze them. So here are some functions you can use in SQL to handle string variables. You might recognize some of these functions from when we talked about spreadsheets. Now it's time to see them work in a new way. Pull up the data set we shared right before this video. And you can follow along step-by-step with me during the rest of this video. Play video starting at :2:42 and follow transcript2:42 The first function I want to show you is LENGTH, which we've encountered before. If we already know the length our string variables are supposed to be, we can use LENGTH to double-check that our string variables are consistent. For some databases, this query is written as LEN, but it does the same thing. Let's say we're working with the customer_address table from our earlier example. We can make sure that all country codes have the same length by using LENGTH on each of these strings. So to write our SQL query, let's first start with SELECT and FROM. We know our data comes from the customer_address table within the customer_data data set. So we add customer_data.customer_address after the FROM clause. Then under SELECT, we'll write LENGTH, and then the column we want to check, country. To remind ourselves what this is, we can label this column in our results as letters_in_country. So we add AS letters_in_country, after LENGTH(country). The result we get is a list of the number of letters in each country listed for each of our customers. It seems like almost all of them are 2s, which means the country field contains only two letters. But we notice one that has 3. That's not good. We want our data to be consistent. Play video starting at :4:32 and follow transcript4:32 So let's check out which countries were incorrectly listed in our table. We can do that by putting the LENGTH(country) function that we created into the WHERE clause. Because we're telling SQL to filter the data to show only customers whose country contains more than two letters. So now we'll write SELECT country FROM customer_data.customer_address WHERE LENGTH(country) greater than 2. Play video starting at :5:8 and follow transcript5:08 When we run this query, we now get the two countries where the number of letters is greater than the 2 we expect to find. Play video starting at :5:17 and follow transcript5:17 The incorrectly listed countries show up as USA instead of US. If we created this table, then we could update our table so that this entry shows up as US instead of USA. But in this case, we didn't create this table, so we shouldn't update it. We still need to fix this problem so we can pull a list of all the customers in the US, including the two that have USA instead of US. The good news is that we can account for this error in our results by using the substring function in our SQL query. To write our SQL query, let's start by writing the basic structure, SELECT, FROM, WHERE. We know our data is coming from the customer_address table from the customer_data data set. So we type in customer_data.customer_address, after FROM. Next, we tell SQL what data we want it to give us. We want all the customers in the US by their IDs. So we type in customer_id after SELECT. Finally, we want SQL to filter out only American customers. So we use the substring function after the WHERE clause. We're going to use the substring function to pull the first two letters of each country so that all of them are consistent and only contain two letters. To use the substring function, we first need to tell SQL the column where we found this error, country. Then we specify which letter to start with. We want SQL to pull the first two letters, so we're starting with the first letter, so we type in 1. Then we need to tell SQL how many letters, including this first letter, to pull. Since we want the first two letters, we need SQL to pull two total letters, so we type in 2. This will give us the first two letters of each country. We want US only, so we'll set this function to equals US. When we run this query, we get a list of all customer IDs of customers whose country is the US, including the customers that had USA instead of US. Going through our results, it seems like we have a couple duplicates where the customer ID is shown multiple times. Remember how we get rid of duplicates? We add DISTINCT before customer_id. Play video starting at :8:5 and follow transcript8:05 So now when we run this query, we have our final list of customer IDs of the customers who live in the US. Finally, let's check out the TRIM function, which you've come across before. This is really useful if you find entries with extra spaces and need to eliminate those extra spaces for consistency. Play video starting at :8:26 and follow transcript8:26 For example, let's check out the state column in our customer_address table. Just like we did for the country column, we want to make sure the state column has the consistent number of letters. So let's use the LENGTH function again to learn if we have any state that has more than two letters, which is what we would expect to find in our data table. Play video starting at :8:49 and follow transcript8:49 We start writing our SQL query by typing the basic SQL structure of SELECT, FROM, WHERE. We're working with the customer_address table in the customer_data data set. So we type in customer_data.customer_address after FROM. Next, we tell SQL what we want it to pull. We want it to give us any state that has more than two letters, so we type in state, after SELECT. Finally, we want SQL to filter for states that have more than two letters. This condition is written in the WHERE clause. So we type in LENGTH(state), and that it must be greater than 2 because we want the states that have more than two letters. Play video starting at :9:49 and follow transcript9:49 We want to figure out what the incorrectly listed states look like, if we have any. When we run this query, we get one result. We have one state that has more than two letters. But hold on, how can this state that seems like it has two letters, O and H for Ohio, have more than two letters? We know that there are more than two characters because we used the LENGTH(state) &gt; 2 statement in the WHERE clause when filtering out results. So that means the extra characters that SQL is counting must then be a space. There must be a space after the H. This is where we would use the TRIM function. The TRIM function removes any spaces. So let's write a SQL query that accounts for this error. Let's say we want a list of all customer IDs of the customers who live in "OH" for Ohio. We start with the basic SQL structure: FROM, SELECT, WHERE. We know the data comes from the customer_address table in the customer_data data set, so we type in customer_data.customer_address after FROM. Next, we tell SQL what data we want. We want SQL to give us the customer IDs of customers who live in Ohio, so we type in customer_id after SELECT. Since we know we have some duplicate customer entries, we'll go ahead and type in DISTINCT before customer_id to remove any duplicate customer IDs from appearing in our results. Finally, we want SQL to give us the customer IDs of the customers who live in Ohio. We're asking SQL to filter the data, so this belongs in the WHERE clause. Here's where we'll use the TRIM function. To use the TRIM function, we tell SQL the column we want to remove spaces from, which is state in our case. And we want only Ohio customers, so we type in = 'OH'. That's it. We have all customer IDs of the customers who live in Ohio, including that customer with the extra space after the H. Play video starting at :12:14 and follow transcript12:14 Making sure that your string variables are complete and consistent will save you a lot of time later by avoiding errors or miscalculations. That's why we clean data in the first place. Hopefully functions like length, substring, and trim will give you the tools you need to start working with string variables in your own data sets. Next up, we'll check out some other ways you can work with strings and more advanced cleaning functions. Then you'll be ready to start working in SQL on your own. See you soon. In this video, we'll discuss how to begin the process of verifying your data-cleaning efforts. Play video starting at ::7 and follow transcript0:07 Verification is a critical part of any analysis project. Without it you have no way of knowing that your insights can be relied on for data-driven decision-making. Think of verification as a stamp of approval. Play video starting at ::20 and follow transcript0:20 To refresh your memory, verification is a process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable. It also involves manually cleaning data to compare your expectations with what's actually present. The first step in the verification process is going back to your original unclean data set and comparing it to what you have now. Review the dirty data and try to identify any common problems. For example, maybe you had a lot of nulls. In that case, you check your clean data to ensure no nulls are present. To do that, you could search through the data manually or use tools like conditional formatting or filters. Play video starting at :1:6 and follow transcript1:06 Or maybe there was a common misspelling like someone keying in the name of a product incorrectly over and over again. In that case, you'd run a FIND in your clean data to make sure no instances of the misspelled word occur. Play video starting at :1:21 and follow transcript1:21 Another key part of verification involves taking a big-picture view of your project. This is an opportunity to confirm you're actually focusing on the business problem that you need to solve and the overall project goals and to make sure that your data is actually capable of solving that problem and achieving those goals. Play video starting at :1:41 and follow transcript1:41 It's important to take the time to reset and focus on the big picture because projects can sometimes evolve or transform over time without us even realizing it. Maybe an e-commerce company decides to survey 1000 customers to get information that would be used to improve a product. But as responses begin coming in, the analysts notice a lot of comments about how unhappy customers are with the e-commerce website platform altogether. So the analysts start to focus on that. While the customer buying experience is of course important for any e-commerce business, it wasn't the original objective of the project. The analysts in this case need to take a moment to pause, refocus, and get back to solving the original problem. Play video starting at :2:32 and follow transcript2:32 Taking a big picture view of your project involves doing three things. First, consider the business problem you're trying to solve with the data. Play video starting at :2:41 and follow transcript2:41 If you've lost sight of the problem, you have no way of knowing what data belongs in your analysis. Taking a problem-first approach to analytics is essential at all stages of any project. You need to be certain that your data will actually make it possible to solve your business problem. Second, you need to consider the goal of the project. It's not enough just to know that your company wants to analyze customer feedback about a product. What you really need to know is that the goal of getting this feedback is to make improvements to that product. On top of that, you also need to know whether the data you've collected and cleaned will actually help your company achieve that goal. And third, you need to consider whether your data is capable of solving the problem and meeting the project objectives. That means thinking about where the data came from and testing your data collection and cleaning processes. Play video starting at :3:35 and follow transcript3:35 Sometimes data analysts can be too familiar with their own data, which makes it easier to miss something or make assumptions. Play video starting at :3:43 and follow transcript3:43 Asking a teammate to review your data from a fresh perspective and getting feedback from others is very valuable in this stage. Play video starting at :3:51 and follow transcript3:51 This is also the time to notice if anything sticks out to you as suspicious or potentially problematic in your data. Again, step back, take a big picture view, and ask yourself, do the numbers make sense? Play video starting at :4:7 and follow transcript4:07 Let's go back to our e-commerce company example. Imagine an analyst is reviewing the cleaned up data from the customer satisfaction survey. The survey was originally sent to 1,000 customers, but what if the analyst discovers that there is more than a thousand responses in the data? This could mean that one customer figured out a way to take the survey more than once. Or it could also mean that something went wrong in the data cleaning process, and a field was duplicated. Either way, this is a signal that it's time to go back to the data-cleaning process and correct the problem. Play video starting at :4:42 and follow transcript4:42 Verifying your data ensures that the insights you gain from analysis can be trusted. It's an essential part of data-cleaning that helps companies avoid big mistakes. This is another place where data analysts can save the day. Play video starting at :4:55 and follow transcript4:55 Coming up, we'll go through the next steps in the data-cleaning process. See you there. In this video, we'll discuss how to begin the process of verifying your data-cleaning efforts. Play video starting at ::7 and follow transcript0:07 Verification is a critical part of any analysis project. Without it you have no way of knowing that your insights can be relied on for data-driven decision-making. Think of verification as a stamp of approval. Play video starting at ::20 and follow transcript0:20 To refresh your memory, verification is a process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable. It also involves manually cleaning data to compare your expectations with what's actually present. The first step in the verification process is going back to your original unclean data set and comparing it to what you have now. Review the dirty data and try to identify any common problems. For example, maybe you had a lot of nulls. In that case, you check your clean data to ensure no nulls are present. To do that, you could search through the data manually or use tools like conditional formatting or filters. Play video starting at :1:6 and follow transcript1:06 Or maybe there was a common misspelling like someone keying in the name of a product incorrectly over and over again. In that case, you'd run a FIND in your clean data to make sure no instances of the misspelled word occur. Play video starting at :1:21 and follow transcript1:21 Another key part of verification involves taking a big-picture view of your project. This is an opportunity to confirm you're actually focusing on the business problem that you need to solve and the overall project goals and to make sure that your data is actually capable of solving that problem and achieving those goals. Play video starting at :1:41 and follow transcript1:41 It's important to take the time to reset and focus on the big picture because projects can sometimes evolve or transform over time without us even realizing it. Maybe an e-commerce company decides to survey 1000 customers to get information that would be used to improve a product. But as responses begin coming in, the analysts notice a lot of comments about how unhappy customers are with the e-commerce website platform altogether. So the analysts start to focus on that. While the customer buying experience is of course important for any e-commerce business, it wasn't the original objective of the project. The analysts in this case need to take a moment to pause, refocus, and get back to solving the original problem. Play video starting at :2:32 and follow transcript2:32 Taking a big picture view of your project involves doing three things. First, consider the business problem you're trying to solve with the data. Play video starting at :2:41 and follow transcript2:41 If you've lost sight of the problem, you have no way of knowing what data belongs in your analysis. Taking a problem-first approach to analytics is essential at all stages of any project. You need to be certain that your data will actually make it possible to solve your business problem. Second, you need to consider the goal of the project. It's not enough just to know that your company wants to analyze customer feedback about a product. What you really need to know is that the goal of getting this feedback is to make improvements to that product. On top of that, you also need to know whether the data you've collected and cleaned will actually help your company achieve that goal. And third, you need to consider whether your data is capable of solving the problem and meeting the project objectives. That means thinking about where the data came from and testing your data collection and cleaning processes. Play video starting at :3:35 and follow transcript3:35 Sometimes data analysts can be too familiar with their own data, which makes it easier to miss something or make assumptions. Play video starting at :3:43 and follow transcript3:43 Asking a teammate to review your data from a fresh perspective and getting feedback from others is very valuable in this stage. Play video starting at :3:51 and follow transcript3:51 This is also the time to notice if anything sticks out to you as suspicious or potentially problematic in your data. Again, step back, take a big picture view, and ask yourself, do the numbers make sense? Play video starting at :4:7 and follow transcript4:07 Let's go back to our e-commerce company example. Imagine an analyst is reviewing the cleaned up data from the customer satisfaction survey. The survey was originally sent to 1,000 customers, but what if the analyst discovers that there is more than a thousand responses in the data? This could mean that one customer figured out a way to take the survey more than once. Or it could also mean that something went wrong in the data cleaning process, and a field was duplicated. Either way, this is a signal that it's time to go back to the data-cleaning process and correct the problem. Play video starting at :4:42 and follow transcript4:42 Verifying your data ensures that the insights you gain from analysis can be trusted. It's an essential part of data-cleaning that helps companies avoid big mistakes. This is another place where data analysts can save the day. Play video starting at :4:55 and follow transcript4:55 Coming up, we'll go through the next steps in the data-cleaning process. See you there. Hey there. In this video, we'll continue building on the verification process. As a quick reminder, the goal is to ensure that our data-cleaning work was done properly and the results can be counted on. You want your data to be verified so you know it's 100 percent ready to go. It's like car companies running tons of tests to make sure a car is safe before it hits the road. You learned that the first step in verification is returning to your original, unclean dataset and comparing it to what you have now. This is an opportunity to search for common problems. After that, you clean up the problems manually. For example, by eliminating extra spaces or removing an unwanted quotation mark. But there's also some great tools for fixing common errors automatically, such as TRIM and remove duplicates. Earlier, you learned that TRIM is a function that removes leading, trailing, and repeated spaces and data. Remove duplicates is a tool that automatically searches for and eliminates duplicate entries from a spreadsheet. Now sometimes you had an error that shows up repeatedly, and it can't be resolved with a quick manual edit or a tool that fixes the problem automatically. In these cases, it's helpful to create a pivot table. A pivot table is a data summarization tool that is used in data processing. Pivot tables sort, reorganize, group, count, total or average data stored in a database. We'll practice that now using the spreadsheet from a party supply store. Let's say this company was interested in learning which of its four suppliers is most cost-effective. An analyst pulled this data on the products the business sells, how many were purchased, which supplier provides them, the cost of the products, and the ultimate revenue. The data has been cleaned. But during verification, we noticed that one of the suppliers' names was keyed in incorrectly. Play video starting at :2:16 and follow transcript2:16 We could just correct the word as "plus," but this might not solve the problem because we don't know if this was a one-time occurrence or if the problem's repeated throughout the spreadsheet. There are two ways to answer that question. The first is using Find and replace. Find and replace is a tool that looks for a specified search term in a spreadsheet and allows you to replace it with something else. We'll choose Edit. Then Find and replace. We're trying to find P-L-O-S, the misspelling of "plus" in the supplier's name. In some cases you might not want to replace the data. You just want to find something. No problem. Just type the search term, leave the rest of the options as default and click "Done." But right now we do want to replace it with P-L-U-S. We'll type that in here. Then click "Replace all" and "Done." Play video starting at :3:20 and follow transcript3:20 There we go. Our misspelling has been corrected. That was of course the goal. But for now let's undo our Find and replace so we can practice another way to determine if errors are repeated throughout a dataset, like with the pivot table. We'll begin by selecting the data we want to use. Choose column C. Select "Data." Then "Pivot Table." Choose "New Sheet" and "Create." Play video starting at :3:59 and follow transcript3:59 We know this company has four suppliers. If we count the suppliers and the number doesn't equal four, we know there's a problem. First, add a row for suppliers. Play video starting at :4:13 and follow transcript4:13 Next, we'll add a value for our suppliers and summarize by COUNTA. COUNTA counts the total number of values within a specified range. Here we're counting the number of times a supplier's name appears in column C. Note that there's also function called COUNT, which only counts the numerical values within a specified range. If we use it here, the result would be zero. Not what we have in mind. But in other special applications, COUNT would give us information we want for our current example. As you continue learning more about formulas and functions, you'll discover more interesting options. If you want to keep learning, search online for spreadsheet formulas and functions. There's a lot of great information out there. Our pivot table has counted the number of misspellings, and it clearly shows that the error occurs just once. Otherwise our four suppliers are accurately accounted for in our data. Now we can correct the spelling, and we verify that the rest of the supplier data is clean. This is also useful practice when querying a database. If you're working in SQL, you can address misspellings using a CASE statement. The CASE statement goes through one or more conditions and returns a value as soon as a condition is met. Let's discuss how this works in real life using our customer_name table. Check out how our customer, Tony Magnolia, shows up as Tony and Tnoy. Tony's name was misspelled. Let's say we want a list of our customer IDs and the customer's first names so we can write personalized notes thanking each customer for their purchase. We don't want Tony's note to be addressed incorrectly to "Tnoy." Here's where we can use: the CASE statement. We'll start our query with the basic SQL structure. SELECT, FROM, and WHERE. We know that data comes from the customer_name table in the customer_data dataset, so we can add customer underscore data dot customer underscore name after FROM. Next, we tell SQL what data to pull in the SELECT clause. We want customer_id and first_name. We can go ahead and add customer underscore ID after SELECT. But for our customer's first names, we know that Tony was misspelled, so we'll correct that using CASE. We'll add CASE and then WHEN and type first underscore name equal "Tnoy." Next we'll use the THEN command and type "Tony," followed by the ELSE command. Here we will type first underscore name, followed by End As and then we'll type cleaned underscore name. Finally, we're not filtering our data, so we can eliminate the WHERE clause. As I mentioned, a CASE statement can cover multiple cases. If we wanted to search for a few more misspelled names, our statement would look similar to the original, with some additional names like this. Play video starting at :8:6 and follow transcript8:06 There you go. Now that you've learned how you can use spreadsheets and SQL to fix errors automatically, we'll explore how to keep track of our changes next. Data-cleaning verification: A checklist This reading will give you a checklist of common problems you can refer to when doing your data cleaning verification, no matter what tool you are using. When it comes to T verification, there is no one-size-fits-all approach or a single checklist that can be universally applied to all projects. Each project has its own organization and data requirements that lead to a unique list of things to run through for verification. Keep in mind, as you receive more data or a better understanding of the project goal(s), you might want to revisit some or all of these steps. Correct the most common problems Make sure you identified the most common problems and corrected them, including: Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset? Null data: Did you search for NULLs using conditional formatting and filters? Misspelled words: Did you locate all misspellings? Mistyped numbers: Did you double-check that your numeric data has been entered correctly? Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function? Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL? Mismatched data types: Did you check that numeric, date, and string data are typecast correctly? Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful? Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset? Misleading variable labels (columns): Did you name your columns meaningfully? Truncated data: Did you check for truncated or missing data that needs correction? Business Logic: Did you check that the data makes sense given your knowledge of the business? Review the goal of your project Once you have finished these data cleaning tasks, it is a good idea to review the goal of your project and confirm that your data is still aligned with that goal. This is a continuous process that you will do throughout your project-- but here are three steps you can keep in mind while thinking about this: Confirm the business problem Confirm the goal of the project Verify that data can solve the problem and is aligned to the goal Embrace changelogs What do engineers, writers, and data analysts have in common? Change. Engineers use engineering change orders (ECOs) to keep track of new product design details and proposed changes to existing products. Writers use document revision histories to keep track of changes to document flow and edits. And data analysts use changelogs to keep track of data transformation and cleaning. Here are some examples of these: Automated version control takes you most of the way Most software applications have a kind of history tracking built in. For example, in Google sheets, you can check the version history of an entire sheet or an individual cell and go back to an earlier version. In Microsoft Excel, you can use a feature called Track Changes. And in BigQuery, you can view the history to check what has changed. Here’s how it works: Google Sheets 1. Right-click the cell and select Show edit history. 2. Click the left-arrow < or right a and forward in the history as needed. Microsoft Excel 1. If Track Changes has been enabled for the spreadsheet: click Review. 2. Under Trac Accept/Reject Changes option to accept or reject any change made. BigQuery Bring up a previous version (without reverting to it) and figure out what changed by c version. Changelogs take you down the last mile A changelog can build on your automated version history by giving you an even more detailed record of your work. This is where data analysts record all the changes they make to the data. Here is another way of looking at it. Version histories record what was done in a data change for a project, but don't tell us why. Changelogs are super useful for helping us understand the reasons changes have been made. Changelogs have no set format and you can even make your entries in a blank document. But if you are using a shared changelog, it is best to agree with other data analysts on the format of all your log entries. Typically, a changelog records this type of information: Data, file, formula, query, or any other component that changed Description of what changed Date of the change Person who made the change Person who approved the change Version number Reason for the change Let’s say you made a change to a formula in a spreadsheet because you observed it in another report and you wanted your data to match and be consistent. If you found out later that the report was actually using the wrong formula, an automated version history would help you undo the change. But if you also recorded the reason for the change in a changelog, you could go back to the creators of the report and let them know about the incorrect formula. If the change happened a while ago, you might not remember who to follow up with. Fortunately, your changelog would have that information ready for you! By following up, you would ensure data integrity outside your project. You would also be showing personal integrity as someone who can be trusted with data. That is the power of a changelog! Finally, a changelog is important for when lots of changes to a spreadsheet or query have been made. Imagine an analyst made four changes and the change they want to revert to is change #2. Instead of clicking the undo feature three times to undo change #2 (and losing changes #3 and #4), the analyst can undo just change #2 and keep all the other changes. Now, our example was for just 4 changes, but try to think about how important that changelog would be if there were hundreds of changes to keep track of. What also happens IRL (in real life) A junior analyst probably only needs to know the above with one exception. If an analyst is making changes to an existing SQL query that is shared across the company, the company most likely uses what is called a version control system. An example might be a query that pulls daily revenue to build a dashboard for senior management. Here is how a version control system affects a change to a query: 1. A company has official versions of important queries in their version control system. 2. An analyst makes sure the most up-to-date version of the query is the one they will change. This is called syncing 3. The analyst makes a change to the query. 4. The analyst might ask someone to review this change. This is called a code review and can be informally or formally done. An informal review could be as simple as asking a senior analyst to take a look at the change. 5. After a reviewer approves the change, the analyst submits the updated version of the query to a repository in the company's version control system. This is called a code commit. A best practice is to document exactly what the change was and why it was made in a comments area. Going back to our example of a query that pulls daily revenue, a comment might be: Updated revenue to include revenue coming from the new product, Calypso. 6. After the change is submitted, everyone else in the company will be able to access and use this new query when they sync to the most up-to-date queries stored in the version control system. 7. If the query has a problem or business needs change, the analyst can undo the change to the query using the version control system. The analyst can look at a chronological list of all changes made to the query and who made each change. Then, after finding their own change, the analyst can revert to the previous version. 8. The query is back to what it was before the analyst made the change. And everyone at the company sees this reverted, original query, too. Advanced functions for speedy data cleaning In this reading, you will learn about some advanced functions that can help you speed up the data cleaning process in spreadsheets. Below is a table summarizing three functions and what they do: IMPORTRANGE: Syntax: =IMPORTRANGE(spreadsheet_url, range_string) Menu Options: Paste Link (copy the data first) Primary Use: Imports (pastes) data from one sheet to another and keeps it automatically updated QUERY: Syntax: =QUERY(Sheet and Range, "Select *") Menu Options: Data > From Other Sources > From Microsoft Query Primary Use: Enables pseudo SQL (SQL-like) statements or a wizard to import the data. FILTER: Syntax: =FILTER(range, condition1, [condition2, ...]) Menu Options: Filter(conditions per column) Primary Use: Displays only the data that meets the specified conditions. Keeping data clean and in sync with a source The IMPORTRANGE function in Google Sheets and the Paste Link feature (a Paste Special option in Microsoft Excel) both allow you to insert data from one sheet to another. Using these on a large amount of data is more efficient than manual copying and pasting. They also reduce the chance of errors being introduced by copying and pasting the wrong data. They are also helpful for data cleaning because you can “cherry pick” the data you want to analyze and leave behind the data that isn’t relevant to your project. Basically, it is like canceling noise from your data so you can focus on what is most important to solve your problem. This functionality is also useful for day-to-day data monitoring; with it, you can build a tracking spreadsheet to share the relevant data with others. The data is synced with the data source so when the data is updated in the source file, the tracked data is also refreshed. If you are using IMPORTRANGE in Google sheets, data can be pulled from another spreadsheet, but you must allow access to the spreadsheet the first time it pulls the data. The URL shown below is for syntax purposes only. Don't enter it in your own spreadsheet. Replace it with a URL to a spreadsheet you have created so you can control access to it by clicking the Allow access button. Refer to the Google support page for IMPORTRANGE for the sample usage and syntax. Example of using IMPORTRANGE An analyst monitoring a fundraiser needs to track and ensure that matching funds are distributed. They use IMPORTRANGE to pull all the matching transactions into a spreadsheet containing all of the individual donations. This enables them to determine which donations eligible for matching funds still need to be processed. Because the total number of matching transactions increases daily, they simply need to change the range used by the function to import the most up-to-date data. On Tuesday, they use the following to import the donor names and matched amounts: =IMPORTRANGE(“https://docs.google.com/spreadsheets/d/1cOsHnBDzm9tBb8Hk_aLYfq 3-o5FZ6DguPYRJ57992_Y”, “Matched Funds!A1:B4001”) On Wednesday, another 500 transactions were processed. They increase the range used by 500 to easily include the latest transactions when importing the data to the individual donor spreadsheet: =IMPORTRANGE(“https://docs.google.com/spreadsheets/d/1cOsHnBDzm9tBb8Hk_aLYfq 3-o5FZ6DguPYRJ57992_Y”, “Matched Funds!A1:B4501”) Note: The above examples are for illustrative purposes only. Don't copy and paste them into your spreadsheet. To try it out yourself, you will need to substitute your own URL (and sheet name if you have multiple tabs) along with the range of cells in the spreadsheet that you have populated with data. Pulling data from other data sources The QUERY function is also useful when you want to pull data from another spreadsheet. The QUERY function's SQL-like ability can extract specific data within a spreadsheet. For a large amount of data, using the QUERY function is faster than filtering data manually. This is especially true when repeated filtering is required. For example, you could generate a list of all customers who bought your company’s products in a particular month using manual filtering. But if you also want to figure out customer growth month over month, you have to copy the filtered data to a new spreadsheet, filter the data for sales during the following month, and then copy those results for the analysis. With the QUERY function, you can get all the data for both months without a need to change your original dataset or copy results. The QUERY function syntax is similar to IMPORTRANGE. You enter the sheet by name and the range of data that you want to query from, and then use the SQL SELECT command to select the specific columns. You can also add specific criteria after the SELECT statement by including a WHERE statement. But remember, all of the SQL code you use has to be placed between the quotes! Google Sheets run the Google Visualization API Query Language across the data. Excel spreadsheets use a query wizard to guide you through the steps to connect to a data source and select the tables. In either case, you are able to be sure that the data imported is verified and clean based on the criteria in the query. Examples of using QUERY Check out the Google support page for the QUERY function with sample usage, syntax, and examples you can download in a Google sheet. Link to make a copy of the sheet: QUERY examples Real life solution Analysts can use SQL to pull a specific dataset into a spreadsheet. They can then use the QUERY function to create multiple tabs (views) of that dataset. For example, one tab could contain all the sales data for a particular month and another tab could contain all the sales data from a specific region. This solution illustrates how SQL and spreadsheets are used well together. Filtering data to get what you want The FILTER function is fully internal to a spreadsheet and doesn’t require the use of a query language. The FILTER function lets you view only the rows (or columns) in the source data that meet your specified conditions. It makes it possible to pre-filter data before you analyze it. The FILTER function might run faster than the QUERY function. But keep in mind, the QUERY function can be combined with other functions for more complex calculations. For example, the QUERY function can be used with other functions like SUM and COUNT to summarize data, but the FILTER function can't. Example of using FILTER Check out the Google support page for the FILTER function with sample usage, syntax, and examples you can download in a Google sheet. Link to make a copy of the sheet: FILTER examples Keeping data organized with sorting and filters You have learned about four phases of analysis: Organize data Format and adjust data Get input from others Transform data The organization of datasets is really important for data analysts. Most of the datasets you will use will be organized as tables. Tables are helpful because they let you manipulate your data and categorize it. Having distinct categories and classifications lets you focus on, and differentiate between, your data quickly and easily. Data analysts also need to format and adjust data when performing an analysis. Sorting and filtering are two ways you can keep things organized when you format and adjust data to work with it. For example, a filter can help you find errors or outliers so you can fix or flag them before your analysis. Outliers are data points that are very different from similarly collected data and might not be reliable values. The benefit of filtering the data is that after you fix errors or identify outliers, you can remove the filter and return the data to its original organization. In this reading, you will learn the difference between sorting and filtering. You will also be introduced to how a particular form of sorting is done in a pivot table. Sorting versus filtering Left image of a pair of hands sorting letters and numbers. Right image is a hand holding a filter sorting numbers and letters Sorting is when you arrange data into a meaningful order to make it easier to understand, analyze, and visualize. It ranks your data based on a specific metric you choose. You can sort data in spreadsheets, SQL databases (when your dataset is too large for spreadsheets), and tables in documents. For example, if you need to rank things or create chronological lists, you can sort by ascending or descending order. If you are interested in figuring out a group’s favorite movies, you might sort by movie title to figure it out. Sorting will arrange the data in a meaningful way and give you immediate insights. Sorting also helps you to group similar data together by a classification. For movies, you could sort by genre -- like action, drama, sci-fi, or romance. Filtering is used when you are only interested in seeing data that meets a specific criteria, and hiding the rest. Filtering is really useful when you have lots of data. You can save time by zeroing in on the data that is really important or the data that has bugs or errors. Most spreadsheets and SQL databases allow you to filter your data in a variety of ways. Filtering gives you the ability to find what you are looking for without too much effort. For example, if you are only interested in finding out who watched movies in October, you could use a filter on the dates so only the records for movies watched in October are displayed. Then, you could check out the names of the people to figure out who watched movies in October. To recap, the easiest way to remember the difference between sorting and filtering is that you can use sort to quickly order the data, and filter to display only the data that meets the criteria that you have chosen. Use filtering when you need to reduce the amount of data that is displayed. It is important to point out that, after you filter data, you can sort the filtered data, too. If you revisit the example of finding out who watched movies in October, after you have filtered for the movies seen in October, you can then sort the names of the people who watched those movies in alphabetical order. Sorting in a pivot table Items in the row and column areas of a pivot table are sorted in ascending order by any custom list first. For example, if your list contains days of the week, the pivot table allows weekday and month names to sort like this: Monday, Tuesday, Wednesday, etc. rather than alphabetically like this: Friday, Monday, Saturday, etc. If the items aren’t in a custom list, they will be sorted in ascending order by default. But, if you sort in descending order, you are setting up a rule that controls how the field is sorted even after new data fields are added. Hey, great to see you again. Earlier we talked about why you should organize your data, no matter what part of the lifecycle it's in. Just like any collection, it's easier to manage and care for a group of things when there's structure around them. Play video starting at ::15 and follow transcript0:15 Now we should keep in mind that organization isn't just about making things look orderly. It's also about making it easier to search and locate the data you need in a quick and easy way. As a data analyst, you'll find yourself rearranging and sifting through databases pretty often. Two of the most common ways of doing this are with sorting and filtering. We've briefly discussed sorting and filtering before, and it's important you know exactly what each one does. Sorting is when you arrange data into a meaningful order to make it easier to understand, analyze, and visualize. Sorting ranks your data based on a specific metric that you can choose. You can sort data in spreadsheets and databases that use SQL. We'll get to all the cool functions you can use in both a little later on. A common way to sort items when you're shopping on a website is from lowest to highest price, but you can also sort by alphabetical order, like books in a library. Or you can sort from newest to oldest, like the order of text messages in a phone. Or nearest to furthest away, like when you're searching for restaurants online. Another way to organize information is with a filter. Filtering is showing only the data that meets a specific criteria while hiding the rest. Typically you can use filters when you want to narrow down the amount of data you want to sift through. Say you're searching for green sneakers online. To save time, you filter for green shoes only. Using a filter slims down larger data sets to smaller subsets that are relevant to what you need. Sorting and filtering are two actions you probably perform a lot online. Whether you're sorting movie showtimes from earliest to latest, or filtering your search results to just images, you're probably already familiar with how helpful they can be for making sense of data. Now let's take that knowledge and apply it. When it comes to sifting through large, disorganized piles of data, filters are your friend. You might remember from a previous video that you can use filters and spreadsheet programs, like Excel and Sheets, to only display data from rows that match the range or condition you've set. You can also filter data in SQL using the WHERE clause. The WHERE clause works similarly to filtering in a spreadsheet because it returns rows based on a condition you name. Let's learn how you can use a WHERE clause in a database. We'll use BigQuery to access the database and run our query. If you're joining us, open up your tool of choice for using SQL and reference the earlier resource on how to access the dataset. Otherwise, watch as the WHERE clause does its thing. Here's the database. Play video starting at :3:5 and follow transcript3:05 You might recognize it from past videos. Basically, it's a long list of movies. Each row includes an entry for the columns named Movie_Title, Release_Date, Genre, Director, Cast_Members, Budget, and Total_Revenue. It also includes a link to the film's Wikipedia page. If you scroll down the list, the list goes on for a long time. Of course, we won't need to go through everything to find the data we want. That's the beauty of a filter! In this case, we'll use the WHERE clause to filter the database and narrow down the list to movies in the comedy genre. To start, we'll use the SELECT command followed by an asterisk. In SQL, an asterisk selects all of the data. On a new line, we'll type FROM and the name of the database: movie_data.movies. To filter the movies by comedy, we're going to type WHERE, then list the condition, which is Genre. Play video starting at :4:5 and follow transcript4:05 Genre is a column in the dataset, and we only want to select rows where the cell in the Genre column exactly matches "Comedy." Next we'll type the equals sign and write the specific genre we're filtering for, which is comedy. Since the data in the Genre column is a string format, we have to use single or double quotations when writing it. And keep in mind that capitalization matters here, so we have to make sure that the letter casing matches the column name exactly. And now we can click Run to check out the results. What we're left with is a shorter list of comedy movies. Pretty cool, right? Here's something else you should know. You can apply multiple filters to a database. You can even sort and filter data at the same time for even more precise results. As a data analyst, knowing how to sort and filter data will make you a superstar. That's all for now. Coming up, we'll get down to the nitty-gritty of sorting functions in spreadsheets. See you there! Sorting and filtering in Sheets and Excel In this reading, we will describe the sorting and filtering options in Google Sheets and Microsoft Excel. Both offer basic sorting and filtering from set menu options. But, if you need more advanced sorting and filtering capabilities, you can use their respective SORT and FILTER functions. Sorting and filtering in Sheets Sorting in Google Sheets helps you quickly spot trends in numbers. One trend might be gross revenue by sales region. In this case, you could sort the gross revenue column in descending (Z to A) order to spot the top performing regions at the top, or sort the gross revenue column in ascending (A-Z) order to spot the lowest performing regions at the top. Although an alphabetical order is implied, these sorting options do sort numbers, as our gross revenue example highlighted. If you want to learn more about the set menu options for sorting and filtering, start with these resources: Sort and filter data (Google Help Center): instructions to sort data in alphabetical or numerical order and create filter views Sort data by selecting a range of data in a column: video of steps to achieve the task Sort a range of data using sort criteria for multiple columns: technical tip video to sort data across multiple columns In addition to the standard menu options, there is a SORT function for more advanced sorting. Use this function to create a custom sort. You can sort the rows of a given range of data by the values in one or more columns. And you get to set the sort criteria per column. Refer to the SORT function page for the syntax. And like the SORT function, you can use the FILTER function to filter by any matching criteria you like. This creates a custom filter. You might recall that you can filter data and then sort the filtered results. Using the FILTER and SORT functions together in a range of cells can programmatically and automatically achieve these results for you. Sorting and filtering in Excel You can also sort in ascending (A-Z) and descending (Z-A) order in Microsoft Excel. Excel offers Smallest to Largest and Largest to Smallest sorting when you are working with numbers. Similar to the SORT function in Google Sheets, Excel includes custom sort capabilities that are available from the menu. After you select the data range, click the Sort & Filter button to select the criteria for sorting. You can even sort by the data in rows instead of by the data in columns if you select Sort left to right under Options. (Sort top to bottom is the default setting to sort the data in columns.) If you want to learn more about sorting and filtering in Excel, start with these resources: Sort data in a range or table (Microsoft Support): instructions and video to perform sorting in 11 different use cases Excel training: sort and filter data (Microsoft Support): sorting and filtering videos with transcripts Excel: sorting data: video of how to use the Sort & Filter and Data menu options for sorting Excel also has SORT, SORTBY, and FILTER functions. Explore how you can use these functions to automatically sort and filter your data in spreadsheets without having to select any menu options at all. Hello there! If you're hoping to learn about sorting—in SQL this time— you've definitely come to the right place. So far, we've sorted spreadsheets through the menu and with a written function. Which brings us to the next part of our learning: more sort functions, but this time in SQL. Data analysts love playing with the way data is presented. Sorting is a useful way to rearrange data because it can help you understand the data you have in a different light. As you've probably already noticed, a lot of things you can do in spreadsheets can also be done in SQL. Sorting is one of those things. We've talked about using SQL with large datasets before. When a spreadsheet has too much data, you can get error messages, or it can cause your program to crash. That's definitely something we want to avoid. SQL shortens processes that would otherwise take a very long time or be impossible to complete in a spreadsheet. Personally, I use SQL to pull and combine different data tables. It's much quicker than a spreadsheet, and that usually comes in handy. Here's something pretty helpful you can do with SQL. You can use the ORDER BY clause to sort results returned in a query. Let's go back to our movie spreadsheet to get a better idea of how this works. Feel free to follow along in a SQL tool of your choice as we go. As a quick refresher, we have a database of movies listed with data like release date, director, and more. We can sort this table in lots of different ways using the ORDER BY function. For this example, let's sort by release date. First, we have the SELECT function and an asterisk. Play video starting at :1:51 and follow transcript1:51 Keep in mind that the asterisk means all columns are selected. Then we have FROM and the name of the database and table we're in right now. Now let's check out the next line. It's empty, but that's where we'll write our ORDER BY function. The ORDER BY command is usually the last clause in your query. Back to the actual sorting! We'll type ORDER BY with the space. With this clause, you can choose to order data by fields in a certain column. Because we want to sort by release date, we'll type Release_Date. By default, the ORDER BY clause sorts data in ascending order. If you run the query as it is right now, the movies will be sorted from oldest to the most recent release dates. Let's run the query and see what we've got. You can also sort the release dates in the reverse order from the most recent dates to the oldest. To do this, just specify the descending order in the ORDER BY command written as DESC, D-E-S-C. Let's run this query. Play video starting at :3:6 and follow transcript3:06 As you'll notice, the most recently released films are now at the top of the database. In spreadsheets, you can combine sorts and filters to display information differently. You can do something similar in SQL too. You might remember that while sorting puts data in a specific order, filters narrow down data, so you only see data that fits the filter. For example, let's say we want to filter movies by genre so that we're only working with comedies. But we still want release dates to be sorted in descending order, from most recent to oldest films. We can do this with the WHERE clause. Let's try that now. First, we'll check that the ORDER BY clause is always the last line. That makes sure that all the results of the query you're running are sorted by that clause. Then, we'll add a new line for the WHERE clause after FROM and before ORDER BY. Play video starting at :4:9 and follow transcript4:09 Here's what we've got so far. From there, we want to type the column we're filtering for. In this case, we want to filter the database for comedies. After the WHERE clause, we'll type the column list's name as Genre. Now, we'll add an equal sign after Genre because we only want to include genres that match what we're filtering for. In this case, we're filtering for comedy, so we'll type Comedy between two apostrophes. Now, if you check out the entire query as a whole, you'll notice that we're selecting all columns, and we know it's all columns because that's what an asterisk means. The FROM clause specifies the name of the movie database we're using, and the WHERE clause filters the data to include entries whose genre is specified as comedy. Then in the last line, we have the ORDER BY clause, which will sort the data we've chosen to filter by release dates in descending order. This means when we run the query, we'll only have comedy movies listed from newest releases to oldest releases. Let's run it and figure out if that's the case. Play video starting at :5:25 and follow transcript5:25 Cool. Check out all those comedy movies and the way those dates are sorted. Play video starting at :5:33 and follow transcript5:33 Now, let's take this query a step further. We'll filter for two conditions at once using the AND filter. Working off the query we've been using, we'll add a second condition in the WHERE clause. We'll keep the sorting the same. Let's say you wanted to filter by comedy movies and movies that earned over 300 million in the box office. In this case, after the AND function, you'd add the revenue condition by typing Revenue. From there, you'll specify that you only want to return films with revenues over $300 million. To do that, type the greater than sign and then the complete number of 300 million without commas. Now let's run the query. Play video starting at :6:23 and follow transcript6:23 Here, the data only shows comedy movies with revenues of over $300 million, and it's sorted in descending order by release date. It looks really good. You just filtered and sorted a database like it's your job. And with practice, one day it can be. Just like that, you've finished another step in your data analyst journey. By now, you really dug and learned about the analysis process with a special emphasis on how organization can change how you go through your data. You also learned about both spreadsheets and SQL, and how to sort and filter data in both types of programs. To help you get more comfortable using spreadsheet and SQL features, you'll be getting some materials you can use as a resource. Coming up, we'll check out how an organizational mindset can take your analytical skills even further. We'll also cover converting, formatting, and adjusting data to combine information in a way that makes sense. Learning those skills early on can make your work as a data analyst much more efficient and effective in the long run. See you soon. Hands-On Activity: Analyze weather data in BigQuery Total points 2 1. Question 1 Activity overview Previously, you learned how to use BigQuery to clean data and prepare it for analysis. Now you will query a dataset and save the results into a new table. This is a useful skill when the original data source changes continuously and you need to preserve a specific dataset for continued analysis. It’s also valuable when you are dealing with a large dataset and know you’ll be doing more than one analysis using the same subset of data. In this scenario, you’re a data analyst at a local news station. You have been tasked with answering questions for meteorologists about the weather. You will work with public data from the National Oceanic and Atmospheric Administration (NOAA), which has data for the entire United States. This is why you will need to save a subset of the data in a separate table. By the time you complete this activity, you will be able to use SQL queries to create new tables when dealing with complex datasets. This will greatly simplify your analysis in the future. Access the public dataset For this activity you will need the NOAA weather data from BigQuery’s public datasets. 1. Click on the + ADD DATA button in the Explorer menu pane and select Explore public datasets. This will open a new menu where you can search public datasets that are already available through Google Cloud. If you have already loaded the BigQuery public datasets into your console, you can just search noaa_gsod in your Explorer menu and skip these steps. 2. Type noaa_gsod into the search bar. You’ll find the GSOD of Global Surface Summary of the Day Weather Data. 3. Click the GSOD dataset to open it. This will provide you with more detailed information about the dataset if you’re interested. Click VIEW DATASET to open this dataset in your console. 4. Search noaa_gsod in your Explorer menu pane to find the dataset. Click the dropdown menu to explore the tables in this dataset. Scroll down to gsod2020 and open the table menu by clicking the three vertical dots. 5. Check the table’s schema and preview it to get familiar with the data. Once you’re ready, you can click COMPOSE NEW QUERY to start querying the dataset. Querying the data The meteorologists who you’re working with have asked you to get the temperature, wind speed, and precipitation for stations La Guardia and JFK, for every day in 2020, in descending order by date, and ascending order by Station ID. Use the following query to request this information: SELECT date, stn, -- Use the IF function to replace 9999.9 values, which the dataset description explains is the default value when temperature is missing, with NULLs instead. IF( temp=9999.9, NULL, temp) AS temperature, -- Use the IF function to replace 999.9 values, which the dataset description explains is the default value when wind speed is missing, with NULLs instead. wdsp="999.9", wind_speed, IF( NULL, CAST(wdsp AS Float64)) AS -- Use the IF function to replace 99.99 values, which the dataset description explains is the default value when precipitation is missing, with NULLs instead. IF( prcp=99.99, 0, prcp) AS precipitation FROM public-data.noaa_gsod.gsod2020` WHERE OR stn="744860" -- JFK ORDER BY `bigquery- stn="725030" -- La Guardia date DESC, stn ASC The meteorologists also asked you a couple questions while they were preparing for the nightly news: They want the average temperature in June 2020 and the average wind_speed in December 2020. Instead of rewriting similar, but slightly different, queries over and over again, there is an easier approach: Save the results from the original query as a table for future queries. Save a new table In order to make this subset of data easier to query from, you can save the table from the weather data into a new dataset. 1. From your Explorer pane, click the three vertical dots next to your project and select Create dataset. You can name this dataset demos and leave the rest of the default options. Click CREATE DATASET. 2. Open your new dataset and select COMPOSE NEW QUERY. Input the following query to get the average temperature, wind speed, visibility, wind gust, precipitation, and snow depth La Guardia and JFK stations for every day in 2020, in descending order by date, and ascending order by Station ID: SELECT stn, date, -- Use the IF function to replace 9999.9 values, which the dataset description explains is the default value when temperature is missing, with NULLs instead. IF( temp=9999.9, NULL, temp) AS temperature, -- Use the IF function to replace 999.9 values, which the dataset description explains is the default value when wind speed is missing, with NULLs instead. IF( wdsp="999.9", NULL, CAST(wdsp AS Float64)) AS wind_speed, -- Use the IF function to replace 99.99 values, which the dataset description explains is the default value when precipitation is missing, with NULLs instead. IF( prcp=99.99, 0, prcp) AS precipitation FROM public-data.noaa_gsod.gsod2020` WHERE OR stn="744860" -- JFK ORDER BY `bigquery- stn="725030" -- La Guardia date DESC, stn ASC 3. Before you run the query, select the MORE menu from the Query Editor and open the Query Settings menu. In the Query Settings menu, select Set a destination table for query results. Set the dataset option to demos and name the table nyc_weather. 4. Run the query from earlier; now it will save as a new table in your demos dataset. 5. Return to the Query settings menu by using the MORE dropdown menu. Reset the settings to Save query results in a temporary table. This will prevent you from accidentally adding every query as a table to your new dataset. Query your new table Now that you have the subset of this data saved in a new table, you can query it more easily. Use the following query to find the average temperature from the meteorologists first question: SELECT AVG(temperature) FROM `airy-shuttle-315515.demos.nyc_weather` --remember to change the project name to your project before running this query WHERE date BETWEEN '2020-06-01' AND '2020-06-30' You can also use this syntax to find the average wind_speed or any other information from this subset of data you’re interested in. Try constructing a few more queries to answer the meteorologists’ questions! The ability to save your results into a new table is a helpful trick when you know you're only interested in a subset of a larger complex dataset that you plan on querying multiple times, such as the weather data for just La Guardia and JFK. This also helps minimize errors during your analysis. Confirmation and reflection What was the average temperature at JFK and La Guardia stations between June 1, 2020 and June 30, 2020? 1 / 1 point 72.883 92.099 87.671 74.909 Correct The average was 72.883. To find out the average temperature during this time period, you successfully created a new table using a query and ran another query against that table. Going forward, you will be able to use this skill to create tables with specific subsets of your data to query. This will help you draw insights from multiple data sources in the future. Converting data in spreadsheets In this reading, you will learn about converting data from one format to another. One of the ways to help ensure that you have an accurate analysis of your data is by putting all of it in the correct format. This is true even if you have already cleaned and processed your data. As a part of getting your data ready for analysis, you will need to convert and format your data early on in the process. A tornado sweeping everything up; an arrow indicating Data Conversion and second image of bar graph, pie chart, line graph As a data analyst, there are lots of scenarios when you might need to convert data in a spreadsheet: String to date How to convert text to date in Excel: Transforming a series of numbers into dates is a common scenario you will encounter. This resource will help you learn how to use Excel functions to convert text and numbers to dates, and how to turn text strings into dates without a formula. Google Sheets: Change date format: If you are working with Google Sheets, this resource will demonstrate how to convert your text strings to dates and how to apply the different date formats available in Google Sheets. String to numbers How to convert text to number in Excel: Even though you will have values in your spreadsheet that resemble numbers, they may not actually be numbers. This conversion is important because it will allow your numbers to add up and be used in formulas without errors in Excel. How to convert text to numbers in Google Sheets: This resource is useful if you are working in Google Sheets; it will demonstrate how to convert text strings to numbers in Google Sheets. It also includes multiple formulas you can apply to your own sheets, so you can find the method that works best for you. Combining columns Convert text from two or more cells: Sometimes you may need to merge text from two or more cells. This Microsoft Support page guides you through two distinct ways you can accomplish this task without losing or altering your data. It also includes a step-by-step video tutorial to help guide you through the process. How to split or combine cells in Google Sheets: This guide will demonstrate how to to split or combine cells using Google Sheets specifically. If you are using Google Sheets, this is a useful resource to reference if you need to combine cells. It includes an example using real data. Number to percentage Format numbers as percentages: Formatting numbers as percentages is a useful skill to have on any project. This Microsoft Support page will provide several techniques and tips for how to display your numbers as percentages. TO_PERCENT: This Google Sheets support page demonstrates how to use the TO_PERCENT formula to convert numbers to percentages. It also includes links to other formulas that can help you convert strings. Pro tip: Keep in mind that you may have lots of columns of data that require different formats. Consistency is key, and best practice is to make sure an entire column has the same format. Additional resources If you find yourself needing to convert other types of data, you can find resources on Microsoft Support for Excel or Google Docs Editor Help for Google Sheets. Converting data is quick and easy, and the same functions can be used again and again. You can also keep these links bookmarked for future use, so you will always have them ready in case any of these issues arise. Now that you know how to convert data, you are on your way to becoming a successful data analyst. Transforming data in SQL Data analysts usually need to convert data from one format to another to complete an analysis. But what if you are using SQL rather than a spreadsheet? Just like spreadsheets, SQL uses standard rules to convert one type of data to another. If you are wondering why data transformation is an important skill to have as a data analyst, think of it like being a driver who is able to change a flat tire. Being able to convert data to the right format speeds you along in your analysis. You don’t have to wait for someone else to convert the data for you. In this reading, you will go over the conversions that can be done using the CAST function. There are also more specialized functions like COERCION to work with big numbers, and UNIX_DATE to work with dates. UNIX_DATE returns the number of days that have passed since January 1, 1970 and is used to compare and work with dates across multiple time zones. You will likely use CAST most often. Common conversions The following table summarizes some of the more common conversions made with the CAST function. Refer to Conversion Rules in Standard SQL for a full list of functions and associated rules. Starting with CAST function can convert to: Numeric (number) - Integer - Numeric (number) - Big number - Floating integer - String String - Boolean - Integer - Numeric (number) - Big number - Floating integer - String - B Time - Timestamp Date - String - Date - Date time - Timestamp The CAST function (syntax and examples) CAST is an American National Standards Institute (ANSI) function used in lots of programming languages, including BigQuery. This section provides the BigQuery syntax and examples of converting the data types in the first column of the previous table. The syntax for the CAST function is as follows: CAST (expression AS typename) Where expression is the data to be converted and typename is the data type to be returned. Converting a number to a string The following CAST statement returns a string from a numeric identified by the variable MyCount in the table called MyTable. SELECT CAST (MyCount AS STRING) FROM MyTable In the above SQL statement, the following occurs: SELECT indicates that you will be selecting data from a table CAST indicates that you will be converting the data you select to a different data type AS comes before and identifies the data type which you are casting to STRING indicates that you are converting the data to a string FROM indicates which table you are selecting the data from Converting a string to a number The following CAST statement returns an integer from a string identified by the variable MyVarcharCol in the table called MyTable. (An integer is any whole number.) SELECT CAST(MyVarcharCol AS INT) FROM MyTable In the above SQL statement, the following occurs: SELECT indicates that you will be selecting data from a table CAST indicates that you will be converting the data you select to a different data type AS comes before and identifies the data type which you are casting to INT indicates that you are converting the data to an integer FROM indicates which table you are selecting the data from Converting a date to a string The following CAST statement returns a string from a date identified by the variable MyDate in the table called MyTable. In the above SQL statement, the following occurs: SELECT indicates that you will be selecting data from a table CAST indicates that you will be converting the data you select to a different data type AS comes before and identifies the data type which you are casting to STRING indicates that you are converting the data to a string FROM indicates which table you are selecting the data from Converting a date to a datetime Datetime values have the format of YYYY-MM-DD hh: mm: ss format, so date and time are retained together. The following CAST statement returns a datetime value from a date. In the above SQL statement, the following occurs: SELECT indicates that you will be selecting data from a table CAST indicates that you will be converting the data you select to a different data type AS comes before and identifies the data type which you are casting to DATETIME indicates that you are converting the data to a datetime value FROM indicates which table you are selecting the data from The SAFE_CAST function Using the CAST function in a query that fails returns an error in BigQuery. To avoid errors in the event of a failed query, use the SAFE_CAST function instead. The SAFE_CAST function returns a value of Null instead of an error when a query fails. The syntax for SAFE_CAST is the same as for CAST. Simply substitute the function directly in your queries. The following SAFE_CAST statement returns a string from a date. Optional: Prepare to use the bike sharing dataset in BigQuery The next video demonstrates how to use CONCAT in a SQL query to return data from two columns in a single column. If you would like to follow along with the instructor, you will need to log in to your BigQuery account to use the open (public) dataset called new_york_citibike. If you need a refresher, the reading Using BigQuery in the Prepare Data for Exploration course explains how to set up a BigQuery account. Prepare for the next video Step 1: In the BigQuery Explorer, enter citibike in the search bar to locate the new_york_citibike dataset under bigquery-public-data. Step 2: Click the citibike_trips table, then click the Preview tab to view the data in the table. What to expect from the query You will be using CONCAT to combine the data in the start_station_name column with the data in the end_station_name column to create the route information in another column; for example, the route from Station 509 to Station 442 in the first row of the table above would be 9 Ave & W 22 St to W 27 St & 7 Ave, a combination of the start and end station names. Great to see you back. In this video, we'll build on what we've learned about CONCATENATE and IMPORTRANGE by exploring a new SQL query: CONCAT. You might remember that CONCATENATE is a function that joins together two or more LEN. As a quick reminder, a text string is a group of characters within a cell most often composed of letters. You've seen how that works within a single spreadsheet. But there's a similar function in SQL that allows you to join multiple text strings from multiple sources, CONCAT. Let's use CONCAT to combine strings from multiple tables to create new strings. For this example, we'll use open data from Citi Bike, which is a public bicycle sharing system in New York. As you've learned earlier, open data initiatives have created a ton of data for analysts to use. Openness or open data is free access, usage, and sharing of data. It's a great resource if you want to practice or experiment with the data analysis tools you've been learning here. You have open access to the New York city bike-sharing data, which has information about the use of shared bikes across the city. Now we can use CONCAT to pull and concatenate data from different columns stored here. The first thing we need to do is figure out which columns we need. That way we can tell SQL where the strings we want are. For example, the bike-sharing company has two different kinds of customers; one-time paying customers and subscribers. Let's say we want to find out what routes are most popular with different user types. To do that, we need to create strings of recognizable route names that we can count and sort. We know that the information we need is in the stations and trips table. We'll start building our query from there. First, we'll input SELECT user type to let SQL know that we want the user type as a column. Then we'll use CONCAT to combine the names of the beginning and ending stations for each trip in a new column. This will create one column based on the routes people take. We also need to input a title for this new column. We'll type in, AS route, to name the route column using those beginning and ending station names we combined with CONCAT. This will make these route names easy for us to read and understand. After that, we want SQL to count the number of trips. So we'll input COUNT to do that. We can use an asterisk to tell it to count up the number of rows in the data we're selecting. In this case, each row represents a trip, which is why we can just count all of the rows we've selected. We'll name this output as num_trips. Play video starting at :2:46 and follow transcript2:46 Now let's also get the average trip duration for each route. In this case, we don't need the exact average, so we can use the ROUND function to round up. We'll put that first and then in the parentheses use average to get the average trip duration. We'll also want this data to be in integer form for this calculation, so we'll input cast as int 64. Big query stores numbers in a 64-bit memory system, which is why there's a 64 after integer in this case. Next, we'll divide it by the number of rows and tell it how far we want it to round, two decimal places. We'll name this output as duration. We'll need to tell SQL where this information is stored. We'll use FROM and the location we're pulling it from. Play video starting at :3:42 and follow transcript3:42 Since we're using COUNT and AVERAGE functions in our select clause, we have to use GROUP BY to group together summary rows. Let's group by the start station, the end station, and the user type for this query. Finally, we'll use ORDER BY to tell it how we want to organize this data. For this, we want to figure out the most common trips so we can input the number of trips column and use DESC to put it in descending order. Finally, we only want the top 10, so let's add LIMIT 10. Now thanks to CONCAT, we can easily read these route names and trace them back to real places. We can see which kinds of customers are taking which routes, which can help the bike-sharing company understand their user base in different parts of the city and where to keep more bikes for people to rent. Being able to combine multiple pieces of data can give you new ways to organize and analyze data. There's a lot of different tools to help you do that. Now you've seen CONCAT in action, and later you will come across another similar query, JOIN. But up next, we'll talk more about working with strings. See you soon. Manipulating strings in SQL Knowing how to convert and manipulate your data for an accurate analysis is an important part of a data analyst’s job. In this reading, you will learn about different SQL functions and their usage, especially regarding string combinations. A string is a set of characters that helps to declare the texts in programming languages such as SQL. SQL string functions are used to obtain various information about the characters, or in this case, manipulate them. One such function, CONCAT, is commonly used. Review the table below to learn more about the CONCAT function and its variations. Function Usage Example CONCAT A function that adds strings together to create new text strings that can be used as unique keys CONCAT (‘Google’, ‘.com’); CONCAT_WS A function that adds two or more strings together with a separator CONCAT_WS (‘ . ’, ‘www’, ‘google’, ‘c the period) gets input before and af SQL function CONCAT with + Adds two or more strings together using the + operator ‘Google’ + ‘.com’ CONCAT at work When adding two strings together such as ‘Data’ and ‘analysis’, it will be input like this: SELECT CONCAT (‘Data’, ‘analysis’); The result will be: Dataanalysis Sometimes, depending on the strings, you will need to add a space character, so your function should actually be: SELECT CONCAT (‘Data’, ‘ ‘, ‘analysis’); And the result will be: Data analysis The same rule applies when combining three strings together. For example, SELECT CONCAT (‘Data’,’ ‘, ‘analysis’, ‘ ‘, ‘is’, ‘ ‘, ‘awesome!’); And the result will be Data analysis is awesome! Practice makes perfect W3 Schools is an excellent resource for interactive SQL learning, and the following links will guide you through transforming your data using SQL: SQL functions: This is a comprehensive list of functions to get you started. Click on each function, where you will learn about the definition, usage, examples, and even be able to create and run your own query for practice. Try it out for yourself! SQL Keywords: This is a helpful SQL keywords reference to bookmark as you increase your knowledge of SQL. This list of keywords are reserved words that you will use as your need to perform different operations in the database grows. While this reading went through the basics of each of these functions, there is still more to learn, and you can even combine your own strings. 1. Practice using CONCAT 2. Practice using CONCAT WS 3. Practice using CONCAT with + Pro tip: The functions presented in the resources above may be applied in slightly different ways depending on the database that you are using (e.g. mySQL versus SQL Server). But, the general description provided for each function will prepare you to customize how you use these functions as needed. Hi there. Data analysts spend a lot of time problem-solving, and that means there's going to be times when you get stuck, but the trick is knowing what to do when that happens. In this video, we'll talk about the importance of knowing how to get help, whether that means asking someone else for help or searching the internet for answers. Asking other people about a problem you're having can help you find new solutions that move a project forward. It's always a good idea to reach out to your peers and mentors, especially if they're working with you on that project. Your team members have valuable knowledge and insight that can help you find the solution you need to get unstuck. Sometimes we spend a lot of time spinning our wheels saying, "I can do this myself," but we can be way more productive if we engage with other people, find new resources to lean on and try to get as many voices as we can involved. For example, let's say you're working with the bike trip time data from the previous videos. Maybe you're trying to find the average time between bike rides in a given month. Calculating the difference between bike rides before midnight is easy, but you can run into a problem if the elapsed time crosses into the next day. If someone went on a bike ride at 11:00 PM, but the next ride wasn't until 06:00 AM, your formula would return a negative number because the end time is less than the start time. You know that you can add one minus the start time if two bike rides start and end on different days, but that formula won't work on times that happened in the same day, and it's pretty inefficient to scroll through every bike ride to pinpoint these special cases. You need to find a way to build a conditional formula, but you aren't sure how. You decide to check in with other analysts working on your team to see if they have any ideas. You could send them a quick email, or stop by their desk, to find out if they have a minute to talk it over with you. Turns out they had a similar problem on a previous project, and they're able to show you a conditional formula that you could use to speed up your calculations. Great! They suggest using an IF formula like this. This basically says that, "if the end time is larger than the start time, replace the standard end time minus start time formula with one minus start time plus end time." Now it's also possible that your team members don't have an answer; that's okay too. There's definitely someone else with the same problem asking the same questions online. Knowing how to find solutions online is an incredibly valuable problem-solving tool for data analysis. There's also all kinds of forums where spreadsheet users can ask questions, and you never know what you can turn up with just a basic search. For example, let's say you look at "calculate number of hours between times" spreadsheets and find a helpful walk-through for a more complicated formula using MOD. This flips the negative values into positive ones, solving your calculation problem. Whether you're asking someone you know or searching the internet for answers, reaching out for help can give you some really interesting solutions and new ways to solve problems for future analysis. Coming up, we'll learn even more about searching for solutions online. See you soon. Advanced spreadsheet tips and tricks Like a lot of the things you’re learning in this program, spreadsheets will get easier the more you practice. This reading provides you with a list of resources that may help advance your knowledge and experience with spreadsheet functions and functionality. The goal is to provide you with access to a variety of advanced tips and tricks that will help make you more efficient and effective when working with spreadsheets to perform data analysis. Review the description of each resource below, click the links to learn more, and save or bookmark any links that are useful to you. You can immediately start practicing anything that you learn to increase the chances of your understanding and to build your familiarity with spreadsheets. This reading provides a range of resources, so feel free to explore the ones that are applicable to you and skip the ones that aren’t. Google Sheets Keyboard shortcuts for Google Sheets: This is a great resource for quickly learning a range of keyboard shortcuts that can make regular tasks quicker and easier, like navigating your spreadsheet or accessing formulas and functions. This list contains shortcuts for the desktop and mobile versions of Google Sheets so that you can apply them to your work no matter what device you are using. List of Google Sheets functions: This is a comprehensive list of the Google Sheets functions and syntax. Each function is listed with a link to learn more. 20 Google Sheets Formulas You Must Know: This blog article summarizes and describes 20 of the most useful Google Sheets formulas. 18 Google Sheets Formula Tips and Techniques: These are tips for using Google Sheets shortcuts when working with formulas. Excel Keyboard shortcuts in Excel: Earlier in this list, you were provided with a resource for keyboard shortcuts in Google Sheets. Similarly, this resource provides a list of keyboard shortcuts in Excel that will make performing regular spreadsheet tasks more efficient. This includes keyboard shortcuts for both desktop and mobile versions of Excel, so you can apply them no matter what platform you are working on. 222 Excel shortcuts: A compilation of shortcuts includes links to more detailed explanations about how to use them. This is a great way to quickly reference keyboard shortcuts. The list has been organized by functionality, so you can go directly to the sections that are most useful to you. List of spreadsheet functions: This is a comprehensive list of Excel spreadsheet functions with links to more detailed explanations. This is a useful resource to save so that you can reference it often; that way, you’ll have access to functions and examples that you can apply to your work. List of spreadsheet formulas: Similar to the previous resource, this comprehensive list of Excel spreadsheet formulas with links to more detailed explanations and can be saved and referenced any time you need to check out a formula for your analysis. Essential Excel Skills for Analyzing Data: This blog post includes more advanced functionalities of some spreadsheet tools that you have previously learned about, like pivot tables and conditional formatting. These skills have been identified as particularly useful for data analysis. Each section includes a how-to video that will take you through the process of using these functions step-by-step, so that you can apply them to your own analysis. Advanced Spreadsheet Skills: Mark Jhon C. Oxillo’s presentation starts with a basic overview of spreadsheet but also includes advanced functions and exercises to help you apply formulas to actual data in Excel. This is a great way to review some basic concepts and practice the skills you have been learning so far. There are lots of resources online about advanced spreadsheet tips and tricks. You'll probably discover new resources and tools on your own, but this list is a great starting point as you become more familiar with spreadsheets. Question 1 Overview Now that you are learning how to convert and format data for analysis, you can pause for a moment and think about what you are learning. In this self-reflection, you will consider how you can seek help while you learn, then respond to brief questions. This self-reflection will help you develop insights into your own learning and prepare you to ask the data analytics community on Stack Overflow about what you’re learning. As you answer questions— and come up with questions of your own—you will consider concepts, practices, and principles to help refine your understanding and reinforce your learning. You’ve done the hard work, so make sure to get the most out of it: This reflection will help your knowledge stick! Seeking help on Stack Overflow Stack Overflow is an online platform where programmers ask code-related questions and peers are available to suggest answers. You can ask questions about programming languages such as SQL and R (which you will learn about in Course 7), data tools, and much more. Follow the steps below to get started on Stack Overflow. Sign up for an account To sign up for Stack Overflow: 1. Click on the Sign up button in the upper right corner 2. Follow the on-screen prompts to enter your desired login information. 3. Click the Sign up button. Explore Stack Overflow From the home page, click the dropdown in the upper left corner and click Questions. The Questions page provides different categories of questions for you to choose. Some examples include the “Newest” and “Active” categories. Read some of the questions under the different categories. Tags will help you find questions. On the left pane, click on Tags. On the Tags page, type in a tag name and then press Enter or Return. Next, you can click on a tag to view questions that have that particular tag. Use the Search bar at the top of the web page to search for keywords and questions. If you would like to view only questions that have a certain tag, include the tag name in brackets with your search. For example, if you want to only find questions that have the tag “SQL,” then type [SQL] in the search field, along with your keywords or question. See the example below. To learn more about searching, read these instructions about how to search. For a quick guide on syntax structures, check out this list of search types and search syntax. Write your own question When asking a question on Stack Overflow, keep it specific. Don’t use Stack Overflow to ask questions with opinion-based answers. For example, “Which SQL function can I use to add two numbers together?” is an appropriate question. “Which SQL function is your favorite?” is not. It is a best practice to search the Stack Overflow website for your question in case someone has already asked it. This reduces redundant questions on the site and saves you the time it would take to wait for an answer. Write clear and concise questions in complete sentences. Then people are more likely to understand what you ask and give you helpful answers. To begin asking a question, click the blue Ask Question button on this page. The form for asking a question has three sections: Title, Body, and Tags. Title: This is where you ask your question. Body: Summarize your problem and include expected and actual results. Include any error codes. If you think that inserting code into the Body section will help, press Ctrl+K (Windows) or Cmd+K (Mac OS) on your keyboard. Then type your code. Tags: Tags include specific keywords, like program names. They help other people find your question. You can add up to five tags. Check out this list of existing tags for examples of what tags to use. Note: Stack Overflow is a public forum. Do not post any confidential company information or code that could impact the company you work for or yourself. When in doubt, first ask your manager whether you may post your question and code excerpt on Stack Overflow. Weekly challenge 2 Latest Submission Grade 37.5% 1. Question 1 An analyst working for a British school system just downloaded a dataset that was created in the United States. The numerical data is correct but it is formatted as U.S. dollars, and the analyst needs it to be in British pounds. What spreadsheet tool can help them select the right format? 0 / 1 point Format as Pounds Format as Currency EXCHANGE CURRENCY Incorrect Review the video on formatting data for a refresher. 2. Question 2 You are using a spreadsheet to organize a list of upcoming home repairs. Column A contains the list of repairs, and column B notes the priority of each item on the list: High Priority or Low Priority. What spreadsheet tool can you use to create a drop-down list of priorities for each cell in column B? 0 / 1 point Pop-up menus Conditional formatting Data validation Find Incorrect Review the video on spreadsheet features for formatting data for a refresher. 3. Question 3 A data analyst in human resources uses a spreadsheet to keep track of employees’ work anniversaries. They add color to any employee who has worked for the company for more than 10 years. Which spreadsheet tool changes how cells appear when values equal 10 or more? 0 / 1 point Data validation CONVERT Conditional formatting Add color Incorrect Review the video on spreadsheet features for formatting data for a refresher. 4. Question 4 You are analyzing data about the capitals of different countries. In your SQL database, you have one column with the names of the countries and another column with the names of the capitals. What function can you use in your query to combine the countries and capitals into a new column? 0 / 1 point CONCAT COMBINE GROUP JOIN Incorrect Review the video on joining text strings for a refresher. 5. Question 5 You are querying a database of ice cream flavors to determine which stores are selling the most mint chip. For your project, you only need the first 80 records. What clause should you add to the following SQL query? 1 / 1 point LIMIT_80 LIMIT,80 LIMIT = 80 LIMIT 80 Correct To return only the first 80 records, type LIMIT 80. 6. Question 6 Fill in the blank: A data analyst is working with a spreadsheet that has very long text strings. They use the LEN function to count the number of _____ in the text strings. 1 / 1 point values substrings fields characters Correct They use the LEN function to count the number of characters in a text string. 7. Question 7 Spreadsheet cell E13 contains the text string “Database”. To return the substring “Data”, what is the correct syntax? 1 / 1 point =RIGHT(E13, 4) =LEFT(E13, 4) =RIGHT(4,E13) =LEFT(4,E13) Correct The function =LEFT(E13, 4) will return “Data” The LEFT function returns a set number of characters from the left side of a text string. In this case, it returns a four-character substring from the end of the string in E13, starting from the left. 8. Question 8 When working with a spreadsheet, data analysts use the FIND function to locate specific characters in a string. FIND is case-sensitive, so it’s necessary to input the substring exactly how it appears. 0 / 1 point True False Incorrect Review the video on strings in spreadsheets for a refresher. VLOOKUP core concepts Functions can be used to quickly find information and perform calculations using specific values. In this reading, you will learn about the importance of one such function, VLOOKUP, or Vertical Lookup, which searches for a certain value in a spreadsheet column and returns a corresponding piece of information from the row in which the searched value is found. When do you need to use VLOOKUP? Two common reasons to use VLOOKUP are: Populating data in a spreadsheet Merging data from one spreadsheet with data in another VLOOKUP syntax A VLOOKUP function is available in both Microsoft Excel and Google Sheets. You will be introduced to the general syntax in Google Sheets. (You can refer to the resources at the end of this reading for more information about VLOOKUP in Microsoft Excel.) Here is the syntax. search_key The value to search for. For example, 42, "Cats", or I24. The range to consider for the search. The first column in the range is searched to locate data matching the value specified by search_key. The column index of the value to be returned, where the first column in range is numbered 1. If index is not between 1 and the number of columns in range, #VALUE! is returned. range index is_sorted Indicates whether the column to be searched (the first column of the specified range) is sorted. TRUE by default. It’s recommended to set is_sorted to FALSE. If set to FALSE, an exact match is returned. If there are multiple matching values, the content of the cell corresponding to the first value found is returned, and #N/A is returned if no such value is found. If is_sorted is TRUE or omitted, the nearest match (less than or equal to the search key) is returned. If all values in the search column are greater than the search key, #N/A is returned. What if you get #N/A? As you have just read, #N/A indicates that a matching value can't be returned as a result of the VLOOKUP. The error doesn’t mean that anything is actually wrong with the data, but people might have questions if they see the error in a report. You can use the IFNA function to replace the #N/A error with something more descriptive, like “Does not exist.” Here is the syntax. value This is a required value. The function checks if the cell value matches the value; such as #N/A. value_if_na This is a required value. The function returns this value if the cell value matches the value in the first argument; it returns this value when the cell value is #N/A. Helpful VLOOKUP reminders TRUE means an approximate match, FALSE means an exact match on the search key. If the data used for the search key is sorted, TRUE can be used. You want the column that matches the search key in a VLOOKUP formula to be on the left side of the data. VLOOKUP only looks at data to the right after a match is found. In other words, the index for VLOOKUP indicates columns to the right only. This may require you to move columns around before you use VLOOKUP. After you have populated data with the VLOOKUP formula, you may copy and paste the data as values only to remove the formulas so you can manipulate the data again. VLOOKUP resources for Microsoft Excel VLOOKUP may slightly differ in Microsoft Excel, but the overall concepts can still be generally applied. Refer to the following resources if you are working with Excel. How to use VLOOKUP in Excel: This tutorial includes a video to help you get a general understanding of how the VLOOKUP function works in Excel, as well as practical examples to look through. VLOOKUP in Excel tutorial: Follow along in this video lesson and learn how to write a VLOOKUP formula in Excel and master time-saving useful tips and tricks. 23 things you should know about VLOOKUP in Excel: Explore this list of 23 VLOOKUP facts as well as challenges you might run into, and start to learn how to master them. How to use Excel's VLOOKUP function: This article shares a specific example around how to apply VLOOKUP in your searches. VLOOKUP in Excel vs Google Sheets: This guide offers a VLOOKUP comparison of Excel and Google Sheets. Optional: Upload the employee dataset to BigQuery The next video demonstrates how to use JOINS to merge and return data from two tables based on a common attribute used in both tables. If you would like to follow along with the instructor, you will need to log in to your BigQuery account and upload the employee data provided as two CSV files. If you have hopped around courses, Using BigQuery in the Prepare Data for Exploration course covers how to set up a BigQuery account. Prepare for the next video First, download the CSV files from the attachments below: Employees Table - Understanding JOINSCSV File Download file Departments Table - Understanding JOINSCSV File Download file Next, complete the following steps in your BigQuery console to upload the employees and departments tables. Step 1: Open your BigQuery console and click on the project you want to upload the data to. Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next to your project name and select Create dataset. Step 3: Enter employee_data for the Dataset ID. Step 4: Click CREATE DATASET (blue button) to add the dataset to your project. Step 5: In the Explorer on the left, click to expand your project, and then click the employee_data dataset you just created. Step 6: Click the Actions icon (three vertical dots) next to employee_data and select Open. Step 7: Click the blue + icon at the top right to open the Create table window. Step 8: Under Source, for the Create table from selection, choose where the data will be coming from. Select Upload. Click Browse to select the Employees Table CSV file you downloaded. Choose CSV from the file format drop-down. Step 9: For Table name, enter employees if you plan to follow along with the video. Step 10: For Schema, click the Auto detect check box. Step 11: Click Create table (blue button). You will now see the employees table under your employee_data dataset in your project. Step 12: Click the employee_data dataset again. Step 13: Click the icon to open the Create table window again. Step 14: Under Source, for the Create table from selection, choose where the data will be coming from. Select Upload. Click Browse to select the Departments Table CSV file you downloaded. Choose CSV from the file format drop-down. Step 15: For Table name, enter departments if you plan to follow along with the video. Step 16: For Schema, click the Auto detect check box. Step 17: Click Create table (blue button). You will now see the departments table under your employee_data dataset in your project. Step 18: Click the employees table and click the Preview tab to verify that you have the data shown below. Step 19: Click the departments table and click the Preview tab to verify that you have the data shown below. If your data previews match, you are ready to follow along with the next video. Hey, welcome back. So far we've checked out a few different tools you can use to aggregate data within spreadsheets. In this video, we'll cover how to use JOIN in SQL to aggregate data in databases. First, I'll tell you a little bit about what a JOIN actually is, and then we'll explore some of the most common JOINs in action. Let's get started. JOIN is a SQL clause that's used to combine rows from two or more tables based on a related column. Basically, you can think of a JOIN as a SQL version of VLOOKUP which we just covered. There are four common JOINs data analysts use, inner, left, right, and outer. Here's a handy visualization of what each JOIN actually does. We'll use these to help us understand these functions. JOINs help you combine matching or related columns from different tables. When we learned about relational databases, we refer to these values as primary and foreign keys. Primary keys reference columns in which each value is unique to that table. But that table can have multiple foreign keys which are primary keys in other tables. For example, in a table about employees, the employee ID is a primary key and the office ID is a foreign key. JOIN use these keys to identify relationships and corresponding values. An inner JOIN is a function that returns records with matching values in both tables. If we think about our tables as a circles of this Venn diagram, then an inner JOIN would return the records that exist where the tables are overlapping. For the records to appear in the results table, they'll have to be key values in both tables. The records will only merge if there are matches in both tables. When we input JOIN into SQL, it usually defaults to inner JOIN. A lot of analysts will use JOIN as shorthand instead of typing the whole query. A LEFT JOIN is a function that will return all the records from the left table and only the matching records from the right table. Here's how you can figure out which table is left or right. In English and SQL we read from left to right. The table mentioned first is left and the table mentioned second is right. You can also think of left as a table name to the left of the JOIN statement and right as a table name to the right of the JOIN statement. In this diagram, you'll notice that the entire left table is colored in, and that's the overlap with the right table which shows us that the left table and the records it shares with the right table are being selected. Each row in the left table appears in the results even if there are no matches in the right table. RIGHT JOIN does the opposite. It will return all records from the right table and only the matching records from the left. You can get the same results if you flip the order of the tables and use a left JOIN. For example, SELECT from table A, LEFT JOIN table B is the same as SELECT from table B, RIGHT JOIN table A. Finally, there's OUTER JOIN. OUTER join combines RIGHT and LEFT JOIN to return all matching records in both tables. This means it will return all records in both tables. If there are records in one table without a match, it'll create a record with no values for the other table. Using JOINs can make working with multiple data sources a lot easier and it can make relationships between tables more clear. Here's an example. Let's say we're working with employee data across multiple departments. We have an employees table and a departments table which both have some columns like department ID. We can use different JOIN clauses to help us put different data from our tables and aggregate it. Maybe we want to get a list of employees with their department name, excluding any employee without a department ID. Because the department ID record is used in both tables, we can use an INNER JOIN to return a list with only those employees. As a quick reminder, analysts will sometimes just input JOIN for an INNER JOIN but for this example, we'll write it out. To build this query, we'll start with SELECT and AS to tell SQL how we want the columns titled. Play video starting at :4:43 and follow transcript4:43 Then we'll use FROM to tell it where we're getting this data, in this case the employees table. Then we'll input INNER JOIN and the other table we're using, which is departments. Play video starting at :4:58 and follow transcript4:58 We can specify which column and each table will contain the matching JOIN key by writing ON employees.department_id equals departments.departments_id. Now, let's run it, and there. Now we've got a list of employee names and department IDs for the employees that have those IDs. But we could use LEFT or RIGHT join to return a list of all employee names and their departments when available. Let's try both really quickly. This will start similar to the last query, we'll put in SELECT AS and FROM again. But this time we'll say LEFT JOIN and use ON like we did with the last query. When we execute the query, we get back this new list with the employee names and departments. But you'll notice there's null values. These are places where the right table which is departments in this case didn't have corresponding values. Let's try RIGHT JOIN just to test it out. This query will be almost the same. Only difference is that we'll use the RIGHT JOIN clause to return all the rows from the right table, whether they have matching values in the table to the left of the JOIN statement or not. In this case, the right table is departments. Play video starting at :6:28 and follow transcript6:28 Now, let's try out one last JOIN: OUTER. OUTER JOIN will fetch all of the employee names and departments. Again, this query will start a lot like the other ones we've done, we'll use SELECT AS and FROM to choose what data we want and how. We'll grab this from the employees table, and put FULL OUTER JOIN with the departments table to get all of the records from both. We'll also use ON again here. Now we can run this, Play video starting at :7:1 and follow transcript7:01 and we'll get all of the employee names and departments from these tables. There will be nulls in the department.name column, the employee.name column and role column because we've joined columns that don't have matching values, and there. Now you know how JOINs work. JOINs are super useful when you need to work with data from multiple related tables. They give you a lot of flexibility with how you combine and view that data. If you ever have trouble remembering what INNER, RIGHT, LEFT, or OUTER JOIN do, just think back to our Venn diagram. We'll keep learning about aggregating data in SQL next time. See you soon. Secret identities: The importance of aliases In this reading, you will learn about using aliasing to simplify your SQL queries. Aliases are used in SQL queries to create temporary names for a column or table. Aliases make referencing tables and columns in your SQL queries much simpler when you have table or column names that are too long or complex to make use of in queries. Imagine a table name like special_projects_customer_negotiation_mileages. That would be difficult to retype every time you use that table. With an alias, you can create a meaningful nickname that you can use for your analysis. In this case “special_projects_customer_negotiation_mileages” can be aliased to simply “mileage.” Instead of having to write out the long table name, you can use a meaningful nickname that you decide. Basic syntax for aliasing Aliasing is the process of using aliases. In SQL queries, aliases are implemented by making use of the AS command. The basic syntax for the AS command can be seen in the following query for aliasing a table: Notice that AS is preceded by the table name and followed by the new nickname. It is a similar approach to aliasing a column: In both cases, you now have a new name that you can use to refer to the column or table that was aliased. Alternate syntax for aliases If using AS results in an error when running a query because the SQL database you are working with doesn't support it, you can leave it out. In the previous examples, the alternate syntax for aliasing a table or column would be: FROM table_name alias_name SELECT column_name alias_name The key takeaway is that queries can run with or without using AS for aliasing, but using AS has the benefit of making queries more readable. It helps to make aliases stand out more clearly. Aliasing in action Let’s check out an example of a SQL query that uses aliasing. Let’s say that you are working with two tables: one of them has employee data and the other one has department data. The FROM statement to alias those tables could be: FROM work_day.employees AS employees These aliases still let you know exactly what is in these tables, but now you don’t have to manually input those long table names. Aliases can be really helpful for long, complicated queries. It is easier to read and write your queries when you have aliases that tell you what is included within your tables. For more information If you are interested in learning more about aliasing, here are some resources to help you get started: SQL Aliases: This tutorial on aliasing is a really useful resource to have when you start practicing writing queries and aliasing tables on your own. It also demonstrates how aliasing works with real tables. SQL Alias: This detailed introduction to aliasing includes multiple examples. This is another great resource to reference if you need more examples. Using Column Aliasing: This is a guide that focuses on column aliasing specifically. Generally, you will be aliasing entire tables, but if you find yourself needing to alias just a column, this is a great resource to have bookmarked. Using JOINs effectively In this reading, you will review how JOINs are used and will be introduced to some resources that you can use to learn more about them. A JOIN combines tables by using a primary or foreign key to align the information coming from both tables in the combination process. JOINs use these keys to identify relationships and corresponding values across tables. If you need a refresher on primary and foreign keys, refer to the glossary for this course, or go back to Databases in data analytics. The general JOIN syntax As you can see from the syntax, the JOIN statement is part of the FROM clause of the query. JOIN in SQL indicates that you are going to combine data from two tables. ON in SQL identifies how the tables are to be matched for the correct information to be combined from both. Type of JOINs There are four general ways in which to conduct JOINs in SQL queries: INNER, LEFT, RIGHT, and FULL OUTER. The circles represent left and right tables, and where they are joined is highlighted in blue Here is what these different JOIN queries do. INNER JOIN INNER is optional in this SQL query because it is the default as well as the most commonly used JOIN operation. You may see this as JOIN only. INNER JOIN returns records if the data lives in both tables. For example, if you use INNER JOIN for the 'customers' and 'orders' tables and match the data using the customer_id key, you would combine the data for each customer_id that exists in both tables. If a customer_id exists in the customers table but not the orders table, data for that customer_id isn’t joined or returned by the query. The results from the query might look like the following, where customer_name is from the customers table and product_id and ship_date are from the orders table: customer_name product_id ship Martin's Ice Cream 043998 202 Beachside Treats 872012 202 Mona's Natural Flavors 724956 202 ... etc. ... etc. The data from both tables was joined together by matching the customer_id common to both tables. Notice that customer_id doesn’t show up in the query results. It is simply used to establish the relationship between the data in the two tables so the data can be joined and returned. ... et LEFT JOIN You may see this as LEFT OUTER JOIN, but most users prefer LEFT JOIN. Both are correct syntax. LEFT JOIN returns all the records from the left table and only the matching records from the right table. Use LEFT JOIN whenever you need the data from the entire first table and values from the second table, if they exist. For example, in the query below, LEFT JOIN will return customer_name with the corresponding sales_rep, if it is available. If there is a customer who did not interact with a sales representative, that customer would still show up in the query results but with a NULL value for sales_rep. The results from the query might look like the following where customer_name is from the customers table and sales_rep is from the sales table. Again, the data from both tables was joined together by matching the customer_id common to both tables even though customer_id wasn't returned in the query results. customer_name sales_rep Martin's Ice Cream Luis Reyes Beachside Treats NULL customer_name sales_rep Mona's Natural Flavors Geri Hall ...etc. ...etc. RIGHT JOIN You may see this as RIGHT OUTER JOIN or RIGHT JOIN. RIGHT JOIN returns all records from the right table and the corresponding records from the left table. Practically speaking, RIGHT JOIN is rarely used. Most people simply switch the tables and stick with LEFT JOIN. But using the previous example for LEFT JOIN, the query using RIGHT JOIN would look like the following: The query results are the same as the previous LEFT JOIN example. customer_name sales_rep Martin's Ice Cream Luis Reyes Beachside Treats NULL Mona's Natural Flavors Geri Hall ...etc. ...etc. FULL OUTER JOIN You may sometimes see this as FULL JOIN. FULL OUTER JOIN returns all records from the specified tables. You can combine tables this way, but remember that this can potentially be a large data pull as a result. FULL OUTER JOIN returns all records from both tables even if data isn’t populated in one of the tables. For example, in the query below, you will get all customers and their products’ shipping dates. Because you are using a FULL OUTER JOIN, you may get customers returned without corresponding shipping dates or shipping dates without corresponding customers. A NULL value is returned if corresponding data doesn’t exist in either table. The results from the query might look like the following. customer_name ship_date Martin's Ice Cream 2021-02-23 Beachside Treats 2021-02-25 NULL 2021-02-25 The Daily Scoop NULL Mountain Ice Cream NULL Mona's Natural Flavors 2021-02-28 ...etc. ...etc. For more information JOINs are going to be useful for working with relational databases and SQL—and you will have plenty of opportunities to practice them on your own. Here are a few other resources that can give you more information about JOINs and how to use them: SQL JOINs: This is a good basic explanation of JOINs with examples. If you need a quick reminder of what the different JOINs do, this is a great resource to bookmark and come back to later. Database JOINs - Introduction to JOIN Types and Concepts: This is a really thorough introduction to JOINs. Not only does this article explain what JOINs are and how to use them, but it also explains the various scenarios in more detail of when and why you would use the different JOINs. This is a great resource if you are interested in learning more about the logic behind JOINing. SQL JOIN Types Explained in Visuals: This resource has a visual representation of the different JOINs. This is a really useful way to think about JOINs if you are a visual learner, and it can be a really useful way to remember the different JOINs. SQL JOINs: Bringing Data Together One Join at a Time: Not only does this resource have a detailed explanation of JOINs with examples, but it also provides example data that you can use to follow along with their step-by-step guide. This is a useful way to practice JOINs with some real data. SQL JOIN: This is another resource that provides a clear explanation of JOINs and uses examples to demonstrate how they work. The examples also combine JOINs with aliasing. This is a great opportunity to see how JOINs can be combined with other SQL concepts that you have been learning about in this course. Optional: Upload the warehouse dataset to BigQuery The next video demonstrates how to use COUNT and COUNT DISTINCT in SQL to count and return the number of certain values in a dataset. If you would like to follow along with the instructor, you will need to log in to your BigQuery account and upload the warehouse data provided as two CSV files. If you have hopped around courses, Using BigQuery in the Prepare Data for Exploration course covers how to set up a BigQuery account. Prepare for the next video First, download the two CSV files from the attachments below: Warehouse Orders - WarehouseCSV File Download file Warehouse Orders - OrdersCSV File Download file Next, complete the following steps in your BigQuery console to upload the Warehouse Orders dataset with the two Warehouse and Orders tables. Step 1: Open your BigQuery console and click on the project you want to upload the data to. Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next to your project name and select Create dataset. Step 3: In the upcoming video, the name "warehouse_orders" will be used for the dataset. If you plan to follow along with the video, enter warehouse_orders for the Dataset ID. Step 4: Click CREATE DATASET (blue button) to add the dataset to your project. Step 5: In the Explorer on the left, click to expand your project, and then click the warehouse_orders dataset you just created. Step 6: Click the Actions icon (three vertical dots) next to warehouse_orders and select Open. Step 7: Click the blue + icon at the top right to open the Create table window. Step 8: Under Source, for the Create table from selection, choose where the data will be coming from. Select Upload. Click Browse to select the Warehouse Orders - Warehouse CSV file you downloaded. Choose CSV from the file format drop-down. Step 9: For Table name, enter Warehouse if you plan to follow along with the video. Step 10: For Schema, click the Auto detect check box. Step 11: Click Create table (blue button). You will now see the Warehouse table under your warehouse_orders dataset in your project. Step 12: Click the warehouse_orders dataset again. Step 13: Click the icon to open the Create table window again. Step 14: Under Source, for the Create table from selection, choose where the data will be coming from. Select Upload. Click Browse to select the Warehouse Orders - Orders CSV file you downloaded. Choose CSV from the file format drop-down. Step 15: For Table name, enter Orders if you plan to follow along with the video. Step 16: For Schema, click the Auto detect check box. Step 17: Click Create table (blue button). You will now see the Orders table under your warehouse_orders dataset in your project. Step 18: Click the Warehouse table and click the Preview tab to verify that you have 10 rows of data. Step 19: Click the Orders table and click the Preview tab to verify that you have the data shown below. If your data previews match, you are ready to follow along with the next video. Hi, it's great to have you back. By now we've discovered that spreadsheets and SQL have a lot of tools in common. Earlier in this program, we learned about COUNT in spreadsheets. Now it's time to look at similar tools in SQL: COUNT and COUNT DISTINCT. In this video, we'll talk about when you'd use these queries and check out an example. Let's get started. COUNT can be used to count the total number of numerical values within a specific range in spreadsheets. COUNT in SQL does the same thing. COUNT is a query that returns the number of rows in a specified range, but COUNT DISTINCT is a little different. COUNT DISTINCT is a query that only returns the distinct values in that range. Basically, this means COUNT DISTINCT doesn't count repeating values. As a data analyst, you'll use COUNT and COUNT DISTINCT anytime you want to answer questions about how many. Like how many customers did this? Or how many transactions were there this month? Or how many dates are in this dataset? And you'll use them throughout the data analysis process at different stages. For example, you might need them while you're cleaning data to check how many rows are left in your dataset. Or you might use COUNT and COUNT DISTINCT during the actual analysis to answer a "how many" question. You'll run into these kinds of questions a lot. So COUNT and COUNT DISTINCT are really useful to know. But let's check out an example to see COUNT and COUNT DISTINCT in action. For this example, we're working with a company that manufactures socks. We have two tables: Warehouse and Orders. Let's take a quick look at these tables before we start querying. First, we'll check out the Warehouse table. Play video starting at :1:46 and follow transcript1:46 You can see the columns here: warehouse ID, warehouse alias, the maximum capacity, the total number of employees, and the state the warehouse is located in. We'll pull up the top 100 rows of the Orders table next. We can use LIMIT here to limit the number of rows returned. This is useful if you're working with large datasets, especially if you just want to explore a small sample of that dataset. From this query, we're actually going to start with a FROM statement so that we can alias our tables. Aliasing is when you temporarily name a table or column in your query to make it easier to read and write. Because these names are temporary, they only last for the given query. We can use our FROM statement to write in what our tables' aliases are going to be to save us some time in other parts of the query. So we'll start with FROM and use aliasing to name the Warehouse Orders table, just "orders." Let's say we need both the warehouse details and the order details because we want to report on the distribution of orders by state. We're going to JOIN these two tables together since we want data from both of them and alias our warehouse table in the process. In this case, we're using JOIN as shorthand for INNER JOIN because we want corresponding data from both tables. Play video starting at :3:14 and follow transcript3:14 And now that we have the aliases in place, let's build out the SELECT statement that comes before FROM. Play video starting at :3:23 and follow transcript3:23 Let's run that. And there. Now we have data from both tables joined together, and we know how to create these handy aliases. Now, we want to count how many states are in our ordered data. To do that, we'll use COUNT and COUNT DISTINCT now. We can try a simple COUNT query first. We'll JOIN the Orders and Warehouse tables in our FROM statement. And in this case we'll start with SELECT and COUNT the number of states. Let's run this query and see what we get. Wait, that's not quite right. This query returned over 9,000 states because we counted every single row that included a state. But we actually want to count the distinct states. Let's try this again with COUNT DISTINCT. This query is going to look similar to the last one, but we'll use DISTINCT to cut out the repeated instances we got the last time. We'll use the query we just built, but replace COUNT with COUNT DISTINCT in our SELECT statement. Let's try this query. That's more like it. According to these results, we have three distinct states in our Orders data. Let's check out what happens when we group by the state column in the warehouse table, which we'll call warehouse dot state. We'll use JOIN and GROUP BY in our FROM statement. Let's start there again. Then GROUP BY warehouse state. Play video starting at :4:58 and follow transcript4:58 Now let's build out our SELECT statement on top of that. We're still going to use COUNT DISTINCT. Let's run it. Now we have three rows, one of each state represented in the Orders data. And our COUNT DISTINCT on the number of orders sums up the count we ran earlier: 9,999. You'll find yourself using COUNT and COUNT DISTINCT during every stage of the data analysis process. Understanding what these queries are and how they are different is key. Great job, and I'll see you again soon! SQL functions and subqueries: A functional friendship In this reading, you will learn about SQL functions and how they are sometimes used with subqueries. SQL functions are tools built into SQL to make it possible to perform calculations. A subquery (also called an inner or nested query) is a query within another query. How do SQL functions, function? SQL functions are what help make data aggregation possible. (As a reminder, data aggregation is the process of gathering data from multiple sources in order to combine it into a single, summarized collection.) So, how do SQL functions work? Going back to W3Schools, let’s review some of these functions to get a better understanding of how to run these queries: SQL HAVING: This is an overview of the HAVING clause, including what it is and a tutorial on how and when it works. SQL CASE: Explore the usage of the CASE statement and examples of how it works. SQL IF: This is a tutorial of the IF function and offers examples that you can practice with. SQL COUNT: The COUNT function is just as important as all the rest, and this tutorial offers multiple examples to review. Subqueries - the cherry on top Think of a query as a cake. A cake can have multiple layers contained within it and even layers within those layers. Each of these layers are our subqueries, and when you put all of the layers together, you get a cake (query). Usually, you will find subqueries nested in the SELECT, FROM, and/or WHERE clauses. There is no general syntax for subqueries, but the syntax for a basic subquery is as follows: SELECT account_table.* FROM ( SELECT * FROM transaction.sf_model_feature_2014_01 WHERE day_of_week = 'Friday' ) account_table WHERE account_table.availability = 'YES' You will find that, within the first SELECT clause is another SELECT clause. The second SELECT clause marks the start of the subquery in this statement. There are many different ways in which you can make use of subqueries, and resources referenced will provide additional guidance as you learn. But first, let’s recap the subquery rules. There are a few rules that subqueries must follow: Subqueries must be enclosed within parentheses A subquery can have only one column specified in the SELECT clause. But if you want a subquery to compare multiple columns, those columns must be selected in the main query. Subqueries that return more than one row can only be used with multiple value operators, such as the IN operator which allows you to specify multiple values in a WHERE clause. A subquery can’t be nested in a SET command. The SET command is used with UPDATE to specify which columns (and values) are to be updated in a table. Additional resources The following resources offer more guidance into subqueries and their usage: SQL subqueries: This detailed introduction includes the definition of a subquery, its purpose in SQL, when and how to use it, and what the results will be Writing subqueries in SQL: Explore the basics of subqueries in this interactive tutorial, including examples and practice problems that you can work through As you continue to learn more about using SQL, functions, and subqueries, you will realize how much time you can truly save when memorizing these tips and tricks. 0:01 Welcome back! One of the first calculations most kids learn how to do is counting. Soon after, they learn adding, and that doesn't go away. No matter what age we are, we're always counting or adding something, whether it's change at the grocery store or measurements in a recipe. Data analysts do a lot of counting and adding too. And with the amount of data you'll come across as a data analyst, you'll be grateful to have functions that can do the counting and adding for you. So let's learn how these functions COUNTIF and SUMIF can help you do calculations for your analysis more easily and accurately. We'll start with the COUNTIF function. You might remember COUNTIF from some of the earlier videos about data cleaning. COUNTIF returns the number of cells that match a specified value. Earlier, we showed how COUNTIF can be used to find and count errors in a data set. Play video starting at ::55 and follow transcript0:55 Here we'll only be counting. Just a reminder though, while we won't be actively searching for errors in this video, you'll still want to watch out for any data that doesn't look right when doing your own analysis. As a data analyst, you'll look for and fix errors every step of the way. Play video starting at :1:13 and follow transcript1:13 For this example, we'll look at a sample of data from an online kitchen supplies retailer. Play video starting at :1:19 and follow transcript1:19 Our stakeholders have asked us to answer a few questions about the data to understand more about customer transactions, including the revenue they're bringing in. We've added the questions we need to answer to the spreadsheet. Play video starting at :1:34 and follow transcript1:34 We'll set up a simple summary table, which is a table used to summarize statistical information about data. We'll use the questions to create the attributes for our table columns: count, revenue total, and average revenue per transaction. Play video starting at :1:52 and follow transcript1:52 Each of our questions ask about transactions with one item or transactions with more than one item, so those will be the observations for our rows. Play video starting at :2:6 and follow transcript2:06 We'll make Quantity the heading for our observations. Play video starting at :2:14 and follow transcript2:14 We'll also add borders to make the summary table nice and clear. Play video starting at :2:22 and follow transcript2:22 The first question asks, How many transactions include exactly one item? To answer this, we'll add a formula using the COUNTIF function in cell G11. Play video starting at :2:33 and follow transcript2:33 We'll begin with an equal sign, COUNTIF, and an open parenthesis. Play video starting at :2:40 and follow transcript2:40 Column B has data about quantity. So we'll select cells B3 through B50, followed by a comma. Play video starting at :2:53 and follow transcript2:53 Next, we need to tell the formula the value that we're looking for in the cells we've selected. We want to tell the data to count the number of transactions if they equal 1. In this case, between quotation marks, we'll type an equal sign and the number 1 because that's the exact value we need to count. When we add a closed parenthesis and press enter, we get the total count for transactions with only one item, which is 25. We can follow the same steps to count values greater than one. Play video starting at :3:40 and follow transcript3:40 But this time, because we only want values greater than 1, we'll type a greater than sign in our formula inside of an equals sign. Play video starting at :3:48 and follow transcript3:48 Getting this information helps us compare the data about quantity. Play video starting at :3:54 and follow transcript3:54 Okay, now we need to find out how much total revenue each transaction type brought in. Since the data isn't organized by quantity, we'll use the SUMIF function to help us add the revenue for transactions with one item and with one more item separately. SUMIF is a function that adds numeric data based on one condition. Building a formula with SUMIF is a bit different than one with COUNTIF. They both start the same way with an equal sign and the function, but a SUMIF formula contains the range of cells to be evaluated by your criteria, and the criteria. In other words, SUMIF has a list of cells to check based on the criteria you set in the formula. Then the range where we want to add the numbers is placed in the formula if that range is different from the range being evaluated. There's commas between each of these parts. Adding a space after each comma is optional. So let's try this. In cell H11, we'll type our formula. Play video starting at :5:1 and follow transcript5:01 The range to be evaluated is in column B, so we'll select those cells. Play video starting at :5:14 and follow transcript5:14 The condition we want the data to meet is for the values in the column to be equal to one. So we'll type a comma and then inside quotes an equal sign and the number one. Play video starting at :5:24 and follow transcript5:24 Then we'll select the range to be added based on whether the data from our first range is equal to one. This range is in column C, which lists the revenue for each transaction. Play video starting at :5:46 and follow transcript5:46 So every amount of revenue earned from a transaction with only one item will be added together. And there's our total. Since this is revenue, we'll change the format of the number to currency, so it shows up as dollars and cents. Play video starting at :6: and follow transcript6:00 So the transactions with exactly one item earned $1,555.00 in revenue. Let's see how much the transactions with more than one item earned. Play video starting at :6:38 and follow transcript6:38 Okay, let's check out the results. Just like with our COUNTIF examples, the second SUMIF formula will be the same as the first, except for the condition, which will make it greater than one. Play video starting at :6:51 and follow transcript6:51 When we run the formula, we discover that the revenue total is much higher, $4,735.00. This makes sense, since the revenue is coming from transactions with more than one item. Good news. To complete our objective, we'll do two more quick calculations. First, we'll find the average revenue per transaction by dividing each total by its count. This will show our stakeholders how much of a difference there is in revenue per transaction between one item and multiple item transactions. This information could be useful for lots of reasons. For example, figuring out whether to add a discount on purchases with more than one item to encourage customers to buy more. We'll put these calculations in the last column of our summary table. You might remember that we use a slash in a formula as the operator for division calculations. Play video starting at :7:44 and follow transcript7:44 The average revenue for transactions with one item is $62.20. Play video starting at :7:55 and follow transcript7:55 And the average revenue for transactions with more than one item is $205.87. And that's it for our analysis. Our summary table now gives the stakeholders and team members a snapshot of the analysis that's easy to understand. Our COUNTIF and SUMIF functions played a big role here. Using these functions to complete calculations, especially in large datasets, can help speed up your analysis. They can also make counting and adding a little more interesting. Nothing wrong with that. And coming up, we'll explore more functions to make your calculations run smoothly. Bye for now. Functions with multiple conditions In this reading, you will learn more about conditional functions and how to construct functions with multiple conditions. Recall that conditional functions and formulas perform calculations according to specific conditions. Previously, you learned how to use functions like SUMIF and COUNTIF that have one condition. You can use the SUMIFS and COUNTIFS functions if you have two or more conditions. You will learn their basic syntax in Google Sheets, and check out an example. Refer to the resources at the end of this reading for information about similar functions in Microsoft Excel. SUMIF to SUMIFS The basic syntax of a SUMIF function is: =SUMIF(range, criterion, sum_range) The first range is where the function will search for the condition that you have set. The criterion is the condition you are applying and the sum_range is the range of cells that will be included in the calculation. For example, you might have a table with a list of expenses, their cost, and the date they occurred. Column A: A1 - Expense A2 - Fuel A3 - Food A4 - Taxi A5 - Coffee A6 - Fuel A7 - Taxi A8 - Coffee A9 - Food Column B: B1 - Price B2 - $48.00 B3 - $12.34 B4 - $21.57 A5 - $2.50 A6 - $36.00 A7 - $15.88 A8 $4.15 A9 - $6.75 Column C: C1 - Date C2 - 12/14/2020 C3 - 12/14/2020 C4 - 12/14/2020 C5 12/15/2020 C6 - 12/15/2020 C7 - 12/15/2020 C8 - 12/15/2020 C9 - 12/15/2020 You could use SUMIF to calculate the total price of fuel in this table, like this: But, you could also build in multiple conditions by using the SUMIFS function. SUMIF and SUMIFS are very similar, but SUMIFS can include multiple conditions. The basic syntax is: =SUMIFS(sum_range, criteria_range1, criterion1, [criteria_range2, criterion2, ...]) The square brackets let you know that this is optional. The ellipsis at the end of the statement lets you know that you can have as many repetition of these parameters as needed. For example, if you wanted to calculate the sum of the fuel costs for one date in this table, you could create a SUMIFS statement with multiple conditions, like this: This formula gives you the total cost of every fuel expense from the date listed in the conditions. In this example, C1:C9 is our second criterion_range and the date 12/15/2020 is the second condition. As long as you follow the basic syntax, you can add up to 127 conditions to a SUMIFS statement! COUNTIF to COUNTIFS Just like the SUMIFS function, COUNTIFS allows you to create a COUNTIF function with multiple conditions. The basic syntax for COUNTIF is: =COUNTIF(range, criterion) Just like SUMIF, you set the range and then the condition that needs to be met. For example, if you wanted to count the number of times Food came up in the Expenses column, you could use a COUNTIF function like this: COUNTIFS has the same basic syntax as SUMIFS: =COUNTIFS(criteria_range1, criterion1, [criteria_range2, criterion2, ...]) The criteria_range and criterion are in the same order, and you can add more conditions to the end of the function. So, if you wanted to find the number of times Coffee appeared in the Expenses column on 12/15/2020, you could use COUNTIFS to apply those conditions, like this: This formula follows the basic syntax to create conditions for “Coffee” and the specific date. Now we can find every instance where both of these conditions are true. For more information SUMIFS and COUNTIFS are just two examples of functions with multiple conditions. They help demonstrate how multiple conditions can be built into the basic syntax of a function. But, there are other functions with multiple conditions that you can use in your data analysis. There are a lot of resources available online to help you get started with these other functions: How to use the Excel IFS function: This resource includes an explanation and example of the IFS function in Excel. This is a great reference if you are interested in learning more about IFS. The example is a useful way to understand this function and how it can be used. VLOOKUP in Excel with multiple criteria: Similar to the previous resource, this resource goes into more detail about how to use VLOOKUP with multiple criteria. Being able to apply VLOOKUP with multiple criteria will be a useful skill, so check out this resource for more guidance on how you can start using it on your own spreadsheet data. INDEX and MATCH in Excel with multiple criteria: This resource explains how to use the INDEX and MATCH functions with multiple criteria. It also includes an example which helps demonstrate how these functions work with multiple criteria and actual data. Using IF with AND, OR, and NOT functions in Excel: This resource combines IF with AND, OR, and NOT functions to create more complex functions. By combining these functions, you can perform your tasks more efficiently and cover more criteria at once. Welcome back. In the last video, we created a pivot table of movie data and revenue calculations to help our manager think through new movie ideas. We used our pivot table to make some initial observations about annual revenue. We also discovered that the average revenue for 2015 was lower than other years even though more movies were released that year. We hypothesized that this was because more movies that earn less than $10 million in revenue were released in 2015. To test this theory, we created a copy of our original pivot table. Now we are going to apply filters in calculated fields to explore the data more. Let's get started. You all remember that the filter option lets us view only the values we need. We'll select a cell in our copied pivot table and add a filter to the box office revenue column. The filter will then be applied to the entire table. When we open the status menu, we can choose to filter the data to show specific values. Play video starting at :1:11 and follow transcript1:11 But in our case, we want to filter by condition so we can figure out how many movies in each year earn less than $10 million. The condition we'll use in our filter is less than and our value will be $10 million which is why we renamed these columns earlier. We'll type our number in a dollar and cents format so the condition matches the data in our pivot table. This might not be necessary, but it prevents potential errors from happening. Now we know that 20 movies released in 2015 made less than $10 million. This seems like a high number compared to the other years. But keep in mind, there were more movies from our data set released in 2015. Before we move on, let's use a calculated field to verify our average because it was copied from another pivot table before we filtered it. That way we can check that it's correct. We'll create a customized column called a calculated field using our values menu. A calculated field is a new field within a pivot table that carries out certain calculations based on the values of other fields. You can do this in Excel too using field settings and the create formula menu. For the formula in our calculated field, we'll use the sum function and divide the sum of the box office revenue data from our original table by the count of the same data. Because we applied our filter to this pivot table earlier, this formula will only return the average revenue of movies under $10 million. That worked. We were able to check the accuracy of some of our data before analyzing it. Always a good thing. But it's still difficult to tell how much of an impact these lower earning movies had on the average revenue. Let's run a quick formula to find the percentage of movies for each year that earned less than $10 million. This will make it easier to compare from year to year. Instead of a calculated field, we'll add this as a formula in a new column, that way we can pull data from both of our pivot tables. We'll put a header for our table in cell G10 and name it percent of total movies. Then we'll add our formula to the next cell in the column. Divide the number of movies in the copy table by the number of movies in the original table. Then we'll use the fill handle in the cell with a formula and drag it to apply the formula to the rest of the years. Finally, we'll format these numbers as percentages. Now our analysis shows that 16 percent of the movies released in 2015 earned less than $10 million of revenue. The other years are all close to 10 percent. This is one possible explanation for why the average revenue is comparatively low in 2015. In real life, we'd most likely need to take our analysis even further depending on our goals. But for now, we're all set. You've learned how you can use pivot tables to perform data calculations. It will take practice, but pivot tables are worth it because they do more than calculate. They organize and filter data too. Together we've covered functions, formulas, and pivot tables. All great tools to use in analysis. With practice and experience, it will feel like you've used them forever. Just take your time getting to know how they work. Keep exploring these videos and the readings. Great work. Elements of a pivot table Previously, you learned that a pivot table is a tool used to sort, reorganize, group, count, total, or average data in spreadsheets. In this reading, you will learn more about the parts of a pivot table and how data analysts use them to summarize data and answer questions about their data. Pivot tables make it possible to view data in multiple ways in order to identify insights and trends. They can help you quickly make sense of larger data sets by comparing metrics, performing calculations, and generating reports. They’re also useful for answering specific questions about your data. A pivot table has four basic parts: rows, columns, values, and filters. The rows of a pivot table organize and group data you select horizontally. For example, in the Working with pivot tables video, the Release Date values were used to create rows that grouped the data by year. The columns organize and display values from your data vertically. Similar to rows, columns can be pulled directly from the data set or created using values. Values are used to calculate and count data. This is where you input the variables you want to measure. This is also how you create calculated fields in your pivot table. As a refresher, a calculated field is a new field within a pivot table that carries out certain calculations based on the values of other fields In the previous movie data example, the Values editor created columns for the pivot table, including the SUM of Box Office Revenue, the AVERAGE of Box Office Revenue, and the COUNT of Box Office Revenue columns. Finally, the filters section of a pivot table enables you to apply filters based on specific criteria — just like filters in regular spreadsheets! For example, a filter was added to the movie data pivot table so that it only included movies that generated less than $10 million in revenue. Being able to use all four parts of the pivot table editor will allow you to compare different metrics from your data and execute calculations, which will help you gain valuable insights. Using pivot tables for analysis Pivot tables can be a useful tool for answering specific questions about a dataset so you can quickly share answers with stakeholders. For example, a data analyst working at a department store was asked to determine the total sales for each department and the number of products they each sold. They were also interested in knowing exactly which department generated the most revenue. Instead of making changes to the original spreadsheet data, they used a pivot table to answer these questions and easily compare the sales revenue and number of products sold by each department. They used the department as the rows for this pivot table to group and organize the rest of the sales data. Then, they input two Values as columns: the SUM of sales and a count of the products sold. They also sorted the data by the SUM of sales column in order to determine which department generated the most revenue. Now they know that the Toys department generated the most revenue! Pivot tables are an effective tool for data analysts working with spreadsheets because they highlight key insights from the spreadsheet data without having to make changes to the spreadsheet. Coming up, you will create your own pivot table to analyze data and identify trends that will be highly valuable to stakeholders. In this reading, you will learn how to create and use pivot tables for data analysis. You will also get some resources about pivot tables that you can save for your own reference when you start creating pivot tables yourself. Pivot tables are a spreadsheet tool that let you view data in multiple ways to find insights and trends. Pivot tables allow you to make sense of large data sets by giving you tools to easily compare metrics, quickly perform calculations, and generate readable reports. You can create a pivot table to help you answer specific questions about your data. For example, if you were analyzing sales data, you could use pivot tables to answer questions like, “Which month had the most sales?” and “What products generated the most revenue this year?” When you need answers to questions about your data, pivot tables can help you cut through the clutter and focus on only the data you need. Create your pivot table Before you can analyze data with pivot tables, you will need to create a pivot table with your data. The following includes the steps for creating a pivot table in Google Sheets, but most spreadsheet programs will have similar tools. First, you will open the Data menu from the toolbar; there will be an option for Pivot table. This pop-up menu will appear: There is an option to select New sheet or Existing sheet and a Create button Generally, you will want to create a new sheet for your pivot table to keep your raw data and your analysis separate. You can also store all of your calculations in one place for easy reference. Once you have created your pivot table, there will be a pivot table editor that you can access to the right of your data. This is where you will be able to customize your pivot table, including what variables you want to include for your analysis. Using your pivot table for analysis You can perform a wide range of analysis tasks with your pivot tables to quickly draw meaningful insights from your data, including performing calculations, sorting, and filtering your data. Below is a list of online resources that will help you learn about performing basic calculations in pivot tables as well as resources for learning about sorting and filtering data in your pivot tables. Perform calculations Microsoft Excel Google Sheets Calculate values in a pivot table: Microsoft Support’s introduction to calculations in Excel pivot tables. This is a useful Create and use pivot tables: This pivot tables in Google Sheets and i Microsoft Excel Google Sheets starting point if you are learning how to perform calculations with pivot tables specifically in Excel. creating calculated fields. This is a save and reference as a quick remi calculated fields. Pivot table calculated field example: This resource includes a detailed example of a pivot table being used for calculations. This step-by-step process demonstrates how calculated fields work, and provides you with some idea of how they can be used for analysis. All about calculated field in pivo comprehensive guide to calculated you are working with Sheets and a more about pivot tables, this is a g Pivot table calculated fields: step-by-step tutorial: This tutorial for creating your own calculated fields in pivot tables is a really useful resource to save and bookmark for when you start to apply calculated fields to your own spreadsheets. Pivot tables in Google Sheets: Th the basics of pivot tables and calcu and uses examples and how-to vid these concepts. Sort your data Microsoft Excel Google Sheets Sort data in a pivot table or PivotChart: This is a Microsoft Support how-to guide to sorting data in pivot tables. This is a useful reference if you are working with Excel and are interested in checking out how filtering will appear in Excel specifically. Customize a pivot table: This guid focuses on sorting pivot tables in Go quick reference if you are working o need a step-by-step guide. Pivot tables- Sorting data: This tutorial for sorting data in pivot tables includes an example with real data that demonstrates how sorting in Excel pivot tables works. This example is a great way to experience the entire process from start to finish. How to sort pivot table columns: data to demonstrate how the sortin pivot tables will work. This is a grea slightly more detailed guide with sc environment. How to sort a pivot table by value: This source uses an example to explain sorting by value in pivot tables. It includes a video, which is a useful guide if you need a demonstration of the process. Pivot table ascending and descen beginner’s guide is a great way to br tables if you are interested in a quic Filter your data Microsoft Excel Google Sheets Filter data in a pivot table: This resource from the Microsoft Support page provides an explanation of filtering data in pivot tables in Excel. If you are working in Excel spreadsheets, this is a great resource to have bookmarked for quick reference. Customize a pivot table: This is the filtering pivot table data. This is a use working with pivot tables in Google S resource to review the process. Microsoft Excel Google Sheets How to filter Excel pivot table data: This how-to guide for filtering data in pivot tables demonstrates the filtering process in an Excel spreadsheet with data and includes tips and reminders for when you start using these tools on your own. Filter multiple values in pivot table about how to filter for multiple value This resource expands some of the fu already learned and sets you up to cr Google Sheets. Format your data Microsoft Excel Google Sheets Design the layout and format of a PivotTable: This Microsoft Support Create and edit pivot t article describes how to change the format of the PivotTable by applying a article provides informa predefined style, banded rows, and conditional formatting. table to change its style, Pivot tables are a powerful tool that you can use to quickly perform calculations and gain meaningful insights into your data directly from the spreadsheet file you are working in! By using pivot table tools to calculate, sort, and filter your data, you can immediately make high-level observations about your data that you can share with stakeholders in reports. But, like most tools we have covered in this course, the best way to learn is to practice. This was just a small taste of what you can do with pivot tables, but the more you work with pivot tables, the more you will discover. Optional: Upload the avocado dataset to BigQuery Using public datasets is a great way to practice working with SQL. Later in the course, you are going to use historical data on avocado prices to perform calculations in BigQuery. This is a step-by-step guide to help you load this data into your own BigQuery console so that you can follow along with the upcoming video. If you have hopped around courses, Using BigQuery in the Prepare Data for Exploration course covers how to set up a BigQuery account. Step 1: Download the CSV file from Kaggle Avocado prices: The publicly available avocado dataset from Kaggle you are going to use (made available by Justin Kiggins under an Open Data Commons license). You can download this data onto your own device and then upload it to BigQuery. There are also other public datasets on Kaggle that you can download and use. You can follow these steps to load them into your console and practice on your own! Screenshot of Kaggle dataset page. There is a header titled Avocado Prices You will find some more information about the avocado dataset, including the context, content, and original source on this page. For now, you can simply download the file. Step 2: Open your BigQuery console and create a new dataset Open BigQuery. After you have downloaded the dataset from Kaggle, you can upload it to your BigQuery console. In the Explorer on the left side of your console, click the project where you want to add a dataset - note that your project will not be named the same as the one in the example ("oval-flow-286322"). Don't choose "bigquery-public-data" as your project because that's a public project that you can't change. Click the Actions icon (three vertical dots) next to your project and select Create dataset. Here, you will name the dataset; in this case, enter avocado_data. Then, click Create dataset (blue button) at the bottom to create your new dataset. This will add data in the Explorer on the left of your console. Screenshot of Create Dataset menu Step 3: Open the new dataset and create a new table Navigate to the dataset in your console by clicking to expand your project and selecting the correct dataset listed. In this case, it will be avocado_data. Screenshot of the Create table menu Click the Actions icon (three vertical dots) next to your dataset and select Open. Then click the + icon to create a table. Next, do the following: Under Source, for the Create table from selection, select Upload. Click Browse to select the CSV file you just downloaded to your computer from Kaggle. The file format should automatically change from Avro to CSV when you select the file. For Table Name, enter avocado_prices for the table. For Schema, click the Auto detect check box. Then, click Create table (blue button). Screenshot of second Create table menuthere are options to upload data, name a project, name the table, and more In the Explorer, the avocado data will appear in the table under the dataset you created. Now you are ready to follow along with the video and learn more about performing calculations with queries! Further reading Introduction to loading data: This step-by-step guide is a useful resource that you can bookmark and save for later. You can refer to it the next time you need to load data into BigQuery. Hi, again. Earlier, we covered data validation, a spreadsheet function that adds drop-down lists to cells. Using data validation lets you control what can and can't be entered into your worksheet. One of its uses is protecting structured data and formulas in your spreadsheets. But as useful as it is, the data validation function is just one part of a larger data validation process. This process involves checking and rechecking the quality of your data so that it is complete, accurate, secure, and consistent. While the data validation process is a form of data cleaning, you should use it throughout your analysis. If this all sounds familiar to you, that's good. Ensuring you have good data is super important. And in my opinion, it's kind of fun because you can pair your knowledge of the business with your technical skills. This will help you understand your data, check that it's clean, and make sure you're aligning with your business objectives. In other words, it's what you do to make sure your data makes sense. Play video starting at :1:6 and follow transcript1:06 Keep in mind, you'll build your business knowledge with time and experience. And here's a pro tip. Asking as many questions as possible whenever you need to will make this much easier. Okay, let's say we're analyzing some data for a furniture retailer. We want to check that the values in the purchase price column are always equal to the number of items sold times the product price. So we'll add a formula in a new column to recalculate the purchase prices using a multiplication formula. Play video starting at :1:46 and follow transcript1:46 Now, comparing the totals, there's at least one value that doesn't match the value in the purchase price column. We need to find an answer to help us move forward with our analysis. Play video starting at :1:58 and follow transcript1:58 By doing some research and asking questions, we find that there's a discount of 30% when customers buy five or more of certain items. Play video starting at :2:6 and follow transcript2:06 If we hadn't run this check, we could have missed this completely. Play video starting at :2:11 and follow transcript2:11 You've learned that as an analyst, calculations are a big part of your job. So it's important that whenever you do calculations, you always check to make sure you've done them in the right way. Sometimes you'll run data validation checks that are common-sense checks. For example, let's say you're working on an analysis to figure out the effectiveness of in-store promotions for a business that's only open on weekdays. Play video starting at :2:36 and follow transcript2:36 You check to make sure that there's no sales data for Saturday and Sundays. If your data does show sales on weekends, it might not be a problem with the data itself. It might not even be a problem at all. There might be a good reason. Maybe your business hosts special events on Saturdays and Sundays. Then you would have sales for those weekends. You still might want to leave out the weekend sales in your analysis if your objective is only to look at the weekdays. But doing this data validation might save you from miscalculations and other errors in your analysis. You should always do data validation no matter what analysis tool you're using. In an earlier video, we used SQL to analyze some data about avocados. One of the queries was a check to make sure the data showing the total number of bags was the sum of small, large, and extra-large bags. By running this query, we were able to determine that the total number column was accurate. We compared our two columns briefly in that video. But to be absolutely sure that there's no issues with the data values in those columns, we could have also run another query. In this query, we would select all using the asterisk, and FROM the avocado prices data set. Play video starting at :3:58 and follow transcript3:58 In our WHERE clause, we'd also type out where our calculated total does not equal the total bags column. If no values are returned, we can be sure that the values in the Total Bags column are accurate. And that led us to continue our analysis. Play video starting at :4:48 and follow transcript4:48 But when we tried to find what percent of the total number of bags was small, we ran into a small problem. We received an error message about dividing by zero. We fixed that error by adjusting our query. If we had linked that query to a presentation that went to our stakeholders, they'd show us the divide by zero error instead of the figures we wanted. By building in these types of checks as part of your data validation process, you can avoid errors in your analysis and complete your business objectives to make everyone happy. And trust me. It's a great feeling when you do. And another great feeling is knowing that you've made it through another video and learned something new. And we have more where that came from coming soon. See you. Types of data validation This reading describes the purpose, examples, and limitations of six types of data validation. The first five are validation types associated with the data (type, range, constraint, consistency, and structure) and the sixth type focuses on the validation of application code used to accept data from user input. As a junior data analyst, you might not perform all of these validations. But you could ask if and how the data was validated before you begin working with a dataset. Data validation helps to ensure the integrity of data. It also gives you confidence that the data you are using is clean. The following list outlines six types of data validation and the purpose of each, and includes examples and limitations. Purpose: Check that the data matches the data type defined for a field. Example: Data values for school grades 1-12 must be a numeric data type. Limitations: The data value 13 would pass the data type validation but would be an unacceptable value. For this case, data range validation is also needed. Purpose: Check that the data falls within an acceptable range of values defined for the field. Example: Data values for school grades should be values between 1 and 12. Limitations: The data value 11.5 would be in the data range and would also pass as a numeric data type. But, it would be unacceptable because there aren't half grades. For this case, data constraint validation is also needed. Purpose: Check that the data meets certain conditions or criteria for a field. This includes the type of data entered as well as other attributes of the field, such as number of characters. Example: Content constraint: Data values for school grades 1-12 must be whole numbers. Limitations: The data value 13 is a whole number and would pass the content constraint validation. But, it would be unacceptable since 13 isn’t a recognized school grade. For this case, data range validation is also needed. Purpose: Check that the data follows or conforms to a set structure. Example: Web pages must follow a prescribed structure to be displayed properly. Limitations: A data structure might be correct with the data still incorrect or inaccurate. Content on a web page could be displayed properly and still contain the wrong information. Purpose: Check that the application code systematically performs any of the previously mentioned validations during user data input. Example: Common problems discovered during code validation include: more than one data type allowed, data range checking not done, or ending of text strings not well defined. Limitations: Code validation might not validate all possible variations with data input. Hello again. Now, if you're like me, you always have sticky notes available nearby to write a reminder or figure out a quick math problem. Sticky notes are useful and important, but they're also disposable since you usually only need them for a short time before you recycle them. Data analysts have their own version of sticky notes when they're working in SQL. They're called temporary tables and we're here to find out what they're all about. Purpose: Check that the data makes sense in the context of other related data. Example: Data values for product shipping dates can’t be earlier than product production dates. Limitations: Data might be consistent but still incorrect or inaccurate. A shipping date could be later than a production date and still be wrong. A temporary table is a database table that is created and exists temporarily on a database server. Temp tables as we call them store subsets of data from standard data tables for a certain period of time. Then they're automatically deleted when you end your SQL database session. Since temp tables aren't stored permanently, they're useful when you only need a table for a short time to complete analysis tasks, like calculations. For example, you might have a lot of tables you're performing calculations on at the same time. If you have a query that needs to join seven or eight of them, you could join the two or three tables having the fewest number of rows and store their output in a temp table. You could then join this temp table to one of the other bigger tables. Another example is when you have lots of different databases you're running queries on. You can run these initial queries in each separate database, and then use a temp table to collect the results of all of these queries. The final report query would then run on the temporary table. You might not be able to make use of this reporting structure without temporary tables. They're also useful if you've got a large number of records in a table and you need to work with a small subset of those records repeatedly to complete some calculations or other analysis. So instead of filtering the data over and over to return the subset, you can filter the data once and store it in a temporary table. Then you can run your queries using a temporary table you've created. Imagine that you've been asked to analyze data about the bike sharing system we looked at earlier. You only need to analyze the data for bike trips that were over 60 minutes or longer, but you have several questions to answer about the specific data. Play video starting at :2:11 and follow transcript2:11 Using a temporary table will let you run several queries about this data without having to keep filtering it. There's different ways to create temporary tables in SQL, depending on the relational database management system you're using. We'll explore some of these options soon. For this scenario we'll use BigQuery. We'll apply a WITH clause to our query. The WITH clause is a type of temporary table that you can query from multiple times. The WITH clause approximates a temporary table. Basically, this means it creates something that does the same thing as a temporary table. Even if it doesn't add a table to the database you're working in for others to see, you can still see your results and anyone who needs to review your work can see the code that led to your results. Play video starting at :2:59 and follow transcript2:59 Let's get this query started. We'll start this query with the WITH command. Play video starting at :3:5 and follow transcript3:05 We'll then name our temp table trips, underscore, over, underscore, 1, underscore, hr. Then we'll type the AS command and an open parenthesis. On a new line, we'll use the SELECT- FROM-WHERE structure for our subquery. We'll type SELECT followed by an asterisk. You might remember the asterisk means you're selecting all the columns in the table. Play video starting at :3:33 and follow transcript3:33 Now we'll type the FROM command and name the database that we're pulling from bigquery, dash, public, dash, data, dot, new, underscore, york, dot, citibike, underscore, trips. Play video starting at :3:55 and follow transcript3:55 Next, we'll add a WHERE clause with the condition that the length of the bike trips we need in our temp table are greater than or equal to 60 minutes. In the query it goes like this: trip duration, space, greater than sign, equal sign, space, 60. Finally, we'll add a close parenthesis on a new line to end our subquery. And that sets up our temporary table. Now we can run queries that'll only return results for trips that lasted 60 minutes or longer. Let's try one. Since we're working in our version of a temp table, we don't need to open a new query. Instead, we'll label our queries before we add our code to describe what we're doing. For this query, we'll type two hashtags. Play video starting at :4:42 and follow transcript4:42 This tells the server that this is a description and not part of the code. Next, we'll add the query description. Play video starting at :4:49 and follow transcript4:49 Count how many trips are 60 plus minutes long. Play video starting at :4:59 and follow transcript4:59 And then we'll add our query. SELECT, then on a new line COUNT with an asterisk in parentheses. As followed by cnt to name the column with our COUNT. Play video starting at :5:12 and follow transcript5:12 Next we'll add FROM and the name we're using for our version of a temporary table: trips over one hour. Play video starting at :5:21 and follow transcript5:21 When we run our query, the results show the total number of bike trips from the dataset that lasted 60 minutes or longer, Play video starting at :5:35 and follow transcript5:35 We can keep running queries on this temp table over and over as long as we're looking to analyze bike trips that were 60 minutes and over. And if you need to end your session and start a new runtime later, most servers store the code used in temp tables. You'll just need to recreate the table by running the code. Play video starting at :5:55 and follow transcript5:55 When you use temporary tables, you make your own work more efficient. Naming and using temp tables can help you deal with a lot of data in a more streamlined way, so you don't get lost repeating query after query with the same code that you could just include in a temp table. And here's another bonus to using temp tables: they can help your fellow team members too. With temp tables your code is usually less complicated and easier to read and understand which your team will appreciate! Play video starting at :6:24 and follow transcript6:24 Once you start to explore temporary tables on your own, you might not be able to stop. Don't say I didn't warn you. Coming up, we'll explore even more things you can do with temp tables. See you soon. Working with temporary tables Temporary tables are exactly what they sound like—temporary tables in a SQL database that aren’t stored permanently. In this reading, you will learn the methods to create temporary tables using SQL commands. You will also learn a few best practices to follow when working with temporary tables. A quick refresher on what you have already learned about temporary tables They are automatically deleted from the database when you end your SQL session. They can be used as a holding area for storing values if you are making a series of calculations. This is sometimes referred to as pre-processing of the data. They can collect the results of multiple, separate queries. This is sometimes referred to as data staging. Staging is useful if you need to perform a query on the collected data or merge the collected data. They can store a filtered subset of the database. You don’t need to select and filter the data each time you work with it. In addition, using fewer SQL commands helps to keep your data clean. It is important to point out that each database has its own unique set of commands to create and manage temporary tables. We have been working with BigQuery, so we will focus on the commands that work well in that environment. The rest of this reading will go over the ways to create temporary tables, primarily in BigQuery. Temporary table creation in BigQuery Temporary tables can be created using different clauses. In BigQuery, the WITH clause can be used to create a temporary table. The general syntax for this method is as follows: Breaking down this query a bit, notice the following: The statement begins with the WITH clause followed by the name of the new temporary table you want to create The AS clause appears after the name of the new table. This clause instructs the database to put all of the data identified in the next part of the statement into the new table. The opening parenthesis after the AS clause creates the subquery that filters the data from an existing table. The subquery is a regular SELECT statement along with a WHERE clause to specify the data to be filtered. The closing parenthesis ends the subquery created by the AS clause. When the database executes this query, it will first complete the subquery and assign the values that result from that subquery to “new_table_data,” which is the temporary table. You can then run multiple queries on this filtered data without having to filter the data every time. Temporary table creation in other databases (not supported in BigQuery) The following method isn’t supported in BigQuery, but most other versions of SQL databases support it, including SQL Server and mySQL. Using SELECT and INTO, you can create a temporary table based on conditions defined by a WHERE clause to locate the information you need for the temporary table. The general syntax for this method is as follows: SELECT * INTO AfricaSales FROM GlobalSales WHERE Region = "Africa" This SELECT statement uses the standard clauses like FROM and WHERE, but the INTO clause tells the database to store the data that is being requested in a new temporary table named, in this case, “AfricaSales.” User-managed temporary table creation So far, we have explored ways of creating temporary tables that the database is responsible for managing. But, you can also create temporary tables that you can manage as a user. As an analyst, you might decide to create a temporary table for your analysis that you can manage yourself. You would use the CREATE TABLE statement to create this kind of temporary table. After you have finished working with the table, you would then delete or drop it from the database at the end of your session. Note: BigQuery uses CREATE TEMP TABLE instead of CREATE TABLE, but the general syntax is the same. CREATE TABLE table_name ( column1 datatype, column2 datatype, column3 datatype, .... ) After you have completed working with your temporary table, you can remove the table from the database using the DROP TABLE clause. The general syntax is as follows: Best practices when working with temporary tables Global vs. local temporary tables: Global temporary tables are made available to all database users and are deleted when all connections that use them have closed. Local temporary tables are made available only to the user whose query or connection established the temporary table. You will most likely be working with local temporary tables. If you have created a local temporary table and are the only person using it, you can drop the temporary table after you are done using it. Dropping temporary tables after use: Dropping a temporary table is a little different from deleting a temporary table. Dropping a temporary table not only removes the information contained in the rows of the table, but removes the table variable definitions (columns) themselves. Deleting a temporary table removes the rows of the table but leaves the table definition and columns ready to be used again. Although local temporary tables are dropped after you end your SQL session, it may not happen immediately. If a lot of processing is happening in the database, dropping your temporary tables after using them is a good practice to keep the database running smoothly. For more information BigQuery Documentation for Temporary Tables: Documentation has the syntax to create temporary tables in BigQuery How to use temporary tables via WITH in Google BigQuery: Article describes how to use WITH Introduction to Temporary Tables in SQL Server: Article describes how to use SELECT INTO and CREATE TABLE SQL Server Temporary Tables: Article describes temporary table creation and removal Choosing Between Table Variables and Temporary Tables: Article describes the differences between passing variables in SQL statements vs. using temporary tables Welcome back, future data analyst. As a budding analyst, you'll be exposed to a lot of data. People learn and absorb data in so many different ways, and one of the most effective ways that this can happen is through visualization. Data visualization is the graphic representation and presentation of data. In reality, it's just putting information into an image to make it easier for other people to understand. If you've ever looked at any kind of map, whether it's paper or online, then you know exactly how helpful visuals could be. Data visualizations are definitely having a moment right now. Online we are surrounded by images that show information in all kinds of ways, but the history of data visualization goes back way further than the Web. Visualizing data began long ago with maps, which are the visual representation of geographic data. This map of the known world is from 1502. Map makers continued to improve their visualizations as new lands were charted. New data was collected about those locations, and new methods for visualizing the data were created. Scientists and mathematicians began to truly embrace the idea of arranging data visually in the 1700s and 1800s. This bar graph is from 1821 and it doesn't look too different from bar graphs that we see today. But since the beginning of the digital age of data analytics in the 1990s, the scope and reach of visualizations have grown along with the data they graphically represent. As we keep learning how to more efficiently communicate with visuals, the quality of our insights continue to grow too. Today we can quantify human behavior through data, and we've learned to use computers to collect, analyze and visualize that data. As an analyst in today's world, you'll probably split your time with data visuals in two ways: looking at visuals in order to understand and draw conclusions about data or creating visuals from raw data to tell a story. Either way, it's always good to keep in mind that data visualizations will be your key to success. This is especially true once you reach the point where you're ready to present the results of your data analysis to an audience. Getting people to understand your vision and thought process can feel challenging. But a well-made data visualization has the power to change people's minds. Plus, it can help someone who doesn't have the same technical background or experience as you form their own opinions. So here's a quick rule for creating a visualization. Your audience should know exactly what they're looking at within the first five seconds of seeing it. Basically, this means the visual should be clear and easy to follow. In the five seconds after that, your audience should understand the conclusion your visualization is making. Even if they aren't totally familiar with the research you've been doing. They might not agree with your conclusion, and that's okay. You can always use their feedback to adjust your visualization and go back to the data to do further analysis. So now let's talk about what we have to do to create a visualization that's understandable, effective and, most importantly, convincing. Let's start from the beginning. Data visualizations are a helpful tool for fitting a lot of information into a small space. To do this, you first need to structure and organize your thoughts. Think about your objectives and the conclusions you've reached after sorting through data. Then think about the patterns you've noticed in the data, the things that surprised you and, of course, how all of this fits together into your analysis. Identifying the key elements of your findings help set the stage for how you should organize your presentation. Check out this data visualization made by David McCandless, a well-known data journalist. This graphic includes four key elements: the information or data, the story, the goal and the visual form. It's arranged in a four-part Venn diagram, which tells us that all four elements are needed for a successful visualization. So far, you've learned a lot about the data used in visualizations. That's important because it's a key building block for your visualization. The story or concept adds meaning to the data and makes it interesting. We'll talk more about the importance of data storytelling later, but for now, just remember that the story and the data combined provide an outline of what you're trying to show. The goal or function makes the data both useful and usable, and the visual form creates both beauty and structure. With just two elements, you can create a rough sketch of a visual. This could work if you're at an early stage, but won't give you a complete visualization because you'd be missing other key elements. Even using three elements gets you closer, but you're not quite finished. For example, if you combine information, goal, and visual form without any story, your visual will probably look fine, but it won't be interesting. On their own, each element has value, but visualizations only become truly powerful and effective when you combine all four elements in a way that makes sense. And when you think about all of these elements together, you can create something meaningful for your audience. At Google I make sure to develop visualizations to tell stories about data that include all four of these elements, and I can tell you that each element is a key to a visualization success. That's why it's so important for you as the analyst to pay close attention to each element as we move forward. Other people might not know or understand the exact steps you took to come to the conclusions you've made, but that shouldn't stop them from understanding your reasoning. Basically, an effective data visualization should lead viewers to reach the same conclusion you did, but much more quickly. Because of the age we live in, we're constantly being shown different ways to view and absorb information. This means that you've already seen lots of visuals you can reference as you design your own visualizations. You have the power to tell convincing stories that could change opinions and shift mindsets. That's pretty cool. But you also have the responsibility to pay attention to the perspectives of others as you create these stories. So it's important to always keep that in mind. Coming up, we'll start drawing connections between data and images to create a strong foundation for your visual masterpieces. I can't wait to get started. Welcome back, future data analyst. As a budding analyst, you'll be exposed to a lot of data. People learn and absorb data in so many different ways, and one of the most effective ways that this can happen is through visualization. Data visualization is the graphic representation and presentation of data. In reality, it's just putting information into an image to make it easier for other people to understand. If you've ever looked at any kind of map, whether it's paper or online, then you know exactly how helpful visuals could be. Data visualizations are definitely having a moment right now. Online we are surrounded by images that show information in all kinds of ways, but the history of data visualization goes back way further than the Web. Visualizing data began long ago with maps, which are the visual representation of geographic data. This map of the known world is from 1502. Map makers continued to improve their visualizations as new lands were charted. New data was collected about those locations, and new methods for visualizing the data were created. Scientists and mathematicians began to truly embrace the idea of arranging data visually in the 1700s and 1800s. This bar graph is from 1821 and it doesn't look too different from bar graphs that we see today. But since the beginning of the digital age of data analytics in the 1990s, the scope and reach of visualizations have grown along with the data they graphically represent. As we keep learning how to more efficiently communicate with visuals, the quality of our insights continue to grow too. Today we can quantify human behavior through data, and we've learned to use computers to collect, analyze and visualize that data. As an analyst in today's world, you'll probably split your time with data visuals in two ways: looking at visuals in order to understand and draw conclusions about data or creating visuals from raw data to tell a story. Either way, it's always good to keep in mind that data visualizations will be your key to success. This is especially true once you reach the point where you're ready to present the results of your data analysis to an audience. Getting people to understand your vision and thought process can feel challenging. But a well-made data visualization has the power to change people's minds. Plus, it can help someone who doesn't have the same technical background or experience as you form their own opinions. So here's a quick rule for creating a visualization. Your audience should know exactly what they're looking at within the first five seconds of seeing it. Basically, this means the visual should be clear and easy to follow. In the five seconds after that, your audience should understand the conclusion your visualization is making. Even if they aren't totally familiar with the research you've been doing. They might not agree with your conclusion, and that's okay. You can always use their feedback to adjust your visualization and go back to the data to do further analysis. So now let's talk about what we have to do to create a visualization that's understandable, effective and, most importantly, convincing. Let's start from the beginning. Data visualizations are a helpful tool for fitting a lot of information into a small space. To do this, you first need to structure and organize your thoughts. Think about your objectives and the conclusions you've reached after sorting through data. Then think about the patterns you've noticed in the data, the things that surprised you and, of course, how all of this fits together into your analysis. Identifying the key elements of your findings help set the stage for how you should organize your presentation. Check out this data visualization made by David McCandless, a well-known data journalist. This graphic includes four key elements: the information or data, the story, the goal and the visual form. It's arranged in a four-part Venn diagram, which tells us that all four elements are needed for a successful visualization. So far, you've learned a lot about the data used in visualizations. That's important because it's a key building block for your visualization. The story or concept adds meaning to the data and makes it interesting. We'll talk more about the importance of data storytelling later, but for now, just remember that the story and the data combined provide an outline of what you're trying to show. The goal or function makes the data both useful and usable, and the visual form creates both beauty and structure. With just two elements, you can create a rough sketch of a visual. This could work if you're at an early stage, but won't give you a complete visualization because you'd be missing other key elements. Even using three elements gets you closer, but you're not quite finished. For example, if you combine information, goal, and visual form without any story, your visual will probably look fine, but it won't be interesting. On their own, each element has value, but visualizations only become truly powerful and effective when you combine all four elements in a way that makes sense. And when you think about all of these elements together, you can create something meaningful for your audience. At Google I make sure to develop visualizations to tell stories about data that include all four of these elements, and I can tell you that each element is a key to a visualization success. That's why it's so important for you as the analyst to pay close attention to each element as we move forward. Other people might not know or understand the exact steps you took to come to the conclusions you've made, but that shouldn't stop them from understanding your reasoning. Basically, an effective data visualization should lead viewers to reach the same conclusion you did, but much more quickly. Because of the age we live in, we're constantly being shown different ways to view and absorb information. This means that you've already seen lots of visuals you can reference as you design your own visualizations. You have the power to tell convincing stories that could change opinions and shift mindsets. That's pretty cool. But you also have the responsibility to pay attention to the perspectives of others as you create these stories. So it's important to always keep that in mind. Coming up, we'll start drawing connections between data and images to create a strong foundation for your visual masterpieces. I can't wait to get started. A data visualization, sometimes referred to as a “data viz,” allows analysts to properly interpret data. A good way to think of data visualization is that it can be the difference between utter confusion and really grasping an issue. Creating effective data visualizations is a complex task; there is a lot of advice out there, and it can be difficult to grasp it all. In this reading, you are going to learn some tips and tricks for creating effective data visualizations. First, you'll review two frameworks that are useful for thinking about how you can organize the information in your visualization. Second, you'll explore pre-attentive attributes and how they can be used to affect the way people think about your visualizations. From there, you'll do a quick review of the design principles that you should keep in mind when creating your visualization. You will end the reading by reviewing some practices that you can use to avoid creating misleading or inaccurate visualizations. Frameworks for organizing your thoughts about visualization Frameworks can help you organize your thoughts about data visualization and give you a useful checklist to reference. Here are two frameworks that may be useful for you as you create your own data viz: 1) The McCandless Method You learned about the David McCandless method in the first lesson on effective data visualizations, but as a refresher, the McCandless Method lists four elements of good data visualization: 1. Information: the data you are working with 2. Story: a clear and compelling narrative or concept 3. Goal: a specific objective or function for the visual 4. Visual form: an effective use of metaphor or visual expression Note: One useful way of approaching this framework is to notice the parts of the graphic where there is incomplete overlap between all four elements. For example, visual form without a goal, story, or data could be a sketch or even art. Data plus visual form without a goal or function is eye candy. Data with a goal but no story or visual form is boring. All four elements need to be at work to create an effective visual. 2) Kaiser Fung’s Junk Charts Trifecta Checkup This approach is a useful set of questions that can help consumers of data visualization critique what they are consuming and determine how effective it is. The Checkup has three questions: 1. What is the practical question? 2. What does the data say? 3. What does the visual say? Note: This checklist helps you think about your data viz from the perspective of your audience and decide if your visual is communicating your data effectively to them or not. In addition to these frameworks, there are some other building blocks that can help you construct your data visualizations. Pre-attentive attributes: marks and channels Creating effective visuals means leveraging what we know about how the brain works, and then using specific visual elements to communicate the information effectively. Preattentive attributes are the elements of a data visualization that people recognize automatically without conscious effort. The essential, basic building blocks that make visuals immediately understandable are called marks and channels. Marks Marks are basic visual objects like points, lines, and shapes. Every mark can be broken down into four qualities: 1. Position - Where a specific mark is in space in relation to a scale or to other marks 2. Size - How big, small, long, or tall a mark is 3. Shape - Whether a specific object is given a shape that communicates something about it 4. Color - What color the mark is Channels Channels are visual aspects or variables that represent characteristics of the data. Channels are basically marks that have been used to visualize data. Channels will vary in terms of how effective they are at communicating data based on three elements: 1. Accuracy - Are the channels helpful in accurately estimating the values being represented? For example, color is very accurate when communicating categorical differences, like apples and oranges. But it is much less effective when distinguishing quantitative data like 5 from 5.5. 2. Popout - How easy is it to distinguish certain values from others? There are many ways of drawing attention to specific parts of a visual, and many of them leverage pre-attentive attributes like line length, size, line width, shape, enclosure, hue, and intensity. 3. Grouping - How good is a channel at communicating groups that exist in the data? Consider the proximity, similarity, enclosure, connectedness, and continuity of the channel. But, remember: the more you emphasize different things, the less that emphasis counts. The more you emphasize one single thing, the more that counts. Design principles Once you understand the pre-attentive attributes of data visualization, you can go on to design principles for creating effective visuals. These design principles are important to your work as a data analyst because they help you make sure that you are creating visualizations that communicate your data effectively to your audience. By keeping these rules in mind, you can plan and evaluate your data visualizations to decide if they are working for you and your goals. And, if they aren’t, you can adjust them! Principle Description Choose the right visual One of the first things you have to decide is which visual will be the most effective for your audience. Sometimes, a simple table is the best visualization. Other times, you need a more complex visualization to illustrate your point. Optimize the data-ink ratio The data-ink entails focusing on the part of the visual that is essential to understanding the point of the chart. Try to minimize non-data ink like boxes around legends or shadows to optimize the data-ink ratio. Use orientation effectively Make sure the written components of the visual, like the labels on a bar chart, are easy to read. You can change the orientation of your visual to make it easier to read and understand. Color There are a lot of important considerations when thinking about using color in your visuals. These include using color consciously and meaningfully, staying consistent throughout your visuals, being considerate of what colors mean to different people, and using inclusive color scales that make sense for everyone viewing them. Principle Description Numbers of things Think about how many elements you include in any visual. If your visualization uses lines, try to plot five or fewer. If that isn’t possible, use color or hue to emphasize important lines. Also, when using visuals like pie charts, try to keep the number of segments to less than seven since too many elements can be distracting. Avoiding misleading or deceptive charts As you are considering what kind of visualization to create and how to design it, you will want to be sure that you are not creating misleading or deceptive charts. As you have been learning, data analysis provides people with insights and knowledge they can use to make decisions. So, it is important that the visualizations you create are communicating your data accurately and truthfully. Here are some common errors to avoid so that your visualizations aren’t accidentally misleading: What to avoid Why Cutting off the y-axis Changing the scale on the y-axis can make the differences betwe data seem more dramatic, even if the difference is actually quite Misleading use of a dual y-axis Using a dual y-axis without clearly labeling it in your data visua misleading charts. Artificially limiting the scope of the data If you only consider the part of the data that confirms your anal be misleading because they don’t take all of the data into accoun Problematic choices in how data is binned or grouped It is important to make sure that the way you are grouping data misrepresenting your data and disguising important trends and Using part-to-whole visuals when the totals do not sum up appropriately If you are using a part-to-whole visual like a pie chart to explain parts should add up to equal 100%. If they don’t, your data visu What to avoid Why Hiding trends in cumulative charts Creating a cumulative chart can disguise more insightful trends visualization too large to track any changes over time. Adding smooth trend lines between points in a scatter plot can Artificially smoothing trends plot, but replacing the points with just the line can actually mak more connected over time than it actually was. Finally, keep in mind that data visualization is an art form, and it takes time to develop these skills. Over your career as a data analyst, you will not only learn how to design good data visualizations, but you will also learn how to evaluate good data visualizations. Use these tips to think critically about data visualization—both as a creator and as an audience member. Further reading The beauty of data visualization: In this video, David McCandless explains the need for design to not just be beautiful, but for it to be meaningful as well. Data visualization must be able to balance function and form for it to be relevant to your audience. ‘The McCandless Method’ of data presentation: At first glance, this blog appears to be written by a David McCandless fan, and it is. However, it contains very useful information and provides an in-depth look at the 5-step process that McCandless uses to present his data. Information is beautiful: Founded by McCandless himself, this site serves as a hub of sample visualizations that make use of the McCandless method. Explore data from the news, science, the economy, and so much more and learn how to make visual decisions based on facts from all kinds of sources. Beautiful daily news: In this McCandless collection, explore uplifting trends and statistics that are beautifully visualized for your creative enjoyment. A new chart is released every day so be sure to visit often to absorb the amazing things happening all over the world. The Wall Street Journal Guide to Information Graphics: The Dos and Don'ts of Presenting Data, Facts, and Figures: This is a comprehensive guide to data visualization, including chapters on basic data visualization principles and how to create useful data visualizations even when you find yourself in a tricky situation. This is a useful book to add to your data visualization library, and you can reference it over and over again. Hello again. Earlier we talked about why data visualizations are so important to both analysts and stakeholders. Now we'll discuss the connections you can make between data and images in your visualizations. Visual communication of data is important to those using the data to help make decisions. To better understand the connection between data and images, let's talk about some examples of data visualizations and how they can communicate data effectively. You may become across lots of these in your daily life. We'll explore them a little bit more here. A good place to start is a bar graph. Bar graphs use size contrast to compare two or more values. The horizontal line of a bar graph usually placed at the bottom, is called the x-axis, and bar graphs with vertical bars, the x-axis is used to represent categories, time periods, or other variables. The vertical line of a bar graph usually placed to the left is called the y-axis. The y-axis usually has a scale of values for the variables. In this example, the time of day is compared to someone's level of motivation throughout the whole workday. Bar graphs are a great way to clarify trends. Here, it's clear this person's motivation is low at the beginning of the day and gets higher and higher by the end of the workday. This type of visualization makes it very easy to identify patterns. Another example is a line graph. Line graphs are a type of visualization that can help your audience understand shifts or changes in your data. They're usually used to track changes through a period of time, but they can be paired with other factors too. In this line graph, we're using two lines to compare the popularity of cats and dogs over a period of time. With two different line colors, we can immediately tell that dogs are more popular than cats. We'll talk more about using colors and patterns to make visualizations more accessible to audiences later too. Even as a line moves up and down, there's a general trend upwards and the line for dogs always stays higher than the line for cats. Now let's check out another visualization you'll probably recognize. Say hello to the pie chart. Pie charts show how much each part of something makes up the whole. This pie chart shows us all the activities that make up someone's day. Half of it's spent working, which is shown by the amount of space that the blue section takes up. From a quick scan, you can easily tell which activities make up a good chunk of the day in this pie chart and which ones take up less time. Earlier, we learned how maps help organize data geographically. The great thing about maps is they can hold a lot of location-based information and they're easy for your audience to interpret. This example shows survey data about people's happiness in Europe. The borderlines are well-defined and the colors added make it even easier to tell the countries apart. Understanding the data represented here, which we'll come back to again later, can happen pretty quickly. So data visualization is an excellent tool for making the connection between an image and the information it represents, but it can sometimes be misleading. One way visualizations can be manipulated is with scaling and proportions. Think of a pie chart. Pie charts show proportions and percentages between categories. Each part of the circle or pi should reflect its percentage to the whole, which is equal to 100 percent. So if you want to visualize your sales analysis to show the percentage of your company sales that come from online transactions, you could use a pie chart. The size of each slice would be the percentage of total sales that it represents. So if your online sales accounted for 60 percent, the slice would be 60 percent of the whole pie. Now here's a misleading pie chart. It's supposed to show opinions about pizza toppings, but each slice or segment represents more than one option. They all add up to well over 100 percent. There is lots of ingredients listed below the image that are not even included in the visual data. All of the segments are the same size, even though they're supposed to be showing different values. If a visualization looks confusing then it probably is confusing. Let's explore another example where the size of the graphic components comes into play. This time with a bar chart. In a truncated bar chart like this one, the values on the y-axis don't start at zero. The data points start at 9,100 and at intervals of 100. This makes it seem like the data, let's say, it's for novel clicks per day on different website links, is fairly wide-ranging. In this view, website E seems to clearly receive way more clicks than website D, which receives more clicks than website C and so on. While the graph is clear and the elements are easy to understand, the way the data is presented is misleading. Let's try to fix this by changing the graph's y-axis, so that it starts at zero instead. Now, the difference between the website clicks per day don't look nearly as drastic. By making the y-axis start at zero, we're changing the visual proportions to be more accurate and more honest. Some platforms always start their y-axis at zero, but other programs like spreadsheets might not fix the y-axis. So it's important to keep this in mind when creating visualizations. By following the conventions of data analysis, you'll be able to avoid misleading visualizations. You always want your visualization to be clear and easy to understand, but never at the expense of communicating ideas that are true to the data. So we've talked about some effective data-driven visualizations like bar graphs, line graphs, and pie charts, and when to use them. On top of that, we've discussed some things to avoid in your visualizations to keep them from being misleading. Coming up, we'll check out how to make those visualizations reach your target audience. See you then. Hey, there. You're back and ready to learn how to create powerful data visualizations. Coming up, we'll explore how to take our findings and turn them into compelling visuals. Earlier, we discussed the relationship between data and images. Now we'll build on that to explore what visualizations can reveal to your audience and how to make your graphics as effective as possible. One of your biggest considerations when creating a data visualization is where you'd like your audience to focus. Showing too much can be distracting and leave your audience confused. In some cases, restricting data can be a good thing. On the other hand, showing too little can make your visualization unclear and less meaningful. As a general rule, as long as it's not misleading, you should visually represent only the data that your audience needs in order to understand your findings. Now let's talk about what you can show with visualizations. Change over time is a big one. If your analysis involves how the data has changed over a certain period, which could be days, weeks, months, or years. You can set your visualization to show only the time period relevant to your objective. This visualization shows the search interests in news story topics like environment and science and social issues. The viz is set up to show how the search entries change day to day. The bubbles represent the most popular topic on each day in a given part of the US. As new stories come up, the data changes to reflect the topic of those stories. If we wanted the data for weekly or monthly news cycles, we change the interactive feature to show changes by week or month. Another situation is when you need to show how your data is distributed. A histogram resembles a bar graph, but it's a chart that shows how often data values fall into certain ranges. This histogram shows a lot of data and how it's distributed on a narrow range from a negative one to a positive one. Each bin or bucket, as the bar is called, contains a certain number of values that fall into one small part of the range. If you don't need to show that much data, other histograms would be more effective, like this one about the length of dinosaurs. Here the bins or buckets of data values are segmented. You can show each value that falls into each part of the range. If your data needs to be ranked, like when ordering the number of responses to survey questions. You should first think about what you want to highlight in your visualization. Bar charts with horizontal bars effectively show data that are ranked, with bars arranged in ascending or descending order. A bar chart should always be ranked by value, unless there's a natural order to the data like age or time, for example. This simple bar chart shows metals like gold and platinum ranked by density. An audience would be able to clearly see the ranking and quickly determine which metals had the highest density, even if this database included a lot more metals. Correlation charts can show relationships among data, but they should be used with caution because they might lead viewers to think that the data shows causation. Causation or a cause-effect relationship occurs when an action directly leads to an outcome. Correlation and causation are often mixed up because humans like to find patterns even when they don't exist. If two variables look like they're associated in some way, we might assume that one is dependent on the other. That implies causation, even if the variables are completely independent. If we put that data into a visualization, then it would be misleading. But correlation charts that do show causation can be effective. For example, this correlation chart has one line of data showing the average traffic for Google searches on Tuesdays in Brazil. The other lines for a specific date of search traffic, June 15th. The data is automatically correlated because both lines are representing the same basic information. But the chart also shows one big difference. When a football match or soccer match for Americans began on June 15th, the search traffic showed a significant drop. This implies causation. Football is a very popular and important sport for Brazilians, and the data in this chart verifies that. We've now talked about time series charts, histograms, ranked bar charts, and correlation charts. Each of these charts can visualize a different type of analysis. Your business objective and audience will help figure out which of these common visualizations to choose. Or you may want to check some other kinds of visualizations out there. There are also glossary visualizations that you'll be able to reference later. That wraps up our lesson on creating visualizations. Coming up next, we'll add some more layers to your planning and execution of visuals. Hang on tight. : Added to Selection. Press [CTRL + S] to save as a note Correlation and causation In this reading, you will examine correlation and causation in more detail. Let’s review the definitions of these terms: Correlation in statistics is the measure of the degree to which two variables move in relationship to each other. An example of correlation is the idea that “As the temperature goes up, ice cream sales also go up.” It is important to remember that correlation doesn’t mean that one event causes another. But, it does indicate that they have a pattern with or a relationship to each other. If one variable goes up and the other variable also goes up, it is a positive correlation. If one variable goes up and the other variable goes down, it is a negative or inverse correlation. If one variable goes up and the other variable stays about the same, there is no correlation. Causation refers to the idea that an event leads to a specific outcome. For example, when lightning strikes, we hear the thunder (sound wave) caused by the air heating and cooling from the lightning strike. Lightning causes thunder. Why is differentiating between correlation and causation important? When you make conclusions from data analysis, you need to make sure that you don’t assume a causal relationship between elements of your data when there is only a correlation. When your data shows that outdoor temperature and ice cream consumption both go up at the same time, it might be tempting to conclude that hot weather causes people to eat ice cream. But, a closer examination of the data would reveal that every change in temperature doesn’t lead to a change in ice cream purchases. In addition, there might have been a sale on ice cream at the same time that the data was collected, which might not have been considered in your analysis. Knowing the difference between correlation and causation is important when you make conclusions from your data since the stakes could be high. The next two examples illustrate the high stakes to health and human services. Cause of disease For example, pellagra is a disease with symptoms of dizziness, sores, vomiting, and diarrhea. In the early 1900s, people thought that the disease was caused by unsanitary living conditions. Most people who got pellagra also lived in unsanitary environments. But, a closer examination of the data showed that pellagra was the result of a lack of niacin (Vitamin B3). Unsanitary conditions were related to pellagra because most people who couldn’t afford to purchase niacin-rich foods also couldn’t afford to live in more sanitary conditions. But, dirty living conditions turned out to be a correlation only. Distribution of aid Here is another example. Suppose you are working for a government agency that provides food stamps. You noticed from the agency’s Google Analytics that people who qualify for food stamps are browsing the official website, but they are leaving the site without signing up for benefits. You think that the people visiting the site are leaving because they aren’t finding the information they need to sign up for food stamps. Google Analytics can help you find clues (correlations), like the same people coming back many times or how quickly people leave the page. One of those correlations might lead you to the actual cause, but you will need to collect additional data, like in a survey, to know exactly why people coming to the site aren’t signing up for food stamps. Only then can you figure out how to increase the sign-up rate. Key takeaways In your data analysis, remember to: Critically analyze any correlations that you find Examine the data’s context to determine if a causation makes sense (and can be supported by all of the data) Understand the limitations of the tools that you use for analysis Further information You can explore the following article and training for more information about correlation and causation: Correlation is not causation: This article describes the impact to a business when correlation and causation are confused. Correlation and causation (Khan Academy lesson): This lesson describes correlation and causation along with a working example. Follow the examples of the analysis and notice if there is a positive correlation between frostbite and sledding accidents. Hey, great to see you again. So far we've shown that there's lots of choices you'll make as a data analyst when creating visualizations. Each of your choices should help make sure that your visuals are meaningful and effective. Another choice you'll need to make is whether you want your visualizations to be static or dynamic. Static visualizations do not change over time unless they're edited. They can be useful when you want to control your data and your data story. Any visualization printed on paper is automatically static. Charts and graphs created in spreadsheets are often static too. For example, the owner of this spreadsheet might have to change the data in order for the visualization to update. Now, dynamic visualizations are interactive or change over time. The interactive nature of these graphics means that users have some control over what they see. This can be helpful if stakeholders want to adjust what they're able to view. Let's check out a visualization about happiness that we've created in Tableau. Tableau is a business intelligence and analytics platform that helps people see, understand, and make decisions with data. Visualizations in Tableau are automatically interactive. We'll go into the dashboard to see how the happiness score has changed from 2015 to 2017. We can check this out in our 12th slide, yearly happiness changes. On the left are the country level changes in happiness score. The countries are sorted by largest increase to largest decrease. On the right, there's a map with overall happiness scores. The color scale moves from blue for the countries with the highest happiness score, to red for those with the lowest. If you look below the map, you'll notice a year to view slider where people can choose which year's happiness scores to display on the map. It's currently set for 2016, but if someone wants to know the scores for 2015 or 2017, they can adjust the slider. They could then make note of how the color-coding and score labels change from year to year. Other dynamic visualizations upload new data automatically. These bar graphs continually update data by the minute and second. Other data visuals can do the same by day, week or month. If you need to, you can show trends in real-time. Having an interactive visualization can be useful for both you and the audience you share it with. But it's good to remember that the more power you give the user, the less control you have over the story you want the data to tell. It's something to keep in mind as you learn how to create your own visualizations. You want to find the right balance between interactivity and control. Something else to consider is, a choice between using a static or dynamic visualization. This will usually depend on the data you're visualizing, the audience you're presenting to, and how you're giving your presentation. Now that we've made some decisions about what kind of data vis we want to create, we can start thinking about the design, which is exactly where we're going to start talking about next time. See you there. The wonderful world of visualizations As a data analyst, you will often be tasked with relaying information and data that your audience might not readily understand. Presenting your data visually is an effective way to communicate complex information and engage your stakeholders. One question to ask yourself is: “what is the best way to tell the story within my data?” This reading includes several options for you to choose from (although there are many more). Line chart A line chart is used to track changes over short and long periods of time. When smaller changes exist, line charts are better to use than bar graphs. Line charts can also be used to compare changes over the same period of time for more than one group. Let’s say you want to present the graduation frequency for a particular high school between the years 2008-2012. You would input your data in a table like this: Year Graduation rate 2008 87 2009 89 2010 92 2011 92 2012 96 From this table, you are able to present your data in a line chart like this: Maybe your data is more specific than above. For example, let’s say you are tasked with presenting the difference of graduation rates between male and female students. Then your chart would resemble something like this: Column chart Column charts use size to contrast and compare two or more values, using height or lengths to represent the specific values. The below is example data concerning sales of vehicles over the course of 5 months: Month Vehicles sold August 2,800 September 3,700 October 3,750 November 4,300 December Visually, it would resemble something like this: 4,600 What would this column chart entail if we wanted to add the sales data for a competing car brand? Heatmap Similar to bar charts, heatmaps also use color to compare categories in a data set. They are mainly used to show relationships between two variables and use a system of color-coding to represent different values. The following heatmap plots temperature changes for each city during the hottest and coldest months of the year. Pie chart The pie chart is a circular graph that is divided into segments representing proportions corresponding to the quantity it represents, especially when dealing with parts of a whole. For example, let’s say you are determining favorite movie categories among avid movie watchers. You have gathered the following data: Movie category Preference Comedy 41% Drama 11% Sci-fi 3% Romance 17% Action 28% Visually, it would resemble something like this: Action- 28% Comedy- 41% Romance- 17% Sci-fi- 3% Drama- 11% Scatter plot Scatter plots show relationships between different variables. Scatter plots are typically used for two variables for a set of data, although additional variables can be displayed. For example, you might want to show data of the relationship between temperature changes and ice cream sales. It would resemble something like this: As you may notice, the higher the temperature got, the more demand there was for ice cream – so the scatter plot is great for showing the relationship between the two variables. Distribution graph A distribution graph displays the spread of various outcomes in a dataset. Let’s apply this to real data. To account for its supplies, a brand new coffee shop owner wants to measure how many cups of coffee their customers consume, and they want to know if that information is dependent on the days and times of the week. That distribution graph would resemble something like this: From this distribution graph, you may notice that the amount of coffee sales steadily increases from the beginning of the week, reaching the highest point mid-week, and then decreases towards the end of the week. If outcomes are categorized on the x-axis by distinct numeric values (or ranges of numeric values), the distribution becomes a histogram. If data is collected from a customer rewards program, they could categorize how many customers consume between one and ten cups of coffee per week. The histogram would have ten columns representing the number of cups, and the height of the columns would indicate the number of customers drinking that many cups of coffee per week. Reviewing each of these visual examples, where do you notice that they fit in relation to your type of data? One way to answer this is by evaluating patterns in data. Meaningful patterns can take many forms, such as: Change: This is a trend or instance of observations that become different over time. A great way to measure change in data is through a line or column chart. Clustering: A collection of data points with similar or different values. This is best represented through a distribution graph. Relativity: These are observations considered in relation or in proportion to something else. You have probably seen examples of relativity data in a pie chart. Ranking: This is a position in a scale of achievement or status. Data that requires ranking is best represented by a column chart. Correlation: This shows a mutual relationship or connection between two or more things. A scatter plot is an excellent way to represent this type of data pattern. Studying your data Data analysts are tasked with collecting and interpreting data as well as displaying data in a meaningful and digestible way. Determining how to visualize your data will require studying your data’s patterns and converting it using visual cues. Feel free to practice your own charts and data in spreadsheets. Simply input your data in the spreadsheet, highlight it, then insert any chart type and view how your data can be visualized based on what you choose. Data grows on decision trees With so many visualization options out there for you to choose from, how do you decide what is the best way to represent your data? A decision tree is a decision-making tool that allows you, the data analyst, to make decisions based on key questions that you can ask yourself. Each question in the visualization decision tree will help you make a decision about critical features for your visualization. Below is an example of a basic decision tree to guide you towards making a data-driven decision about which visualization is the best way to tell your story. Please note that there are many different types of decision trees that vary in complexity, and can provide more in-depth decisions. -Does your data have only one numeric variable? Histogram or Density plot -Are there multiple data sets? Line chart or pie chart -Are you measuring changes over time? Bar chart -Do relationships between the data need to be shown? Scatter plot or heatmap Begin with your story Start off by evaluating the type of data you have and go through a series of questions to determine the best visual source: Does your data have only one numeric variable? If you have data that has one, continuous, numerical variable, then a histogram or density plot are the best methods of plotting your categorical data. Depending on your type of data, a bar chart can even be appropriate in this case. For example, if you have data pertaining to the height of a group of students, you will want to use a histogram to visualize how many students there are in each height range: Are there multiple datasets? For cases dealing with more than one set of data, consider a line or pie chart for accurate representation of your data. A line chart will connect multiple data sets over a single, continuous line, showing how numbers have changed over time. A pie chart is good for dividing a whole into multiple categories or parts. An example of this is when you are measuring quarterly sales figures of your company. Below are examples of this data plotted on both a line and pie chart. Are you measuring changes over time? A line chart is usually adequate for plotting trends over time. However, when the changes are larger, a bar chart is the better option. If, for example, you are measuring the number of visitors to NYC over the past 6 months, the data would look like this: Do relationships between the data need to be shown? When you have two variables for one set of data, it is important to point out how one affects the other. Variables that pair well together are best plotted on a scatter plot. However, if there are too many data points, the relationship between variables can be obscured so a heat map can be a better representation in that case. If you are measuring the population of people across all 50 states in the United States, your data points would consist of millions so you would use a heat map. If you are simply trying to show the relationship between the number of hours spent studying and its effects on grades, your data would look like this: Additional resources The decision tree example used in this reading is one of many. There are multiple decision trees out there with varying levels of details that you can use to help guide your visual decisions. If you want more in-depth insight into more visual options, explore the following resources: From data to visualization: This is an excellent analysis of a larger decision tree. With this comprehensive selection, you can search based on the kind of data you have or click on each graphic example for a definition and proper usage. Selecting the best chart: This two-part YouTube video can help take the guesswork out of data chart selection. Depending on the type of data you are aiming to illustrate, you will be guided through when to use, when to avoid, and several examples of best practices. Part 2 of this video provides even more examples of different charts, ensuring that there is a chart for every type of data out there. Data is beautiful At this point, you might be asking yourself: What makes a good visualization? Is it the data you use? Or maybe it is the story that it tells? In this reading, you are going to learn more about what makes data visualizations successful by exploring David McCandless’ elements of successful data visualization and evaluating three examples based on those elements. Data visualization can change our perspective and allow us to notice data in new, beautiful ways. A picture is worth a thousand words—that’s true in data too! You will have the option to save all of the data visualization examples that are used throughout this reading; these are great examples of successful data visualization that you can use for future inspiration. Where just two or three ovals overlap, there are different types of incomplete data visualization. At the center, where all four overlap, contains the words “successful visualization”. This visualization stresses the idea that all four elements are necessary to create a successful data visualization You can also access a PDF version of this visualization and save it for your own reference by clicking the file below: WEB_What-Makes-a-Good-Infoviz.pdfPDF File Open file Four elements of successful visualizations The Venn diagram by David McCandless identifies four elements of successful visualizations: Information (data): The information or data that you are trying to convey is a key building block for your data visualization. Without information or data, you cannot communicate your findings successfully. Story (concept): Story allows you to share your data in meaningful and interesting ways. Without a story, your visualization is informative, but not really inspiring. Goal (function): The goal of your data visualization makes the data useful and usable. This is what you are trying to achieve with your visualization. Without a goal, your visualization might still be informative, but can’t generate actionable insights. Visual form (metaphor): The visual form element is what gives your data visualization structure and makes it beautiful. Without visual form, your data is not visualized yet. All four of these elements are important on their own, but a successful data visualization balances all four. For example, if your data visualization has only two elements, like the information and story, you have a rough outline. This can be really useful in your early planning stages, but is not polished or informative enough to share. Even three elements are not quite enough— you need to consider all four to create a successful data visualization. In the next part of this reading, you will use these elements to examine two data visualization examples and evaluate why they are successful. Example 1: Visualization of dog breed comparison It uses two axes, popularity and data score, to place different dog breeds on a foursquare chart. The squares are labelled “Inexplicably Overrated,” “The Rightly Ignored,” “Hot Dogs!,” and “Overlooked Treasures.” Different dog breeds, visualized with plotted points shaped like dogs, are distributed on the chart based on their popularity and their data score. Save this data visualization as a PDF by clicking the file below: IIB-LICENSED_Best-in-Show.pdfPDF File Open file View the data The Best in Show visualization uses data about different dog breeds from the American Kennel Club. The data has been compiled in a spreadsheet. Click the link below and select "Use Template" to view the data. Link to the template: KIB - Best in Show Or, if you don't have a Google account, download the file below. KIB - Best in Show (public)XLSX File Download file Examine the four elements This visualization compares the popularity of different dog breeds to a more objective data score. Consider how it uses the elements of successful data visualization: Information (data): If you view the data, you can explore the metrics being illustrated in the visualization. Story (concept): The visualization shows which dogs are overrated, which are rightly ignored, and those that are really hot dogs! And, the visualization reveals some overlooked treasures you may not have known about previously. Goal (function): The visualization is interested in exploring the relationship between popularity and the objective data scores for different dog breeds. By comparing these data points, you can learn more about how different dog breeds are perceived. Visual form (metaphor): In addition to the actual four-square structure of this visualization, other visual cues are used to communicate information about the dataset. The most obvious is that the data points are represented as dog symbols. Further, the size of a dog symbol and the direction the dog symbol faces communicate other details about the data. Example 2: Visualization of rising sea levels demonstrates how much sea levels are projected to rise over the course of 8,000 years. On the y-axis, it lists both the number of years and the sea level in meters. From right to left, starting with the lowest sea level, the chart includes silhouettes of different cities around the world to demonstrate how long it would take for most of the world to be underwater. It also includes inset maps of the continents and how they would appear at different times as sea levels continue to rise. Save this data visualization as a PDF by clicking the file below: IIB-LICENSED_Sea-Levels.pdfPDF File Open file Examine the four elements This When Sea Levels Attack visualization illustrates how much sea levels are projected to rise over the course of 8,000 years. The silhouettes of different cities with different sea levels, rising from right to left, helps to drive home how much of the world will be affected as sea levels continue to rise. Here is how this data visualization stacks up using the four elements of successful visualization: Information (data): This visualization uses climate data on rising sea levels from a variety of sources, including NASA and the Intergovernmental Panel on Climate Change. In addition to that data, it also uses recorded sea levels from around the world to help illustrate how much rising sea levels will affect the world. Story (concept): The visualization tells a very clear story: Over the course of 8,000 years, much of the world as we know it will be underwater. Goal (function): The goal of this project is to demonstrate how soon rising sea levels are going to affect us on a global scale. Using both data and the visual form, this visualization makes rising sea levels feel more real to the audience. Visual form (metaphor): The city silhouettes in this visualization are a beautiful way to drive home the point of the visualization. It gives the audience a metaphor for how rising sea levels will affect the world around them in a way that showing just the raw numbers can’t do. And for a more global perspective, the visualization also uses inset maps. Key takeaways Notice how each of these visualizations balance all four elements of successful visualization. They clearly incorporate data, use storytelling to make that data meaningful, focus on a specific goal, and structure the data with visual forms to make it beautiful and communicative. The more you practice thinking about these elements, the more you will be able to include them in your own data visualizations. Design thinking for visualization improvement Design thinking for data visualization involves five phases: 1. Empathize: Thinking about the emotions and needs of the target audience for the data visualization 2. Define: Figuring out exactly what your audience needs from the data 3. Ideate: Generating ideas for data visualization 4. Prototype: Putting visualizations together for testing and feedback 5. Test: Showing prototype visualizations to people before stakeholders see them As interactive dashboards become more popular for data visualization, new importance has been placed on efficiency and user-friendliness. In this reading, you will learn how design thinking can improve an interactive dashboard. As a junior analyst, you wouldn’t be expected to create an interactive dashboard on your own, but you can use design thinking to suggest ways that developers can improve data visualizations and dashboards. An example: online banking dashboard Suppose you are an analyst at a bank that has just released a new dashboard in their online banking application. This section describes how you might explore this dashboard like a new user would, consider a user’s needs, and come up with ideas to improve data visualization in the dashboard. The dashboard in the banking application has the following data visualization elements: Monthly spending is shown as a donut chart that reflects different categories like utilities, housing, transportation, education, and groceries. When customers set a budget for a category, the donut chart shows filled and unfilled portions in the same view. Customers can also set an overall spending limit, and the dashboard will automatically assign the budgeted amounts (unfilled areas of the donut chart) to each category based on past spending trends. Empathize First, empathize by putting yourself in the shoes of a customer who has a checking account with the bank. Do the colors and labels make sense in the visualization? How easy is it to set or change a budget? When you click on a spending category in the donut chart, are the transactions in the category displayed? What is the main purpose of the data visualization? If you answered that it was to help customers stay within budget or to save money, you are right! Saving money was a top customer need for the dashboard. Define Now, imagine that you are helping dashboard designers define other things that customers might want to achieve besides saving money. What other data visualizations might be needed? Track income (in addition to spending) Track other spending that doesn’t neatly fit into the set categories (this is sometimes called discretionary spending) Pay off debt Can you think of anything else? Ideate Next, ideate additional features for the dashboard and share them with the software development team. What new data visualizations would help customers? Would you recommend bar charts or line charts in addition to the standard donut chart? Would you recommend allowing users to create their own (custom) categories? Can you think of anything else? Prototype Finally, developers can prototype the next version of the dashboard with new and improved data visualizations. Test Developers can close the cycle by having you (and others) test the prototype before it is sent to stakeholders for review and approval. Key takeaways This design thinking example showed how important it is to: Understand the needs of users Generate new ideas for data visualizations Make incremental improvements to data visualizations over time You can refer to the following articles for more information about design thinking: Three Critical Aspects of Design Thinking for Big Data Solutions Data and Design Thinking: Why Use Data in the Design Process? Hello again. We've learned data visualizations are designed to help an audience process information quickly and memorably. You might remember the 5-second rule we covered earlier. Within the first five seconds of seeing a data visualization, your audience should understand exactly what you're trying to convey. Five seconds might seem like a flash, but adding in descriptive wording can really help your audience interpret and understand the data in the right way. Your audience will be less likely to have questions about what you're sharing if you add headlines, subtitles, and labels. One of the easiest ways to highlight key data in your data viz, is through headlines. A headline is a line of words printed in large letters at the top of the visualization to communicate what data is being presented. It's the attention-grabber that makes your audience want to read more. Take charts, for example. A chart without a headline is like a report without a title. You want to make it easy to understand what your chart's about. Be sure to use clear, concise language, explaining all information as plainly as possible. Try to avoid using abbreviations or acronyms, even if you think they're common knowledge. The typography and placement of the headline is important too. It's best to keep it simple. Make it bold or a few sizes larger than the rest of the text and place it directly above the chart, aligned to the left. Then, explain your data viz even further with a subtitle. A subtitle supports the headline by adding more context and description. Use a font style that matches the rest of the charts elements and place the subtitle directly underneath the headline. Now, let's talk about labels. Earlier, we mentioned Dona Wong, a visual journalist who's well known for sharing guidelines on making data viz more effective. She makes a very strong case for using labels directly on the data instead of relying on legends. This is because lots of charts use different visual properties like colors or shapes to represent different values of data. A legend or key identifies the meaning of various elements in a data visualization and can be used as an alternative to labeling data directly. Direct labeling like this keeps your audience's attention fixed on your graphic and helps them identify data quickly. While legends force the audience to do more work, because a legend is positioned away from the chart's data. The truth is, the more support we provide our audience, the less work they have to do trying to understand what the data is trying to say, and the faster our story will make an impact. Now that we've covered how to make a data viz as effective as possible, next up, we'll figure out how to make it accessible to all. See you in a bit. Pro tips for highlighting key information Headlines, subtitles, labels, and annotations help you turn your data visualizations into more meaningful displays. After all, you want to invite your audience into your presentation and keep them engaged. When you present a visualization, they should be able to process and understand the information you are trying to share in the first five seconds. This reading will teach you what you can do to engage your audience immediately. If you already know what headlines, subtitles, labels and annotations do, go to the guidelines and style checks at the end of this reading. If you don’t, these next sections are for you. Headlines that pop A headline is a line of words printed in large letters at the top of a visualization to communicate what data is being presented. It is the attention grabber that makes your audience want to read more. Here are some examples: Which Generation Controls the Senate?: This headline immediately generates curiosity. Refer to the subreddit post in the dataisbeautiful community, r/dataisbeautiful, on January 21, 2021. Top 10 coffee producers: This headline immediately informs how many coffee producers are ranked. Read the full article: bbc.com/news/business-43742686. Check out the chart below. Can you identify what type of data is being represented? Without a headline, it can be hard to figure out what data is being presented. A graph like the one below could be anything from average rents in the tri-city area, to sales of competing products, or daily absences at the local elementary, middle, and high schools. Turns out, this illustration is showing average rents in the tri-city area. So, let’s add a headline to make that clear to the audience. Adding the headline, “Average Rents in the Tri-City Area” above the line chart instantly informs the audience what it is comparing. Subtitles that clarify A subtitle supports the headline by adding more context and description. Adding a subtitle will help the audience better understand the details associated with your chart. Typically, the text for subtitles has a smaller font size than the headline. In the average rents chart, it is unclear from the headline “Average Rents in the Tri-City Area” which cities are being described. There are tri-cities near San Diego, California (Oceanside, Vista, and Carlsbad), tri-cities in the San Francisco Bay Area (Fremont, Newark, and Union City), tri-cities in North Carolina (Raleigh, Durham, and Chapel Hill), and tricities in the United Arab Emirates (Dubai, Ajman, and Sharjah). We are actually reporting the data for the tri-city area near San Diego. So adding “Oceanside, Vista, and Carlsbad” becomes the subtitle in this case. This subtitle enables the audience to quickly identify which cities the data reflects. Labels that identify A label in a visualization identifies data in relation to other data. Most commonly, labels in a chart identify what the x-axis and y-axis show. Always make sure you label your axes. We can add “Months (January - June 2020)” for the x-axis and “Average Monthly Rents ($)” for the y-axis in the average rents chart. Data can also be labeled directly in a chart instead of through a chart legend. This makes it easier for the audience to understand data points without having to look up symbols or interpret the color coding in a legend. We can add direct labels in the average rents chart. The audience can then identify the data for Oceanside in yellow, the data for Carlsbad in green, and the data for Vista in blue. Annotations that focus An annotation briefly explains data or helps focus the audience on a particular aspect of the data in a visualization. Suppose in the average rents chart that we want the audience to pay attention to the rents at their highs. Annotating the data points representing the highest average rents will help people focus on those values for each city. Guidelines and pro tips Refer to the following table for recommended guidelines and style checks for headlines, subtitles, labels, and annotations in your data visualizations. Think of these guidelines as guardrails. Sometimes data visualizations can become too crowded or busy. When this happens, the audience can get confused or distracted by elements that aren’t really necessary. The guidelines will help keep your data visualizations simple, and the style checks will help make your data visualizations more elegant. Visualization components Guidelines Style checks Headlines - Content: Briefly describe the data - Length: Usually the width of the data frame - Position: Above the data - Use brief language - Don’ - Don’t use acronyms - Don use humor or sarcasm Subtitles - Content: Clarify context for the data - Length: Same as or shorter than headline - Position: Directly below the headline - Use smaller font size than undefined words - Don’t u Don’t use acronyms - Don' Labels - Content: Replace the need for legends - Length: Usually fewer than 30 characters - Position: Next to data or below or beside axes - Use a few words only - Us Use callouts to point to the bold, or italic Annotations - Content: Draw attention to certain data Length: Varies, limited by open space - Position: Immediately next to data annotated - Don’t use all caps, bold, o text - Don’t distract viewer You want to be informative without getting too detailed. To meaningfully communicate the results of your data analysis, use the right visualization components with the right style. In other words, let simplicity and elegance work together to help your audience process the data you are sharing in five seconds or less. Accessible visualizations Notes Discuss Save note Download Transcript English Help Us Translate Interactive Transcript - Enable basic transcript mode by pressing the escape key You may navigate through the transcript using tab. To save a note for a section of text press CTRL + S. To expand your selection you may use CTRL + arrow key. You may contract your selection using shift + CTRL + arrow key. For screen readers that are incompatible with using arrow keys for shortcuts, you can replace them with the H J K L keys. Some screen readers may require using CTRL in conjunction with the alt key Play video starting at ::1 and follow transcript0:01 Hey, great to have you back, let's dive back in. Over 1 billion people in the world have a disability. That's more than the populations of the United States, Canada, France, Italy, Japan, Mexico, and Brazil combined. Before you design a data viz, it's important to keep that fact in mind. Not everyone has the same abilities, and people take in information in lots of different ways. You might have a viewer who's deaf or hard of hearing and relies on captions, or someone who's color blind might look to specific labeling for more description. We've covered a lot of ways to make a data visualization beautiful and informative. And now it's time to take that knowledge and make it accessible to everyone, including those with disabilities. Accessibility can be defined a number of different ways. Right from the start, there's a few ways you can incorporate accessibility in your data visualization. You'll just have to think a little differently, it helps to label data directly instead of relying exclusively on legends, which require color interpretation and more effort by the viewer to understand. This can also just make it a faster read for those with or without disabilities. Check out this data viz, the colors make it challenging to read and the legend is confusing. Now, if we just remove the legend and add in data labels, bam, you've got a clearer presentation. Another way to make your visualizations more accessible is to provide text alternatives, so that it can be changed into other forms people need, such as large print, braille, or speech. Alternative text provides a textual alternative to non-text content. It allows the content and function of the image to be accessible to those with visual or certain cognitive disabilities. Here's an example that shows additional text describing the chart. And speaking of text, you can make data from charts and diagrams available in a text-based format through an export to Sheets or Excel. You can also make it easier for people to see and hear content by separating foreground from background. Using bright colors, that contrast against the background can help those with poor visibility, whether permanently or temporarily clearly see the information conveyed. Another option is to avoid relying solely on color to convey information, and instead distinguished with different textures and shapes. Another general rule is to avoid over complicating data visualizations. Overly complicated data visualizations turn off most audiences because they can't figure out where and what to focus on. That's why breaking down data into simple visualizations is key. A common mistake is including too much information in a single piece, or including long chunks, of text or too much information and graphs and charts. This can defeat the whole purpose of your visualization, making it impossible to understand at first glance. Ultimately, designing with an accessibility mindset means thinking about your audience ahead of time. Focusing on simple, easy to understand visuals, and most importantly, creating alternative ways for your audience to access and interact with your data. And when you pay attention to these details, we can find solutions that make data visualizations more effective for everyone. So now you completed your first course of exploration of data visualization. You've discovered the importance of creating data viz that cater to your audience while keeping focus on the objective. You learn different ways to brainstorm and plan your visualizations, and how to choose the best charts to meet that objective. And you also learned how to incorporate elements of science and even philosophy into your visualizations. Coming up we'll check out how to take all of these learnings and apply them in Tableau. You'll get to see how this data visualization tool makes your data viz work more efficient and effective. See you soon. Designing a chart in 60 minutes By now, you understand the principles of design and how to think like a designer. Among the many options of data visualization is creating a chart, which is a graphical representation of data. Choosing to represent your data via a chart is usually the most simple and efficient method. Let’s go through the entire process of creating any type of chart in 60 minutes. The goal here is to develop a prototype or mock up of your chart that you can quickly present to an audience. This will also enable you to have a sense of whether or not the chart is communicating the information that you want. 5 minutes- prep 15 minutes- talk & listen 20 minutes- prototype & improve 20 minutes- sketch & design Follow this high level 60-minute chart to guide your thinking whenever you begin working on a data visualization. Prep (5 min): Create the mental and physical space necessary for an environment of comprehensive thinking. This means allowing yourself room to brainstorm how you want your data to appear while considering the amount and type of data that you have. Talk and listen (15 min): Identify the object of your work by getting to the “ask behind the ask” and establishing expectations. Ask questions and really concentrate on feedback from stakeholders regarding your projects to help you hone how to lay out your data. Sketch and design (20 min): Draft your approach to the problem. Define the timing and output of your work to get a clear and concise idea of what you are crafting. Prototype and improve (20 min): Generate a visual solution and gauge its effectiveness at accurately communicating your data. Take your time and repeat the process until a final visual is produced. It is alright if you go through several visuals until you find the perfect fit. Key takeaway This is a great overview you can use when you need to create a visualization in a short amount of time. As you become more experienced in data visualization, you will find yourself creating your own process. You will get a more detailed description of different visualization options in the next reading, including line charts, bar charts, scatter plots, and more. No matter what you choose, always remember to take the time to prep, identify your objective, take in feedback, design, and create. Welcome back. Mastering online tools like Tableau will make it easier for your audience to understand difficult concepts or identify new patterns in your data. Need to help a news outlet showcase changing real estate prices in regional markets? Check. Want to help a nonprofit use their data in better ways to streamline operations? Check. Play video starting at ::23 and follow transcript0:23 Need to explore what video games sales look like over the past few decades? Double check many different kinds of companies are using Tableau right now to do all of these things and more. This means there's a good chance you'll end up using it at some point. in your career. But I'm getting ahead of myself. First, let's talk about what Tableau actually is. You might remember learning that Tableau is a business intelligence and analytics platform that you can use online to help people see, understand, and make decisions with data. But it's not all business all the time. Take this data viz, for example, created by Tableau enthusiast Steve Thomas to record Bigfoot sightings across the US. It's available on Tableau Public, which will be using together in our activities in this course Tableau can help you make and easily share interactive dashboards, maps, and graphs with your data. Without any coding, you can connect to data and lots of formats like Excel, CSV, and Google Sheets. You might also find yourself working with a company that uses another option, like Looker or Google Data Studio. for example. Like Tableau, Looker and Google Data Studio help you take raw data and bring it to life visually, but each does this in different ways. For example, while Tableau's offered in a variety of formats like browser and desktop, Looker and Google Data Studio are completely browser-based. But here's the great news. Once you learn the fundamentals of Tableau, you'll find they easily transfer to other visualization tools, Ready to get started using it? Then, without further ado, meet Tableau up next. Visualizations in spreadsheets and Tableau This reading summarizes the seven primary chart types: column, line, pie, horizontal bar, area, scatter, and combo. Then, it describes how visualizations in spreadsheets compare to those in Tableau. Primary chart types in spreadsheets In spreadsheets, charts are graphical representations of data from one or more sheets. Although there are many variations to choose from, we will focus on the most broadly applicable charts to give you a sense of what is possible in a spreadsheet. As you review these examples, keep in mind that these are meant to give you an overview of visualizations rather than a detailed tutorial. Another reading in this program will describe the applicable steps and process to create a chart more specifically. When you are in an application, you can always select Help from the menu bar for more information. To create a chart In Google Sheets, select the data cells, click Insert from the main menu, and then select Chart. You can set up and customize the chart in the dialog box on the right. To create a chart in Microsoft Excel, select the data cells, click Insert from the main menu, and then select the chart type. Tip: You can optionally click Recommended Charts to view Excel’s recommendations for the data you selected and then select the chart you like from those shown. These are the primary chart types available: Column (vertical bar): a column chart allows you to display and compare multiple categories of data by their values. Line: a line chart showcases trends in your data over a period of time. The last line chart example is a combo chart which can include a line chart. Refer to the description for the combo chart type. Pie: a pie chart is an easy way to visualize what proportion of the whole each data point represents. Horizontal bar: a bar chart functions similarly to a column chart, but is flipped horizontally. Area: area charts allow you to track changes in value across multiple categories of data. Scatter: scatter plots are typically used to display trends in numeric data. Combo: combo charts use multiple visual markers like columns and lines to showcase different aspects of the data in one visualization. The example below is a combo chart that has a column and line chart together. You can find more information about other charts here: Types of charts and graphs in Google Sheets: a Google Help Center page with a list of chart examples you can download. Excel Charts: a tutorial outlining all of the different chart types in Excel, including some subcategories. How visualizations differ in Tableau As you have also learned, Tableau is an analytics platform that helps data analysts display and understand data. Most if not all of the charts that you can create in spreadsheets are available in Tableau. But, Tableau offers some distinct charts that aren’t available in spreadsheets. These are handy guides to help you select chart types in Tableau: Which chart or graph is right for you? This presentation covers 13 of the most popular charts in Tableau. The Ultimate Cheat Sheet on Tableau Charts. This blog describes 24 chart variations in Tableau and guidelines for use. The following are visualizations that are more specialized in Tableau with links to examples or the steps to create them: Highlight tables appear like tables with conditional formatting. Review the steps to build a highlight table. Heat maps show intensity or concentrations in the data. Review the steps to build a heat map. Density maps show concentrations (like a population density map). Refer to instructions to create a heat map for density. Gantt charts show the duration of events or activities on a timeline. Review the steps to build a Gantt chart. Symbol maps display a mark over a given longitude and latitude. Learn more from this example of a symbol map. Filled maps are maps with areas colored based on a measurement or dimension. Explore an example of a filled map. Circle views show comparative strength in data. Learn more from this example of a circle view. Box plots also known as box-and whiskers charts show the distribution of values along a chart axis. Refer to the steps to build a box plot. Bullet graphs compare a primary measure with another and can be used instead of dial gauge charts. Review the steps to build a bullet graph. Packed bubble charts display data in clustered circles. Review the steps to build a packed bubble chart. Key takeaway This reading described the chart types you can create in spreadsheets and introduced visualizations that are more unique to Tableau. Misleading visualizations You can create data visualizations in Tableau using a wide variety of charts, colors, and styles. And you have tremendous freedom in the tool to decide how these visualizations will look and how they will present your data. Below is an example of a visualization created in Tableau: A heatmap listing different supplies and order dates 2010, 2011, 2012, 2013. The cells are colored in yellow, green, and red Study the visualization and think about these questions: Red normally indicates danger or a warning. Why do you think cells are highlighted in red? Green normally indicates a positive or “go” status. Is it clear why certain cells are highlighted in green? The purpose of the color coding isn’t clear without a legend, but can you guess what might have been the intent? Post your theory of what the colors mean. In the same post, share in 3-5 sentences (150-200 words) how this table could be misleading and how you would improve it to avoid confusion. Then, visit the discussion forum to browse what other learners shared and engage in at least two discussions about the visualization. Participation is optional Stephen Few, an innovator, author, a teacher, and data visualization expert, once said," Numbers have an important story to tell. They rely on you to give them a clear and convincing voice." Facts and figures are very important in the business world, but they rarely make a lasting impression. To create strong communications that make people think and convince them to take action, you need data storytelling. Data storytelling is communicating the meaning of a data set with visuals and a narrative that are customized for each particular audience. A narrative is another word for a story. In this video, you'll learn about data storytelling steps. These are: tiled your audience, create compelling visuals, and tell the story in an interesting way. Here's an example from the music streaming industry. Some companies send their customers a year in review email. It highlights the songs the users have listened to most and sometimes congratulate them for being a top fan of a particular artist. This is a much more exciting way to share data than just a printout of the customer's activity. It also reminds the listener about how much time they spend enjoying the service, a great way to build customer loyalty. Here's another example, some ride-sharing companies are using data storytelling to show their customers how many miles they've traveled and how that equals spending less money on gas, reducing carbon emissions and saving time they might otherwise have spent fighting traffic. It makes it really easy for the rider to clearly see the value of the service in the simple and fun visual. Data stories like these keep the customer engaged and make them feel like their choices matter because the companies are taking the time to create something just for them, and importantly, the stories are interesting. Knowing how to reach people in this way is an essential part of data storytelling. Images can draw us in at a subconscious level. This is the concept of engaging people through data visualizations. So far you've been learning about the importance of focusing on your audience. Coming up, you'll keep building on that knowledge, you'll discover that there are three data storytelling steps, and the first is knowing how to engage your audience. Engagement is capturing and holding someone's interest and attention. When your audience is engaged, you're much more likely to connect with them and convince them to see the same story you see. Every data story should start with audience engagement, all successful storytellers consider who's listening first. For instance, when a kindergarten teacher is choosing books for their class. they'll pick ones that are appropriate for five-year-olds. If they were to choose high school level novels, the complex subject matter would probably confuse the kids and they'd get bored and tune out. The second step is to create compelling visuals. In other words, you want to show the story of your data, not just tell it. Visuals should take your audience on a journey of how the data changed over time or highlight the meaning behind the numbers. Here's an example, let's say a cosmetic company keeps track of stores that buy its product and how much they buy. You could communicate the data to others in a spreadsheet like this, or you could create a colorful visual such as this pie chart, which makes it easy to see which stores are most and least profitable as business partners. That's a much clearer and more visually interesting approach. Now, the third and final step is to tell the story in an interesting narrative. A narrative has a beginning, a middle, and an end. It should connect the data you've collected to the project objective and clearly explain important insights from your analysis. To do this, it's important that your data storytelling is organized and concise. Soon you'll learn how to do that using slides for discussion during a meeting and a formal presentation. We'll discuss how the content, visuals and tone of your message changes depending on the way you're communicating it. And speaking of business communications, one of the many ways that companies use visualization to tell data stories, is with word clouds. Word clouds are a pretty simple visualization of data. These words are presented in different sizes based on how often they appear in your data set. It's a great way to get someone's attention and to unlock stories from big blocks of text where each word alone could never be seen. Word clouds can be used in all sorts of ways. On social media, they can show you which topics show up in posts most often, or you can use them in blogs to highlight the ideas that interest readers the most. This word cloud was created using text from the syllabus of this course. It tells a pretty engaging story where data analytics, analysis, SQL and spreadsheets are, unsurprisingly, some of the lead characters. Let's continue turning the pages of your data analytics story. There's lots of action and adventure to come. Effective data stories In data analytics, data storytelling is communicating the meaning of a dataset with visuals and a narrative that is customized for a particular audience. In data journalism, journalists engage their audience of readers by combining visualizations, narrative, and context into data-driven articles. It turns out that data analysts and data journalists have a lot in common! As a junior data analyst, you might learn a few things about effective storytelling from data journalism. Read further to explore the role and work of a data journalist in telling a good story. Note: This reading refers to an article published in The New Yorker. Non-subscribers may access several free articles each month. If you already reached your monthly limit on free articles, bookmark the article and come back to this reading later. Take a tour of a data-driven article Ben Wellington, a contributing writer for The New Yorker and a professor at the Pratt Institute, used New York City’s open data portal to track down noise complaints from logged service requests. He analyzed the data to gain a more quantitative understanding of where the noise was coming from and which neighborhoods were the noisiest. Then, he presented his findings in the Mapping New York's Noisiest Neighborhoods article. First, click the link above to skim the article and familiarize yourself with the data visualizations. Then, join the bus tour of the data! You will be directed to three visualizations (tour stops) to observe how each visualization helped strengthen the overall storytelling in the article. Tour stop 1: setting context Earlier in the training, you learned how context is important to understand data. Context is the condition in which something exists or happens. Based on the categorization of noise complaints, the data journalist set the context in the article by defining what people considered to be noise. In the article, review the combo table and bar chart that categorizes the noise complaints. Evaluate the visualization: How does the visualization help set the context? The combo table and bar chart is effective in summarizing the noise categories as percentages of the logged complaints. This helps set the context by answering the question, “what is noise?” Notice that the data journalist created a combo table and bar chart instead of a pie chart. With 11 noise categories, a list with a bar chart showing relative proportions is an elegant representation. A pie chart with 11 slices would have been harder to read. How does the visualization help clarify the data? If you add the percentages in the categories in the combo table and bar chart, the total is ninety-eight percent. There is a difference of two percent that can’t be accounted for in the visualization. So, rather than clarifying the data, the visualization actually causes a little confusion. One lesson is to always make sure that your percentages add up correctly. Sometimes rounding decimal places up or down causes percentages to be off so they don’t add up to 100%. Do you notice a data visualization best practice? You learned that a companion table in Tableau shows data in a different way in case some in your audience prefer tables. It appears that the data journalist had the same idea by using a combo table and bar chart. Note: As a refresher, a companion table in Tableau is displayed right next to a visualization. A companion table displays the same data as the visualization, but in a table format. You may replay the Getting Creative video which includes an example of a companion table. Tour stop 2: analyzing variables After setting the context by identifying the noise categories, the data journalist describes his analysis of the noise data. One interesting analysis is the distribution of noise complaints versus the time of day. In the article, review the stacked area chart for the distribution of noise complaints by hour of the day. Evaluate the visualization: How does the visualization perform against the five-second rule? Recall that the five-second rule states that you should understand what is being conveyed within the first five seconds of seeing a chart. We are guessing that this visualization performs quite well! The area charts for loud music and barking dogs help the audience understand that more of these types of noise complaints were made during late night and early morning hours (between 10:00 PM and 2:00 AM). Notice also that the color coding in the legend aligns with the colors in the chart. A chart legend normally has the largest category at the top, but the data journalist chose to order the legend so the largest category, “Loud music or party” appears at the bottom instead. How much time do you think this alignment saved readers? How does the visualization help clarify the data? Unlike the visualization from the previous tour stop, this visualization does a better job of clearly showing that all percentages add up to 100%. Do you notice a data visualization best practice? As a best practice, both the x-axis and y-axis should be labeled. But, the data journalist chose to include % or A.M. and P.M. with each tick on an axis. As a result, labeling the x-axis “Time of Day'' and the y-axis “Percentage of Noise Complaints” isn’t required. This demonstrates that a little creativity with labeling can help you achieve a cleaner chart. Tour stop 3: drawing conclusions After describing how the data was analyzed, the data journalist shares which neighborhoods are the noisiest using a variety of visualizations: combo table and bar chart, density map, and neighborhood map. In the article, review the neighborhood map for how close a noisy neighborhood is to a quiet neighborhood. Evaluate the visualization: How does the visualization help make a point? The data journalist observed that one of the noisiest neighborhoods was right next to one of the quietest neighborhoods. The neighborhood map is effective in emphasizing this observation as a dark blue area versus a white area. How does the visualization help clarify the data? The visualization classifies the data by neighborhood and allows the audience to follow along when the journalist focuses specifically on the Williamsburg, East Williamsburg, and North Side/South Side neighborhoods. Do you notice a data visualization best practice? Each neighborhood is directly labeled so a legend isn’t necessary. End of the tour: being inspired We hope you enjoyed your tour of a data journalist’s work! May this inspire your data storytelling to be as engaging as possible. For additional information about effective data storytelling, read these articles: What is Data Storytelling? The Art of Storytelling in Analytics and Data Science | How to Create Data Stories? Use Data and Analytics to Tell a Story Tell a Meaningful Story With Data Welcome back. When you want to communicate something to others, a great story can help you reach people's hearts and minds and make them more open to what you have to say. In other words, stories make people care. As you learned before, the first of the three data storytelling steps teach us that for a story to be successful, you need to focus on who's listening. Data analysts do this by making sure that they're engaging their audience. That's what we'll explore together now. First, you need to know your audience. Think back to the example of telling someone a joke they've heard many times before and expecting them to laugh at the punchline. Not likely. To get the response you're seeking, you've got to understand your audience's point of view. That means thinking about how your data project might affect them. It helps to ask yourself a few questions. What role does this audience play? What is their stake in the project? What do they hope to get from the data insights I deliver? Let's say you're analyzing readership data from customers to help a magazine publisher decide if they should switch from quarterly to monthly issues. If your stakeholder audience includes people from the printing company, they're going to care because the change means they have to order paper and ink more frequently. They also might need to assign more staff members to the project. Or if your stakeholders include the magazine authors and editors, you'll want to keep in mind that your recommendations might change the way they work. For instance, they might need to write and edit stories at a faster pace than they're used to. Once you've considered the answers to those questions, it's time to choose your primary message. Every single part of your story flows from this one key point, so it's got to be clear and direct. With that in mind, let's think about the key message for the data project about our pretend magazine. Maybe the readership data from customers shows that print magazine subscriptions have been going down recently. You discover in survey data that this is mainly because readers feel the information is outdated, so this finding suggests that readers would probably appreciate a publication cycle that gets the information into their hands more often. But that's not all. Your reader survey data also shows that readers prefer shorter articles with quick takeaways. The data is generating a lot of possible decision points. The volume and variety of information in front of you may feel challenging. To get the key message, you'll need to take a few steps back and pinpoint only the most useful pieces. Not every piece of data is relevant to the questions you're trying to answer. A big part of being a data analyst is knowing how to eliminate the less important details. One way to do this is with something called spotlighting. Spotlighting is scanning through the data to quickly identify the most important insights. There are many ways to spotlight, but lots of data analysts like to use sticky notes on a whiteboard, like how archaeologists make sense of the artifacts they discover in a dig. To do this, you write each insight from your analysis on a piece of paper, spread them out, and display them on a whiteboard. Then you examine it. It's important not to get bogged down in every tiny detail. Instead, look for broad universal ideas and messages. Try to find ideas or concepts that keep popping up again and again or numbers and words that are repeated often. Maybe you're finding things that look like they're connecting or forming patterns. Highlight these items or group them together on your whiteboard. Next, explore your discoveries. Find the meaning behind the numbers. The idea is to identify which insights are most likely to help solve your business problem or give you the answers you've been seeking. This is how spotlighting can lead you to your key message. Remember to keep your key message clear and concise, as an overly-long message like this one shown on screen has less chance of conveying the most important conclusion. Here's a clear, concise message that's likely to engage your audience because it's short and to the point. Of course, no matter how much time and effort you put into studying your audience, you can't predict exactly how they'll react to your recommendations. But if you follow the steps we're discussing, you'll be much more likely to have good results. In an upcoming video, you'll learn how to deal with situations that don't go quite according to plan. That's okay. It happens to all of us. Have you ever been driving a car when one of the warning lights on the dashboard suddenly comes on? Maybe the gas gauge starts blinking because you're getting low on fuel. It's handy when you have that alert right in front of you, clearly showing you that you need to pay attention to your gas level. Can you imagine if cars didn't have dashboards? We'd never know if we were about to run out of gas. We'd have no idea if our tire pressure was low or if it was time for an oil change. Without dashboards, if our cars started acting differently, we'd have to pull out the user manual, sift through all that information inside, and try to figure out the problem ourselves. Car dashboards make it easy for drivers to understand and respond to any issues with their vehicles because they're constantly tracking and analyzing the car status. But as you've been learning, dashboards aren't just for cars. Companies also use them to share information, get people engaged with business plans and goals, and uncover potential problems. Just like a car's dashboard, data analytics dashboards take tons of information and bring it to life in a clear, visually-interesting way. This is extremely important when telling a story with data, which is why it's a big part of number two in our three data storytelling steps. You've learned that a dashboard is a tool that organizes information from multiple data sets into one central location for tracking, analysis, and simple visualization through tables, charts, and graphs. Dashboards do this by constantly monitoring live incoming data. As we've been discussing, you can make dashboards that are specifically designed to speak to your stakeholders. You can think about who will be looking at the data and what they need from it and how often they'll use it. Then you can make a dashboard with the perfect information just for them. This is helpful because people can get confused and distracted when they're presented with too much data. A dashboard keeps things neat and tidy and easy to understand. When designing a dashboard, it's best to start simple with just the most important data points, and if later on you discover something's missing, you can always go back and tweak your dashboard or create a new one. An important part of dashboard design is the placement or layout of your charts, graphs, and other visuals. These elements need to be cohesive, which means they're balanced and make good use of the space on the dashboard. After you decide what information should be on your dashboard, you might need to resize and reorganize it so it works better for your users. One option in Tableau is choosing between a vertical or horizontal layout. A vertical layout adjusts the height. A horizontal layout resizes the width of the views and objects it contains. Also, as you can see here, evenly distributing the items within your layout helps create a clear and organized data visual. You can select either tiled or floating layouts. Tiled items are part of a single-layer grid that automatically resizes based on the overall dashboard size. Floating items can be layered over other objects. In this example, the map and scatter plots are tiled—they don't overlap. This really helps make clear what the data is all about, which is valuable because the majority of people in the world are visual learners—they process information based on what they see. That's why sharing your dashboards with stakeholders is such a valuable practice. Now there's something important to keep in mind about that. Sharing dashboards with others likely means that you'll lose control of the narrative; in other words, you won't be there to tell the story of your data and share your key messages. Dashboards put storytelling power in the hands of the viewer. That means they'll craft their own narrative and draw their own conclusions, but don't let that scare you away from being collaborative and open. Just understand the risks that come with sharing your dashboards. After all, sharing information and resources means that you'll have more people working on the solution to a big problem or coming up with that next big idea. This leads to more connections, which can result in really exciting new practices and innovations. Live and static insights Previously, you learned about data storytelling and interpreting your dataset through a narrative. In this reading, you will explore the difference between live and static insights to make your data even clearer. An image of a man driving. His car’s dashboard is made up of bar chart, pie chart, line graph, and heatmap Live versus static Identifying whether data is live or static depends on certain factors: How old is the data? How long until the insights are stale or no longer valid to make decisions? Does this data or analysis need updating on a regular basis to remain valuable? Static data involves providing screenshots or snapshots in presentations or building dashboards using snapshots of data. There are pros and cons to static data. PROS Can tightly control a point-in-time narrative of the data and insight Allows for complex analysis to be explained in-depth to a larger audience CONS Insight immediately begins to lose value and continues to do so the longer the data remains in a static state Snapshots can't keep up with the pace of data change Live data means that you can build dashboards, reports, and views connected to automatically updated data. PROS Dashboards can be built to be more dynamic and scalable Gives the most up-to-date data to the people who need it at the time when they need it Allows for up-to-date curated views into data with the ability to build a scalable “single source of truth” for various use cases Allows for immediate action to be taken on data that changes frequently Alleviates time/resources spent on processes for every analysis CONS Can take engineering resources to keep pipelines live and scalable, which may be outside the scope of some companies' data resource allocation Without the ability to interpret data, you can lose control of the narrative, which can cause data chaos (i.e. teams coming to conflicting conclusions based on the same data) Can potentially cause a lack of trust if the data isn’t handled properly Key takeaways Analysts need to familiarize themselves with the business and data so they can recommend when an updated static analysis is needed or should be refreshed. Also, this data insight will help you make the case for what sorts of analyses, visualizations, and additional data are recommended for the types of decisions that the business needs to make. Keep this customer survey spreadsheet on hand as it will be useful for the next video. So far, we've focused a lot on understanding our audience. Whether you're trying to engage people with data storytelling or creating dashboards designed for a certain person or group, understanding your audience is key. As you've learned, you can make dashboards that are tailored to meet different stakeholder requirements. To do this, it's important to think about who will be looking at the data and what they need from it. In this video, we'll continue exploring how to create compelling visuals to tell an interesting and persuasive data story. One great tool for doing this is a filter. You've learned about filters and spreadsheets and queries, but as a refresher, filtering means showing only the data that meets a specific criteria while hiding the rest. Play video starting at ::50 and follow transcript0:50 Filtering works the same way with dashboards—you can apply different filters for different users based on their needs. Play video starting at ::59 and follow transcript0:59 Tableau lets you limit the data you see based on the criteria you specify. Maybe you want to filter data and the data set to show only the last six months, or maybe you want to see information from one particular customer. You can even limit the number of rows or columns in a view. To explore these options, let's return to our world happiness example. Say your stakeholders were interested in only a few of the topics that affect overall happiness. Filtering for just gross domestic product, family, generosity, freedom, trust, and health, and then creating individual scatter plots for each would make this possible. Play video starting at :1:42 and follow transcript1:42 You can also use filters to highlight or hide individual data points. For instance, if you have a scatter plot with outliers, you may want to explore what your plot would look like without them. However, note that this is just an example to show you how filters work; it's not okay to drop a data point just because it's an outlier. Outliers could be important observations, sometimes even the most interesting ones, so be sure to put on your data detective hat and investigate that outlier before deciding to remove it from your dashboard. Here's how to do it. To filter data points from the view, we can choose a single data point or click and drag in the view to select several points. Let's choose just one. Then on the tool tip that appears, we'll select "exclude" to hide it or we could have chosen to do it the other way by keeping only selected data points. Play video starting at :2:36 and follow transcript2:36 Here's another example. If your data is in a table, you can filter entire rows or columns from your view. To do this, we'll select the rows we want in the view. Then, on the tool tip that appears, we'll choose to keep only those countries. Again, we could have also selected the data points we wanted to exclude and picked that option instead. Play video starting at :3: and follow transcript3:00 Or if you like, we can even prefilter a Tableau dashboard. This means that your stakeholders don't have to filter the data themselves. Basically, by doing the filtering for them, you can save them time and effort and direct them to the important data you want them to focus on. Personally, I think the best thing about filters is they let you zero in on what's important. Sometimes I'm working with a huge data set, and I want to concentrate only on a specific area, so I'll add a filter to limit the data displayed on my dashboard. This cuts the clutter and gives me a simple, clear visual. I use filters a lot when working with data about advertising campaign performance. Filters help me isolate specific tactics, such a search or YouTube ads, to see which ones are working best and which ones could be improved. By limiting and customizing the information I'm looking at, it's much easier for me to see the story behind the numbers. And as I'm sure you've noticed, I love a good data story. As a data analyst, you'll often be relying on spreadsheets to create quick visualizations of your data to tell your story. Let's practice building a chart in a spreadsheet. To follow along, use the spreadsheet link in the previous reading, also included in the video. We'll be using Google Sheets, so this might look a little different in other spreadsheet platforms, like Excel. We'll begin by filtering just the data on how many customers purchase basic plus or premium software packages. To start, select the column for the software package and insert a chart. The spreadsheet suggests what it thinks is the best type of chart for our data, but we can choose any type of chart you'd like. Spreadsheet charts also let you assign different styles, access titles, a legend, and many other options. Feel free to explore the different functionality later on. We'll also cover this more in a reading. There's lots of different options to choose from. Let's say we also have data on which countries our customers are from and their overall satisfaction score for the software they purchased. First, highlight columns A and B, then click on "insert" and then "chart" again under "chart type." You want to select the first map option. Voila! Now we have a map that summarizes a customer survey scores by country. We can also customize this chart by clicking "customize" in the top right corner. Let's say we wanted to change our colors from red and green to a gradient so it's more accessible. We can do that by clicking "geo" and then change the min color to the lightest shade of blue, the mid color to the middle shade of blue, and the max color to the darkest shade of blue to show the spectrum of scores from low to high. Now we have a map chart that shows where respondents are most satisfied with their software in dark blue and least satisfied with their software in light blue. And this will be easier for anyone in our audience with color vision deficiencies to understand. Tableau and spreadsheets are common tools for creating data visualizations. By using their built-in functionalities like filters and charts, you can zero in on what information is most important and create compelling visuals for your audience. And now that we've explored some ways to create visuals, it's time to start preparing our data narrative. Coming up, we're going to talk more about telling stories with data and organizing presentations. I'll see you soon. Hands-On Activity: Build a dashboard in Tableau Total points 1 1. Question 1 Activity overview The video you just watched showed you how to create a dashboard in Tableau. Now, you can use the template, dataset, and instructions in this activity to create the visualization yourself. Feel free to refer back to the previous video if you get stuck. In previous activities, you linked data sources and created data visualizations. Now, you’ll use what you learned about the process of data visualization to add data to a dashboard. By the time you complete this activity, you will be able to create and use a dashboard to present data in an accessible and interactive way. This will enable you to communicate your work and display dynamic data in professional settings. Note: You will need the Tableau Public Desktop app to import the Dashboards Starter Template in this activity. For more information on downloading the Tableau Public app, see the Reading: Optional: Using Tableau Desktop. If you are unable to download the app to your device, use the two visualizations you created in the last Tableau activities as Sheet 1 and Sheet 2 of this activity. What you will need A starter template with a few existing data sources and visualizations and a data set have been provided. Click the link to the folder containing the starter template and data set. If you are logged into your Google Account: Click and drag to highlight both the template and the data set. Then, right-click on the selected files and click Download. If you are not logged into your Google Account: To download both items, click the DOWNLOAD ALL button in the top right corner of the page. You do not need a Google account to download the files. Download the starter template and data set: Starter template and data set Open the template and load the data In a business context, data visualizations are most useful when they are presented in a dashboardstyle format to stakeholders. Dashboards put all the pertinent information in the same place, making it easier to understand the important takeaways. Many dashboards are also constantly updating to reflect new data, and some are even interactive. No matter what style of dashboard you choose, they can help you deliver the work you’ve done when creating visualizations. Now it's time to begin the activity. After you download the Dashboards Starter Template, find the file in your storage and open it in Tableau Public Desktop. Upon opening the Tableau project template, your screen should look like this: there are multiple multi-colored lines each representing different data The Dashboards Starter Template workbook allows you to explore and manipulate the visualizations found in two sheets: Sheet 1 and Sheet 2. However, the Tableau workbook does not contain the actual dataset. Next, you will load the dataset. To load the actual dataset: 1. Click the Data Source tab in the bottom left-hand corner of the window. This will open the Datasources folder Tableau Public has created on your computer by default. 2. Navigate to the location on your computer where you downloaded the World Bank CO2 dataset and open it. 3. Locate the My Tableau Repository folder on your computer. This is usually placed in the Documents folder of your local files. If you cannot find the folder, use the search bar in your computer’s file explorer. 4. Double-click the folder My Tableau Repository, then double-click the folder Datasources. 5. Drag your datasets for Tableau from where you downloaded them into the Datasources folder. This will help you keep track of your datasets for various projects and stay organized. Note: As a best practice, you should always move your datasets for Tableau into the Datasources folder. Create a dashboard The example project contains the World Bank CO2 dataset, with two separate visualizations. Click Sheet 1. This visualization shows the average CO2 per capita of each country. Now, click Sheet 2. This visualization is a line chart of the CO2 production of each global region over time. You will use these visualizations to create a dashboard. Click the Add Dashboard button, which is the middle button on the bottom row with a symbol that appears like a spreadsheet with a plus sign. This will open a new dashboard. Your screen should appear like this: Now, you just need to add some visualizations to your dashboard. Add visualizations To add visualizations, drag the appropriate sheets onto the dashboard in the layout that you prefer. In this case, you’ll add the map visualization from Sheet 1 on top of the line graph from Sheet 2. 1. Start by finding Sheet 1 in the Sheets section on the left side of the screen. Click and drag Sheet 1 onto the area that says Drop sheets here. Your screen should appear like this: 2. Click and drag Sheet 2 onto the visualization. You’ll notice that the visualization adjusts to show the layout depending on where you drag the sheet. Place Sheet 2 so that it takes up the bottom half. Clean the dashboard The dashboard currently contains three legends, but only two of them are needed. The topmost legend of grayscale values represents the CO2 Per Capita by size. CO2 per capita is represented by size and color. As such, Tableau creates two legends. To simplify the visualization, your best choice is to delete the topmost legend that corresponds to size. The relationship between small and large emissions can be interpreted by the relative sizes of the circles. However, the color representing the number of emissions per capita is not interpretable without the legend. 1. Delete the topmost legend. To do this, click it and then click the X attached to it to remove it from the dashboard. Now that it’s been removed, you’ll set the remaining legends to float. 2. Click on a legend. 3. Click the arrow pointing downwards for More Options. From there, select Floating. 4. Drag the legend onto the top-right corner of the map visualization. 5. Repeat steps 2-4 and float the remaining legend onto the top-right corner of the bottom graph. Once you’ve done it, your dashboard should appear like this: You’ve now created a basic dashboard. Tableau contains tons of other functionality that allows for dashboards that update in real-time or interactive dashboards and visualizations. Businesses everywhere know the power of using data to solve problems and achieve goals. But all the data in the world won't get you anywhere if your stakeholders can't understand it or if they can't stay focused on what you're telling them. So you want to create presentations that are logically organized, interesting, and communicate your key messages clearly. An effective presentation supports your narrative by making it more interesting than words alone. Play video starting at ::28 and follow transcript0:28 It starts with how you want to organize your data insights. The narrative you share with your stakeholders needs characters, a setting, a plot, a big reveal, and an "aha moment," just like any other story. The characters are the people affected by your story. This could be your stakeholders, customers, clients, and others. When adding information about your characters to your story, you have a great opportunity to include a personal account and bring more human context to the facts that the data has revealed—think about why they care. Next up is a setting, which describes what's going on, how often it's happening, what tasks are involved, and other background information about the data project that describes the current situation. Play video starting at :1:17 and follow transcript1:17 The plot, sometimes called the conflict, is what creates tension in the current situation. This could be a challenge from a competitor, an inefficient process that needs to be fixed, or a new opportunity that the company just can't pass up. This complication of the current situation should reveal the problem your analysis is solving and compel the characters to act. The big reveal, or resolution, is how the data has shown that you can solve the problem the characters are facing by becoming more competitive, improving a process, inventing a new system, or whatever the ultimate goal of your data project may be. Finally, your "aha moment" is when you share your recommendations and explain why you think they'll help your company be successful. When I'm working on a presentation, this is where I like to start, too. Using these basic elements to outline your presentation could be a great place to start, and they can help you organize your findings into a clear story. And once you decided on these five key parts of your story, it's time to think about how to pair your narrative with interesting visuals because as you're learning, an interesting and persuasive data story needs interesting and persuasive visuals. Coming up, you'll learn even more about how to be an expert data storyteller. [MUSIC] Hi, my name is Sundas, and I'm analytical lead at Google, my role is turning data into powerful stories that influences business decisions. I have an untraditional background where I have a six year gap between my high school and my college career. So for me when I was trying to start all over again, so I started at a community college, that was my first exposure to online learning, and it was perfect because, I was managing kids at home. So at Google, we talk a lot about imposter syndrome, and I personally relate to it quite a bit, being the first female in my family to graduate university and also being an immigrant, a lot of times I'm surrounded by people who do not look like me. For example, there was one time where I was presenting to a senior leaders in my org, and I was so nervous presenting to them, I was like I'm going to totally blow this up, and they're going to figure out that I'm just a totally a fraud and a fake. One of the things that I changed is that, even though I was the only female in my team, I started networking, I started expanding my network. And I met a lot of women who are from the country where I am from and they were also immigrant, they also struggled with English and they also looked like me and they were doing very well in their career, they were being successful. So when I looked at them, I was like, okay, if they can do it, then so can I, so that for me was a very a big confidence boost to kind of like get over that imposter syndrome feeling. But I struggle from it day to day, I'm struggling with it right now, standing in front of you, do I even deserve to be talking about my journey and my skills? So it's completely normal, there are few things that I like to do, one is that I like to give myself a pep talk, pep talk definitely works, just saying you're totally worth it, you you deserve it, it does wonders for me personally. The second thing I like to do is, I like to keep a log off my success and failures. So when I am at a down point, when I'm feeling down or feeling I do not belong here, I look at all the things that I have achieved from that log, and that kind of helps me. That's a good reminder of the hard work that I put in to kind of get here, so I did not get here because of luck, I got here because I worked hard and I earned it. My family is actually really proud of me, after seeing me go to school and graduate, the two kids, my younger brother, he actually went to school with two kids as well, and he graduated. He finished his master's program and my sister in law, who also had two kids and she was managing, and after seeing me that I could do it, they had somebody to look up to. And so my sister in law went back to school and she finish her degree as well. So I think just being the first in my family was really hard, because I don't have anybody to look up to. But now I am that person that people in my family can look up to specifically girls and they can pursue whatever they put their minds to. Hi again. Earlier in this program, you learned how to keep your audience in mind when communicating your data findings. By making sure that you're thinking about who your audience is and what they need to know, you'll be able to tell your story more effectively. In this video, we'll learn how to use a strategic framework to help your audience understand the most important takeaways from your presentation. To make your data findings accessible to your audience, you'll need a framework to guide your presentation. Play video starting at ::34 and follow transcript0:34 This helps to create logical connections that tie back to the business tasks and metrics. As a quick reminder, the business task is the question or problem. your data analysis answers. The framework you choose gives your audience context to better understand your data. Play video starting at ::53 and follow transcript0:53 On top of that, it helps keep you focused on the most important information during your presentation. Play video starting at :1:1 and follow transcript1:01 The framework for your presentation starts with your understanding of the business task. Raw data doesn't mean much to most people, but if you present your data in the context of the business task, your audience will have a much easier time connecting with it. This makes your presentation more informative and helps you empower your audience with knowledge. That's why understanding the business task early on is key. Play video starting at :1:29 and follow transcript1:29 Here's an example. Play video starting at :1:31 and follow transcript1:31 Let's say we're working with a grocery store chain. They've asked us to identify trends and online searches for avocados to help them make seasonal stocking decisions. Play video starting at :1:43 and follow transcript1:43 During our presentation, we want to make sure that we continue focusing on this task and framing our information with it. Let's check out this example slide presentation. We can begin our presentation by framing it with the business task here. In this second slide, I've added goals for the discussion. It starts with "share an overview of historical online avocado searches." Play video starting at :2:13 and follow transcript2:13 Under that, a more detailed explanation: "We'll cover how avocado searches have grown year over year and what that means for your business." Then we'll "examine seasonal trends in online avocado searches using historical data." This is important because "understanding seasonal trends can help forecast stocking needs and inform planning." And finally, "discuss any potential areas for further exploration." Play video starting at :2:43 and follow transcript2:43 This is where we'll address next steps in the presentation. This clearly outlines the presentation so our audience knows what to expect. Play video starting at :2:53 and follow transcript2:53 It also lets them know how the information we share is going to be connected to the business task. You might remember, we talked about telling a story with data before. You can think of this like outlining the narrative. We can do the same thing with our data viz examples. If we're showing this visual graph of annual searches for avocados, we might want to frame it by saying this graph shows the months with the most online searches for avocados last year, so we can expect that this interest in avocados will fall on the same months this year. That can even be used in our speaker notes for the slide. This is a great place to add important points you want to remember during the presentation ahead of time. These notes aren't visible to your audience in presentation mode, so they're great reminders you can refer to as you present. Plus, you could even share your presentation with speaker notes ahead of time to make the content more accessible for your audience. Play video starting at :3:56 and follow transcript3:56 Using this data, the grocery store can anticipate demand and make a plan to stock enough avocados to match their customers' interests. That's just one way we can use the business task to frame our data and make it easier to understand. Play video starting at :4:12 and follow transcript4:12 You also want to make sure you're outlining and connecting with your business metrics by showcasing what business metrics you use. You can help your audience understand the impact your findings will have. Play video starting at :4:26 and follow transcript4:26 Think about the metrics we use for our avocado presentation. We track the number of online searches for avocados from different months over several years to anticipate trends and demand. By explaining this in our presentation, it's easy for our audience to understand how we used our data. These data points alone—the dates or number of searches— aren't useful for our audience, but when we explain how they're combined as metrics, the data we're sharing makes so much more sense. Here's another potential data viz that we want to use. We can frame it for our audience by including some of our metrics. There's an explanation of what time period this data covers: "Our data shows Google search queries from 2004 to 2018." Where we gathered this data from: "Search queries are limited to the United States only." And a quick explanation of how the trends are being measured: Play video starting at :5:25 and follow transcript5:25 "Google trends scores are normalized at 100." So now that our audience understands the metrics we use to organize this data, they'll be able to understand the graph more clearly. Using a strategic framework to guide your presentation can help your audience understand your findings, which is what the sharing phase of the data analysis process is all about. Coming up, we'll learn even more about how to weave data into your presentations. Hey, great to have you back. So we know how to use our business tasks and metrics to frame our data findings during a presentation. Now let's talk about how you work data into your presentations to help your audience better understand and interpret your findings. First, it's helpful for your audience to understand what data was available during data collection. You can also tell them if any new relevant data has come up, or if you discovered that you need different data. Play video starting at ::32 and follow transcript0:32 For our analysis, we used data about online searches for avocados over several years. The data we collected includes all searches with the word "avocado," so it includes a lot of different kinds of searches. This helps our audience understand what data they're actually looking at and what questions they can expect it to answer. With the data we collected on searches containing the word avocado, we can answer questions about the general interest in avocados. But if we wanted to know more about something specific, like guacamole, we'd probably need to collect different data to better understand that part of our search data. Next, you'll want to establish the initial hypothesis. Your initial hypothesis is a theory you're trying to prove or disprove with data. In this example, our business task was to compile average monthly prices. Our hypothesis is that this will show clear trends that can help the grocery store chain plan for avocado demand in the coming year. You want to establish your hypothesis early in the presentation. That way, when you present your data, your audience has the right context to put it in. Next, you'll want to explain the solution to your business tasks using examples and visualizations. A good example is the graph we used last time that clearly visualized the search trend score for the word avocado from year to year. Raw data could take time to sink in, but a good example or visualization can make it much easier for your audience to understand you during a presentation. Play video starting at :2:7 and follow transcript2:07 Keep in mind, presenting your visualizations effectively is just as important as the content, if not more. And that's where the McCandless Method we learned about earlier can help. So let's talk through the steps of this method and then apply them to our own data visualizations. The McCandless Method moves from the general to the specific, like it's building a pyramid. Play video starting at :2:30 and follow transcript2:30 You start with the most basic information: introduce the graphic you're presenting by name. This direct your audience's attention. Let's open the slide deck we were working on earlier. We've got the framework we explored last time and our two data viz examples. According to the McCandless Method, we want to introduce our graphic by name. The name of this graph, "yearly avocado search trends," is clearly written here. When we present it, we'll be sure to share that title with our audience so they know where to focus and what the graphic is all about. Next, you'll want to answer the obvious questions your audience might have before they're asked. Start with the high-level information and work your way into the lowest level of detail that's useful to your audience. This way, your audience won't get distracted trying to understand something that could have easily been answered when the graphic was introduced. We added in the information about when, where, and how this data was gathered to frame this data viz. But it also answers the first question many stakeholders will ask, "Where is this data from, and what does it cover?" So going back to the second graph in our presentation, let's think about some obvious questions our audience might have when they see this graph at first. This data viz is really interesting, but it can be hard to understand at a glance, so our audience might have questions about how to read it. Knowing that, we can add an explanation to our speaker notes to answer these questions as soon as this graph is introduced. "This shows time running in a circle with winter months on top and summer on bottom. The farther elements are away from the center, the more queries happened around that time for 'avocado.'" Now some of the answers to these questions are built into our presentation. Once you've answered any potential questions your audience might have, you'll want to state the insight your data viz provides. It's important to get everyone on the same page before you move into the supporting details. We can write in some key takeaways to this slide to help our audience understand the most important insights from the graphic. Here we let the audience know that this data shows us a consistent seasonal trend year over year. We can also see that there's low online interest in avocados from October through December. This is an important insight that we definitely want to share. Play video starting at :4:56 and follow transcript4:56 Even though avocados are a seasonal summer fruit, searches peak in January and February. For a lot of people in the United States, watching the Super Bowl and eating chips with guacamole is popular this time of year. Now our audience knows what takeaways we want them to have before moving on. The fourth step in the McCandless Method is calling out data to support that insight. This is your chance to really wow your audience, so gives many examples as you can. With our avocado graphs, it might be worth pointing to specific examples. In our monthly trends graph, we can point to specific weeks recorded here. "During the week of November 25th, 2018, the search score was around 49, but the week of February 4th the search score was 90. This shows the rise and fall of online search interest, with the help of some of the very cool data in our graphs." Finally, it's time to tell your audience why it matters. This is the "so what" moment. Why is this insight interesting or important to them? This is a good time to present the possible business impact of the solution and clear action stakeholders can take. Play video starting at :6:12 and follow transcript6:12 You might remember that we outlined this in our framework at the beginning of our presentation. Play video starting at :6:18 and follow transcript6:18 So let's explain what this data helps our grocery store stakeholder do. First, they can account for lower interest in avocados between the months of October and December. They can also prepare for the Super Bowl surge in avocado interest in late January/early February. And they'll be able to consider how to optimize stocking practices during summer and spring. There's a little more detail under each of these points, but this is a basic breakdown of the impact. And that's how we use the McCandless Method to introduce data visualizations during our presentations. I have one more piece of advice. Take a second to self-check and ask yourself, "Does this data point or chart support the point I want people to walk away with?" It's a good reminder to think about your audience every time you add data to a presentation. So now you know how to present data using a framework, and weave data into your presentation for your audience. And you got to learn the McCandless Method for data presentation. Coming up, we'll learn some best practices for actually creating presentations. See you soon. Step-by-step critique of a presentation This reading provides an orientation of two upcoming videos: Connor: Messy example of a data presentation Connor: Good example of a data presentation To get the most out of these videos, you should watch them together (back to back). In the first video, Connor introduces a presentation that is confusing and hard to follow. In the second video, he returns to talk about what can be done to improve it and help the audience better understand the data and conclusions being shared. Messy data presentation In the first video, watch and listen carefully for the specific reasons the “messy” presentation falls short. Here is a preview: No story or logical flow No titles Too much text Inconsistent format (no theme) No recommendation or conclusion at the end Messy presentation: people don’t know where to focus their attention The main problem with the messy presentation is the lack of a logical flow. Notice also how the data visualizations are hard to understand and appear without any introduction or explanation. The audience has no sense of what they are looking at and why. When people in the audience have to figure out what the data means without any help, they can end up being lost, confused, and unclear about any actions they need to take. Good data presentation In the second video, numerous best practices are applied to create a better presentation on the same topic. This “good” presentation is so much easier to understand than the messy one! Here is a preview: Title and date the presentation was last updated Flow or table of contents Transition slides Visual introduction to the data (also used as a repeated theme) Animated bullet points Annotations on top of visuals Logic and progression Limitations to the data (caveats) - what the data can’t tell you Tip: As you watch this video, take notes about what Connor suggests to create a good presentation. You can keep these notes in your journal. When you create your own presentations, refer back to your notes. This will help you to develop your own thinking about the quality of presentations. Good presentation: people are logically guided through the data The good presentation logically guides the audience through the data – from the objectives at the beginning all the way to the conclusions at the end. Notice how the data visualizations are introduced using a common theme and are thoughtfully placed before each conclusion. A good presentation gives people in the audience the facts and data, helps them understand what the data means, and provides takeaways about how they can use their understanding to make a change or do some good. Up next Get started with the messy vs. good presentation comparison by viewing the first video: Connor: Messy example of a data presentation. Hey there. So far we've learned about using a framework to guide your audience through your presentation and how to weave data in. Now I want to talk about why these presentation skills are so important and give you some simple tips you can use during your own presentations. As a data analyst, you have two key responsibilities: analyze data and present your findings effectively. Analyzing data seems pretty obvious. It's in the title "data analyst," after all. But data analysis is all about turning raw information into knowledge. If you can't actually communicate what you've learned during your analysis, then that knowledge can't help anyone. There's plenty of ways data analysts communicate: emails, memos, dashboards, and of course, presentations. Effective presentations start with the things we've already talked about, like creating effective visualizations and organizing your slides, but how you deliver those things can make a big difference in how well your audience understands them. You want to make sure they leave your presentation empowered by the knowledge and ready to make decisions based on your analysis. That's why strong presentation skills are so important as a data analyst. If the idea of giving a presentation makes you nervous, don't worry—a lot of people feel that way. Here's a secret: it gets easier the more you practice. Now let's look at some tips and tricks you can use when giving your presentations. We'll go over some more advanced ones later, but let's start with the basics for now. It's natural to feel your adrenaline levels rise before giving a presentation. That's just because you're excited to be there. To help keep that excitement in check, try taking deep, controlled breaths to calm your body down. As a bonus, this will also help you channel all that excitement into a presentation style that shows your passion for the work you've done. You might remember we talked earlier about using the McCandless Method to present data visualizations. Well, it's also a good rule of thumb for presentations in general. Start with the broader ideas, the obvious questions your audience might have, and what they need to understand to put your findings in context. Then you can get more specific about your analysis and the insights you've uncovered. Let's go back to our avocado example and imagine how we'd start that presentation. After we introduce ourselves and the title of our presentation, we have a slide with our goals for the discussion. We start with the most general goals and then get more specific. We might say our goal for today is to first provide you all with the state of the world on online avocado searches. Then we'll examine the opportunities and risks of seasonal trends in online avocado searches. We'll move into actionable next steps that can help you start taking advantage of these opportunities, as well as help to mitigate the risks. Finally, we'd love to make the third part a discussion with you about what you think of these next steps. What you'll want to notice here is how our presentation focuses on the general interest in avocados online before getting into specifics about what that means for our stakeholders. We also learned about the five-second rule. As a quick refresher, whenever you introduce a data visualization, you should use the five-second rule and ask two questions. First, wait five seconds after showing a data visualization to let your audience process it, then ask if they understand it. If not, take time to explain it, then give your audience another five seconds to let that sink in before telling them the conclusion you want them to understand. Try not to rush through data visualizations. This will be the first time some of the people in your audience are encountering your data, and it's worth making time in your presentations for them. Here's our first data viz in the avocado presentation. When we get to this slide, we want to introduce our yearly avocado search trends graph and explain the basic background we've included here. After we wait five seconds, we can ask, "Are there any questions about this graph?" Let's say one of our stakeholders asks, "Could you explain Google search trends?" Great. After explaining that, we wait another five seconds, then we can tell them our conclusion: Searches for avocados have been increasing every year. You'll learn more about these concepts later on, but these are some great tips for starting out. Finally, when it comes to presenting data, preparation is key. For some people, that means doing dress rehearsals. For others, it means writing out a script and repeating it in their head. Others find visualizing themselves giving the presentation helps. Try to find a method that works for you. The most important thing to remember is that the more prepared you are, the better you'll perform when the lights are on and it's your turn to present. Coming up, we'll cover more best practices for presentations and also look at some examples. Looking forward to it. Hey there. So far we've learned about using a framework to guide your audience through your presentation and how to weave data in. Now I want to talk about why these presentation skills are so important and give you some simple tips you can use during your own presentations. As a data analyst, you have two key responsibilities: analyze data and present your findings effectively. Analyzing data seems pretty obvious. It's in the title "data analyst," after all. But data analysis is all about turning raw information into knowledge. If you can't actually communicate what you've learned during your analysis, then that knowledge can't help anyone. There's plenty of ways data analysts communicate: emails, memos, dashboards, and of course, presentations. Effective presentations start with the things we've already talked about, like creating effective visualizations and organizing your slides, but how you deliver those things can make a big difference in how well your audience understands them. You want to make sure they leave your presentation empowered by the knowledge and ready to make decisions based on your analysis. That's why strong presentation skills are so important as a data analyst. If the idea of giving a presentation makes you nervous, don't worry—a lot of people feel that way. Here's a secret: it gets easier the more you practice. Now let's look at some tips and tricks you can use when giving your presentations. We'll go over some more advanced ones later, but let's start with the basics for now. It's natural to feel your adrenaline levels rise before giving a presentation. That's just because you're excited to be there. To help keep that excitement in check, try taking deep, controlled breaths to calm your body down. As a bonus, this will also help you channel all that excitement into a presentation style that shows your passion for the work you've done. You might remember we talked earlier about using the McCandless Method to present data visualizations. Well, it's also a good rule of thumb for presentations in general. Start with the broader ideas, the obvious questions your audience might have, and what they need to understand to put your findings in context. Then you can get more specific about your analysis and the insights you've uncovered. Let's go back to our avocado example and imagine how we'd start that presentation. After we introduce ourselves and the title of our presentation, we have a slide with our goals for the discussion. We start with the most general goals and then get more specific. We might say our goal for today is to first provide you all with the state of the world on online avocado searches. Then we'll examine the opportunities and risks of seasonal trends in online avocado searches. We'll move into actionable next steps that can help you start taking advantage of these opportunities, as well as help to mitigate the risks. Finally, we'd love to make the third part a discussion with you about what you think of these next steps. What you'll want to notice here is how our presentation focuses on the general interest in avocados online before getting into specifics about what that means for our stakeholders. We also learned about the five-second rule. As a quick refresher, whenever you introduce a data visualization, you should use the five-second rule and ask two questions. First, wait five seconds after showing a data visualization to let your audience process it, then ask if they understand it. If not, take time to explain it, then give your audience another five seconds to let that sink in before telling them the conclusion you want them to understand. Try not to rush through data visualizations. This will be the first time some of the people in your audience are encountering your data, and it's worth making time in your presentations for them. Here's our first data viz in the avocado presentation. When we get to this slide, we want to introduce our yearly avocado search trends graph and explain the basic background we've included here. After we wait five seconds, we can ask, "Are there any questions about this graph?" Let's say one of our stakeholders asks, "Could you explain Google search trends?" Great. After explaining that, we wait another five seconds, then we can tell them our conclusion: Searches for avocados have been increasing every year. You'll learn more about these concepts later on, but these are some great tips for starting out. Finally, when it comes to presenting data, preparation is key. For some people, that means doing dress rehearsals. For others, it means writing out a script and repeating it in their head. Others find visualizing themselves giving the presentation helps. Try to find a method that works for you. The most important thing to remember is that the more prepared you are, the better you'll perform when the lights are on and it's your turn to present. Coming up, we'll cover more best practices for presentations and also look at some examples. Looking forward to it. Guide: Sharing data findings in presentations Use this guide to help make your presentation stand out as you tell your data story. Follow the recommended tips and slide sequence in this guide for a presentation that will truly impress your audience. You can also download this guide as a PDF, so you can reference it in the future: Sharing your data findings in presentations _ Tips and Tricks.pdfPDF File Open file Telling your data story (tips and tricks to present your data and results) Use the following tips and sample layout to build your own presentation. Tip 1: Know your flow Just like in any good story, a data story must have a good plot (theme and flow), good dialogue (talking points), and a great ending or big reveal (results and conclusions). One flow could be an overview of what was analyzed followed by resulting trends and potential areas for further exploration. In order to develop the right flow for your presentation, keep your audience in mind. Ask yourself these two questions to help you define the overall flow and build out your presentation. Who is my audience? If your intended audience is executives, board members, directors, or other Clevel (C-Suite) executives, your storytelling should be kept at a high level. This audience will want to hear about your story but might not have time to hear the entire story. Executives tend to focus on endings that encourage improving, correcting, or inventing things. Keep your presentation brief and spend most of your time on your results and recommendations. Refer to an upcoming topic in this reading—Tip 3: end with your recommendations. If your intended audience is stakeholders and managers, they might have more time to learn about how you performed your analysis and they might ask more data-specific questions. Be prepared with talking points about the aspects of your analysis that led you to your final results and conclusions. If your intended audience is other analysts and individual contributors, you will have the most freedom—and perhaps the most time—to go more deeply into the data, processes, and results. What is the purpose of my presentation? If the goal of your presentation is to request or recommend something at the end, like a sales pitch, you can have each slide work toward the recommendations at the end. If the goal of your presentation is to focus on the results of your analysis, each slide can help mark the path to the results. Be sure to include plenty of breadcrumbs (views of the data analysis steps) to demonstrate the path you took with the data. If the goal of your presentation is to provide a report on the data analysis, your slides should clearly summarize your data and key findings. In this case, it is alright to let the data be the star or speak for itself. Tip 2: Prepare talking points and limit text on slides As you create each slide in your presentation, prepare talking points (also called speaker notes) on what you will say. Don’t forget that you will be talking at the same time that your audience is reading your slides. If your slides start becoming more like documents, you should rethink what you will say so that you can remove some text from the slides. Make it easy for your audience to skim read the slides while still paying attention to what you are saying. In general, follow the five-second rule. Your audience should not be spending more than five seconds reading any block of text on a slide. Knowing exactly what you will say when explaining each slide throughout your presentation also creates a natural flow to your story. Talking points help you avoid awkward pauses between topics. Slides that summarize data can also be repetitive (and boring). If you prepare a variety of interesting talking points about the data, you can keep your audience alert and paying attention to the data and its analysis. Tip 3: End with your recommendations When climbing a mountain, getting to the top is the goal. Making recommendations at the end of your presentation is like getting to the mountaintop. Use one slide for your recommendations at the end. Be clear and concise. If you are recommending that something be done, provide next steps and describe what you would consider a successful outcome. Tip 4: Allow enough time for the presentation and questions Assume that everyone in your audience is busy. Keep your presentation on topic and as short as possible by: Being aware of your timing. This applies to the total number of slides and the time you spend on each slide. Presenting your data efficiently. Make sure that every slide tells a unique and important part of your data story. If a slide isn’t that unique, you might think about combining the information on that slide with another slide. Saving enough time for questions at the end or allowing enough time to answer questions throughout your presentation. Putting it all together: Your slide deck layout In this section, we will describe how to put everything together in a sample slide deck layout. First slide: Agenda Provide a high-level bulleted list of the topics you will cover and the amount of time you will spend on each. Every company’s norms are different, but in general, most presentations run from 30 minutes to an hour at most. Here is an example of a 30-minute agenda: Introductions (4 minutes) Project overview and goals (5 minutes) Data and analysis (10 minutes) Recommendations (3 minutes) Actionable steps (3 minutes) Questions (5 minutes) Second slide: Purpose Everyone might not be familiar with your project or know why it is important. They didn’t spend the last couple of weeks thinking about the analysis and results of your project like you did. This slide summarizes the purpose of the project and why it is important to the business for your audience. Here is an example of a purpose statement: Service center consolidation is an important cost savings initiative. The aim of this project was to determine the impact of service center consolidation on customer response times. Third slide: Data/analysis First, It really is possible to tell your data story in a single slide if you summarize the key things about your data and analysis. You may have supporting slides with additional data or information in an appendix at the end of the presentation. But, if you choose to tell your story using more than one slide, keep the following in mind: Slides typically have a logical order (beginning, middle, and end) to fully build the story. Each slide should logically introduce the slide that follows it. Visual cues from the slides or verbal cues from your talking points should let the audience know when you will go on to the next slide. Remember not to use too much text on the slides. When in doubt, refer back to the second tip on preparing talking points and limiting the text on slides. The high-level information that people read from the slides shouldn’t be the same as the information you provide in your talking points. There should be a nice balance between the two to tell a good story. You don’t want to simply read or say the words on the slides. For extra visuals on the slides, use animations. For example, you can: Fade in one bullet point at a time as you discuss each on a slide. Only display the visual that is relevant to what you are talking about (fade out non-relevant visuals). Use arrows or callouts to point to a specific area of a visual that you are using. Fourth slide: Recommendations If you have been telling your story well in the previous slides, the recommendations will be obvious to your audience. This is when you might get a lot of questions about how your data supports your recommendations. Be ready to communicate how your data backs up your conclusion or recommendations in different ways. Having multiple words to state the same thing also helps if someone is having difficulty with one particular explanation. Fifth slide: Call to action Sometimes the call to action can be combined with the recommendations slide. If there are multiple actions or activities recommended, a separate slide is best. Recall our example of a purpose statement: Service center consolidation is an important cost savings initiative. The aim of this project was to determine the impact of service center consolidation on customer response times. Suppose the data analysis showed that service center consolidation negatively impacted customer response times. A call to action might be to examine if processes need to change to bring customer response times back to what they were before the consolidation. Wrapping it up: Getting feedback After you present to your audience, think about how you told your data story and how you can get feedback for improvement. Consider asking your manager or another data analyst for candid thoughts about your storytelling and presentation overall. Feedback is great to help you improve. When you have to write a brand new data story (or a sequel to the one you already told), you will be ready to impress your audience even more! Hey, good to see you again. By now you've learned some ways to organize and incorporate data into your presentations. You've also covered why effective presentation skills are so important as a data analyst. Now you're ready to start presenting like a pro. Coming up, I'll share some pro tips and best practices with you. Let's get started. We've talked about how important your audience is throughout this program, and it's especially important for presentations. It's also important to remember that not everyone can experience your presentations the same way. Sharing your presentation via email and putting some forethought into how accessible your data viz is before your presentation can help ensure your work is accessible and understandable. But during the actual presentation, it can be tempting to focus on what's most interesting and exciting to us and not on what the audience actually needs to hear. Sometimes, even the best audiences can lose focus and get distracted, but here's a few things you can do during your final presentation to help you stay focused on your audience and keep them engaged. First, try to keep in mind that your audience won't always get the steps you took to reach a conclusion. Your work makes sense to you because you did it—this is called the curse of knowledge. Basically, it means that because you know something, it can be hard to imagine your audience not knowing it. It's important to remember that your audience doesn't have the same context you do, so focus on what information they need to reach the same conclusion you did. Earlier, we covered some useful things you can add to your presentations to help with this. First, answer basic questions about where the data came from and what it covers: How is it collected? Does it focus on a specific time or place? You can also include your guiding hypothesis and the goals that drove your analysis. Adding any assumptions or methods you used to reach your conclusions can also be useful. For example, in our avocado presentation, we grouped months by season and looked at overall trends. And finally, explain your conclusion and how you reached it. Your audience also has a lot on their mind already. They might be thinking about their own work projects or what they want to have for lunch. They aren't trying to be rude, and it doesn't mean they aren't interested; they're just busy people with a lot going on. Try to keep your presentation focused and to the point to keep their minds from wandering. Try not to tell stories that take your audience down into unrelated line of thinking, and try not to go into too much detail about things that don't concern your audience. You might have found a really exciting new SQL database, but unless your presentation is about databases, you can probably leave that out. Your audience can also be easily distracted by information in your presentation. For example, the more you include in a chart, the more your audience will need to explore it. Try to avoid including information in your presentations that you don't think will be productive to discussions with your audience, sharing the right amount of content to keep your audience focused and ready to take action. It's also good to note that how you present information is just as important as what you present, and I have some best practices for delivering presentations. First, pay attention to how you speak. Keep your sentences short. Don't use long words where short words will work. Build in intentional pauses to give your audience time to think about what you've just said. Try to keep the pitch of your sentences level so that your statements aren't confused for questions. Also, try to be mindful of any nervous habits you have. Maybe you talk faster, tap your toes, or touch your hair when you're nervous. That's totally normal—everyone does—but these habits can be distracting for your audience. When you're presenting, try to stay still and move with purpose. Practice good posture and make positive eye contact with the people in your audience. Finally, remember that you can practice and improve these skills with every presentation. Accept and seek out feedback from people you trust. Feedback is a gift and an opportunity to grow. With that, you've completed another module. The presentation skills you've learned here, like using frameworks, weaving data into your presentation, and best practices you can apply during your actual presentations, are going to help you communicate your findings with audiences effectively. Five se Hello. So let's talk about how you can be sure you're prepared for a Q &amp; A. For starters, knowing the questions ahead of time can make a big difference. You don't have to be a mind reader, but there's a few things you can do to prepare that'll help. For this example, we'll go back to the presentation we created about health and happiness around the world. We put together these slides, clean them up a bit, and now we're getting ready for the actual presentation. Let's go over some ways we can anticipate possible questions before our Q &amp; A to give us more time to think about the answers. Understanding your stakeholder's expectations will help you predict the questions they might ask. As we previously discussed, it's important to set stakeholder expectations early in the project. Keep their expectations in mind while you're planning presentations and Q &amp; A sessions. Make sure you have a clear understanding of the objective and what the stakeholders wanted when they asked you to take on this project. For this project, our stakeholders were interested in what factors contributed to a happier life around the world. Our objective was to identify if there were geographic, demographic, and/or economic factors that contributed to a happier life. Knowing that, we can start thinking about the potential questions about that objective they might have. At the end of the day, if you misunderstood your stakeholders' expectations or the project objectives, you won't be able to correctly anticipate or answer their questions. Think about these things early and often when planning for a Q &amp; A. Once you feel confident that you fully understand your stakeholders' expectations and the project goals, you can start identifying possible questions. A great way to identify audience questions is to do a test run of your presentation. I like to call this the "colleague test." Show your presentation or your data viz to a colleague who has no previous knowledge of your work, and see what questions they ask you. They might have the same questions your real audience does. We talked about feedback as a gift, so don't be afraid to seek it out and ask colleagues for their opinions. Let's say we ran through our presentation with a colleague, we showed them our data visualizations, then asked them what questions they had. They tell us they weren't sure how we were measuring health and happiness with our data in this slide. That's a great question. We can absolutely work that information into our presentation. Sometimes the questions asked during our colleague tests help us revise our presentation. Other times, they help us anticipate questions that might come up during the presentation, even if we didn't originally want to build that information into the presentation itself. It helps to be prepared to go into detail about your process, but only if someone asks. Either way, their feedback can help take your presentation to the next level. Next, it's helpful to start with zero assumptions. Don't assume that your audience is already familiar with jargon, acronyms, past events, or other necessary background information. Try to explain these things in the presentation, and be ready to explain them further if asked. When we showed our presentation to our colleague, we accidentally assumed that they already knew how health and happiness were measured and left that out of our original presentation. Now, let's look at our second data viz. This graph is showing the relationship between health, wealth, and happiness, but includes GDP to measure the economy. We don't want to assume that our audience knows what that means, so during the presentation, we'll want to include a definition of GDP. In our speaker notes, we've added gross domestic product: total monetary or market value of all the finished goods and services produced within a country's borders in a specific period of time. We'll fully explain what GDP means as soon as this graphic comes up; that way, no one in our audience is confused by that acronym. It helps to work with your team to anticipate questions and draft responses. Together, you'll be able to include their perspectives and coordinate answers so that everyone on your team is prepared and ready to share their unique insights with stakeholders. The team working on the world happiness project with you probably have a lot of great insights about the data, like how it was gathered or what it might be missing. Touch base with them so you don't miss out on their perspective. Finally, be prepared to consider and describe to your stakeholders any limitations in your data. You can do this by critically analyzing the patterns you've discovered in your data for integrity. For example, could the correlations found be explained as coincidence? On top of that, use your understanding of the strengths and weaknesses of the tools you use in your analysis to pinpoint any limitations they may have introduced. While you probably don't have the power to predict the future, you can come pretty close to predicting stakeholder and audience questions by doing a few key things. Remember to focus on stakeholder expectations and project goals, identify possible questions with your team, review your presentation with zero assumptions, and consider the limitations of your data. Sometimes, though, your audience might raise objections to the data before and after your presentation. Coming up, we'll talk about the kind of objections they might have and how you can respond. See you next time. Welcome back. In this video, we'll talk about how you can handle objections about the data you're presenting. Stakeholders might raise objections during or after your presentation. Usually, these objections are about the data, your analysis, or your findings. We'll start by discussing what questions these objections are asking and then talk about how to respond. Objections about the data could mean a few different things. Sometimes, stakeholders might be asking where you got the data and what systems that came from, or they might want to know what transformations happened to it before you worked with it, or how fresh and accurate your data is. You can include all this information in the beginning of your presentation to set up the data context. You can add a more detailed breakdown in your appendix in case there are more questions. When we're talking about cleaning data, you learned keeping a detailed log of data transformations is useful. That log can help you answer the questions we're talking about here, and if you keep it in your presentation's appendix, it'll be easy to reference if any of your stakeholders want more detail during a Q &amp; A. Now, your audience might also have questions or objections about your analysis. They might want to know if your analysis is reproducible, so it helps to keep a change log documenting the steps you took. This way, someone else could follow along and reproduce your process. You can even create a slide in the appendix section of your presentation explaining these steps, if you think it will be necessary. And it can be useful to keep a clean version of your script if you're working with a programming language like SQL or R, which we'll learn all about later. Also, be prepared to answer questions like, "Who did you get feedback from during this process?" This is especially important when your analysis reveals insights that are the opposite of your audience's gut feelings about the data. Making sure to include lots of perspectives throughout your analysis process will help you back up your findings during your presentation. Finally, you might be faced with objections to the findings themselves. Play video starting at :2:12 and follow transcript2:12 A lot of the time these will be questions like, "Do these findings exist in previous time periods, or did you control for the differences in your data?" Your audience wants to be sure that your final results accounted for any possible inconsistencies and that they're accurate and useful. Now that you know some of the possible kinds of objections your audience might raise, let's talk about how you can think about responding. First, it can be useful to communicate any assumptions about the data, your analysis, or your findings that might help answer their questions. For example, did your team clean and format your data before analysis? Telling your audience that can clear up any doubts they might have. Second, explain why your analysis might be different than expected. Walk your audience through the variables that change the outcomes to help them understand how you got there. And third, some objections have merit, especially if they bring up something you hadn't thought of before. If that's true, you can acknowledge that those objections are valid and take steps to investigate further. Following up with more details afterwards is great, too. And now you know some of the basic objections you might run into. Understanding that your audience might have questions about your data, your analysis, or your findings can help you prepare responses ahead of time, and walking your audience through any assumptions about the data or unexpected results are great approaches to responding. Play video starting at :3:43 and follow transcript3:43 Coming up, we'll go over even more best practices for responding to questions during a Q &amp; A. Bye for now. Hello again. Earlier we talked about some ways that you can respond to objections during or after your presentations. In this video, I want to share some more Q &amp; A best practices. Let's go back to our world happiness presentation example. Imagine we finished preparing for a Q &amp; A, and it's time to actually answer some of our audience's questions. Let's go over some ways that we can be sure that we're answering questions effectively. Will start with a really simple one: listen to the whole question. I know this sounds like a given, but it can be really tempting to start thinking about your answer before the person you're talking to has even finished asking their question. On slide 11 of our presentation, we outline our conclusions. After explaining these conclusions, one of our stakeholders asks, "How was happiness measured for this project?" It's important to listen to the whole question and wait to respond until they're done talking. Take a moment to repeat the question. Repeating the question is helpful for a few different reasons. For one, it helps you make sure that you're understanding the question. Second, it gives the person asking it a chance to correct you if you're not. Anyone who couldn't hear the question will still know what's being asked. Plus, it gives you a moment to get your thoughts together. After listening to the question and repeating it to make sure you understand, you can explain that participants in different countries were given a survey that asked them to rate their happiness, and just like that, your audience has a better understanding of the project because you took the time to listen carefully. Now that they know about the survey, they're interested in knowing more. At this point, we can go into more detail about that data. We have a slide built in here called the appendix. This is a great place to keep extra information that might not be necessary for our presentation but could be useful for answering questions afterwards. This is also a great place for us to have more detailed information about the survey data so we can reference it more easily. As always, make sure you understand the context questions are being asked in. Think about who is your audience and what kinds of concerns or backgrounds they might have. Remember the project goals and your stakeholders' interests in them, and try to keep your answers relevant to that specific context, just like you made sure your presentation itself was relevant to your stakeholders. We have this slide with data about life expectancy as a metric for health. If you're presenting to a group of stakeholders who are in the healthcare industry, they're probably going to be more interested in the medical data and the relationship between overall health and happiness. Knowing this, you can tailor your answers to focus on their interests so that the presentation is relevant and useful to them. When answering, try to involve the whole audience. You aren't just having a one-on-one conversation with the person that's asked the question; you're presenting to a group of people who might also have the same question or need to know what that answer is. It's important to not accidentally exclude other audience members. You can also include other voices. If there's someone in your audience or team that might have insight, ask them for their thoughts. Keep your responses short and to the point. Start with a headline response that gives your stakeholders the basic answer. Then if they have more questions, you can go into more detail. This can be difficult as a data analyst. You have all the background information and want to share your hard work, but you don't want to lose your audience with a long and potentially confusing answer. Stay focused on the question itself. This is why listening to the whole question is so important. It keeps the focus on that specific question. Answer the question as directly as possible using the fewest words you can. From there, you can expand on your answer or add color, contexts, and detail as needed. Like when one of our stakeholders asked how the data measuring happiness was gathered. We started by telling them that a survey was used to measure an individual's happiness, and only when they are interested in hearing more about the survey did we go into more detail. To recap, when you're answering questions during a presentation Q &amp; A, remember to listen to the whole question, repeat the question if necessary, understand the context, involve your whole audience, and keep your responses short. Remember, you don't have to answer every question on the spot. If it is a tough question that will require additional analysis or research, it's fine to let your audience know that you'll get back to them; just remember to follow up in a timely manner. These tips will make it easier to answer questions and make you seem prepared and professional. Now that your presentation-ready, it's time to wrap up. We covered a lot about how to consider questions before a Q &amp; A, how to handle different kinds of objections, and some best practices you can use in your next presentation. That's it for now. See you in the next video. : Added to Selection. Press [CTRL + S] to save as a note Hello again. Earlier we talked about some ways that you can respond to objections during or after your presentations. In this video, I want to share some more Q &amp; A best practices. Let's go back to our world happiness presentation example. Imagine we finished preparing for a Q &amp; A, and it's time to actually answer some of our audience's questions. Let's go over some ways that we can be sure that we're answering questions effectively. Will start with a really simple one: listen to the whole question. I know this sounds like a given, but it can be really tempting to start thinking about your answer before the person you're talking to has even finished asking their question. On slide 11 of our presentation, we outline our conclusions. After explaining these conclusions, one of our stakeholders asks, "How was happiness measured for this project?" It's important to listen to the whole question and wait to respond until they're done talking. Take a moment to repeat the question. Repeating the question is helpful for a few different reasons. For one, it helps you make sure that you're understanding the question. Second, it gives the person asking it a chance to correct you if you're not. Anyone who couldn't hear the question will still know what's being asked. Plus, it gives you a moment to get your thoughts together. After listening to the question and repeating it to make sure you understand, you can explain that participants in different countries were given a survey that asked them to rate their happiness, and just like that, your audience has a better understanding of the project because you took the time to listen carefully. Now that they know about the survey, they're interested in knowing more. At this point, we can go into more detail about that data. We have a slide built in here called the appendix. This is a great place to keep extra information that might not be necessary for our presentation but could be useful for answering questions afterwards. This is also a great place for us to have more detailed information about the survey data so we can reference it more easily. As always, make sure you understand the context questions are being asked in. Think about who is your audience and what kinds of concerns or backgrounds they might have. Remember the project goals and your stakeholders' interests in them, and try to keep your answers relevant to that specific context, just like you made sure your presentation itself was relevant to your stakeholders. We have this slide with data about life expectancy as a metric for health. If you're presenting to a group of stakeholders who are in the healthcare industry, they're probably going to be more interested in the medical data and the relationship between overall health and happiness. Knowing this, you can tailor your answers to focus on their interests so that the presentation is relevant and useful to them. When answering, try to involve the whole audience. You aren't just having a one-on-one conversation with the person that's asked the question; you're presenting to a group of people who might also have the same question or need to know what that answer is. It's important to not accidentally exclude other audience members. You can also include other voices. If there's someone in your audience or team that might have insight, ask them for their thoughts. Keep your responses short and to the point. Start with a headline response that gives your stakeholders the basic answer. Then if they have more questions, you can go into more detail. This can be difficult as a data analyst. You have all the background information and want to share your hard work, but you don't want to lose your audience with a long and potentially confusing answer. Stay focused on the question itself. This is why listening to the whole question is so important. It keeps the focus on that specific question. Answer the question as directly as possible using the fewest words you can. From there, you can expand on your answer or add color, contexts, and detail as needed. Like when one of our stakeholders asked how the data measuring happiness was gathered. We started by telling them that a survey was used to measure an individual's happiness, and only when they are interested in hearing more about the survey did we go into more detail. To recap, when you're answering questions during a presentation Q &amp; A, remember to listen to the whole question, repeat the question if necessary, understand the context, involve your whole audience, and keep your responses short. Remember, you don't have to answer every question on the spot. If it is a tough question that will require additional analysis or research, it's fine to let your audience know that you'll get back to them; just remember to follow up in a timely manner. These tips will make it easier to answer questions and make you seem prepared and professional. Now that your presentation-ready, it's time to wrap up. We covered a lot about how to consider questions before a Q &amp; A, how to handle different kinds of objections, and some best practices you can use in your next presentation. That's it for now. See you in the next video. : Added to Selection. Press [CTRL + S] to save as a note The R-versus-Python debate People often wonder which programming language they should learn first. You might be wondering about this, too. This certificate teaches the open-source programming language, R. R is a great starting point for foundational data analysis, and it has helpful packages that beginners can apply to projects. Python isn’t covered in the curriculum, but we encourage you to explore Python after completing the certificate. If you are curious about other programming languages, make every effort to continue learning. Any language a beginner starts to learn will have some advantages and challenges. Let’s put this into context by looking at R and Python. The following table is a high-level overview based on a sampling of articles and opinions of those in the field. You can review the information without necessarily picking a side in the R vs. Python debate. In fact, if you check out RStudio’s blog article in the Additional resources section, it’s actually more about working together than winning a debate. Languages R Python Common features - Open-source - Data stored in data frames Formulas and functions readily available Community for code development and support - Open-source - Data stored functions readily available development and support Unique advantages - Data manipulation, data visualization, and statistics packages - "Scalpel" approach to data: find - Easy syntax for machine le cloud platforms like Google and Azure Unique challenges - Inconsistent naming conventions make it harder for beginners to select the right functions - Methods for handling variables may be a little complex for beginners to understand packages to do what you want with the data - Many more decisions for b input/output, structure, var "Swiss army knife" approac do what you want with the d Additional resources For more information on comparing R and Python, refer to these resources: R versus Python, a comprehensive guide for data professionals: This article is written by a data professional with extensive experience using both languages and provides a detailed comparison. R versus Python, an objective comparison: This article provides a comparison of the languages using examples of code use. R versus Python: What’s the best language for data science?: This blog article provides RStudio’s perspective on the R vs. Python debate. Key takeaways Certain aspects make some programming languages easier to learn than others. But, that doesn’t make the harder languages impossible for beginners to learn. On the flip side, a programming language’s popularity doesn’t always make it the best language for beginners either. R has been used by professionals who have a statistical or research-oriented approach to solving problems; among them are scientists, statisticians, and engineers. Python has been used by professionals looking for solutions in the data itself, those who must heavily mine data for answers; among them are data scientists, machine learning specialists, and software developers. As you grow as a data analytics professional, you may need to learn additional programming languages. The skills and competencies you learn from your first programming experience are a good foundation. That's why this course focuses on the basics of R. You can develop the right perspective, that programming languages play an important part in the data analysis process no matter what job title you have. The good news is that many of the concepts and coding principles that you will learn from using R in this course are transferable to other programming languages. You will also learn how to write R code in an Integrated Development Environment (IDE) called RStudio. RStudio allows you to manage projects that use R or Python, or even a combination of the two. Refer to RStudio: A Single Home for R & Python for more information. So, after you have worked with R and RStudio, learning Python or another programming language in the future will be more intuitive. For a better idea of popular programming languages by job role, refer to Ways to learn about programming. The programming languages most commonly used by data analysts, web designers, mobile and web application developers, and game developers are listed, along with links to resources to help you start learning more about those languages. From spreadsheets to SQL to R Although the programming language R might be new to you, it actually has a lot of similarities to the other tools you have explored in this program. In this reading, you will compare spreadsheet programs, SQL, and R to have a better sense of how to use each moving forward. Spreadsheets, SQL, and R: a comparison As a data analyst, there is a good chance you will work with SQL, R, and spreadsheets at some point in your career. Each tool has its own strengths and weaknesses, but they all make the data analysis process smoother and more efficient. There are two main things that all three have in common: They all use filters: for example, you can easily filter a dataset using any of these tools. In R, you can use the filter function. This performs the same task as a basic SELECT-FROM-WHERE SQL query. In a spreadsheet, you can create a filter using the menu options. They all use functions: In spreadsheets, you use functions in formulas, and in SQL, you include them in queries. In R, you will use functions in the code that is part of your analysis. The table below presents key questions to explore a few more ways that these tools compare to each other. You can use this as a general guide as you begin to navigate R. Key question Spreadsheets SQL R What is it? A program that uses rows and columns to organize data and allows for analysis and manipulation through formulas, functions, and built-in features A database programming language used to communicate with databases to conduct an analysis of data A general purp programming for statistical a visualization, a analysis What is a primary advantage? Includes a variety of visualization tools and features Allows users to manipulate and reorganize data as needed to aid analysis Provides an ac language to or and clean data create insightf visualizations Which datasets does it work best with? Smaller datasets Larger datasets Larger dataset What is the source of the data? Entered manually or imported from an external source Accessed from an external database Loaded with R imported from or loaded from sources Where is the data from my analysis usually stored? In a spreadsheet file on your computer Inside tables in the accessed database In an R file on Do I use formulas and functions? Yes Yes Yes Yes Yes, by using an additional tool like a database management system (DBMS) or a business intelligence (BI) tool Yes Can I create visualizations? When to use RStudio As a data analyst, you will have plenty of tools to work with in each phase of your analysis. Sometimes, you will be able to meet your objectives by working in a spreadsheet program or using SQL with a database. In this reading, you will go through some examples of when working in R and RStudio might be your better option instead. Why RStudio? One of your core tasks as an analyst will be converting raw data into insights that are accurate, useful, and interesting. That can be tricky to do when the raw data is complex. R and RStudio are designed to handle large data sets, which spreadsheets might not be able to handle as well. RStudio also makes it easy to reproduce your work on different datasets. When you input your code, it's simple to just load a new dataset and run your scripts again. You can also create more detailed visualizations using RStudio. When RStudio truly shines When the data is spread across multiple categories or groups, it can be challenging to manage your analysis, visualize trends, and build graphics. And the more groups of data that you need to work with, the harder those tasks become. That’s where RStudio comes in. For example, imagine you are analyzing sales data for every city across an entire country. That is a lot of data from a lot of different groups–in this case, each city has its own group of data. Here are a few ways RStudio could help in this situation: Using RStudio makes it easy to take a specific analysis step and perform it for each group using basic code. In this example, you could calculate the yearly average sales data for every city. RStudio also allows for flexible data visualization. You can visualize differences across the cities effectively using plotting features like facets–which you’ll learn more about later on. You can also use RStudio to automatically create an output of summary stats— or even your visualized plots—for each group. As you learn more about R and RStudio moving forward in this program, you’ll get a better understanding of when RStudio should be your data analysis tool of choice. For more information The Advantages of RStudio: This web page explains some of the reasons why RStudio is many analysts’ preferred choice for interfacing with R. You’ll learn about the advantages of using RStudio for data analysis, from ease of use to accessibility of graphics and more. Data analysis and R programming: This online introduction to data analysis and R programming is a good starting point for R and RStudio users. It also includes a list of detailed explanations about the advantages of using R and RStudio. You’ll also find a helpful guide for getting set up with RStudio. transcript0:00 Hey there. Anytime you're learning a new skill from cooking to driving to dancing, you should always start with the fundamentals. Programming with R is no different. To build this foundation, you'll get familiar with the basic concepts of R, including functions, comments, variables, data types, vectors, and pipes. Some of these terms might sound familiar. For example, we've come across functions in spreadsheets and SQL. As a quick refresher, functions are a body of reusable code used to perform specific tasks in R. Functions begin with function names like print or paste, and are usually followed by one or more arguments in parentheses. An argument is information that a function in R needs in order to run. Here's a simple function in action. Feel free to join in and try it yourself in RStudio using your cloud account. Check out the reading for more details on how to get started. Play video starting at :1:10 and follow transcript1:10 You can pause the video anytime you need to. We'll open RStudio Cloud to get started. We'll start our function in the console with the function name print. This function name will return whatever we include in the values in parentheses. We'll type an open parenthesis followed by a quotation mark. Both the close parenthesis and end quote automatically pop up because RStudio recognizes this syntax. Now we just have to add the text string. We'll type Coding in R. Play video starting at :1:45 and follow transcript1:45 Then we'll press enter. Play video starting at :1:48 and follow transcript1:48 Success! The code returns the words "Coding in R." If you want to find out more about the print function or any function, all you have to do is type a question mark, the function name, and a set of parentheses. Play video starting at :2:5 and follow transcript2:05 This returns a page in the Help window, which helps you learn more about the functions you're working with. Keep in mind that functions are case-sensitive, so typing Print with a Capital P brings back an error message. Play video starting at :2:24 and follow transcript2:24 Functions are great, but it can be pretty time-consuming to type out lots of values. To save time, we can use variables to represent the values. This lets us call out the values any time we need to with just the variable. Earlier, we learned about variables in SQL. A variable is a representation of a value in R that can be stored for use later during programming. Variables can also be called objects. As a data analyst, you'll find variables are very useful when programming. For example, if you want to filter a dataset, just assign a variable to the function you used to filter the data. That way, all you have to do is use that variable to filter the data later. When naming a variable in R, you can use a short phrase. A variable name should start with a letter and can also contain numbers and underscores. So the variable 5penguin wouldn't work well because it starts with a number. Also just like functions, variable names are case-sensitive. Using all lower case letters is good practice whenever possible. Now, before we get to coding a variable, let's add a comment. Comments are helpful when you want to describe or explain what's going on in your code. Use them as much as possible so that you and everyone can understand the reasoning behind it. Comments should be used to make an R script more readable. A comment shouldn't be treated as code, so we'll put a # in front of it. Then we'll add our comment. Here's an example of a variable. Play video starting at :4:16 and follow transcript4:16 Now let's go ahead with our example. It makes sense to use a variable name to connect to what the variable is representing. So we'll type the variable name first_variable. Play video starting at :4:30 and follow transcript4:30 Then after the variable name, we'll type a &lt; sign, followed by a -. Play video starting at :4:36 and follow transcript4:36 This is the assignment operator. It assigns the value to the variable. It looks like an arrow, which makes sense, since it's pointing from the value to the variable. There are other assignment operators that work too, but it's always good to stick with just one type in your code. Next, we'll add the value that our variable will represent. We'll use the text, "This is my variable." Play video starting at :5:5 and follow transcript5:05 If we type the variable and hit Run, it will return the value that the variable represents. This is a very basic way of using a variable. You'll learn more ways of using variables in your code soon. For now, let's assign a variable to a different data type, numeric. We'll name this second_variable, and type our assignment operator. Play video starting at :5:30 and follow transcript5:30 We'll give it the numeric value 12.5. Play video starting at :5:35 and follow transcript5:35 The Environment pane in the upper- right part of our work space now shows both of our variables and their values. There are other data types in R like logical, date, and date time. R has a few options for dealing with these data types. We'll explore them later. With functions, comments, variables, and data types, you've got a good foundation for working with R. We'll revisit these throughout this program, and show you how they're used in different ways during analysis. Let's finish up with two more fundamental concepts, vectors and pipes. Simply put, a vector is a group of data elements of the same type stored in a sequence in R. You can make a vector using the combined function. In R this function is just the letter c followed by the values you want in your vector inside parentheses. All right, let's create a vector. Imagine this vector is for a measurement data that we need to analyze. We'll start our code with the variable vec_1 to assign to the vector. Play video starting at :6:52 and follow transcript6:52 Then we'll type c and the open parenthesis. Play video starting at :6:57 and follow transcript6:57 Then we'll type our list of numbers separated by commas. Play video starting at :7:4 and follow transcript7:04 We'll then close our parentheses and press enter. Play video starting at :7:11 and follow transcript7:11 This time when we type our variable and press enter, it returns our vector. We can use this vector anywhere in our analysis with only its variable name vec_1. The values in the vector will automatically be applied to our analysis. That brings us to the last of our fundamentals, pipes. A pipe is a tool in R for expressing a sequence of multiple operations. A pipe is represented by a % sign, followed by a &gt; sign, and another % sign. It's used to apply the output of one function into another function. Pipes can make your code easier to read and understand. For example, this pipe filters and sorts the data. Later, we'll learn how each part of the pipe works. So there they are, the super six fundamentals: functions, comments, variables, data types, vectors, and pipes. They all work together as a foundation for using R. It's a lot to take in, so feel free to watch any of these videos again if you need a refresher. When you're ready, there's so much more to know about R and RStudio. So let's get to it. Vectors and lists in R You can save this reading for future reference. Feel free to download a PDF version of this reading below: Vectors and lists in R.pdfPDF File Open file In programming, a data structure is a format for organizing and storing data. Data structures are important to understand because you will work with them frequently when you use R for data analysis. The most common data structures in the R programming language include: Vectors Data frames Matrices Arrays Think of a data structure like a house that contains your data. This reading will focus on vectors. Later on, you’ll learn more about data frames, matrices, and arrays. There are two types of vectors: atomic vectors and lists. Coming up, you’ll learn about the basic properties of atomic vectors and lists, and how to use R code to create them. Atomic vectors First, we will go through the different types of atomic vectors. Then, you will learn how to use R code to create, identify, and name the vectors. Earlier, you learned that a vector is a group of data elements of the same type, stored in a sequence in R. You cannot have a vector that contains both logicals and numerics. There are six primary types of atomic vectors: logical, integer, double, character (which contains strings), complex, and raw. The last two–complex and raw–aren’t as common in data analysis, so we will focus on the first four. Together, integer and double vectors are known as numeric vectors because they both contain numbers. This table summarizes the four primary types: Type Description Logical True/False Integer Positive and negative whole values Double Decimal values Character String/character values This diagram illustrates the hierarchy of relationships among these four main types of vectors: Bottom: logical (arrow points to atomic), integer (arrow points to numeric), double (arrow points to numeric), character (arrow points to atomic) Second to bottom: numeric (arrow points to atomic) second level: atomic (arrow points to vector) top: vector Creating vectors One way to create a vector is by using the c() function (called the “combine” function). The c() function in R combines multiple values into a vector. In R, this function is just the letter “c” followed by the values you want in your vector inside the parentheses, separated by a comma: c(x, y, z, …). For example, you can use the c() function to store numeric data in a vector. c(2.5, 48.5, 101.5) To create a vector of integers using the c() function, you must place the letter "L" directly after each number. c(1L, 5L, 15L) You can also create a vector containing characters or logicals. c(“Sara” , “Lisa” , “Anna”) c(TRUE, FALSE, TRUE) Determining the properties of vectors Every vector you create will have two key properties: type and length. You can determine what type of vector you are working with by using the typeof() function. Place the code for the vector inside the parentheses of the function. When you run the function, R will tell you the type. For example: typeof(c(“a” , “b”)) #> [1] "character" Notice that the output of the typeof function in this example is “character”. Similarly, if you use the typeof function on a vector with integer values, then the output will include “integer” instead: typeof(c(1L , 3L)) #> [1] "integer" You can determine the length of an existing vector–meaning the number of elements it contains–by using the length() function. In this example, we use an assignment operator to assign the vector to the variable x. Then, we apply the length() function to the variable. When we run the function, R tells us the length is 3. x <- c(33.5, 57.75, 120.05) length(x) #> [1] 3 You can also check if a vector is a specific type by using an is function: is.logical(), is.double(), is.integer(), is.character(). In this example, R returns a value of TRUE because the vector contains integers. x <- c(2L, 5L, 11L) is.integer(x) #> [1] TRUE In this example, R returns a value of FALSE because the vector does not contain characters, rather it contains logicals. y <- c(TRUE, TRUE, FALSE) is.character(y) #> [1] FALSE Naming vectors All types of vectors can be named. Names are useful for writing readable code and describing objects in R. You can name the elements of a vector with the names() function. As an example, let’s assign the variable x to a new vector with three elements. x <- c(1, 3, 5) You can use the names() function to assign a different name to each element of the vector. names(x) <- c("a", "b", "c") Now, when you run the code, R shows that the first element of the vector is named a, the second b, and the third c. x #> a b c #> 1 3 5 Remember that an atomic vector can only contain elements of the same type. If you want to store elements of different types in the same data structure, you can use a list. Creating lists Lists are different from atomic vectors because their elements can be of any type—like dates, data frames, vectors, matrices, and more. Lists can even contain other lists. You can create a list with the list() function. Similar to the c() function, the list() function is just list followed by the values you want in your list inside parentheses: list(x, y, z, …). In this example, we create a list that contains four different kinds of elements: character ("a"), integer (1L), double (1.5), and logical (TRUE). list("a", 1L, 1.5, TRUE) Like we already mentioned, lists can contain other lists. If you want, you can even store a list inside a list inside a list—and so on. list(list(list(1 , 3, 5))) Determining the structure of lists If you want to find out what types of elements a list contains, you can use the str() function. To do so, place the code for the list inside the parentheses of the function. When you run the function, R will display the data structure of the list by describing its elements and their types. Let’s apply the str() function to our first example of a list. str(list("a", 1L, 1.5, TRUE)) We run the function, then R tells us that the list contains four elements, and that the elements consist of four different types: character (chr), integer (int), number (num), and logical (logi). #> List of 4 #> $ : chr "a" #> $ : int 1 #> $ : num 1.5 #> $ : logi TRUE Let’s use the str() function to discover the structure of our second example. First, let’s assign the list to the variable z to make it easier to input in the str() function. z <- list(list(list(1 , 3, 5))) Let’s run the function. str(z) #> List of 1 #> $ :List of 1 #> ..$ :List of 3 #> .. ..$ : num 1 #> .. ..$ : num 3 #> .. ..$ : num 5 The indentation of the $ symbols reflect the nested structure of this list. Here, there are three levels (so there is a list within a list within a list). Naming lists Lists, like vectors, can be named. You can name the elements of a list when you first create it with the list() function: list('Chicago' = 1, 'New York' = 2, 'Los Angeles' = 3) $Chicago [1] 1 $`New York` [1] 2 $`Los Angeles` [1] 3 Additional resource To learn more about vectors and lists, check out R for Data Science, Chapter 20: Vectors. R for Data Science is a classic resource for learning how to use R for data science and data analysis. It covers everything from cleaning to visualizing to communicating your data. If you want to get more details about the topic of vectors and lists, this chapter is a great place to start.