Uploaded by Mike Wilson

notes data analyst (AutoRecovered) (AutoRecovered)

advertisement
Course 2
Data in Action
From issue to action: The six data analysis phases
There are six data analysis phases that will help you make seamless decisions: ask, prepare, process,
analyze, share, and act. Keep in mind, these are different from the data life cycle, which describes the
changes data goes through over its lifetime. Let’s walk through the steps to see how they can help you
solve problems you might face on the job.
Icon of a question mark with the word ask
Step 1: Ask
It’s impossible to solve a problem if you don’t know what it is. These are some things to consider:
Define the problem you’re trying to solve
Make sure you fully understand the stakeholder’s expectations
Focus on the actual problem and avoid any distractions
Collaborate with stakeholders and keep an open line of communication
Take a step back and see the whole situation in context
Questions to ask yourself in this step:
What are my stakeholders saying their problems are?
Now that I’ve identified the issues, how can I help the stakeholders resolve their questions?
Icon of a clipboard with the word prepare
Step 2: Prepare
You will decide what data you need to collect in order to answer your questions and how to organize it
so that it is useful. You might use your business task to decide:
What metrics to measure
Locate data in your database
Create security measures to protect that data
Questions to ask yourself in this step:
What do I need to figure out how to solve this problem?
What research do I need to do?
Icon of numbers with the word process
Step 3: Process
Clean data is the best data and you will need to clean up your data to get rid of any possible errors,
inaccuracies, or inconsistencies. This might mean:
Using spreadsheet functions to find incorrectly entered data
Using SQL functions to check for extra spaces
Removing repeated entries
Checking as much as possible for bias in the data
Questions to ask yourself in this step:
What data errors or inaccuracies might get in my way of getting the best possible answer to the problem
I am trying to solve?
How can I clean my data so the information I have is more consistent?
Icon of magnifying glass with the word analyze
Step 4: Analyze
You will want to think analytically about your data. At this stage, you might sort and format your data to
make it easier to:
Perform calculations
Combine data from multiple sources
Create tables with your results
Questions to ask yourself in this step:
What story is my data telling me?
How will my data help me solve this problem?
Who needs my company’s product or service? What type of person is most likely to use it?
Icon of an arrow with the word share
Step 5: Share
Everyone shares their results differently so be sure to summarize your results with clear and enticing
visuals of your analysis using data via tools like graphs or dashboards. This is your chance to show the
stakeholders you have solved their problem and how you got there. Sharing will certainly help your
team:
Make better decisions
Make more informed decisions
Lead to stronger outcomes
Successfully communicate your findings
Questions to ask yourself in this step:
How can I make what I present to the stakeholders engaging and easy to understand?
What would help me understand this if I were the listener?
Icon of finger pressing on a button with the word act
Step 6: Act
Now it’s time to act on your data. You will take everything you have learned from your data analysis and
put it to use. This could mean providing your stakeholders with recommendations based on your
findings so they can make data-driven decisions.
Questions to ask yourself in this step:
How can I use the feedback I received during the share phase (step 5) to actually meet the stakeholder’s
needs and expectations?
These six steps can help you to break the data analysis process into smaller, manageable parts, which is
called structured thinking. This process involves four basic activities:
Recognizing the current problem or situation
Organizing available information
Revealing gaps and opportunities
Identifying your options
When you are starting out in your career as a data analyst, it is normal to feel pulled in a few different
directions with your role and expectations. Following processes like the ones outlined here and using
structured thinking skills can help get you back on track, fill in any gaps and let you know exactly what
you need.
In a previous video,
I shared how data analysis helped a company
figure out where to advertise its services.
An important part of this process
was strong problem-solving skills.
As a data analyst,
you'll find that problems are at the center
of what you do every single day,
but that's a good thing.
Think of problems as opportunities to put your skills to
work and find creative and insightful solutions.
Problems can be small or large,
simple or complex,
no problem is like another and they all require
a slightly different approach
but the first step is always the same:
Understanding what kind of problem you're trying to
solve and that's what we're going to talk about now.
Data analysts work with a variety of problems.
In this video, we're going to focus on six common types.
These include: making predictions, categorizing things,
spotting something unusual, identifying themes,
discovering connections, and finding patterns.
Let's define each of these now.
First, making predictions.
This problem type involves using data to make
an informed decision about
how things may be in the future.
For example, a hospital system might use
a remote patient monitoring to
predict health events for chronically ill patients.
The patients would take
their health vitals at home every day,
and that information combined with data about their age,
risk factors, and other important details could enable
the hospital's algorithm to predict
future health problems and
even reduce future hospitalizations.
The next problem type is categorizing things.
This means assigning information to
different groups or clusters based on common features.
An example of this problem type is
a manufacturer that reviews data on
shop floor employee performance.
An analyst may create a group for employees
who are most and least effective at engineering.
A group for employees who are most and least
effective at repair and maintenance,
most and least effective at assembly,
and many more groups or clusters.
Next, we have spotting something unusual.
In this problem type,
data analysts identify data
that is different from the norm.
An instance of spotting something
unusual in the real world is
a school system that has
a sudden increase in the number of students registered,
maybe as big as
a 30 percent jump in the number of students.
A data analyst might look into
this upswing and discover that
several new apartment complexes had been
built in the school district earlier that year.
They could use this analysis to make sure the school has
enough resources to handle the additional students.
Identifying themes is the next problem type.
Identifying themes takes categorization as a step
further by grouping information into broader concepts.
Going back to our manufacturer that has just
reviewed data on the shop floor employees.
First, these people are grouped by types and tasks.
But now a data analyst could
take those categories and group them into
the broader concept of
low productivity and high productivity.
This would make it possible for the business to
see who is most and least productive,
in order to reward top performers and
provide additional support to
those workers who need more training.
Now, the problem type of discovering connections enables
data analysts to find
similar challenges faced by different entities,
and then combine data and insights to address them.
Here's what I mean;
say a scooter company is experiencing
an issue with the wheels it gets from its wheel supplier.
That company would have to stop production until it could
get safe, quality wheels back in stock.
But meanwhile, the wheel companies encountering
the problem with the rubber it uses to make wheels,
turns out its rubber supplier could
not find the right materials either.
If all of these entities could talk about
the problems they're facing and share data openly,
they would find a lot of
similar challenges and better yet,
be able to collaborate to find a solution.
The final problem type is finding patterns.
Data analysts use data to find
patterns by using historical data to
understand what happened in
the past and is therefore likely to happen again.
Ecommerce companies use data to
find patterns all the time.
Data analysts look at transaction data to understand
customer buying habits at
certain points in time throughout the year.
They may find that customers buy more
canned goods right before a hurricane,
or they purchase fewer cold-weather accessories
like hats and gloves during warmer months.
The ecommerce companies can
use these insights to make sure
they stock the right amount of
products at these key times.
Alright, you've now learned six basic problem types
that data analysts typically face.
As a future data analyst,
this is going to be valuable knowledge for your career.
Coming up, we'll talk a bit more
about these problem types and I'll
provide even more examples of them
being solved by data analysts.
Personally, I love real-world examples.
They really help me better understand new concepts.
I can't wait to share
even more actual cases with you. See you there.
Six problem types
Data analytics is so much more than just plugging information into a platform to find
insights. It is about solving problems. To get to the root of these problems and find practical
solutions, there are lots of opportunities for creative thinking. No matter the problem, the
first and most important step is understanding it. From there, it is good to take a problemsolver approach to your analysis to help you decide what information needs to be included,
how you can transform the data, and how the data will be used.
Data analysts typically work with six problem types
1. Making predictions 2. Categorizing things 3. Spotting something unusual 4. Identifying
themes 5. Discovering connections 6. Finding patterns
A video, Common problem types, introduced the six problem types with an example for
each. The examples are summarized below for review.
Making predictions
A company that wants to know the best advertising method to bring in new customers is an
example of a problem requiring analysts to make predictions. Analysts with data on
location, type of media, and number of new customers acquired as a result of past ads can't
guarantee future results, but they can help predict the best placement of advertising to
reach the target audience.
Categorizing things
An example of a problem requiring analysts to categorize things is a company's goal to
improve customer satisfaction. Analysts might classify customer service calls based on
certain keywords or scores. This could help identify top-performing customer service
representatives or help correlate certain actions taken with higher customer satisfaction
scores.
Spotting something unusual
A company that sells smart watches that help people monitor their health would be
interested in designing their software to spot something unusual. Analysts who have
analyzed aggregated health data can help product developers determine the right
algorithms to spot and set off alarms when certain data doesn't trend normally.
Identifying themes
User experience (UX) designers might rely on analysts to analyze user interaction data.
Similar to problems that require analysts to categorize things, usability improvement
projects might require analysts to identify themes to help prioritize the right product
features for improvement. Themes are most often used to help researchers explore certain
aspects of data. In a user study, user beliefs, practices, and needs are examples of themes.
By now you might be wondering if there is a difference between categorizing things and
identifying themes. The best way to think about it is: categorizing things involves assigning
items to categories; identifying themes takes those categories a step further by grouping
them into broader themes.
Discovering connections
A third-party logistics company working with another company to get shipments delivered
to customers on time is a problem requiring analysts to discover connections. By analyzing
the wait times at shipping hubs, analysts can determine the appropriate schedule changes
to increase the number of on-time deliveries.
Finding patterns
Minimizing downtime caused by machine failure is an example of a problem requiring
analysts to find patterns in data. For example, by analyzing maintenance data, they might
discover that most failures happen if regular maintenance is delayed by more than a 15-day
window.
Key takeaway
As you move through this program, you will develop a sharper eye for problems and you
will practice thinking through the problem types when you begin your analysis. This
method of problem solving will help you figure out solutions that meet the needs of all .
You've been learning about six common problem types of data analysts encounter,
making predictions, categorizing things, spotting something unusual,
identifying themes, discovering connections, and finding patterns.
Let's think back to our real world example from a previous video.
In that example,
anywhere gaming repair wanted to figure out how to bring in new customers.
So the problem was, how to determine the best advertising method for
anywhere gaming repair's target audience.
To help solve this problem, the company used data to envision
what would happen if it advertised in different places.
Now nobody can see the future but the data helped them make an informed
decision about how things would likely work out.
So, their problem type was making predictions.
Now let's think about the second problem type, categorizing things.
Here's an example of a problem that involves categorization.
Let's say a business wants to improve its customer satisfaction levels.
Data analysts could review recorded calls to the company's customer
service department and evaluate the satisfaction levels of each caller.
They could identify certain key words or phrases that come up during
the phone calls and then assign them to categories such as politeness,
satisfaction, dissatisfaction, empathy, and more.
Categorizing these key words gives us data that lets the company
identify top performing customer service representatives, and
those who might need more coaching.
This leads to happier customers and higher customer service scores.
Okay, now let's talk about a problem that involves spotting something unusual.
Some of you may have a smart watch, my favorite app is for health tracking.
These apps can help people stay healthy by collecting data such as their heart rate,
sleep patterns, exercise routine, and much more.
There are many stories out there about health apps actually saving
people's lives.
One is about a woman who was young, athletic, and
had no previous medical problems.
One night she heard a beep on her smartwatch,
a notification said her heart rate had spiked.
Now in this example think of the watch as a data analyst.
The watch was collecting and analyzing health data.
So when her resting heart rate was suddenly 120 beats per minute,
the watch spotted something unusual because according to its data,
the rate was normally around 70.
Thanks to the data her smart watch gave her, the woman went to the hospital and
discovered she had a condition which could have led to life threatening
complications if she hadn't gotten medical help.
Now let's move on to the next type of problem: identifying themes.
We see a lot of examples of this in the user experience field.
User experience designers study and
work to improve the interactions people have with products they use every day.
Let's say a user experience designer wants to see what customers think about
the coffee maker his company manufactures.
This business collects anonymous survey data from users,
which can be used to answer this question.
But first to make sense of it all,
he will need to find themes that represent the most valuable data,
especially information he can use to make the user experience even better.
So the problem the user experience designer's company faces,
is how to improve the user experience for its coffee makers.
The process here is kind of like finding categories for
keywords and phrases in customer service conversations.
But identifying themes goes even further by grouping each insight into
a broader theme.
Then the designer can pinpoint the themes that are most common.
In this case he learned users often couldn't tell if the coffee maker
was on or off.
He ended up optimizing the design with improved placement and lighting for
the on/off button, leading to the product improvement and happier users.
Now we come to the problem of discovering connections.
This example is from the transportation industry and
uses something called third party logistics.
Third party logistics partners help businesses ship products when
they don't have their own trucks, planes or ships.
A common problem these partners face is figuring out how to reduce wait time.
Wait time happens when a truck driver from the third party logistics provider
arrives to pick up a shipment but it's not ready.
So she has to wait.
That costs both companies time and money and
it stops trucks from getting back on the road to make more deliveries.
So how can they solve this?
Well, by sharing data the partner companies can view each other's timelines
and see what's causing shipments to run late.
Then they can figure out how to avoid those problems in the future.
So a problem for one business doesn't cause a negative impact for the other.
For example, if shipments are running late because one company only delivers Mondays,
Wednesdays and Fridays, and the other company only delivers Tuesdays and
Thursdays, then the companies can choose to deliver on the same day to reduce
wait time for customers.
All right, we've come to our final problem type, finding patterns.
Oil and gas companies are constantly working to keep their machines running
properly.
So the problem is, how to stop machines from breaking down.
One way data analysts can do this is by looking at patterns
in the company's historical data.
For example, they could investigate how and when a particular machine
broke down in the past and then generate insights into what led to the breakage.
In this case, the company saw pattern indicating that machines began breaking
down at faster rates when maintenance wasn't kept up in 15 day cycles.
They can then keep track of current conditions and
intervene if any of these issues happen again.
Pretty cool, right?
I'm always amazed to hear about how data helps real people and
businesses make meaningful change.
I hope you are too.
See you soon.
Now that we've talked about six basic problem types, it's time to start solving them. To do
that, data analysts start by asking the right questions. In this video, we're going to learn
how to ask effective questions that lead to key insights you can use to solve all kinds of
problems. As a data analyst, I ask questions constantly. It's a huge part of the job. If
someone requests that I work on a project, I ask questions to make sure we're on the same page
about the plan and the goals. And when I do get a result, I question it. Is the data showing me
something superficially? Is there a conflict somewhere that needs to be resolved? The more
questions you ask, the more you'll learn about your data and the more powerful your
insights will be at the end of the day. Some questions are more effective than others. Let's
say you're having lunch with a friend and they say, "These are the best sandwiches ever,
aren't they?" Well, that question doesn't really give you the opportunity to share your own
opinion, especially if you happen to disagree and didn't enjoy the sandwich very much. This
is called a leading question because it's leading you to answer in a certain way. Or maybe
you're working on a project and you decide to interview a family member. Say you ask your
uncle, did you enjoy growing up in Malaysia? He may reply, "Yes." But you haven't learned
much about his experiences there. Your question was closed-ended. That means it can be
answered with a yes or no. These kinds of questions rarely lead to valuable insights. Now
what if someone asks you, do you prefer chocolate or vanilla? Well, what are they specifically
talking about? Ice cream, pudding, coffee flavoring or something else? What if you like
chocolate ice cream but vanilla in your coffee? What if you don't like either flavor? That's the
problem with this question. It's too vague and lacks context. Knowing the difference between
effective and ineffective questions is essential for your future career as a data analyst. After
all, the data analyst process starts with the ask phase. So it's important that we ask the right
questions. Effective questions follow the SMART methodology. That means they're specific,
measurable, action-oriented, relevant and time-bound. Let's break that down. Specific
questions are simple, significant and focused on a single topic or a few closely related ideas.
This helps us collect information that's relevant to what we're investigating. If a question is
too general, try to narrow it down by focusing on just one element. For example, instead of
asking a closed-ended question, like, are kids getting enough physical activities these days?
Ask what percentage of kids achieve the recommended 60 minutes of physical activity at
least five days a week? That question is much more specific and can give you more useful
information. Now, let's talk about measurable questions. Measurable questions can be
quantified and assessed. An example of an unmeasurable question would be, why did a
recent video go viral? Instead, you could ask how many times was our video shared on social
channels the first week it was posted? That question is measurable because it lets us count the
shares and arrive at a concrete number. Okay, now we've come to action-oriented questions.
Action-oriented questions encourage change. You might remember that problem solving is
about seeing the current state and figuring out how to transform it into the ideal future
state. Well, action-oriented questions help you get there. So rather than asking, how can we
get customers to recycle our product packaging? You could ask, what design features will
make our packaging easier to recycle? This brings you answers you can act on. All right,
let's move on to relevant questions. Relevant questions matter, are important and have
significance to the problem you're trying to solve. Let's say you're working on a problem
related to a threatened species of frog. And you asked, why does it matter that Pine Barrens
tree frogs started disappearing? This is an irrelevant question because the answer won't help
us find a way to prevent these frogs from going extinct. A more relevant question would be,
what environmental factors changed in Durham, North Carolina between 1983 and 2004
that could cause Pine Barrens tree frogs to disappear from the Sandhills Regions? This
question would give us answers we can use to help solve our problem. That's also a great
example for our final point, time-bound questions. Time-bound questions specify the time to
be studied. The time period we want to study is 1983 to 2004. This limits the range of
possibilities and enables the data analyst to focus on relevant data. Okay, now that you have
a general understanding of SMART questions, there's something else that's very important
to keep in mind when crafting questions, fairness. We've touched on fairness before, but as a
quick reminder, fairness means ensuring that your questions don't create or reinforce bias.
To talk about this, let's go back to our sandwich example. There we had an unfair question
because it was phrased to lead you toward a certain answer. This made it difficult to answer
honestly if you disagreed about the sandwich quality. Another common example of an
unfair question is one that makes assumptions. For instance, let's say a satisfaction survey
is given to people who visit a science museum. If the survey asks, what do you love most about
our exhibits? This assumes that the customer loves the exhibits which may or may not be true.
Fairness also means crafting questions that make sense to everyone. It's important for
questions to be clear and have a straightforward wording that anyone can easily
understand. Unfair questions also can make your job as a data analyst more difficult. They
lead to unreliable feedback and missed opportunities to gain some truly valuable insights.
You've learned a lot about how to craft effective questions, like how to use the SMART
framework while creating your questions and how to ensure that your questions are fair and
objective. Moving forward, you'll explore different types of data and learn how each is used to
guide business decisions. You'll also learn more about visualizations and how metrics or
measures can help create success. It's going to be great!
More about SMART questions
Companies in lots of industries today are dealing with rapid change and rising uncertainty.
Even well-established businesses are under pressure to keep up with what is new and
figure out what is next. To do that, they need to ask questions. Asking the right questions
can help spark the innovative ideas that so many businesses are hungry for these days.
The same goes for data analytics. No matter how much information you have or how
advanced your tools are, your data won’t tell you much if you don’t start with the right
questions. Think of it like a detective with tons of evidence who doesn’t ask a key suspect
about it. Coming up, you will learn more about how to ask highly effective questions, along
with certain practices you want to avoid.
Highly effective questions are SMART questions:
Examples of SMART questions
Here's an example that breaks down the thought process of turning a problem question
into one or more SMART questions using the SMART method: What features do people
look for when buying a new car?
Specific: Does the question focus on a particular car feature?
Measurable: Does the question include a feature rating system?
Action-oriented: Does the question influence creation of different or new
feature packages?
 Relevant: Does the question identify which features make or break a potential
car purchase?
 Time-bound: Does the question validate data on the most popular features
from the last three years?
Questions should be open-ended. This is the best way to get responses that will help you
accurately qualify or disqualify potential solutions to your specific problem. So, based on
the thought process, possible SMART questions might be:








On a scale of 1-10 (with 10 being the most important) how important is your
car having four-wheel drive?
What are the top five features you would like to see in a car package?
What features, if included with four-wheel drive, would make you more
inclined to buy the car?
How much more would you pay for a car with four-wheel drive?
Has four-wheel drive become more or less popular in the last three years?
Things to avoid when asking questions
Leading questions: questions that only have a particular response
Example: This product is too expensive, isn’t it?
This is a leading question because it suggests an answer as part of the question. A better
question might be, “What is your s of this product?” There are tons of answers to that
question, and they could include information about usability, features, accessories, color,
reliability, and popularity, on top of price. Now, if your problem is actually focused on
pricing, you could ask a question like “What price (or price range) would make you
consider purchasing this product?” This question would provide a lot of different
measurable responses.

Closed-ended questions: questions that ask for a one-word or brief response only
 Example: Were you satisfied with the customer trial?
This is a closed-ended question because it doesn’t encourage people to expand on their
answer. It is really easy for them to give one-word responses that aren’t very informative. A
better question might be, “What did you learn about customer experience from the trial.”
This encourages people to provide more detail besides “It went well.”
Vague questions: questions that aren’t specific or don’t provide context
 Example: Does the tool work for you?
This question is too vague because there is no context. Is it about comparing the new tool
to the one it replaces? You just don’t know. A better inquiry might be, “When it comes to
data entry, is the new tool faster, slower, or about the same as the old tool? If faster, how
much time is saved? If slower, how much time is lost?” These questions give context (data
entry) and help frame responses that are measurable (time).
Hi,
I'm Evan. I'm a learning portfolio manager here at Google, and
I have one of the coolest jobs in the world
where I get to look at all the different technologies that affect big data
and then work them into training courses like this one for students to take.
I wish I had a course like this when I was first coming out of college or
high school.
It was honestly a data analyst course that's geared in the way like this one is
if you've already taken some of the videos
really prepares you to do anything you want. It will
open all of those doors that you want for
any of those roles inside of the data curriculum.
Well, what are some of those roles?
There are so many different career paths for someone who's interested in data.
Generally, if you're like me,
you'll come in through the door as a data analyst maybe working with spreadsheets,
maybe working with small, medium, and large databases,
but all you have to remember is 3 different core roles.
Now there's many in special, whether specialties, within each of these different
careers, but these three are the data analysts,
which is generally someone who works with SQL, spreadsheets,
databases, might work as a business intelligence team
creating those dashboards.
Now where does all that data come from?
Generally, a data analyst will work with a data engineer to turn
that raw data into actionable pipelines.
So you have data analysts, data engineers, and then lastly,
you might have data scientists who basically say the data engineers have
built these beautiful pipelines.
Sometimes the analyst do that too. The analysts have provided us with clean and
actionable data. Then the data
scientists then worked actually to turn it into really cool machine learning
models or statistical inferences that are just
well beyond anything you could have ever imagined.
We'll share a lot of resources in links for ways that you can get excited for
each of these different roles.
And the best part is, if you're like me
when I went into school, I didn't know what I wanted to do and
you don't have to know at the outset which path you want to go down.
Try 'em all.
See what you really, really like.
It's very personal. Becoming a data analyst is so exciting.
Why? Because it's not just like a means to
an end.
It's just taking a career path where so many bright people have gone before and
have made the tools and technologies that much easier for you and me today.
For example, when I was starting to learn SQL or the structured
query language that you're going to be learning as part of this course,
I was doing it on my local laptop and each of the queries would take like
20, 30 minutes to run and it was very hard for
me to keep track of different SQL statements that I was writing or
share them with somebody else. That was about 10 or 15 years ago.
Now, through all the different companies and
all the different tools that are making data analysis tools and
technologies easier for you, you're going to have a blast creating these insights
with a lot less of the overhead that I had when I first started out.
So I'm really excited to hear what you think and
what your experience is going to be.
We've talked a lot about what data
is and how it plays into decision-making.
What do we know already?
Well, we know that data is a collection of facts.
We also know that data analysis reveals
important patterns and insights about that data.
Finally, we know that data analysis
can help us make more informed decisions.
Now, we'll look at how data plays into
the decision-making process and take
a quick look at the differences between
data-driven and data-inspired decisions.
Let's look at a real-life example.
Think about the last time you
searched "restaurants near me" and
sorted the results by rating to
help you decide which one looks best.
That was a decision you made using data.
Businesses and other organizations use data
to make better decisions all the time.
There's two ways they can do this,
with data-driven or data-inspired decision-making.
We'll talk more about
data-inspired decision-making later on,
but here's a quick definition for now.
Data-inspired decision-making
explores different data sources
to find out what they have in common.
Here at Google, we use data every single day,
in very surprising ways too.
For example, we use data to help cut back on
the amount of energy spent cooling your data centers.
After analyzing years of
data collected with artificial intelligence,
we were able to make decisions
that help reduce the energy we
use to cool our data centers by over 40 percent.
Google's People Operations team
also uses data to improve how
we hire new Googlers
and how we get them started on the right foot.
We wanted to make sure we weren't
passing over any talented applicants and
that we made their transition into
their new roles as smooth as possible.
After analyzing data on applications, interviews,
and new hire orientation processes,
we started using an algorithm.
An algorithm is a process or set of
rules to be followed for a specific task.
With this algorithm, we reviewed applicants that didn't
pass the initial screening process
to find great candidates.
Data also helped us determine
the ideal number of interviews that lead to
the best possible hiring decisions.
We've created new onboarding agendas to
help new employees get started at their new jobs.
Data is everywhere.
Today, we create so much data that scientists estimate
90 percent of the world's data
has been created in just the last few years.
Think of the potential here.
The more data we have,
the bigger the problems we can solve and
the more powerful our solutions can be.
But responsibly gathering data
is only part of the process.
We also have to turn data into
knowledge that helps us make better solutions.
I'm going to let fellow Googler,
Ed, talk more about that.
Just having tons of data isn't enough.
We have to do something meaningful with it.
Data in itself provides little value.
To quote Jack Dorsey,
the founder of Twitter and Square,
"Every single action that we do in
this world is triggering off some amount of data,
and most of that data is meaningless until someone adds
some interpretation of it
or someone adds a narrative around it."
Data is straightforward, facts collected together,
values that describe something.
Individual data points become more
useful when they're collected and structured,
but they're still somewhat meaningless by themselves.
We need to interpret data to turn it into information.
Look at Michael Phelps' time in
a 200-meter individual medal swimming race,
one minute, 54 seconds.
Doesn't tell us much. When we
compare it to his competitor's times in the race,
however, we can see that Michael came
in the first place and won the gold medal.
Our analysis took data, in this case,
a list of Michael's races and times and turned it into
information by comparing it with other data.
Context is important.
We needed to know that this race was an Olympic final and
not some other random race to
determine that this was a gold medal finish.
But this still isn't knowledge.
When we consume information, understand it,
and apply it, that's when data is most useful.
In other words, Michael Phelps is a fast swimmer.
It's pretty cool how we can turn data into
knowledge that helps us in all kinds of ways,
whether it's finding the perfect restaurant or
making environmentally friendly changes.
But keep in mind,
there are limitations to data analytics.
Sometimes we don't have
access to all of the data we need,
or data is measured differently across programs,
which can make it difficult to find concrete examples.
We'll cover these more in detail later on,
but it's important that you start
thinking about them now.
Now that you know how data drives decision-making,
you know how key your role as
a data analyst is to the business.
Data is a powerful tool for decision-making,
and you can help provide businesses with the information
they need to solve problems and make new decisions,
but before that, you will
need to learn a little more about
the kinds of data you'll be
working with and how to deal with it.
Data trials and triumphs
This reading focuses on why accurate interpretation of data is key to data-driven decisions.
You have been learning why data is such a powerful business tool and how data analysts
help their companies make data-driven decisions for great results. As a quick reminder, the
goal of all data analysts is to use data to draw accurate conclusions and make good
recommendations. That all starts with having complete, correct, and relevant data.
But keep in mind, it is possible to have solid data and still make the wrong choices. It is up
to data analysts to interpret the data accurately. When data is interpreted incorrectly, it
can lead to huge losses. Consider the examples below.
Coke launch failure
In 1985, New Coke was launched, replacing the classic Coke formula. The company had
done taste tests with 200,000 people and found that test subjects preferred the taste of
New Coke over Pepsi, which had become a tough competitor. Based on this data alone,
classic Coke was taken off the market and replaced with New Coke. This was seen as the
solution to take back the market share that had been lost to Pepsi.
But as it turns out, New Coke was a massive flop and the company ended up losing tens of
millions of dollars. How could this have happened with data that seemed correct? It is
because the data wasn’t complete, which made it inaccurate. The data didn't consider how
customers would feel about New Coke replacing classic Coke. The company’s decision to
retire classic Coke was a data-driven decision based on incomplete data.
Mars orbiter loss
In 1999, NASA lost the $125 million Mars Climate Orbiter, even though it had good data.
The spacecraft burned to pieces because of poor collaboration and communication. The
Orbiter’s navigation team was using the SI or metric system (newtons) for their force
calculations, but the engineers who built the spacecraft used the English Engineering
Units system (pounds) for force calculations.
No one realized a problem even existed until the Orbiter burst into flames in the Martian
atmosphere. Later, a NASA review board investigating the root cause of the problem
figured out that the issue was isolated to the software that controlled the thrusters. One
program calculated the thrusters’ force in pounds; another program looking at the data
assumed it was in newtons. The software controllers were making data-driven decisions to
adjust the thrust based on 100% accurate data, but these decisions were wrong because of
inaccurate assumptions when interpreting it. A conversion of the data from one system of
measurement to the other could have prevented the loss.
When data is used strategically, businesses can transform and grow their revenue.
Consider the examples below.
Crate and Barrel
At Crate and Barrel, online sales jumped more than 40% during stay-at-home orders to
combat the global pandemic. Currently, online sales make up more than 65% of their
overall business. They are using data insights to accelerate their digital transformation and
bring the best of online and offline experiences together for customers.
BigQuery enables Crate and Barrel to "draw on ten times [as many] information sources
(compared to a few years ago) which are then analyzed and transformed into actionable
insights that can be used to influence the customer’s next interaction. And this, in turn,
drives revenue."
Read more about Crate and Barrel's data strategy in How one retailer’s data strategy
powers seamless customer experiences.
PepsiCo
Since the days of the New Coke launch, things have changed dramatically for beverage and
other consumer packaged goods (CPG) companies.
PepsiCo "hired analytical talent and established cross-functional workflows around an
infrastructure designed to put consumers’ needs first. Then [they] set up the right
processes to make critical decisions based on data and technology use cases. Finally, [they]
invested in the right technology stack and platforms so that data could flow into a central
cloud-based hub. This is critical. When data comes together, [they] develop a holistic
understanding of the consumer and their journeys."
Read about how PepsiCo is delivering a more personal and valuable experience to
customers using data in How one of the world’s biggest marketers ripped up its playbook
and learned to anticipate intent.
Key skills for triumphant results
As a data analyst, your own skills and knowledge will be the most important part of any
analysis project. It is important for you to keep a data-driven mindset, ask lots of questions,
experiment with many different possibilities, and use both logic and creativity along the
way. You will then be prepared to interpret your data with the highest levels of care and
accuracy. Note that there is a difference between making a decision with incomplete data
and making a decision with a small amount of data. You learned that making a decision
with incomplete data is dangerous. But sometimes accurate data from a small test can help
you make a good decision. Stay tuned. You will learn about how much data to collect later
in the program.
Hi again.
When it comes to decision-making, data is key.
But we've also learned that
there are a lot of different kinds
of questions that data might help us answer,
and these different questions
make different kinds of data.
There are two kinds of data that we'll talk about in
this video, quantitative and qualitative.
Quantitative data is all about
the specific and objective measures of numerical facts.
This can often be the what,
how many, and how often about a problem.
In other words, things you can measure,
like how many commuters
take the train to work every week.
As a financial analyst,
I work with a lot of quantitative data.
I love the certainty and accuracy of numbers.
On the other hand,
qualitative data describes
subjective or explanatory measures of
qualities and characteristics or
things that can't be measured with numerical data,
like your hair color.
Qualitative data is great
for helping us answer why questions.
For example, why people might like
a certain celebrity or snack food more than others.
With quantitative data, we can see numbers
visualized as charts or graphs.
Qualitative data can then give us
a more high-level understanding of
why the numbers are the way they are.
This is important because it helps
us add context to a problem.
As a data analyst,
you'll be using both
quantitative and qualitative analysis,
depending on your business task.
Reviews are a great example of this.
Think about a time you used reviews to decide
whether you wanted to buy something or go somewhere.
These reviews might have told you
how many people dislike that thing and why.
Businesses read these reviews too,
but they use the data in different ways.
Let's look at an example of a business using data from
customer reviews to see
qualitative and quantitative data in action.
Now, say a local ice cream shop has started using
their online reviews to engage with
their customers and build their brand.
These reviews give the ice cream shop
insights into their customers' experiences,
which they can use to inform their decision-making.
The owner notices that their rating has been going down.
He sees that lately his shop
has been receiving more negative reviews.
He wants to know why,
so he starts asking questions.
First are measurable questions.
How many negative reviews are there?
What's the average rating?
How many of these reviews use the same keywords?
These questions generate quantitative data,
numerical results that help
confirm their customers aren't satisfied.
This data might lead them to ask different questions.
Why are customers unsatisfied?
How can we improve their experience?
These are questions that lead to qualitative data.
After looking through the reviews,
the ice cream shop owner sees a pattern,
17 of negative reviews use
the word "frustrated." That's quantitative data.
Now we can start collecting qualitative data
by asking why this word is being repeated?
He finds that customers are
frustrated because the shop is
running out of popular flavors before the end of the day.
Knowing this, the ice cream shop can change its
weekly order to make sure it has
enough of what the customers want.
With both quantitative and qualitative data,
the ice cream shop owner was able to figure out
his customers were unhappy and understand why.
Having both types of data made it possible for
him to make the right changes and improve his business.
Now that you know the difference between
quantitative and qualitative data,
you know how to get different types of
data by asking different questions.
It's your job as a data detective to know
which questions to ask to find the right solution.
Then you can start thinking
about cool and creative ways to
help stakeholders better understand the data.
For example, interactive dashboards,
which we'll learn about soon.
.
Your analysis of the historical data shows that the 7:30 PM showtime was the most popular
and had the greatest attendance, followed by the 7:15 PM and 9:00 PM showtimes. You
may suggest replacing the current 8:00 PM showtime that has lower attendance with an
8:30 PM showtime. But you need more data to back up your hunch that people would be
more likely to attend the later show.
Evening movie-goers are the largest source of revenue for the theater. Therefore, you also
decide to include a question in your online survey to gain more insight.
Qualitative data for all three trends plus ticket pricing
Since you know that the theater is planning to raise ticket prices for evening showtimes in
a few months, you will also include a question in the survey to get an idea of customers’
price sensitivity.
Your final online survey might include these questions for qualitative data:
1. What went into your decision to see a movie in our theater today? (movie
attendance)
2. What do you think about the quality and value of your purchases at the
concession stand? (concession stand profitability)
3. Which showtime do you prefer, 8:00 PM or 8:30 PM, and why do you prefer
that time? (evening movie-goer preferences)
4. Under what circumstances would you choose a matinee over a nighttime
showing? (ticket price increase)
Summing it up
Data analysts will generally use both types of data in their work. Usually, qualitative data
can help analysts better understand their quantitative data by providing a reason or more
thorough explanation. In other words, quantitative data generally gives you the what, and
qualitative data generally gives you the why. By using both quantitative and qualitative
data, you can learn when people like to go to the movies and why they chose the theater.
Maybe they really like the reclining chairs, so your manager can purchase more recliners.
Maybe the theater is the only one that serves root beer. Maybe a later show time gives them
more time to drive to the theater from where popular restaurants are located. Maybe they
go to matinees because they have kids and want to save money. You wouldn’t have
discovered this information by analyzing only the quantitative data for attendance, profit,
and showtimes.
In the last video, we learned how you can visualize your data using reports and
dashboards to show off your findings in interesting ways.
In one of our examples,
the company wanted to see the sales revenue of each salesperson.
That specific measurement of data is done using metrics.
Now, I want to tell you a little bit more about the difference between data and
metrics.
And how metrics can be used to turn data into useful information.
Play video starting at ::30 and follow transcript0:30
A metric is a single, quantifiable type of data that can be used for measurement.
Think of it this way.
Data starts as a collection of raw facts, until we organize
them into individual metrics that represent a single type of data.
Play video starting at ::48 and follow transcript0:48
Metrics can also be combined into formulas that you can plug
your numerical data into.
In our earlier sales revenue example all that data doesn't mean much
unless we use a specific metric to organize it.
So let's use revenue by individual salesperson as our metric.
Now we can see whose sales brought in the highest revenue.
Metrics usually involve simple math.
Revenue, for example, is the number of sales multiplied by the sales price.
Choosing the right metric is key.
Play video starting at :1:25 and follow transcript1:25
Data contains a lot of raw details about the problem we're exploring.
But we need the right metrics to get the answers we're looking for.
Different industries will use all kinds of metrics to measure things in a data set.
Let's look at some more ways businesses in different industries use metrics.
So you can see how you might apply metrics to your collected data.
Play video starting at :1:50 and follow transcript1:50
Ever heard of ROI?
Play video starting at :1:53 and follow transcript1:53
Companies use this metric all the time.
ROI, or Return on Investment is essentially a formula designed using
metrics that let a business know how well an investment is doing.
The ROI is made up of two metrics,
the net profit over a period of time and the cost of investment.
By comparing these two metrics, profit and cost of investment, the company
can analyze the data they have to see how well their investment is doing.
This can then help them decide how to invest in the future and
which investments to prioritize.
We see metrics used in marketing too.
For example, metrics can be used to help calculate customer retention rates,
or a company's ability to keep its customers over time.
Customer retention rates can help the company compare the number of customers at
the beginning and the end of a period to see their retention rates.
This way the company knows how successful their marketing strategies are
and if they need to research new approaches to bring back more repeat
customers.
Play video starting at :3:3 and follow transcript3:03
Different industries use all kinds of different metrics.
But there's one thing they all have in common:
they're all trying to meet a specific goal by measuring data.
Play video starting at :3:14 and follow transcript3:14
This metric goal is a measurable goal set by a company and evaluated using metrics.
And just like there are a lot of possible metrics,
there are lots of possible goals too.
Play video starting at :3:26 and follow transcript3:26
Maybe an organization wants to meet a certain number of monthly sales,
or maybe a certain percentage of repeat customers.
Play video starting at :3:34 and follow transcript3:34
By using metrics to focus on individual aspects of your data,
you can start to see the story your data is telling.
Metric goals and formulas are great ways to measure and understand data.
But they're not the only ways.
We'll talk more about how to interpret and understand data throughout this course.
So far, you've learned a lot about how to think like a data analyst.
We've explored a few different ways of thinking.
And now, I want to take that one step further by using a mathematical approach
to problem-solving.
Mathematical thinking is a powerful skill you can use to help you solve problems and
see new solutions.
So, let's take some time to talk about what mathematical thinking is, and
how you can start using it.
Using a mathematical approach doesn't mean you have to suddenly become a math whiz.
It means looking at a problem and logically breaking it down step-by-step,
so you can see the relationship of patterns in your data, and
use that to analyze your problem.
This kind of thinking can also help you figure out the best tools for analysis
because it lets us see the different aspects of a problem and
choose the best logical approach.
There are a lot of factors to consider when choosing the most helpful tool for
your analysis.
One way you could decide which tool to use is by the size of your dataset.
When working with data, you'll find that there's big and small data.
Small data can be really small.
These kinds of data tend to be made up of datasets concerned with specific
metrics over a short, well defined period of time.
Like how much water you drink in a day.
Small data can be useful for making day-to-day decisions,
like deciding to drink more water.
But it doesn't have a huge impact on bigger frameworks like business
operations.
You might use spreadsheets to organize and
analyze smaller datasets when you first start out.
Big data on the other hand has larger,
less specific datasets covering a longer period of time.
They usually have to be broken down to be analyzed.
Big data is useful for looking at large- scale questions and problems, and
they help companies make big decisions.
When you're working with data on this larger scale, you might switch to SQL.
Let's look at an example of how a data analyst working in a hospital might use
mathematical thinking to solve a problem with the right tools.
The hospital might find that they're having a problem with over or
under use of their beds.
Based on that, the hospital could make bed optimization a goal.
They want to make sure that beds are available to patients who need them, but
not waste hospital resources like space or money on maintaining empty beds.
Using mathematical thinking, you can break this problem down into a step-by-step
process to help you find patterns in their data.
There's a lot of variables in this scenario.
But for now, let's keep it simple and focus on just a few key ones.
There are metrics that are related to this problem that might show us patterns in
the data:
for example, maybe the number of beds open and
the number of beds used over a period of time.
There's actually already a formula for this.
It's called the bed occupancy rate, and
it's calculated using the total number of inpatient days, and
the total number of available beds over a given period of time.
What we want to do now is take our key variables and see how their relationship
to each other might show us patterns that can help the hospital make a decision.
To do that, we have to choose the tool that makes sense for this task.
Hospitals generate a lot of patient data over a long period of time.
So logically, a tool that's capable of handling big datasets is a must.
SQL is a great choice.
In this case, you discover that the hospital always has unused beds.
Knowing that, they can choose to get rid of some beds, which saves them space and
money that they can use to buy and store protective equipment.
By considering all of the individual parts of this problem logically,
mathematical thinking helped us see new perspectives that led us to a solution.
Well, that's it for now.
Great job.
You've covered a lot of material already.
You've learned about how empowering data can be in decision-making,
the difference between quantitative and qualitative analysis,
using reports and dashboards for data visualization,
metrics, and using a mathematical approach to problem-solving.
Coming up next, we'll be tackling spreadsheet basics.
You'll get to put what you've learned into action and
learn a new tool to help you along the data analysis process.
See you soon.
Big and small data
As a data analyst, you will work with data both big and small. Both kinds of data are
valuable, but they play very different roles.
Whether you work with big or small data, you can use it to help stakeholders improve
business processes, answer questions, create new products, and much more. But there are
certain challenges and benefits that come with big data and the following table explores the
differences between big and small data.
Small data
Big data
Describes a data set made up of specific metrics over a
short, well-defined time period
Describes large, less-specific data sets that
Usually organized and analyzed in spreadsheets
Usually kept in a database and queried
Likely to be used by small and midsize businesses
Likely to be used by large organizations
Simple to collect, store, manage, sort, and visually
represent
Takes a lot of effort to collect, store, manag
Usually already a manageable size for analysis
Usually needs to be broken into smaller pie
and analyzed effectively for decision-makin
Challenges and benefits
Here are some challenges you might face when working with big data:

A lot of organizations deal with data overload and way too much unimportant
or irrelevant information.
 Important data can be hidden deep down with all of the non-important data,
which makes it harder to find and use. This can lead to slower and more
inefficient decision-making time frames.
 The data you need isn’t always easily accessible.
 Current technology tools and solutions still struggle to provide measurable and
reportable data. This can lead to unfair algorithmic bias.
 There are gaps in many big data business solutions.
Now for the good news! Here are some benefits that come with big data:




When large amounts of data can be stored and analyzed, it can help companies
identify more efficient ways of doing business and save a lot of time and money.
Big data helps organizations spot the trends of customer buying patterns and
satisfaction levels, which can help them create new products and solutions that
will make customers happy.
By analyzing big data, businesses get a much better understanding of current
market conditions, which can help them stay ahead of the competition.
As in our earlier social media example, big data helps companies keep track of
their online presence—especially feedback, both good and bad, from
customers. This gives them the information they need to improve and protect
their brand.
The three (or four) V words for big data
When thinking about the benefits and challenges of big data, it helps to think about the
three Vs: volume, variety, and velocity. Volume describes the amount of data. Variety
describes the different kinds of data. Velocity describes how fast the data can be processed.
Some data analysts also consider a fourth V: veracity. Veracity refers to the quality and
reliability of the data. These are all important considerations related to processing huge,
complex data sets.
Volume
Variety
Velocity
Veracit
The amount of data
The different kinds of data
How fast the data can be processed
The qu
Hi, again. I'm glad you're back.
In this part of the program,
we'll revisit the spreadsheet.
Spreadsheets are a powerful and versatile tool,
which is why they're a big part of
pretty much everything we do as data analysts.
There's a good chance a
spreadsheet will be the first tool
you reach for when trying
to answer data-driven questions.
After you've defined what you need to do with the data,
you'll turn to spreadsheets to help
build evidence that you can then visualize,
and use to support your findings.
Spreadsheets are often
the unsung heroes of the data world.
They don't always get the appreciation they deserve,
but as a data detective,
you'll definitely want them in
your evidence collection kit.
I know spreadsheets have saved
the day for me more than once.
I've added data for purchase orders into a sheet,
setup formulas in one tab,
and had the same formulas do
the work for me in other tabs.
This frees up time for me to work
on other things during the day.
I couldn't imagine not using spreadsheets.
Math is a core part of every data analyst's job,
but not every analyst enjoys it.
Luckily, spreadsheets can make
calculations more enjoyable,
and by that, I mean easier. Let's see how.
Spreadsheets can do
both basic and complex calculations automatically.
Not only does this help you work more efficiently,
but it also lets you see
the results and understand how you got them.
Here's a quick look at some of
the functions that you'll
use when performing calculations.
Many functions can be used as
part of a math formula as well.
Functions and formulas also have other uses,
and we'll take a look at those too.
We'll take things one step further with
exercises that use real data from databases.
This is your chance to reorganize a spreadsheet,
do some actual data analysis,
and have some fun with data.
You have been learning a lot about spreadsheets and all kinds of time-saving calculations
and organizational features they offer. One of the most valuable spreadsheet features is a
formula. As a quick reminder, a formula is a set of instructions that does a specific
calculation using the data in a spreadsheet. Formulas make it easy for data analysts to do
powerful calculations automatically, which helps them analyze data more effectively. Below
is a quick-reference guide to help you get the most out of formulas.
Formulas
The basics



When you write a formula in math, it generally ends with an equal sign (2 + 3
= ?). But with formulas, they always start with one instead (=A2+A3). The
equal sign tells the spreadsheet that what follows is part of a formula, not just a
word or number in a cell.
After you type the equal sign, most spreadsheet applications will display an
autocomplete menu that lists valid formulas, names, and text strings. This is a
great way to create and edit formulas while avoiding typing and syntax errors.
A fun way to learn new formulas is just by typing an equal sign and a single
letter of the alphabet. Choose one of the options that pops up and you will learn
what that formula does.
Mathematical operators




The mathematical operators used in spreadsheet formulas include:
Subtraction – minus sign ( - )
Addition – plus sign ( + )
Division – forward-slash ( / )

Multiplication – asterisk ( * )
Auto-filling
The lower-right corner of each cell has a fill handle. It is a small green square in Microsoft
Excel and a small blue square in Google Sheets.



Click the fill handle for a cell and drag it down a column to auto-fill other cells in
the column with the same value or formula in that cell.
Click the fill handle for a cell and drag it across a row to auto-fill other cells in
the row with the same value or formula in that cell.
If you want to create a numbered sequence in a column or row, do the
following: 1) Fill in the first two numbers of the sequence in two adjacent cells,
2) Select to highlight the cells, and 3) Drag the fill handle to the last cell to
complete the sequence of numbers. For example, to insert 1 through 100 in
each row of column A, enter 1 in cell A1 and 2 in cell A2. Then, select to
highlight both cells, click the fill handle in cell A2, and drag it down to cell A100.
This auto-fills the numbers sequentially so you don't have to type them in each
cell.
Absolute referencing




Absolute referencing is marked by a dollar sign ($). For example, =$A$10 has
absolute referencing for both the column and the row value
Relative references (which is what you normally do e.g. “=A10”) will change
anytime the formula is copied and pasted. They are in relation to where the
referenced cell is located. For example if you copied “=A10” to the cell to the
right it would become “=B10”. With absolute referencing “=$A$10” copied to
the cell to the right would remain “=$A$10”. But if you copied $A10 to the cell
below, it would change to $A11 because the row value isn't an absolute
reference.
Absolute references will not change when you copy and paste the formula in a
different cell. The cell being referenced is always the same.
To easily switch between absolute and relative referencing in the formula bar,
highlight the reference you want to change and press the F4 key; for example, if
you want to change the absolute reference, $A$10, in your formula to a relative
reference, A10, highlight $A$10 in the formula bar and then press the F4 key to
make the change.
Data range

When you click into your formula, the colored ranges let you see which cells are
being used in your spreadsheet. There are different colors for each unique
range in your formula.

In a lot of spreadsheet applications, you can press the F2 (or Enter) key to
highlight the range of data in the spreadsheet that is referenced in a formula.
Click the cell with the formula, and then press the F2 (or Enter) key to highlight
the data in your spreadsheet.
Combining with functions

COUNTIF() is a formula and a function. This means the function runs based on
criteria set by the formula. In this case, COUNT is the formula; it will be
executed IF the conditions you create are true. For example, you could use
=COUNTIF(A1:A16, “7”) to count only the cells that contained the number 7.
Combining formulas and functions allows you to do more work with a single
command.
Quick reference: Functions in spreadsheets
As a quick refresher, a function is a preset command that automatically performs a specific
process or task using the data in a spreadsheet. Functions give data analysts the ability to
do calculations, which can be anything from simple arithmetic to complex equations. Use
this reading to help you keep track of some of the most useful options.
Functions
The basics



Just like formulas, start all of your functions with an equal sign; for example
=SUM. The equal sign tells the spreadsheet that what follows is part of a
function, not just a word or number in a cell.
After you type the equal sign, most spreadsheet applications will display an
autocomplete menu that lists valid functions, names, and text strings. This is a
great way to create and edit functions while avoiding typing and syntax errors.
A fun way to learn new functions is by simply typing an equal sign and a single
letter of the alphabet. Choose one of the options that pops up and learn what
that function does.
Difference between formulas and functions


A formula is a set of instructions used to perform a calculation using the data in
a spreadsheet.
A function is a preset command that automatically performs a specific process
or task using the data in a spreadsheet.
Popular functions
A lot of people don’t realize that keyboard shortcuts like cut, save, and find are actually
functions. These functions are built into an application and are amazing time-savers. Using
shortcuts lets you do more with less effort. They can make you more efficient and
productive because you are not constantly reaching for the mouse and navigating menus.
The following table shows some of the most popular shortcuts, for Chromebook, PC, and
Mac:
Command
Chromebook
PC
Mac
Create new
workbook
Control+N
Control+N
Command+
Open workbook
Control+O
Control+O
Command+
Save workbook
Control+S
Control+S
Command+
Close workbook
Control+W
Control+W
Command+
Undo
Control+Z
Control+Z
Command+
Redo
Control+Y
Control+Y
Command+
Copy
Control+C
Control+C
Command+
Cut
Control+X
Control+X
Command+
Paste
Control+V
Control+V
Command+
Paste values only
Control+Shift+V
Control+Shift+V
Command+
Find
Control+Shift+F
Control+F
Command+
Find and replace
Control+H
Control+H
Command+
Insert link
Control+K
Control+K
Command+
Bold
Control+B
Control+B
Command+
Italicize
Control+I
Control+I
Command+
Underline
Control+U
Control+U
Command+
Zoom in
Control+Plus (+)
Control+Plus (+)
Option+Com
Zoom out
Control+Minus (-)
Control+Minus (-)
Option+Com
Select column
Control+Spacebar
Control+Spacebar
Command+
Select row
Shift+Spacebar
Shift+Spacebar
Up Arrow+
Select all cells
Control+A
Control+A
Command+
Edit the current cell
Enter
F2
F2
Comment on a cell
Ctrl + Alt + M
Alt+I+M
Option+Com
Insert column to the
left
Ctrl + Alt + = (with existing column
selected)
Alt+Shift+I, then
C
⌘ + Option
selected)
Command
Chromebook
PC
Mac
Insert column to the
right
Alt + I, then O
Alt+Shift+I, then
O
Ctrl + Optio
Insert row above
Ctrl + Alt + = (with existing row
selected)
Alt+Shift+I, then
R
⌘ + Option
selected)
Insert row below
Alt + I, then R, then B
Alt+Shift+I, then
B
Ctrl + Optio
Auto-filling
The lower-right corner of each cell has a fill handle. It is a small green square in Microsoft
Excel and a small blue square in Google Sheets.


Click the fill handle for a cell and drag it down a column to auto-fill other cells in
the column with the same formula or function used in that cell.
Click the fill handle for a cell and drag it across a row to auto-fill other cells in
the row with the same formula or function used in that cell.
Relative, absolute, and mixed references




Relative references (cells referenced without a dollar sign, like A2) will change
when you copy and paste the function into a different cell. With relative
references, the location of the cell that contains the function determines the
cells used by the function.
Absolute references (cells fully referenced with a dollar sign, like $A$2) will not
change when you copy and paste the function into a different cell. With
absolute references, the cells referenced always remain the same.
Mixed references (cells partially referenced with a dollar sign, like $A2 or A$2)
will change when you copy and paste the function into a different cell. With
mixed references, the location of the cell that contains the function determines
the cells used by the function, but only the row or column is relative (not both).
In spreadsheets, you can press the F4 key to toggle between relative, absolute,
and mixed references in a function. Click the cell containing the function,
highlight the referenced cells in the formula bar, and then press F4 to toggle
between and select relative, absolute, or mixed referencing.
Data ranges


When you click a cell that contains a function, colored data ranges in the
formula bar indicate which cells are being used in the spreadsheet. There are
different colors for each unique range in a function.
Colored data ranges help prevent you from getting lost in complex functions.

In spreadsheets, you can press the F2 key to highlight the range of data used by
a function. Click the cell containing the function, highlight the range of data
used by the function in the formula bar, and then press F2. The spreadsheet will
go to and highlight the cells specified by the range.
Data ranges evaluated for a condition
COUNTIF is an example of a function that returns a value based on a condition that the data
range is evaluated for. The function counts the number of cells that meet the criteria. For
example, in an expense spreadsheet, use COUNTIF to count the number of cells that contain
a reimbursement for "airfare."
For more information, refer to:


Microsoft Support's page for COUNTIF
Google Help Center's documentation for COUNTIF where you can copy a sheet
with COUNTIF examples (click "Use Template" if you click the COUNTIF link
provided on this page)
Conclusion
There are a lot more functions that can help you make the most of your data. This is just the
start. You can keep learning how to use functions to help you solve complex problems
efficiently and accurately throughout your entire career.
Activity overview
You have been learning about the role of a data analyst and how to manage, analyze, and
visualize data. Now, you will consider a valuable tool to help you practice structured
thinking and avoid mistakes: a scope-of-work (SOW).
In this activity, you’ll get practical experience developing an SOW document with the help
of a handy template. You will then complete an example SOW for an imaginary project of
your choosing and learn how analysts outline the work they are going to perform. By the
time you complete this activity, you will be familiar with an essential, industry-standard
tool, and gain comfort asking the right questions to develop an SOW.
Before you get started, take a minute to think about the main ideas, goals, and target
audiences of SOW documents.
Scope of work: What you need to know
As a data analyst, it’s hard to overstate the importance of an SOW document. A well-defined
SOW keeps you, your team, and everyone involved with a project on the same page. It
ensures that all contributors, sponsors, and stakeholders share the same understanding of
the relevant details.
Why do you need an SOW?
The point of data analysis projects is to complete business tasks that are useful to the
stakeholders. Creating an SOW helps to make sure that everyone involved, from analysts
and engineers to managers and stakeholders, shares the understanding of what those
business goals are, and the plan for accomplishing them.
Clarifying requirements and setting expectations are two of the most important parts of a
project. Recall the first phase of the Data Analysis Process—asking questions.
As you ask more and more questions to clarify requirements, goals, data sources,
stakeholders, and any other relevant info, an SOW helps you formalize it all by recording all
the answers and details. In this context, the word “ask” means two things. Preparing to
write an SOW is about asking questions to learn the necessary information about the
project, but it’s also about clarifying and defining what you’re being asked to accomplish,
and what the limits or boundaries of the “ask” are. After all, if you can’t make a distinction
between the business questions you are and aren’t responsible for answering, then it’s
hard to know what success means!
What is a good SOW?
There’s no standard format for an SOW. They may differ significantly from one
organization to another, or from project to project. However, they all have a few
foundational pieces of content in common.
Deliverables: What work is being done, and what things are being created as a result of this
project? When the project is complete, what are you expected to deliver to the
stakeholders? Be specific here. Will you collect data for this project? How much, or for how
long?
Avoid vague statements. For example, “fixing traffic problems” doesn’t specify the scope.
This could mean anything from filling in a few potholes to building a new overpass. Be
specific! Use numbers and aim for hard, measurable goals and objectives. For example:
“Identify top 10 issues with traffic patterns within the city limits, and identify the top 3
solutions that are most cost-effective for reducing traffic congestion.”
Milestones: This is closely related to your timeline. What are the major milestones for
progress in your project? How do you know when a given part of the project is considered
complete?
Milestones can be identified by you, by stakeholders, or by other team members such as the
Project Manager. Smaller examples might include incremental steps in a larger project like
“Collect and process 50% of required data (100 survey responses)”, but may also be larger
examples like ”complete initial data analysis report” or “deliver completed dashboard
visualizations and analysis reports to stakeholders”.
Timeline: Your timeline will be closely tied to the milestones you create for your project.
The timeline is a way of mapping expectations for how long each step of the process should
take. The timeline should be specific enough to help all involved decide if a project is on
schedule. When will the deliverables be completed? How long do you expect the project
will take to complete? If all goes as planned, how long do you expect each component of the
project will take? When can we expect to reach each milestone?
Reports: Good SOWs also set boundaries for how and when you’ll give status updates to
stakeholders. How will you communicate progress with stakeholders and sponsors, and
how often? Will progress be reported weekly? Monthly? When milestones are completed?
What information will status reports contain?
At a minimum, any SOW should answer all the relevant questions in the above areas. Note
that these areas may differ depending on the project. But at their core, the SOW document
should always serve the same purpose by containing information that is specific, relevant,
and accurate. If something changes in the project, your SOW should reflect those changes.
What is in and out of scope?
SOWs should also contain information specific to what is and isn’t considered part of the
project. The scope of your project is everything that you are expected to complete or
accomplish, defined to a level of detail that doesn’t leave any ambiguity or confusion about
whether a given task or item is part of the project or not.
Notice how the previous example about studying traffic congestion defined its scope as the
area within the city limits. This doesn’t leave any room for confusion — stakeholders need
only to refer to a map to tell if a stretch of road or intersection is part of the project or not.
Defining requirements can be trickier than it sounds, so it’s important to be as specific as
possible in these documents, and to use quantitative statements whenever possible.
For example, assume that you’re assigned to a project that involves studying the
environmental effects of climate change on the coastline of a city: How do you define what
parts of the coastline you are responsible for studying, and which parts you are not?
In this case, it would be important to define the area you’re expected to study using GPS
locations, or landmarks. Using specific, quantifiable statements will help ensure that
everyone has a clear understanding of what’s expected.
Completing your own SOW
Now that you know the basics, you can practice creating your own mock SOW for a project
of your choice. To get started, first access the scope-of-work template.
What you will need
To use the template for this course item, click the link below and select “Use Template.”
Link to template: Data Analysis Project Scope-Of-Work (SOW) Template
OR
If you don’t have a Google account, you can download the template directly from the
attachment below.
Scope-Of-Work Template
DOCX File
Download file
Fill the template in for an imaginary project
Spend a few minutes thinking about a plausible data analysis project.
Come up with a problem domain, and then make up the relevant details to help you fill out
the template.
Take some time to fill out the template. Treat this exercise as if you were writing your first
SOW in your new career as a data analyst. Try to be thorough, specific, and concise!
The specifics here aren’t important. The goal is to get comfortable identifying and
formalizing requirements and using those requirements in a professional manner by
creating SOWs.
Compare your work to a strong example
Once you’ve filled out your template, consider the strong example below and compare it to
yours.
Link to the strong example: Data Analysis Project Scope-of-Work (SOW) Strong Example
OR
You can download the template directly from the attachment below.
Scope-Of-Work Exemplar.pdf
PDF File
Open file
Confirmation and reflection
When you created a complete and thorough mock SOW, which foundational pieces of
content did you include? Select all that apply.
The importance of context
Context is the condition in which something exists or happens. Context is important in data
analytics because it helps you sift through huge amounts of disorganized data and turn it
into something meaningful. The fact is, data has little value if it is not paired with context.
Image of a hand putting the final puzzle piece in a 4-piece puzzle
Understanding the context behind the data can help us make it more meaningful at every
stage of the data analysis process. For example, you might be able to make a few guesses
about what you're looking at in the following table, but you couldn't be certain without
more context.
2010
28000
2005
18000
2000
23000
1995
10000
On the other hand, if the first column was labeled to represent the years when a survey was
conducted, and the second column showed the number of people who responded to that
survey, then the table would start to make a lot more sense. Take this a step further, and
you might notice that the survey is conducted every 5 years. This added context helps you
understand why there are five-year gaps in the table.
Years (Collected every 5 years)
Respondents
2010
28000
2005
18000
2000
23000
1995
10000
Context can turn raw data into meaningful information. It is very important for data
analysts to contextualize their data. This means giving the data perspective by defining it.
To do this, you need to identify:
Who: The person or organization that created, collected, and/or funded the data collection
What: The things in the world that data could have an impact on
Where: The origin of the data
When: The time when the data was created or collected
Why: The motivation behind the creation or collection
How: The method used to create or collect it
This is an image of an unlabeled graph with 3 dashed lines (red, blue, and yellow) with a
star on the yellow line
Understanding and including the context is important during each step of your analysis
process, so it is a good idea to get comfortable with it early in your career. For example,
when you collect data, you’ll also want to ask questions about the context to make sure that
you understand the business and business process. During organization, the context is
important for your naming conventions, how you choose to show relationships between
variables, and what you choose to keep or leave out. And finally, when you present, it is
important to include contextual information so that your stakeholders understand your
analysis.
It's normal for conflict to come up in your work life. A lot of what you've learned so far, like
managing expectations and communicating effectively can help you avoid conflict, but
sometimes you'll run into conflict anyways. If that happens, there are ways to resolve it and
move forward. In this video, we will talk about how conflict could happen and the best ways
you can practice conflict resolution. A conflict can pop up for a variety of reasons. Maybe a
stakeholder misunderstood the possible outcomes for your project; maybe you and your team
member have very different work styles; or maybe an important deadline is approaching and
people are on edge. Mismatched expectations and miscommunications are some of the most
common reasons conflicts happen. Maybe you weren't clear on who was supposed to clean a
dataset and nobody cleaned it, delaying a project. Or maybe a teammate sent out an email
with all of your insights included, but didn't mention it was your work. While it can be easy
to take conflict personally, it's important to try and be objective and stay focused on the
team's goals. Believe it or not, tense moments can actually be opportunities to re-evaluate a
project and maybe even improve things. So when a problem comes up, there are a few ways
you can flip the situation to be more productive and collaborative. One of the best ways you
can shift a situation from problematic to productive is to just re-frame the problem. Instead of
focusing on what went wrong or who to blame, change the question you're starting with. Try
asking, how can I help you reach your goal? This creates an opportunity for you and your
team members to work together to find a solution instead of feeling frustrated by the
problem. Discussion is key to conflict resolution. If you find yourself in the middle of a
conflict, try to communicate, start a conversation or ask things like, are there other
important things I should be considering? This gives your team members or stakeholders a
chance to fully lay out your concerns. But if you find yourself feeling emotional, give
yourself some time to cool off so you can go into the conversation with a clearer head. If I need
to write an email during a tense moment, I'll actually save it to drafts and come back to it
the next day to reread it before sending to make sure that I'm being level-headed. If you find
you don't understand what your team member or stakeholder is asking you to do, try to
understand the context of their request. Ask them what their end goal is, what story they're
trying to tell with the data or what the big picture is. By turning moments of potential
conflict into opportunities to collaborate and move forward, you can resolve tension and get
your project back on track. Instead of saying, "There's no way I can do that in this time
frame," try to re-frame it by saying, "I would be happy to do that, but I'll just take this
amount of time, let's take a step back so I can better understand what you'd like to do with
the data and we can work together to find the best path forward." With that, we've reached the
end of this section. Great job. Learning how to work with new team members can be a big
challenge in starting a new role or a new project but with the skills you've picked up in these
videos, you'll be able to start on the right foot with any new team you join. So far, you've
learned about balancing the needs and expectations of your team members and stakeholders.
You've also covered how to make sense of your team's roles and focus on the project objective,
the importance of clear communication and communication expectations in a workplace, and
how to balance the limitations of data with stakeholder asks. Finally, we covered how to have
effective team meetings and how to resolve conflicts by thinking collaboratively with your
team members. Hopefully now you understand how important communication is to the
success of a data analyst. These communication skills might feel a little different from some
of the other skills you've been learning in this program, but they're also an important part of
your data analyst toolkit and your success as a professional data analyst. Just like all of the
other skills you're learning right now, your communication skills will grow with practice
and experience.
Limitations of data
Data is powerful, but it has its limitations. Has someone’s personal opinion found its way
into the numbers? Is your data telling the whole story? Part of being a great data analyst is
knowing the limits of data and planning for them. This reading explores how you can do
that.
If you have incomplete or nonexistent data, you might realize during an analysis that you
don't have enough data to reach a conclusion. Or, you might even be solving a different
problem altogether! For example, suppose you are looking for employees who earned a
particular certificate but discover that certification records go back only two years at your
company. You can still use the data, but you will need to make the limits of your analysis
clear. You might be able to find an alternate source of the data by contacting the company
that led the training. But to be safe, you should be up front about the incomplete dataset
until that data becomes available.
If you're collecting data from other teams and using existing spreadsheets, it is good to
keep in mind that people use different business rules. So one team might define and
measure things in a completely different way than another. For example, if a metric is the
total number of trainees in a certificate program, you could have one team that counts
every person who registered for the training, and another team that counts only the people
who completed the program. In cases like these, establishing how to measure things early
on standardizes the data across the board for greater reliability and accuracy. This will
make sure comparisons between teams are meaningful and insightful.
Dirty data refers to data that contains errors. Dirty data can lead to productivity loss,
unnecessary spending, and unwise decision-making. A good data cleaning effort can help
you avoid this. As a quick reminder, data cleaning is the process of fixing or removing
incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
When you find and fix the errors - while tracking the changes you made - you can avoid a
data disaster. You will learn how to clean data later in the training.
Avinash Kaushik, a Digital Marketing Evangelist for Google, has lots of great tips for data
analysts in his blog: Occam's Razor. Below are some of the best practices he recommends
for good data storytelling:





Compare the same types of data: Data can get mixed up when you chart it for
visualization. Be sure to compare the same types of data and double check that
any segments in your chart definitely display different metrics.
Visualize with care: A 0.01% drop in a score can look huge if you zoom in
close enough. To make sure your audience sees the full story clearly, it is a good
idea to set your Y-axis to 0.
Leave out needless graphs: If a table can show your story at a glance, stick
with the table instead of a pie chart or a graph. Your busy audience will
appreciate the clarity.
Test for statistical significance: Sometimes two datasets will look different,
but you will need a way to test whether the difference is real and important. So
remember to run statistical tests to see how much confidence you can place in
that difference.
Pay attention to sample size: Gather lots of data. If a sample size is small, a
few unusual responses can skew the results. If you find that you have too little
data, be careful about using it to form judgments. Look for opportunities to
collect more data, then chart those trends over longer periods.
In any organization, a big part of a data analyst’s role is making sound judgments. When
you know the limitations of your data, you can make judgment calls that help people make
better decisions supported by the data. Data is an extremely powerful tool for decisionmaking, but if it is incomplete, misaligned, or hasn’t been cleaned, then it can be misleading.
Take the unstructurenecessary steps to make sure that your data is complete and
consistent. Clean the data before you begin your analysis to save yourself and possibly
others a great amount of time and effort.
Data modeling levels and techniques
This reading introduces you to data modeling and different types of data models. Data
models help keep data consistent and enable people to map out how data is organized. A
basic understanding makes it easier for analysts and other stakeholders to make sense of
their data and use it in the right ways.
Important note: As a junior data analyst, you won't be asked to design a data model. But
you might come across existing data models your organization already has in place.
What is data modeling?
Data modeling is the process of creating diagrams that visually represent how data is
organized and structured. These visual representations are called data models. You can
think of data modeling as a blueprint of a house. At any point, there might be electricians,
carpenters, and plumbers using that blueprint. Each one of these builders has a different
relationship to the blueprint, but they all need it to understand the overall structure of the
house. Data models are similar; different users might have different data needs, but the
data model gives them an understanding of the structure as a whole.
Levels of data modeling
Each level of data modeling has a different level of detail.
1. Conceptual data modeling gives a high-level view of the data structure, such
as how data interacts across an organization. For example, a conceptual data
model may be used to define the business requirements for a new database. A
conceptual data model doesn't contain technical details.
2. Logical data modeling focuses on the technical details of a database such as
relationships, attributes, and entities. For example, a logical data model defines
how individual records are uniquely identified in a database. But it doesn't spell
out actual names of database tables. That's the job of a physical data model.
3. Physical data modeling depicts how a database operates. A physical data
model defines all entities and attributes used; for example, it includes table
names, column names, and data types for the database.
More information can be found in this comparison of data models.
Data-modeling techniques
There are a lot of approaches when it comes to developing data models, but two common
methods are the Entity Relationship Diagram (ERD) and the Unified Modeling
Language (UML) diagram. ERDs are a visual way to understand the relationship between
entities in the data model. UML diagrams are very detailed diagrams that describe the
structure of a system by showing the system's entities, attributes, operations, and their
relationships. As a junior data analyst, you will need to understand that there are different
data modeling techniques, but in practice, you will probably be using your organization’s
existing technique.
You can read more about ERD, UML, and data dictionaries in this data modeling techniques
article.
Data analysis and data modeling
Data modeling can help you explore the high-level details of your data and how it is related
across the organization’s information systems. Data modeling sometimes requires data
analysis to understand how the data is put together; that way, you know how to map the
data. And finally, data models make it easier for everyone in your organization to
understand and collaborate with you on your data. This is important for you and everyone
on your team!
By now you've learned a lot about data.
From generated data, to collected data, to data formats,
it's good to know as much as you can
about the data you'll use for analysis.
In this video, we'll talk about another way
you can describe data: the data type.
A data type is a specific kind of data attribute
that tells what kind of value the data is.
In other words, a data type
tells you what kind of data you're working with.
Data types can be different
depending on the query language you're using.
For example, SQL allows for
different data types depending
on which database you're using.
For now though, let's focus on
the data types that you'll use in spreadsheets.
To help us out, we'll use
a spreadsheet that's already filled with data.
We'll call it "Worldwide
Interests in Sweets through Google Searches."
Now a data type in
a spreadsheet can be one of three things:
a number, a text
or string, or a Boolean.
You might find spreadsheet programs that classify them
a bit differently or include other types,
but these value types cover just about
any data you'll find in spreadsheets.
We'll look at all of these in just a bit.
Looking at columns B, D,
and F, we find number data types.
Each number represents the search interest
for the terms "cupcakes,"
"ice cream," and "candy" for a specific week.
The closer a number is to 100,
the more popular that search term was during that week.
One hundred represents peak popularity.
Keep in mind that in this case,
100 is a relative value,
not the actual number of searches.
It represents the maximum number
of searches during a certain time.
Think of it like a percentage on a test.
All other searches are then also valued out of 100.
You might notice this in other data sets as well.
Gold star for 100!
If you needed to, you could change the numbers into
percents or other formats, like currency.
These are all examples of number data types.
In column H, the data shows
the most popular treat for each week,
based on the search data.
So as we'll find in cell H4 for
the week beginning July 28th, 2019,
the most popular treat was ice cream.
This is an example of a text data type,
or a string data type,
which is a sequence of characters and
punctuation that contains textual information.
In this example, that information
would be the treats and people's names.
These can also include numbers, like
phone numbers or numbers in street addresses.
But these numbers wouldn't be used for calculations.
In this case they're treated like text, not numbers.
In columns C, E,
and G, it seems like we've got some text.
But the text here isn't a text or string data type.
Instead, it's a Boolean data type.
A Boolean data type is
a data type with only two possible values:
true or false.
Columns C, E, and G show
Boolean data for whether the
search interest for each week,
is at least 50 out of 100.
Here's how it works. To get this data,
we've created a formula that calculates
whether the search interest data in columns B,
D, and F is 50 or greater.
In cell B4, the search interest is 14.
In cell C4, we find the word false
because, for this week of data,
the search interest is less than 50.
For each cell in columns C, E,
and G, the only two possible values are true or false.
We could change the formula so
other words appear in these cells instead,
but it's still Boolean data.
You'll get a chance to read more
about the Boolean data type soon.
Let's talk about a common issue that
people encounter in spreadsheets:
mistaking data types with cell values.
For example, in cell B57,
we can create a formula to calculate data in other cells.
This will give us the average of the search interests
in cupcakes across all weeks in the dataset,
which is about 15.
The formula works because we
calculated using a number data type.
But if we tried it with a text or string data type,
like the data in column C, we'd get an error.
Error values usually happen if a mistake is
made in entering the values in the cells.
The more you know your data types and which ones to use,
the less errors you'll run into.
There you have it, a data type for everyone.
We're not done yet. Coming up,
we'll go deeper into the relationship between data types,
fields, and values. See you soon.
Understanding Boolean logic
In this reading, you will explore the basics of Boolean logic and learn how to use multiple
conditions in a Boolean statement. These conditions are created with Boolean operators,
including AND, OR, and NOT. These operators are similar to mathematical operators and
can be used to create logical statements that filter your results. Data analysts use Boolean
statements to do a wide range of data analysis tasks, such as creating queries for searches
and checking for conditions when writing programming code.
Boolean logic example
Imagine you are shopping for shoes, and are considering certain preferences:


You will buy the shoes only if they are pink and grey
You will buy the shoes if they are entirely pink or entirely grey, or if they are
pink and grey
 You will buy the shoes if they are grey, but not if they have any pink
Below are Venn diagrams that illustrate these preferences. AND is the center of the Venn
diagram, where two conditions overlap. OR includes either condition. NOT includes only
the part of the Venn diagram that doesn't contain the exception.
The AND operator
Your condition is “If the color of the shoe has any combination of grey and pink, you will
buy them.” The Boolean statement would break down the logic of that statement to filter
your results by both colors. It would say “IF (Color=”Grey”) AND (Color=”Pink”) then buy
them.” The AND operator lets you stack multiple conditions.
Below is a simple truth table that outlines the Boolean logic at work in this statement. In
the Color is Grey column, there are two pairs of shoes that meet the color condition. And in
the Color is Pink column, there are two pairs that meet that condition. But in the If Grey
AND Pink column, there is only one pair of shoes that meets both conditions. So, according
to the Boolean logic of the statement, there is only one pair marked true. In other words,
there is one pair of shoes that you can buy.
Color is Grey
Color is Pink
If Grey AND Pink, then Buy
Boole
Grey/True
Pink/True
True/Buy
True A
Grey/True
Black/False
False/Don't buy
True A
Red/False
Pink/True
False/Don't buy
False A
Red/False
Green/False
False/Don't buy
False A
The OR operator
The OR operator lets you move forward if either one of your two conditions is met. Your
condition is “If the shoes are grey or pink, you will buy them.” The Boolean statement
would be “IF (Color=”Grey”) OR (Color=”Pink”) then buy them.” Notice that any shoe that
meets either the Color is Grey or the Color is Pink condition is marked as true by the
Boolean logic. According to the truth table below, there are three pairs of shoes that you
can buy.
Color is Grey
Color is Pink
If Grey OR Pink, then Buy
Boo
Red/False
Black/False
False/Don't buy
Fals
Black/False
Pink/True
True/Buy
Fals
Grey/True
Green/False
True/Buy
True
Grey/True
Pink/True
True/Buy
True
The NOT operator
Finally, the NOT operator lets you filter by subtracting specific conditions from the results.
Your condition is "You will buy any grey shoe except for those with any traces of pink in
them." Your Boolean statement would be “IF (Color="Grey") AND (Color=NOT “Pink”)
then buy them.” Now, all of the grey shoes that aren't pink are marked true by the Boolean
logic for the NOT Pink condition. The pink shoes are marked false by the Boolean logic for
the NOT Pink condition. Only one pair of shoes is excluded in the truth table below.
Color is Grey
Color is Pink
Boolean Logic for NOT Pink
If Grey AND (NOT Pink), then Bu
Grey/True
Red/False
Not False = True
True/Buy
Grey/True
Black/False
Not False = True
True/Buy
Grey/True
Green/False
Not False = True
True/Buy
Grey/True
Pink/True
Not True = False
False/Don't buy
The power of multiple conditions
For data analysts, the real power of Boolean logic comes from being able to combine
multiple conditions in a single statement. For example, if you wanted to filter for shoes that
were grey or pink, and waterproof, you could construct a Boolean statement such as: “IF
((Color = “Grey”) OR (Color = “Pink”)) AND (Waterproof=“True”).” Notice that you can
use parentheses to group your conditions together.
Whether you are doing a search for new shoes or applying this logic to your database
queries, Boolean logic lets you create multiple conditions to filter your results. And now
that you know a little more about how Boolean logic is used, you can start using it!
Additional Reading/Resources

Learn about who pioneered Boolean logic in this historical article: Origins of
Boolean Algebra in the Logic of Classes.

Find more information about using AND, OR, and NOT from these tips for
searching with Boolean operators.
In this reading, you will explore how data is transformed and the differences between wide
and long data. Data transformation is the process of changing the data’s format, structure,
or values. As a data analyst, there is a good chance you will need to transform data at some
point to make it easier for you to analyze it.
Data transformation usually involves:






Adding, copying, or replicating data
Deleting fields or records
Standardizing the names of variables
Renaming, moving, or combining columns in a database
Joining one set of data with another
Saving a file in a different format. For example, saving a spreadsheet as a
comma separated values (CSV) file.
Why transform data?
Goals for data transformation might be:



Data organization: better organized data is easier to use
Data compatibility: different applications or systems can then use the same
data
Data migration: data with matching formats can be moved from one system to
another



Data merging: data with the same organization can be merged together
Data enhancement: data can be displayed with more detailed fields
Data comparison: apples-to-apples comparisons of the data can then be made
Data transformation example: data merging
Mario is a plumber who owns a plumbing company. After years in the business, he buys
another plumbing company. Mario wants to merge the customer information from his
newly acquired company with his own, but the other company uses a different database.
So, Mario needs to make the data compatible. To do this, he has to transform the format of
the acquired company’s data. Then, he must remove duplicate rows for customers they had
in common. When the data is compatible and together, Mario’s plumbing company will
have a complete and merged customer database.
Data transformation example: data organization (long to wide)
To make it easier to create charts, you may also need to transform long data to wide data.
Consider the following example of transforming stock prices (collected as long data) to
wide data.
Long data is data where each row contains a single data point for a particular item. In
the long data example below, individual stock prices (data points) have been collected for
Apple (AAPL), Amazon (AMZN), and Google (GOOGL) (particular items) on the given dates.
Long data example: Stock prices
Wide data is data where each row contains multiple data points for the particular items
identified in the columns.
Wide data example: Stock prices
With data transformed to wide data, you can create a chart comparing how each company's
stock changed over the same period of time.
You might notice that all the data included in the long format is also in the wide format. But
wide data is easier to read and understand. That is why data analysts typically transform
long data to wide data more often than they transform wide data to long data. The
following table summarizes when each format is preferred:
Wide data is preferred when
Creating tables and charts with a few variables
about each subject
Comparing straightforward line graphs
Long data is preferred when
Storing a lot of variables about each subject.For examp
rates for each bank
Performing advanced statistical analysis or graphing
Question 1
Activity overview
By now, you’ve learned a lot about different data types and data structures. In this activity,
you will work with datasets from Kaggle, an online community of people passionate about
data. To start this activity, you’ll create a Kaggle account, set up a profile, and explore
Kaggle notebooks.
Every data analyst has a data community that they rely on for help, support, and
inspiration. Kaggle can help you build your own data community.
Kaggle has millions of users in all stages of their data career, from beginners to data
scientists with decades of experience. The Kaggle community brings people together to
develop their data analysis skills, share datasets and interactive notebooks, and collaborate
on solving real-life data problems.
Check out this brief introductory video to learn more about Kaggle.
By the time you complete this activity, you will be able to use many of Kaggle’s key features.
This will enable you to create notebooks and browse data, which is important for
completing and sharing data projects in your career as a data analyst.
Create a Kaggle account
To get started, follow these steps to create a Kaggle account.
Note: Kaggle frequently updates its user interface. The latest changes may not be reflected
in the screenshots, but the principles in this activity remain the same. Adapting to changes
in software updates is an essential skill for data analysts, and we encourage you to practice
troubleshooting. You can also reach out to your community of learners on the discussion
forum for help.
1. Go to kaggle.com
2. Click on the Register button at the top-right of the Kaggle homepage. You can register
with your Google credentials or your trans email address.
screenshot of kaggle homepage. The register button is highlighted
3. Once you’re registered and logged in to Kaggle, click on the Account icon at the top-right
of your screen. In the menu that opens, click the Your Profile button.
4. On your profile page, click on the Edit Profile button. Enter any information you’d like to
share with the Kaggle community. Your profile will be public, so only enter the information
you’re comfortable sharing.
5. If you want some inspiration, check out the profile of Kaggle’s Community Advocate,
Jesse Mostipak!
Explore Kaggle notebooks
Now that you’ve created an account and set up your profile, you can check out some
notebooks on Kaggle. Kagglers use notebooks to share datasets and data analyses.
Step 1: Go to the Code home page
First, go to the Navigation bar on the left side of your screen. Then, click on the Code icon.
This takes you to the Code home page.
Step 2: Review Kaggler contributions
On the Code home page, you’ll notice links to notebooks created by other Kagglers.
To begin, feel free to scroll through the list and click on notebooks that interest you. As you
explore, you may come across unfamiliar terms and new information: That’s fine! Kagglers
come from diverse backgrounds and focus on different areas of data analysis, data science,
machine learning, and deep learning.
Step 3: Narrow your search
Once you’re familiar with the Code home page, you can narrow your search results by
typing a word in the search bar or by using the filter feature.
For example, type Beginner in the search bar to show notebooks tagged as beginnerfriendly. Or, click on the Filter icon, the triangle shape on the right side of the search bar.
You can filter results by tags, programming language, output, and other options. Filter to
Datasets to show notebooks that use one of the tens of thousands of public datasets
available on Kaggle.
Step 4: Review suggested notebooks
If you’re looking for specific suggestions, check out the following notebooks:
gganimate by Meg Risdal
Getting staRted in R by Rachael Tatman
Writing Hamilton Lyrics with TensorFlow/R by Ana Sofia Uzsoy
Dive into dplyr (tutorial #1) by Jesse Mostipak
Spend some time checking out a couple of notebooks to get an idea of the work that
Kagglers share online—and that you’ll be able to create by the time you’ve finished this
course!
Edit a notebook
Now, take a look at a specific notebook: Dive into dplyr (tutorial #1) by Jesse Mostipak.
Follow these steps to learn how to edit notebooks:
1. Click on the link to open up the notebook. It contains the dataset you’ll work with later
on.
2. Click on the Copy and Edit button at the top-right to make a copy of the notebook in your
account. Now, the notebook appears in Edit mode. Edit mode lets you make changes to the
notebook if you want.
Screenshot of a notebook viewer page. Introductory text has been copied and pasted
This notebook is private. If you want to share your work, you can choose to make it public.
When you copy and edit another Kaggler’s work, always make meaningful changes to the
notebook before publishing it. That way, you’re not misrepresenting someone else’s work
as your own.
3. Take a moment to explore the Edit mode of the notebook.
Some of this may seem unfamiliar—and that’s just fine. By the end of this course, you’ll
know how to create a notebook like this from scratch!
What is data anonymization?
You have been learning about the importance of privacy in data analytics. Now, it is time to
talk about data anonymization and what types of data should be anonymized. Personally
identifiable information, or PII, is information that can be used by itself or with other
data to track down a person's identity.
Data anonymization is the process of protecting people's private or sensitive data by
eliminating that kind of information. Typically, data anonymization involves blanking,
hashing, or masking personal information, often by using fixed-length codes to represent
data columns, or hiding data with altered values.
Your role in data anonymization
Organizations have a responsibility to protect their data and the personal information that
data might contain. As a data analyst, you might be expected to understand what data
needs to be anonymized, but you generally wouldn't be responsible for the data
anonymization itself. A rare exception might be if you work with a copy of the data for
testing or development purposes. In this case, you could be required to anonymize the data
before you work with it.
What types of data should be anonymized?
Healthcare and financial data are two of the most sensitive types of data. These industries
rely a lot on data anonymization techniques. After all, the stakes are very high. That’s why
data in these two industries usually goes through de-identification, which is a process
used to wipe data clean of all personally identifying information.
Data anonymization is used in just about every industry. That is why it is so important for
data analysts to understand the basics. Here is a list of data that is often anonymized:





Telephone numbers
Names
License plates and license numbers
Social security numbers
IP addresses




Medical records
Email addresses
Photographs
Account numbers
For some people, it just makes sense that this type of data should be anonymized. For
others, we have to be very specific about what needs to be anonymized. Imagine a world
where we all had access to each other’s addresses, account numbers, and other identifiable
information. That would invade a lot of people’s privacy and make the world less safe. Data
anonymization is one of the ways we can keep data private and secure!
The open-data debate
Just like data privacy, open data is a widely debated topic in today’s world. Data analysts
think a lot about open data, and as a future data analyst, you need to understand the basics
to be successful in your new role.
What is open data?
In data analytics, open data is part of data ethics, which has to do with using data
ethically. Openness refers to free access, usage, and sharing of data. But for data to be
considered open, it has to:



Be available and accessible to the public as a complete dataset
Be provided under terms that allow it to be reused and redistributed
Allow universal participation so that anyone can use, reuse, and redistribute
the data
Data can only be considered open when it meets all three of these standards.
The open data debate: What data should be publicly available?
One of the biggest benefits of open data is that credible databases can be used more widely.
Basically, this means that all of that good data can be leveraged, shared, and combined with
other data. This could have a huge impact on scientific collaboration, research advances,
analytical capacity, and decision-making. But it is important to think about the individuals
being represented by the public, open data, too.
Third-party data is collected by an entity that doesn’t have a direct relationship with the
data. You might remember learning about this type of data earlier. For example, third
parties might collect information about visitors to a certain website. Doing this lets these
third
parties create audience profiles, which helps them better understand user behavior and
target them with more effective advertising.
Personal identifiable information (PII) is data that is reasonably likely to identify a
person and make information known about them. It is important to keep this data safe. PII
can include a person’s address, credit card information, social security number, medical
records, and more.
Everyone wants to keep personal information about themselves private. Because thirdparty data is readily available, it is important to balance the openness of data with the
privacy of individuals.
Balancing security and analytics
The battle between security and data analytics
Data security means protecting data from unauthorized access or corruption by putting
safety measures in place. Usually the purpose of data security is to keep unauthorized users
from accessing or viewing sensitive data. Data analysts have to find a way to balance data
security with their actual analysis needs. This can be tricky-- we want to keep our data safe
and secure, but we also want to use it as soon as possible so that we can make meaningful
and timely observations.
In order to do this, companies need to find ways to balance their data security measures
with their data access needs.
Luckily, there are a few security measures that can help companies do just that. The two we
will talk about here are encryption and tokenization.
Encryption uses a unique algorithm to alter data and make it unusable by users and
applications that don’t know the algorithm. This algorithm is saved as a “key” which can be
used to reverse the encryption; so if you have the key, you can still use the data in its
original form.
Tokenization replaces the data elements you want to protect with randomly generated
data referred to as a “token.” The original data is stored in a separate location and mapped
to the tokens. To access the complete original data, the user or application needs to have
permission to use the tokenized data and the token mapping. This means that even if the
tokenized data is hacked, the original data is still safe and secure in a separate location.
Encryption and tokenization are just some of the data security options out there. There are
a lot of others, like using authentication devices for AI technology.
As a junior data analyst, you probably won’t be responsible for building out these systems.
A lot of companies have entire teams dedicated to data security or hire third party
companies that specialize in data security to create these systems. But it is important to
know that all companies have a responsibility to keep their data secure, and to understand
some of the potential systems your future employer might use.
Signing up
Signing up with LinkedIn is simple. Just follow these simple steps:
1. Browse to linkedin.com
2. Click Join now or Join with resume.
If you clicked Join now:
1. Enter your email address and a password and click Agree & Join (or click Join
with Google to link to a Google account).
2. Enter your first and last name and click Continue.
3. Enter your country/region, your postal code, and location with the area (this
helps LinkedIn find job opportunities near you).
4. Enter your most recent job title, or select I’m a student.
5. If you entered your most recent job title, select your employment type and
enter the name of your most recent company.
6. If you selected self-employed or freelance, LinkedIn will ask for your industry.
7. Click confirm your email address. You will receive an email from LinkedIn.
8. To confirm your email address, click Agree & Confirm in your email.
9. LinkedIn will then ask if you are looking for a job. Click the answer that applies.
If you select Yes, LinkedIn will help you start looking for job opportunities.
If you clicked Join with resume:
1. Click Upload your resume and select the file to upload.
2. Follow any of the steps under Join Now that are relevant.
The Join with resume option saves you some time because it auto-fills most of the
information from your resume. And just like that, your initial profile is now ready!
Including basic information in your profile
It is a good idea to take your time filling out every section of your profile. This helps
recruiters find your profile and helps people you connect with get to know you better. Start
with your photo. Here are some tips to help you choose a great picture for your new profile:




Choose an image that looks like you: You want to make sure that your profile is
the best representation of you and that includes your photo. You want a
potential connection or potential employer to be able to recognize you from
your profile picture if you were to meet.
Use your industry as an example: If you are having trouble deciding what is
appropriate for your profile image, look at other profiles in the same industry
or from companies you are interested in to get a better sense of what you
should be doing.
Choose a high-resolution image: The better the resolution, the better
impression it makes, so make sure the image you choose isn’t blurry. The ideal
image size for a LinkedIn profile picture is 400 x 400 pixels. Use a photo where
your face takes up at least 60% of the space in the frame.
Remember to smile: Your profile picture is a snapshot of who you are as a
person so it is okay to be serious in your photo. But smiling helps put potential
connections and potential employers at ease.
Adding connections
Connections are a great way to keep up to date with your previous coworkers, colleagues,
classmates, or even companies you want to work with. The world is a big place with a lot of
people. So here are some tips to help get you started.
1. Connect to people you know personally.
2. Add a personal touch to your invitation message. Instead of just letting them
know you would like to connect, let them know why.
3. Make sure your profile picture is current so people can recognize you.
4. Add value. Provide them with a resource, a website link, or even some content
they might find interesting in your invitation to connect.
Finding leaders and influencers
LinkedIn is a great place to find great people and great ideas. From technology to
marketing, and everything in between, there are all kinds of influencers and thought
leaders active on LinkedIn. If you have ever wanted to know the thoughts of some of the
most influential and respected minds in a certain field, LinkedIn is a great place to start.
Following your favorite people takes only a few minutes. You can search for people or
companies individually, or you can use these lists as starting points.
Top influencers on LinkedIn LinkedIn Top Voices 2020: Data Science & AI
Looking for a new position
On LinkedIn, letting recruiters and potential employers know that you are in the market for
a new job is simple. Just follow these steps:
1. Click the Me icon at the top of your LinkedIn homepage.
2. Click View profile.
3. Click the Add profile section drop-down and under Intro, select Looking for a
new job.
Make sure to select the appropriate filters for the new positions you might be looking for
and update your profile to better fit the role that you are applying for.
Keeping your profile up to date
Add to your profile to keep it complete, current, and interesting. For example, remember to
add the Google Data Analytics Certificate to your profile after you complete the program!
Building connections on LinkedIn
Using LinkedIn to connect
A connection is someone you know and trust on a personal or professional basis. Your
connections are who make up your network. And when it comes to your network, it is
important to remember quality over quantity. So don’t focus on how many connections you
have. Instead, make sure that everyone you connect with adds value to your network, and
vice versa.
Inviting those you know versus making cold requests
Adding connections on LinkedIn is easy. You invite people to join your network, and they
accept your invitation. When you send an invitation, you can attach a personal note.
Personal notes are highly recommended.
A great way to increase the number of your connections is to invite classmates, friends,
teachers, or even members of a club or organization you are in. LinkedIn also gives
suggestions for connections based on your profile information. Here's an example
(template) that you can use to connect with a former co-worker:
The message: Hi <fill in name here>, Please accept my invitation to connect. It has been a while
since we were at <fill in company name here> and I look forward to catching up with you. I’m
looking for job opportunities and would love to hear about what you’re doing and who is hiring
in your organization. Best regards, <fill in your name here>
Cold requests on LinkedIn are invitations to connect with people you don’t know
personally or professionally. When you start to build your network, it is best to connect
with people you already know. But cold requests might be the only way to connect with
people who work at companies you are interested in. You can learn a lot about a company’s
culture and job openings from current employees. As a best practice, send cold requests
rarely and only when there is no other way to connect.
Asking for recommendations (references)
Recommendations on LinkedIn are a great way to have others vouch for you. Ask people to
comment on your past performance, how you handled a challenging project, or your
strengths as a data analyst. You can choose to accept, reject, show, or hide
recommendations in your profile.
Here are some tips for asking for a recommendation:

Reach out to a variety of people for a 360-degree view: supervisors, coworkers, direct reports, partners, and clients
 Personalize the recommendation request with a custom message
 Suggest strengths and capabilities they can highlight as part of your request
 Be willing to write a recommendation in return
 Read the recommendation carefully before you accept it into your profile
Sometimes the hardest part of getting a recommendation is creating the right request
message. Here's an example (template) that you can use to ask for a recommendation:
Hi <fill in name here>, How are you? I hope you are well. I’m preparing for a new job search
and would appreciate it if you could write a recommendation that highlights my <insert your
specific skill here>. Our experience working on <insert project here> is a great example and I
would be happy to provide other examples if you need them. Please let me know if I can write a
recommendation for you. I would be very glad to return the favor. Thanks in advance for your
support! <fill in your name here>
Ask a few connections to recommend you and highlight why you should be hired.
Recommendations help prospective employers get a better idea of who you are and the
quality of your work.
Summing it up
When you write thoughtful posts and respond to others genuinely, people in and even
outside your network will be open and ready to help you during your job search.
More about data integrity and compliance
This reading illustrates the importance of data integrity using an example of a global
company’s data. Definitions of terms that are relevant to data integrity will be provided at
the end.
Scenario: calendar dates for a global company
Calendar dates are represented in a lot of different short forms. Depending on where you
live, a different format might be used.


In some countries,12/10/20 (DD/MM/YY) stands for October 12, 2020.
In other countries, the national standard is YYYY-MM-DD so October 12, 2020
becomes 2020-10-12.
 In the United States, (MM/DD/YY) is the accepted format so October 12, 2020
is going to be 10/12/20.
Now, think about what would happen if you were working as a data analyst for a global
company and didn’t check date formats. Well, your data integrity would probably be
questionable. Any analysis of the data would be inaccurate. Imagine ordering extra
inventory for December when it was actually needed in October!
A good analysis depends on the integrity of the data, and data integrity usually depends on
using a common format. So it is important to double-check how dates are formatted to
make sure what you think is December 10, 2020 isn’t really October 12, 2020, and vice
versa.
Here are some other things to watch out for:


Data replication compromising data integrity: Continuing with the example,
imagine you ask your international counterparts to verify dates and stick to one
format. One analyst copies a large dataset to check the dates. But because of
memory issues, only part of the dataset is actually copied. The analyst would be
verifying and standardizing incomplete data. That partial dataset would be
certified as compliant but the full dataset would still contain dates that weren't
verified. Two versions of a dataset can introduce inconsistent results. A final
audit of results would be essential to reveal what happened and correct all
dates.
Data transfer compromising data integrity: Another analyst checks the dates
in a spreadsheet and chooses to import the validated and standardized data
back to the database. But suppose the date field from the spreadsheet was

incorrectly classified as a text field during the data import (transfer) process.
Now some of the dates in the database are stored as text strings. At this point,
the data needs to be cleaned to restore its integrity.
Data manipulation compromising data integrity: When checking dates,
another analyst notices what appears to be a duplicate record in the database
and removes it. But it turns out that the analyst removed a unique record for a
company’s subsidiary and not a duplicate record for the company. Your dataset
is now missing data and the data must be restored for completeness.
Conclusion
Fortunately, with a standard date format and compliance by all people and systems that
work with the data, data integrity can be maintained. But no matter where your data comes
from, always be sure to check that it is valid, complete, and clean before you begin any
analysis.
Reference: Data constraints and examples
As you progress in your data journey, you'll come across many types of data constraints (or
criteria that determine validity). The table below offers definitions and examples of data
constraint terms you might come across.
Data constraint
Definition
Examples
Data type
Values must be of a certain type:
date, number, percentage,
Boolean, etc.
If the data type is a date, a single numbe
constraint and be invalid
Data range
Values must fall between
predefined maximum and
minimum values
If the data range is 10-20, a value of 30 w
be invalid
Data constraint
Definition
Examples
Mandatory
Values can’t be left blank or
empty
If age is mandatory, that value must be f
Unique
Values can’t have a duplicate
Two people can’t have the same mobile
same service area
Regular expression
(regex) patterns
Values must match a prescribed
pattern
A phone number must match ###-###allowed)
Cross-field
validation
Certain conditions for multiple
fields must be satisfied
Values are percentages and values from
100%
Primary-key
(Databases only) value must be
unique per column
A database table can’t have two rows wi
value. A primary key is an identifier in a
column in which each value is unique. M
primary and foreign keys is provided lat
Set-membership
(Databases only) values for a
column must come from a set of
discrete values
Value for a column must be set to Yes, N
Foreign-key
(Databases only) values for a
column must be unique values
coming from a column in another
table
In a U.S. taxpayer database, the State col
territory with the set of acceptable value
table
Accuracy
The degree to which the data
conforms to the actual entity
being measured or described
If values for zip codes are validated by s
the data goes up.
Completeness
The degree to which the data
contains all desired components
or measures
If data for personal profiles required hai
collected, the data is complete.
Consistency
The degree to which the data is
repeatable from different points
of entry or collection
If a customer has the same address in th
the data is consistent.
Hey there, it's good to remember to check for data integrity. It's also important to check
that the data you use aligns with the business objective. This adds another layer to the
maintenance of data integrity because the data you're using might have limitations that
you'll need to deal with. The process of matching data to business objectives can actually be
pretty straightforward. Here's a quick example. Let's say you're an analyst for a business
that produces and sells auto parts.
Play video starting at ::29 and follow transcript0:29
If you need to address a question about the revenue generated by the sale of a certain part,
then you'd pull up the revenue table from the data set.
Play video starting at ::37 and follow transcript0:37
If the question is about customer reviews, then you'd pull up the reviews table to analyze
the average ratings. But before digging into any analysis, you need to consider a few
limitations that might affect it. If the data hasn't been cleaned properly, then you won't be
able to use it yet. You would need to wait until a thorough cleaning has been done. Now,
let's say you're trying to find how much an average customer spends. You notice the same
customer's data showing up in more than one row. This is called duplicate data. To fix this,
you might need to change the format of the data, or you might need to change the way you
calculate the average. Otherwise, it will seem like the data is for two different people, and
you'll be stuck with misleading calculations. You might also realize there's not enough data
to complete an accurate analysis. Maybe you only have a couple of months' worth of sales
data. There's slim chance you could wait for more data, but it's more likely that you'll have
to change your process or find alternate sources of data while still meeting your objective. I
like to think of a data set like a picture. Take this picture. What are we looking at?
Play video starting at :1:48 and follow transcript1:48
Unless you're an expert traveler or know the area, it may be hard to pick out from just
these two images.
Play video starting at :1:55 and follow transcript1:55
Visually, it's very clear when we aren't seeing the whole picture. When you get the
complete picture, you realize... you're in London!
Play video starting at :2:4 and follow transcript2:04
With incomplete data, it's hard to see the whole picture to get a real sense of what is going
on. We sometimes trust data because if it comes to us in rows and columns, it seems like
everything we need is there if we just query it. But that's just not true. I remember a time
when I found out I didn't have enough data and had to find a solution.
Play video starting at :2:26 and follow transcript2:26
I was working for an online retail company and was asked to figure out how to shorten
customer purchase to delivery time. Faster delivery times usually lead to happier
customers. When I checked the data set, I found very limited tracking information. We were
missing some pretty key details. So the data engineers and I created new processes to track
additional information, like the number of stops in a journey. Using this data, we reduced
the time it took from purchase to delivery and saw an improvement in customer
satisfaction. That felt pretty great! Learning how to deal with data issues while staying
focused on your objective will help set you up for success in your career as a data analyst.
And your path to success continues. Next step, you'll learn more about aligning data to
objectives. Keep it up!
Well-aligned objectives and data
You can gain powerful insights and make accurate conclusions when data is well-aligned to
business objectives. As a data analyst, alignment is something you will need to judge. Good
alignment means that the data is relevant and can help you solve a business problem or
determine a course of action to achieve a given business objective.
In this reading, you will review the business objectives associated with three scenarios. You
will explore how clean data and well-aligned business objectives can help you come up
with accurate conclusions. On top of that, you will learn how new variables discovered
during data analysis can cause you to set up data constraints so you can keep the data
aligned to a business objective.
Clean data + alignment to business objective = accurate
conclusions
Business objective
Account managers at Impress Me, an online content subscription service, want to know
how soon users view content after their subscriptions are activated.
To start off, the data analyst verifies that the data exported to spreadsheets is clean and
confirms that the data needed (when users access content) is available. Knowing this, the
analyst decides there is good alignment of the data to the business objective. All that is
missing is figuring out exactly how long it takes each user to view content after their
subscription has been activated.
Here are the data processing steps the analyst takes for a user from an account called V&L
Consulting. (These steps would be repeated for each subscribing account, and for each user
associated with that account.)
Step 1
Data-processing step
Look up the activation date for V&L Consulting
Relevant data in spreadsheet:
Source of d
Account sp
Result: October 21, 2019
Step 2
Data-processing step
Look up the name of a user belonging to the V&L Consulting account
Relevant data in spreadsheet:
Source of da
Account spr
Result: Maria Ballantyne
Step 3
Data-processing step
Find the first content access date for Maria B.
Relevant data in spreadsheet:
Source of data
Content usage spre
Result: October 31, 2019
Step 4
Data-processing step
Calculate the time between activation and first content usage for Maria B.
Relevant data in spreadsheet:
Result: 10 days
Pro tip 1
In the above process, the analyst could use VLOOKUP to look up the data in Steps 1, 2, and
3 to populate the values in the spreadsheet in Step 4. VLOOKUP is a spreadsheet function
that searches for a certain value in a column to return a related piece of information. Using
VLOOKUP can save a lot of time; without it, you have to look up dates and names manually.
Refer to the VLOOKUP page in the Google Help Center for how to use the function in Google
Sheets.
Pro tip 2
In Step 4 of the above process, the analyst could use the DATEDIF function to automatically
calculate the difference between the dates in column C and column D. The function can
calculate the number of days between two dates.
Refer to the Microsoft Support DATEDIF page for how to use the function in Excel. The
DAYS360 function does the same thing in accounting spreadsheets that use a 360-day year
(twelve 30-day months).
Refer to the DATEDIF page in the Google Help Center for how to use the function in Google
Sheets.
Source
New sp
Alignment to business objective + additional data cleaning =
accurate conclusions
Business objective
Cloud Gate, a software company, recently hosted a series of public webinars as free product
introductions. The data analyst and webinar program manager want to identify companies
that had five or more people attend these sessions. They want to give this list of companies
to sales managers who can follow up for potential sales.
The webinar attendance data includes the fields and data shown below.
Name
<First name> <Last name>
This was required information attendee
Email Address
xxxxx@company.com
This was required information attende
Company
<Company name>
This was optional information attende
Data cleaning
The webinar attendance data seems to align with the business objective. But the data
analyst and program manager decide that some data cleaning is needed before the analysis.
They think data cleaning is required because:


The company name wasn’t a mandatory field. If the company name is blank, it
might be found from the email address. For example, if the email address is
username@google.com, the company field could be filled in with Google for the
data analysis. This data cleaning step assumes that people with companyassigned email addresses attended a webinar for business purposes.
Attendees could enter any name. Since attendance across a series of webinars is
being looked at, they need to validate names against unique email addresses.
For example, if Joe Cox attended two webinars but signed in as Joe Cox for one
and Joseph Cox for the other, he would be counted as two different people. To
prevent this, they need to check his unique email address to determine that he
was the same person. After the validation, Joseph Cox could be changed to Joe
Cox to match the other instance.
Alignment to business objective + newly discovered variables
+ constraints = accurate conclusions
Business objective
An after-school tutoring company, A+ Education, wants to know if there is a minimum
number of tutoring hours needed before students have at least a 10% improvement in their
assessment scores.
The data analyst thinks there is good alignment between the data available and the
business objective because:


Students log in and out of a system for each tutoring session, and the number of
hours is tracked
Assessment scores are regularly recorded
Data constraints for new variables
After looking at the data, the data analyst discovers that there are other variables to
consider. Some students had consistent weekly sessions while other students had
scheduled sessions more randomly even though their total number of tutoring hours was
the same. The data doesn’t align as well with the original business objective as first
thought, so the analyst adds a data constraint to focus only on the students with consistent
weekly sessions. This modification helps to get a more accurate picture about the
enrollment time needed to achieve a 10% improvement in assessment scores.
Key takeaways
Hopefully these examples give you a sense of what to look for to know if your data aligns
with your business objective.



When there is clean data and good alignment, you can get accurate insights and
make conclusions the data supports.
If there is good alignment but the data needs to be cleaned, clean the data
before you perform your analysis.
If the data only partially aligns with an objective, think about how you could
modify the objective, or use data constraints to make sure that the subset of
data better aligns with the business objective.
What to do when you find an issue with your
data
When you are getting ready for data analysis, you might realize you don’t have the data you
need or you don’t have enough of it. In some cases, you can use what is known as proxy
data in place of the real data. Think of it like substituting oil for butter in a recipe when you
don’t have butter. In other cases, there is no reasonable substitute and your only option is
to collect more data.
Consider the following data issues and suggestions on how to work around them.
Data issue 1: no data
Possible Solutions
Examples of solutions in real life
Gather the data on a small scale to perform a preliminary
analysis and then request additional time to complete the
analysis after you have collected more data.
If you are surveying employees about wh
performance and bonus plan, use a samp
Then, ask for another 3 weeks to collect t
If there isn’t time to collect data, perform the analysis
using proxy data from other datasets. This is the most
If you are analyzing peak travel times for
the data for a particular city, use the data
similar size and demographic.
common workaround.
Data issue 2: too little data
Possible Solutions
Examples of solutions in real life
Do the analysis using proxy data
along with actual data.
If you are analyzing trends for owners of golden retrievers, ma
including the data from owners of labradors.
Adjust your analysis to align with
the data you already have.
If you are missing data for 18- to 24-year-olds, do the analysis b
limitation in your report: this conclusion applies to adults 25 ye
Data issue 3: wrong data, including data with errors*
Possible Solutions
Examples of solutions in real life
If you have the wrong data because requirements were
misunderstood, communicate the requirements again.
If you need the data for female vo
male voters, restate your needs.
Identify errors in the data and, if possible, correct them at the
source by looking for a pattern in the errors.
If your data is in a spreadsheet a
statement or boolean causing cal
the conditional statement instea
values.
If you can’t correct data errors yourself, you can ignore the wrong
data and go ahead with the analysis if your sample size is still
large enough and ignoring the data won’t cause systematic bias.
If your dataset was translated fro
some of the translations don’t m
with bad translation and go ahea
other data.
* Important note: sometimes data with errors can be a warning sign that the data isn’t
reliable. Use your best judgment.
Use the following decision tree as a reminder of how to deal with data errors or not
enough data:
Calculating sample size
Before you dig deeper into sample size, familiarize yourself with these terms and
definitions:
Terminology
Definitions
Population
The entire group that you are interested in for your study. For example, if you are
company, the population would be all the employees in your company.
Sample
A subset of your population. Just like a food sample, it is called a sample because i
company is too large to survey every individual, you can survey a representative s
Margin of error
Since a sample is used to represent a population, the sample’s results are expected
result would have been if you had surveyed the entire population. This difference
The smaller the margin of error, the closer the results of the sample are to what th
you had surveyed the entire population.
Confidence
level
How confident you are in the survey results. For example, a 95% confidence level
the same survey 100 times, you would get similar results 95 of those 100 times. C
before you start your study because it will affect how big your margin of error is a
Confidence
interval
The range of possible values that the population’s result would be at the confidenc
range is the sample result +/- the margin of error.
Statistical
significance
The determination of whether your result could be due to random chance or not. T
the less due to chance.
Things to remember when determining the size of your
sample
When figuring out a sample size, here are things to keep in mind:

Don’t use a sample size less than 30. It has been statistically proven that 30 is
the smallest sample size where an average result of a sample starts to represent
the average result of a population.
 The confidence level most commonly used is 95%, but 90% can work in some
cases.
Increase the sample size to meet specific needs of your project:
For a higher confidence level, use a larger sample size
To decrease the margin of error, use a larger sample size
For greater statistical significance, use a larger sample size
Note: Sample size calculators use statistical formulas to determine a sample size. More
about these are coming up in the course! Stay tuned.



Why a minimum sample of 30?
This recommendation is based on the Central Limit Theorem (CLT) in the field of
probability and statistics. As sample size increases, the results more closely resemble the
normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the
smallest sample size for which the CLT is still valid. Researchers who rely on regression
analysis – statistical methods to determine the relationships between controlled and
dependent variables – also prefer a minimum sample of 30.
Still curious? Without getting too much into the math, check out these articles:


Central Limit Theorem (CLT): This article by Investopedia explains the Central
Limit Theorem and briefly describes how it can apply to an analysis of a stock
index.
Sample Size Formula: This article by Statistics Solutions provides a little more
detail about why some researchers use 30 as a minimum sample size.
Sample sizes vary by business problem
Sample size will vary based on the type of business problem you are trying to solve.
For example, if you live in a city with a population of 200,000 and get 180,000 people to
respond to a survey, that is a large sample size. But without actually doing that, what would
an acceptable, smaller sample size look like?
Would 200 be alright if the people surveyed represented every district in the city?
Answer: It depends on the stakes.

A sample size of 200 might be large enough if your business problem is to find
out how residents felt about the new library
 A sample size of 200 might not be large enough if your business problem is to
determine how residents would vote to fund the library
You could probably accept a larger margin of error surveying how residents feel about the
new library versus surveying residents about how they would vote to fund it. For that
reason, you would most likely use a larger sample size for the voter survey.
Larger sample sizes have a higher cost
You also have to weigh the cost against the benefits of more accurate results with a larger
sample size. Someone who is trying to understand consumer preferences for a new line of
products wouldn’t need as large a sample size as someone who is trying to understand the
effects of a new drug. For drug safety, the benefits outweigh the cost of using a larger
sample size. But for consumer preferences, a smaller sample size at a lower cost could
provide good enough results.
Knowing the basics is helpful
Knowing the basics will help you make the right choices when it comes to sample size. You
can always raise concerns if you come across a sample size that is too small. A sample size
calculator is also a great tool for this. Sample size calculators let you enter a desired
confidence level and margin of error for a given population size. They then calculate the
sample size needed to statistically achieve those results.
Refer to the Determine the Best Sample Size video for a demonstration of a sample size
calculator, or refer to the Sample Size Calculator reading for additional information.
What to do when there is no data
Earlier, you learned how you can still do an analysis using proxy data if you have no data.
You might have some questions about proxy data, so this reading will give you a few more
examples of the types of datasets that can serve as alternate data sources.
Proxy data examples
Sometimes the data to support a business objective isn’t readily available. This is when
proxy data is useful. Take a look at the following scenarios and where proxy data comes in
for each example:
Business scenario
How proxy data can be used
A new car model was just launched a few days ago and the auto
dealership can’t wait until the end of the month for sales data to
come in. They want sales projections now.
The analyst proxies the numb
specifications on the dealersh
potential sales at the dealersh
A brand new plant-based meat product was only recently stocked in
grocery stores and the supplier needs to estimate the demand over
the next four years.
The analyst proxies the sales d
made out of tofu that has been
years.
The Chamber of Commerce wants to know how a tourism campaign
is going to impact travel to their city, but the results from the
campaign aren’t publicly available yet.
The analyst proxies the histor
to the city one to three month
run six months earlier.
Open (public) datasets
If you are part of a large organization, you might have access to lots of sources of data. But
if you are looking for something specific or a little outside your line of business, you can
also make use of open or public datasets. (You can refer to this Towards Data Science
article for a brief explanation of the difference between open and public data.)
Here's an example. A nasal version of a vaccine was recently made available. A clinic wants
to know what to expect for contraindications, but just started collecting first-party data
from its patients. A contraindication is a condition that may cause a patient not to take a
vaccine due to the harm it would cause them if taken. To estimate the number of possible
contraindications, a data analyst proxies an open dataset from a trial of the injection
version of the vaccine. The analyst selects a subset of the data with patient profiles most
closely matching the makeup of the patients at the clinic.
There are plenty of ways to share and collaborate on data within a community. Kaggle
(kaggle.com) which we previously introduced, has datasets in a variety of formats
including the most basic type, Comma Separated Values (CSV) files.
CSV, JSON, SQLite, and BigQuery datasets

CSV: Check out this Credit card customers dataset, which has information from
10,000 customers including age, salary, marital status, credit card limit, credit
card category, etc. (CC0: Public Domain, Sakshi Goyal).

JSON: Check out this JSON dataset for trending YouTube videos (CC0: Public
Domain, Mitchell J).
 SQLite: Check out this SQLite dataset for 24 years worth of U.S. wildfire data
(CC0: Public Domain, Rachael Tatman).
 BigQuery: Check out this Google Analytics 360 sample dataset from the Google
Merchandise Store (CC0 Public Domain, Google BigQuery).
Refer to the Kaggle documentation for datasets for more information and search for and
explore datasets on your own at kaggle.com/datasets.
As with all other kinds of datasets, be on the lookout for duplicate data and ‘Null’ in open
datasets. Null most often means that a data field was unassigned (left empty), but
sometimes Null can be interpreted as the value, 0. It is important to understand how Null
was used before you start analyzing a dataset with Null data.
Sample size calculator
In this reading, you will learn the basics of sample size calculators, how to use them, and how
to understand the results. A sample size calculator tells you how many people you need to
interview (or things you need to test) to get results that represent the target population. Let’s
review some terms you will come across when using a sample size calculator:
Confidence level: The probability that your sample size accurately reflects the greater
population.
Margin of error: The maximum amount that the sample results are expected to differ from
those of the actual population.
Population: This is the total number you hope to pull your sample from.
Sample: A part of a population that is representative of the population.
Estimated response rate: If you are running a survey of individuals, this is the percentage of
people you expect will complete your survey out of those who received the survey.
How to use a sample size calculator
In order to use a sample size calculator, you need to have the population size, confidence level,
and the acceptable margin of error already decided so you can input them into the tool. If this
information is ready to go, check out these sample size calculators below:
Sample size calculator by surveymonkey.com
Sample size calculator by raosoft.com
What to do with the results
After you have plugged your information into one of these calculators, it will give you a
recommended sample size. Keep in mind, the calculated sample size is the minimum number
to achieve what you input for confidence level and margin of error. If you are working with a
survey, you will also need to think about the estimated response rate to figure out how many
surveys you will need to send out. For example, if you need a sample size of 100 individuals
and your estimated response rate is 10%, you will need to send your survey to 1,000
individuals to get the 100 responses you need for your analysis.
Now that you have the basics, try some calculations using the sample size calculators and
refer back to this reading if you need a refresher on the definitions.
What is dirty data?
Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant to the
problem you are trying to solve. This reading summarizes:
Types of dirty data you may encounter
What may have caused the data to become dirty
How dirty data is harmful to businesses
Types of dirty data
Icons of the 6 types of dirty data: duplicate, outdated, incomplete, incorrect and inconsistent
data
Duplicate data
Description
Possible causes
Potential harm to businesses
Any data record that shows up more than once
Manual data entry, batch data imports, or data migration
Skewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion
during data retrieval
Outdated data
Description
Possible causes
Potential harm to businesses
Any data that is old which should be replaced with newer and more accurate information
People changing roles or companies, or software and systems becoming obsolete
Inaccurate insights, decision-making, and analytics
Incomplete data
Description
Possible causes
Potential harm to businesses
Any data that is missing important fields
Improper data collection or incorrect data entry
Decreased productivity, inaccurate insights, or inability to complete essential services
Incorrect/inaccurate data
Description
Possible causes
Potential harm to businesses
Any data that is complete but inaccurate
Human error inserted during data input, fake information, or mock data
Inaccurate insights or decision-making based on bad information resulting in revenue loss
Inconsistent data
Description
Possible causes
Potential harm to businesses
Any data that uses different formats to represent the same thing
Data stored incorrectly or errors inserted during data transfer
Contradictory data points leading to confusion or inability to classify or segment customers
Business impact of dirty data
For further reading on the business impact of dirty data, enter the term “dirty data” into
your preferred browser’s search bar to bring up numerous articles on the topic. Here are a few
impacts cited for certain industries from a previous search:
Banking: Inaccuracies cost companies between 15% and 25% of revenue (source).
Digital commerce: Up to 25% of B2B database contacts contain inaccuracies (source).
Marketing and sales: 8 out of 10 companies have said that dirty data hinders sales
campaigns (source).
Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s electronic health
records (source).
Common data-cleaning pitfalls
In this reading, you will learn the importance of data cleaning and how to identify common
mistakes. Some of the errors you might come across while cleaning your data could
include:
Common mistakes to avoid


Not checking for spelling errors: Misspellings can be as simple as typing or
input errors. Most of the time the wrong spelling or common grammatical
errors can be detected, but it gets harder with things like names or addresses.
For example, if you are working with a spreadsheet table of customer data, you
might come across a customer named “John” whose name has been input
incorrectly as “Jon” in some places. The spreadsheet’s spellcheck probably
won’t flag this, so if you don’t double-check for spelling errors and catch this,
your analysis will have mistakes in it.
Forgetting to document errors: Documenting your errors can be a big time
saver, as it helps you avoid those errors in the future by showing you how you
resolved them. For example, you might find an error in a formula in your
spreadsheet. You discover that some of the dates in one of your columns
haven’t been formatted correctly. If you make a note of this fix, you can
reference it the next time your formula is broken, and get a head start on
troubleshooting. Documenting your errors also helps you keep track of changes
in your work, so that you can backtrack if a fix didn’t work.






Not checking for misfielded values: A misfielded value happens when the
values are entered into the wrong field. These values might still be formatted
correctly, which makes them harder to catch if you aren’t careful. For example,
you might have a dataset with columns for cities and countries. These are the
same type of data, so they are easy to mix up. But if you were trying to find all of
the instances of Spain in the country column, and Spain had mistakenly been
entered into the city column, you would miss key data points. Making sure your
data has been entered correctly is key to accurate, complete analysis.
Overlooking missing values: Missing values in your dataset can create errors
and give you inaccurate conclusions. For example, if you were trying to get the
total number of sales from the last three months, but a week of transactions
were missing, your calculations would be inaccurate. As a best practice, try to
keep your data as clean as possible by maintaining completeness and
consistency.
Only looking at a subset of the data: It is important to think about all of the
relevant data when you are cleaning. This helps make sure you understand the
whole story the data is telling, and that you are paying attention to all possible
errors. For example, if you are working with data about bird migration patterns
from different sources, but you only clean one source, you might not realize
that some of the data is being repeated. This will cause problems in your
analysis later on. If you want to avoid common errors like duplicates, each field
of your data requires equal attention.
Losing track of business objectives: When you are cleaning data, you might
make new and interesting discoveries about your dataset-- but you don’t want
those discoveries to distract you from the task at hand. For example, if you
were working with weather data to find the average number of rainy days in
your city, you might notice some interesting patterns about snowfall, too. That
is really interesting, but it isn’t related to the question you are trying to answer
right now. Being curious is great! But try not to let it distract you from the task
at hand.
Not fixing the source of the error: Fixing the error itself is important. But if
that error is actually part of a bigger problem, you need to find the source of the
issue. Otherwise, you will have to keep fixing that same error over and over
again. For example, imagine you have a team spreadsheet that tracks
everyone’s progress. The table keeps breaking because different people are
entering different values. You can keep fixing all of these problems one by one,
or you can set up your table to streamline data entry so everyone is on the
same page. Addressing the source of the errors in your data will save you a lot
of time in the long run.
Not analyzing the system prior to data cleaning: If we want to clean our data
and avoid future errors, we need to understand the root cause of your dirty
data. Imagine you are an auto mechanic. You would find the cause of the
problem before you started fixing the car, right? The same goes for data. First,
you figure out where the errors come from. Maybe it is from a data entry error,
not setting up a spell check, lack of formats, or from duplicates. Then, once you


understand where bad data comes from, you can control it and keep your data
clean.
Not backing up your data prior to data cleaning: It is always good to be
proactive and create your data backup before you start your data clean-up. If
your program crashes, or if your changes cause a problem in your dataset, you
can always go back to the saved version and restore it. The simple procedure of
backing up your data can save you hours of work-- and most importantly, a
headache.
Not accounting for data cleaning in your deadlines/process: All good things
take time, and that includes data cleaning. It is important to keep that in mind
when going through your process and looking at your deadlines. When you set
aside time for data cleaning, it helps you get a more accurate estimate for ETAs
for stakeholders, and can help you know when to request an adjusted ETA.
1.
Question 1
Activity overview
You’ve learned about cleaning data and its importance in meeting good data science standards. In
this activity, you’ll do some data cleaning with spreadsheets, then transpose the data.
By the time you complete this activity, you will be able to perform some basic cleaning methods in
spreadsheets. This will enable you to clean and transpose data, which is important for making data
more specific and accurate in your career as a data analyst.
What you will need
To get started, first access the data spreadsheet.
To use the spreadsheet for this course item, click the link below and select “Use Template.”
Link to data spreadsheet: Cleaning with spreadsheets
OR
If you don’t have a Google account, you can download the template directly from the attachment
below.
Data Spreadsheet for Cleaning with SpreadsheetsXLSX File
Download file
Select and remove blank cells
The first technique we’ll use is to select and eliminate rows containing blank cells by using filters. To
eliminate rows with blank cells:
1. Highlight all cells in the spreadsheet. You can highlight Columns A-H by clicking on the header of
Column A, holding Shift, and clicking on the header of Column H.
2. Click on the Data tab and pick the Create a filter option. In Microsoft Excel, this is called Filter.
Excel:
3. Every column now shows a green triangle in the first row next to the column title. Click the green
triangle in Column B to access a new menu.
4. On that new menu, click Filter by condition and open the dropdown menu to select Is empty.
Click OK.
In Excel, click the dropdown, then Filter... then make sure only (Blanks) is checked. Click OK.
Excel:
You can then review a list of all the rows with blank cells in that column.
5. Select all these cells and delete the rows except the row of column headers.
6. Return to the Filter by condition and return it to None. In Excel, click Clear Filter from
‘Column’.
 Note: You will now notice that any row that had an empty cell in Column A will be removed
(including the extra empty rows after the data).
7. Repeat this for Columns B-H.
All the rows that had blank cells are now removed from the spreadsheet.
Transpose the data
The second technique you will practice will help you convert the data from the current long format
(more rows than columns) to the wide format (more columns than rows). This action is called
transposing. To transpose your data:
1. Highlight and copy the data that you want to transpose including the column labels. You can do
this by highlighting Columns A-H. In Excel, highlight only the relevant cells (A1-H45) instead of the
headers.
2. Right-click on cell I1. This is where you want the transposed data to start.
3. Hover over Paste Special from the right-click menu. Select the Transposed option. In Excel,
select the Transpose icon under the paste options.
Excel:
You should now find the data transformed into the new wide format. At this point, you should remove
the original long data from the spreadsheet.
4. Delete the previous long data. The easiest way to do this is to click on Column A, so the entire
column is highlighted. Then, hold down the Shift key and click on Column H. You should find these
columns highlighted. Right-click on the highlighted area and select Delete Columns A - H.
Your screen should now appear like this:
Get rid of extra spaces in cells with string data
Now that you have transposed the data, eliminate the extra spaces in the values of the cells.
1. Highlight the data in the spreadsheet.
2. Click on the Data tab, then hover over Data cleanup and select Trim whitespace.
In Excel, you can use the v command to get rid of white spaces. In any space beneath your data
(such as cell A10), type =TRIM(A1). Then, drag the bottom right corner of the cell to the bottom right
to call the data without the white spaces.
Now all the extra spaces in the cells have been removed.
Change Text Lower/Uppercase/Proper Case
Next, you’ll process string data. The easiest way to clean up string data will depend on the
spreadsheet program you are using. If you are using Excel, you’ll use a simple formula. If you are
using Google Sheets, you can use an Add-On to do this with a few clicks. Follow the steps in the
relevant section below.
Microsoft Excel
If you are using Microsoft Excel, this documentation explains how to use a formula to change the
case of a text string. Follow these instructions to clean the string text and then move on to the
confirmation and reflection section of this activity.
Google sheets
If you’re completing this exercise using Google Sheets, you’ll need to install an add-in that will give
you the functionality needed to easily clean string data and change cases.
Google Sheets Add-on Instructions:
1. Click on the Add-Ons option at the top of Google Sheets.
2. Click on Get add-ons.
3. Search for ChangeCase. It should appear like this:
4. Click on Install to install the add-on. It may ask you to login or verify the installation permissions.
Once you have installed the add-on successfully, you can access it by clicking on the Add-ons
menu again.
Now, you can change the case of text data that shows up. To change the text in Column C to all
uppercase:
1. Click on Column C. Be sure to deselect the column header, unless you want to change the case
of that as well (which you don't).
2. Click on the Add-Ons tab and select ChangeCase. Select the option All uppercase. Notice the
other options that you could have chosen if needed.
Delete all formatting
If you want to clear the formatting for any or all cells, you can find the command in the Format tab.
To clear formatting:
1. Select the data for which you want to delete the formatting. In this case, highlight all the data in
the spreadsheet by clicking and dragging over Rows 1-8.
2. Click the Format tab and select the Clear Formatting option.
In Excel, go to the Home tab, then hover over Clear and select Clear Formats.
You will notice that all the cells have had their formatting removed.
Workflow automation
In this reading, you will learn about workflow automation and how it can help you work
faster and more efficiently. Basically, workflow automation is the process of automating
parts of your work. That could mean creating an event trigger that sends a notification
when a system is updated. Or it could mean automating parts of the data cleaning process.
As you can probably imagine, automating different parts of your work can save you tons of
time, increase productivity, and give you more bandwidth to focus on other important
aspects of the job.
What can be automated?
Automation sounds amazing, doesn’t it? But as convenient as it is, there are still some parts
of the job that can’t be automated. Let's take a look at some things we can automate and
some things that we can’t.
Can it be
automated?
Why?
No
Communication is key to understanding the needs of yo
you complete the tasks you are working on. There is no
person communications.
Presenting your
findings
No
Presenting your data is a big part of your job as a data a
accessible and understandable to stakeholders and crea
be automated for the same reasons that communication
Preparing and cleaning
data
Partially
Some tasks in data preparation and cleaning can be aut
processes, like using a programming script to automati
Data exploration
Partially
Sometimes the best way to understand data is to see it.
tools available that can help automate the process of vi
speed up the process of visualizing and understanding
itself still needs to be done by a data analyst.
Modeling the data
Yes
Data modeling is a difficult process that involves lots of
there are tools that can completely automate the differe
Task
Communicating with
your team and
stakeholders
More about automating data cleaning
One of the most important ways you can streamline your data cleaning is to clean data
where it lives. This will benefit your whole team, and it also means you don’t have to repeat
the process over and over. For example, you could create a programming script that
counted the number of words in each spreadsheet file stored in a specific folder. Using
tools that can be used where your data is stored means that you don’t have to repeat your
cleaning steps, saving you and your team time and energy.
More resources
There are a lot of tools out there that can help automate your processes, and those tools are
improving all the time. Here are a few articles or blogs you can check out if you want to
learn more about workflow automation and the different tools out there for you to use:
Towards Data Science’s Automating Scientific Data Analysis
MIT News’ Automating Big-Data Analysis
TechnologyAdvice’s 10 of the Best Options for Workflow Automation
Software
As a data analyst, automation can save you a lot of time and energy, and free you up to
focus more on other parts of your project. The more analysis you do, the more ways you
will find to make your processes simpler and more streamlined.



Learning Log: Develop your approach to
cleaning data
Overview
By this point, you have started working with real data. And you may have noticed that data
is often messy-- you can expect raw, primary data to be imperfect. In this learning log, you
will develop an approach to cleaning data by creating a cleaning checklist, considering your
preferred methods for data cleaning, and deciding on a data cleaning motto. By the time
you complete this entry, you will have a stronger understanding of how to approach the
data cleaning process methodically. This will help you save time cleaning data in the future
and ensure that your data is clean and usable.
Fill out the Data Cleaning Approach Table
The problem with data cleaning is that it usually requires a lot of time, energy, and
attention from a junior data analyst. One of the best ways to lessen the negative impacts of
data cleaning is to have a plan of action or a specific approach to cleaning the data.
In order to help you develop your own approach, you’ll use the instructions from this
learning log to fill out a Data Cleaning Approach Table in your learning log template. The
table will appear like this in the template:
Once you have completed your Data Cleaning Approach Table, you will spend some time
reflecting on the data cleaning process and your own approach.
Access your learning log
To use the learning log for this course item, click the link below and select “Use Template.”
Link to learning log template: Develop your approach to data cleaning
OR
If you don’t have a Google account, you can download the template directly from the
attachment below.
Learning Log Template_ Develop your approach to cleaning dataDOCX File
Download file
Step 1: Create your checklist
You can start developing your personal approach to cleaning data by creating a standard
checklist to use before your data cleaning process. Think of this checklist as your default
"what to search for" list.
With a good checklist, you can efficiently and, hopefully, swiftly identify all the problem
spots without getting sidetracked. You can also use the checklist to identify the scale and
scope of the dataset itself.
Some things you might include in your checklist:





Size of the data set
Number of categories or labels
Missing data
Unformatted data
The different data types
You can use your own experiences so far to help you decide what else you want to include
in your checklist!
Step 2: List your preferred cleaning methods
After you have compiled your personal checklist, you can create a list of activities you like
to perform when cleaning data. This list is a collection of procedures that you will
implement when you encounter specific issues present in the data related to your checklist
or every time you clean a new dataset.
For example, suppose that you have a dataset with missing data, how would you handle it?
Moreover, if the data set is very large, what would you do to check for missing data?
Outlining some of your preferred methods for cleaning data can help save you time and
energy.
Step 3: Choose a data cleaning motto
Now that you have a personal checklist and your preferred data cleaning methods, you can
create a data cleaning motto to help guide and explain your process. The motto is a short
one or two sentence summary of your philosophy towards cleaning data. For example, here
are a few data cleaning mottos from other data analysts:
1. "Not all data is the same, so don't treat it all the same."
2. "Be prepared for things to not go as planned. Have a backup plan.”
3. "Avoid applying complicated solutions to simple problems."
The data you encounter as an analyst won’t always conform to your checklist or activities
list regardless of how comprehensive they are. Data cleaning can be an involved and
complicated process, but surprisingly most data has similar problems. A solid personal
motto and explanation can make the more common data cleaning tasks easy to understand
and complete.
Reflection
Now that you have completed your Data Cleaning Approach Table, take a moment to reflect
on the decisions you made about your data cleaning approach. Write 1-2 sentences (20-40
words) answering each of the following questions:

What items did you add to your data cleaning checklist? Why did you decide
these were important to check for?
 How have your own experiences with data cleaning affected your preferred
cleaning methods? Can you think of an example where you needed to perform
one of these cleaning tasks?
 How did you decide on your data cleaning motto?
Using SQL as a junior data analyst
In this reading, you will learn more about how to decide when to use SQL, or Structured
Query Language. As a data analyst, you will be tasked with handling a lot of data, and SQL is
one of the tools that can help make your work a lot easier. SQL is the primary way data
analysts extract data from databases. As a data analyst, you will work with databases all the
time, which is why SQL is such a key skill. Let’s follow along as a junior data analyst uses
SQL to solve a business task.
The business task and context
The junior data analyst in this example works for a social media company. A new business
model was implemented on February 15, 2020 and the company wants to understand how
their user-growth compares to the previous year. Specifically, the data analyst was asked to
find out how many users have joined since February 15, 2020.
An image of a person holding a laptop containing different data and an image of a multicolored outline of 3 people
Spreadsheets functions and formulas or SQL queries?
Before they can address this question, this data analyst needs to choose what tool to use.
First, they have to think about where the data lives. If it is stored in a database, then SQL is
the best tool for the job. But if it is stored in a spreadsheet, then they will have to perform
their analysis in that spreadsheet. In that scenario, they could create a pivot table of the
data and then apply specific formulas and filters to their data until they were given the
number of users that joined after February 15th. It isn’t a really complicated process, but it
would involve a lot of steps.
In this case, the data is stored in a database, so they will have to work with SQL. And this
data analyst knows they could get the same results with a single SQL query:
Screenshot of a single SQL query
SELECT
COUNT(DISTINCT user_id) AS count_of_unique_users
FROM
table
WHERE
join_date >= ‘2020-02-15’
Spreadsheets and SQL both have their advantages and disadvantages:
Features of Spreadsheets
Features of SQL Databases
Smaller data sets
Larger datasets
Enter data manually
Access tables across a database
Create graphs and visualizations in the same program
Prepare data for further analysis in another software
Built-in spell check and other useful functions
Fast and powerful functionality
Best when working solo on a project
Great for collaborative work and tracking queries run by all users
When it comes down to it, where the data lives will decide which tool you use. If you are
working with data that is already in a spreadsheet, that is most likely where you will
perform your analysis. And if you are working with data stored in a database, SQL will be
the best tool for you to use for your analysis. You will learn more about SQL coming up, so
that you will be ready to tackle any business problem with the best tool possible.
Optional: Upload the store transactions
dataset to BigQuery
In the next video, the instructor uses a specific dataset. The instructions in this reading are
provided for you to upload the same dataset in your BigQuery console so you can follow
along.
You must have a BigQuery account to follow along. If you have hopped around courses,
Using BigQuery in the Prepare Data for Exploration course covers how to set up a
BigQuery account.
Prepare for the next video

First, download the CSV file from the attachment below.
Lauren's Furniture Store Transaction TableCSV File
Download file

Next, complete the steps below in your BigQuery console to upload the Store
Transaction dataset.
Note: These steps will be different from what you performed before. In previous instances,
you selected the Auto detect check box to allow BigQuery to auto-detect the schema. This
time, you will choose to create the schema by editing it as text. This method can be used
when BigQuery doesn't automatically set the desired type for a particular field. In this case,
you will specify STRING instead of FLOAT as the type for the purchase_price field.
Step 1: Open your BigQuery console and click on the project you want to upload the data
to. If you already created a customer_data dataset for your project, jump to step 5;
otherwise, continue with step 2.
Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next to your
project name and select Create dataset.
Step 3: Enter customer_data for the Dataset ID.
Step 4: Click CREATE DATASET (blue button) to add the dataset to your project.
Step 5: In the Explorer, click to expand your project, and then click the customer_data
dataset.
Step 6: Click the Actions icon (three vertical dots) next to customer_data and select Open.
Step 7: Click the blue + icon at the top right to open the Create table window.
Step 8: Under Source, for the Create table from selection, choose where the data will be
coming from.



Select Upload.
Click Browse to select the Store Transaction Table CSV file you downloaded.
Choose CSV from the file format drop-down.
Step 9: For Table name, enter customer_purchase if you plan to follow along with the
video.
Step 10: For Schema, click the toggle switch for Edit as text. This opens up a box for the
text.
Step 11: Copy and paste the following text into the box. Be sure to include the opening and
closing brackets. They are required.
[ { "description": "date", "mode": "NULLABLE", "name": "date", "type": "DATETIME" }, {
"description": "transaction id", "mode": "NULLABLE", "name": "transaction_id", "type":
"INTEGER" }, { "description": "customer id", "mode": "NULLABLE", "name": "customer_id",
"type": "INTEGER" }, { "description": "product name", "mode": "NULLABLE", "name":
"product", "type": "STRING" }, { "description": "product_code", "mode": "NULLABLE",
"name": "product_code", "type": "STRING" }, { "description": "product color", "mode":
"NULLABLE", "name": "product_color", "type": "STRING" }, { "description": "product price",
"mode": "NULLABLE", "name": "product_price", "type": "FLOAT" }, { "description":
"quantity purchased", "mode": "NULLABLE", "name": "purchase_size", "type": "INTEGER" },
{ "description": "purchase price", "mode": "NULLABLE", "name": "purchase_price", "type":
"STRING" }, { "description": "revenue", "mode": "NULLABLE", "name": "revenue", "type":
"FLOAT" } ]
Step 12: Scroll down and expand the Advanced options section.
Step 13: For the Header rows to skip field, enter 1.
Step 14: Click Create table (blue button). You will now see the customer_purchase table
under your customer_data dataset in your project.
Step 15: Click the customer_purchase table and in the Schema tab, confirm that the
schema matches the schema shown below.
Step 16: Click the Preview tab and confirm that your data matches the data shown below.
Congratulations, you are now ready to follow along with the video!
How to get to the BigQuery console
In your browser, go to console.cloud.google.com/bigquery.
Note: Going to console.cloud.google.com in your browser takes you to the main dashboard
for the Google Cloud Platform. To navigate to BigQuery from the dashboard, do the
following:
Click the Navigation menu icon (Hamburger icon) in the banner.
Scroll down to the BIG DATA section.
Click BigQuery and select SQL workspace.
Watch the How to use BigQuery video for an introduction to each part of the BigQuery SQL
workspace.
(Optional) Explore a BigQuery public dataset
You will be exploring a public dataset in an upcoming activity, so you can perform these
steps later if you prefer.
Refer to these step-by-step instructions.
(Optional) Upload a CSV file to BigQuery
These steps are provided so you can work with a dataset on your own at this time. You will
upload CSV files to BigQuery later in the program.
Refer to these step-by-step instructions.
Getting started with other databases (if not using BigQuery)
It is easier to follow along with the course activities if you use BigQuery, but if you are
connecting to and practicing SQL queries on other database platforms instead of BigQuery,
here are similar getting started resources:
Getting started with MySQL: This is a guide to setting up and using MySQL.
Getting started with Microsoft SQL Server: This is a tutorial to get started using SQL Server.
Getting started with PostgreSQL: This is a tutorial to get started using PostgreSQL.
Getting started with SQLite: This is a quick start guide for using SQLite.
It's so great to have you back.
Now that we know some basic SQL queries and spent some time working in a database,
let's apply that knowledge to something else we've been talking about:
preparing and cleaning data.
You already know that cleaning and
completing your data before you analyze it is an important step.
So in this video, I'll show you some ways SQL can help you do just that,
including how to remove duplicates,
as well as four functions to help you clean string variables.
Earlier, we covered how to remove duplicates in spreadsheets using
the Remove duplicates tool.
In SQL,
we can do the same thing by including DISTINCT in our SELECT statement.
For example, let's say the company we work for
has a special promotion for customers in Ohio.
We want to get the customer IDs of customers who live in Ohio.
But some customer information has been entered multiple times.
We can get these customer IDs by
writing SELECT customer_id FROM
customer_data.customer_address.
This query will give us duplicates if they exist in the table.
If customer ID 9080 shows up three times in our table,
our results will have three of that customer ID.
But we don't want that. We want a list of all unique customer IDs.
To do that, we add DISTINCT to our SELECT statement by writing,
SELECT DISTINCT customer_id FROM customer_data.customer_address.
In those cases, you'll need to clean them before you can analyze them.
So here are some functions you can use in SQL to handle string variables.
You might recognize some of these functions from when we talked about
spreadsheets.
Now it's time to see them work in a new way.
Pull up the data set we shared right before this video.
And you can follow along step-by-step with me during the rest of this video.
Play video starting at :2:42 and follow transcript2:42
The first function I want to show you is LENGTH, which we've encountered before.
If we already know the length our string variables are supposed to be,
we can use LENGTH to double-check that our string variables are consistent.
For some databases, this query is written as LEN, but it does the same thing.
Let's say we're working with the customer_address table from our
earlier example.
We can make sure that all country codes have the same length by using
LENGTH on each of these strings.
So to write our SQL query, let's first start with SELECT and FROM.
We know our data comes from the customer_address table
within the customer_data data set.
So we add customer_data.customer_address after the FROM clause.
Then under SELECT, we'll write LENGTH, and
then the column we want to check, country.
To remind ourselves what this is,
we can label this column in our results as letters_in_country.
So we add AS letters_in_country,
after LENGTH(country).
The result we get is a list of the number of letters in each country listed for
each of our customers.
It seems like almost all of them are 2s,
which means the country field contains only two letters.
But we notice one that has 3. That's not good.
We want our data to be consistent.
Play video starting at :4:32 and follow transcript4:32
So let's check out which countries were incorrectly listed in our table.
We can do that by putting the LENGTH(country) function that
we created into the WHERE clause.
Because we're telling SQL to filter the data to show only
customers whose country contains more than two letters.
So now we'll write SELECT country
FROM customer_data.customer_address
WHERE LENGTH(country) greater than 2.
Play video starting at :5:8 and follow transcript5:08
When we run this query, we now get the two countries where the number of letters is
greater than the 2 we expect to find.
Play video starting at :5:17 and follow transcript5:17
The incorrectly listed countries show up as USA instead of US.
If we created this table, then we could update our table so
that this entry shows up as US instead of USA.
But in this case, we didn't create this table, so we shouldn't update it.
We still need to fix this problem so we can pull a list of all the customers
in the US, including the two that have USA instead of US.
The good news is that we can account for
this error in our results by using the substring function in our SQL query.
To write our SQL query, let's start by writing
the basic structure, SELECT, FROM, WHERE.
We know our data is coming from the customer_address table
from the customer_data data set.
So we type in customer_data.customer_address,
after FROM.
Next, we tell SQL what data we want it to give us.
We want all the customers in the US by their IDs.
So we type in customer_id after SELECT.
Finally, we want SQL to filter out only American customers.
So we use the substring function after the WHERE clause.
We're going to use the substring function to pull the first two letters of each
country so that all of them are consistent and only contain two letters.
To use the substring function,
we first need to tell SQL the column where we found this error, country.
Then we specify which letter to start with.
We want SQL to pull the first two letters, so
we're starting with the first letter, so we type in 1.
Then we need to tell SQL how many letters, including this first letter, to pull.
Since we want the first two letters,
we need SQL to pull two total letters, so we type in 2.
This will give us the first two letters of each country.
We want US only, so we'll set this function to equals US.
When we run this query, we get a list of all customer IDs of customers
whose country is the US, including the customers that had USA instead of US.
Going through our results, it seems like we have a couple duplicates where
the customer ID is shown multiple times.
Remember how we get rid of duplicates?
We add DISTINCT before customer_id.
Play video starting at :8:5 and follow transcript8:05
So now when we run this query,
we have our final list of customer IDs of the customers who live in the US.
Finally, let's check out the TRIM function, which you've come across before.
This is really useful if you find entries with extra spaces and
need to eliminate those extra spaces for consistency.
Play video starting at :8:26 and follow transcript8:26
For example, let's check out the state column in our customer_address table.
Just like we did for the country column,
we want to make sure the state column has the consistent number of letters.
So let's use the LENGTH function again to learn if we have any state that has more
than two letters, which is what we would expect to find in our data table.
Play video starting at :8:49 and follow transcript8:49
We start writing our SQL query by typing the basic
SQL structure of SELECT, FROM, WHERE.
We're working with the customer_address table in the customer_data data set.
So we type in customer_data.customer_address
after FROM.
Next, we tell SQL what we want it to pull.
We want it to give us any state that has more than two letters,
so we type in state, after SELECT.
Finally, we want SQL to filter for states that have more than two letters.
This condition is written in the WHERE clause.
So we type in LENGTH(state), and
that it must be greater than 2 because we want
the states that have more than two letters.
Play video starting at :9:49 and follow transcript9:49
We want to figure out what the incorrectly listed states look like, if we have any.
When we run this query, we get one result.
We have one state that has more than two letters.
But hold on, how can this state that seems like it has two letters,
O and H for Ohio, have more than two letters?
We know that there are more than two characters because we used
the LENGTH(state) > 2 statement in the WHERE clause when filtering out results.
So that means the extra characters that SQL is counting must then be a space.
There must be a space after the H.
This is where we would use the TRIM function.
The TRIM function removes any spaces.
So let's write a SQL query that accounts for this error.
Let's say we want a list of all customer IDs of the customers who live in "OH" for Ohio.
We start with the basic SQL structure: FROM, SELECT, WHERE.
We know the data comes from the customer_address
table in the customer_data data set, so
we type in customer_data.customer_address after FROM.
Next, we tell SQL what data we want.
We want SQL to give us the customer IDs of customers who live in Ohio,
so we type in customer_id after SELECT.
Since we know we have some duplicate customer entries,
we'll go ahead and type in DISTINCT before customer_id
to remove any duplicate customer IDs from appearing in our results.
Finally, we want SQL to give us the customer IDs of
the customers who live in Ohio.
We're asking SQL to filter the data, so this belongs in the WHERE clause.
Here's where we'll use the TRIM function.
To use the TRIM function, we tell SQL the column we want to
remove spaces from, which is state in our case.
And we want only Ohio customers, so we type in = 'OH'.
That's it. We have all customer IDs of the customers who live in Ohio,
including that customer with the extra space after the H.
Play video starting at :12:14 and follow transcript12:14
Making sure that your string variables are complete and consistent will save
you a lot of time later by avoiding errors or miscalculations.
That's why we clean data in the first place.
Hopefully functions like length, substring, and trim will give you
the tools you need to start working with string variables in your own data sets.
Next up, we'll check out some other ways you can work with strings
and more advanced cleaning functions.
Then you'll be ready to start working in SQL on your own.
See you soon.
In this video,
we'll discuss how to begin the process of verifying your data-cleaning efforts.
Play video starting at ::7 and follow transcript0:07
Verification is a critical part of any analysis project.
Without it you have no way of knowing that your insights can be relied on for
data-driven decision-making.
Think of verification as a stamp of approval.
Play video starting at ::20 and follow transcript0:20
To refresh your memory, verification is a process to confirm that a data-cleaning
effort was well-executed and the resulting data is accurate and reliable.
It also involves manually cleaning data to compare your expectations with what's
actually present.
The first step in the verification process is going back to your
original unclean data set and comparing it to what you have now.
Review the dirty data and try to identify any common problems.
For example, maybe you had a lot of nulls.
In that case, you check your clean data to ensure no nulls are present.
To do that, you could search through the data manually or
use tools like conditional formatting or filters.
Play video starting at :1:6 and follow transcript1:06
Or maybe there was a common misspelling like someone keying in the name of
a product incorrectly over and over again.
In that case, you'd run a FIND in your clean data to make sure no instances of
the misspelled word occur.
Play video starting at :1:21 and follow transcript1:21
Another key part of verification involves taking a big-picture view of your project.
This is an opportunity to confirm you're actually focusing on the business problem
that you need to solve and the overall project goals
and to make sure that your data is actually capable of solving that problem
and achieving those goals.
Play video starting at :1:41 and follow transcript1:41
It's important to take the time to reset and focus on the big picture
because projects can sometimes evolve or
transform over time without us even realizing it.
Maybe an e-commerce company decides to survey 1000 customers to
get information that would be used to improve a product.
But as responses begin coming in, the analysts notice a lot of comments about
how unhappy customers are with the e-commerce website platform altogether.
So the analysts start to focus on that.
While the customer buying experience is of course important for
any e-commerce business, it wasn't the original objective of the project.
The analysts in this case need to take a moment to pause, refocus,
and get back to solving the original problem.
Play video starting at :2:32 and follow transcript2:32
Taking a big picture view of your project involves doing three things.
First, consider the business problem you're trying to solve with the data.
Play video starting at :2:41 and follow transcript2:41
If you've lost sight of the problem,
you have no way of knowing what data belongs in your analysis.
Taking a problem-first approach to analytics is essential at all stages of
any project.
You need to be certain that your data will actually make it possible to solve your
business problem.
Second, you need to consider the goal of the project.
It's not enough just to know that your company wants to analyze customer feedback
about a product.
What you really need to know is that the goal of getting this feedback is to make
improvements to that product.
On top of that, you also need to know whether the data you've collected and
cleaned will actually help your company achieve that goal.
And third, you need to consider whether your data is capable of solving
the problem and meeting the project objectives.
That means thinking about where the data came from and
testing your data collection and cleaning processes.
Play video starting at :3:35 and follow transcript3:35
Sometimes data analysts can be too familiar with their own data,
which makes it easier to miss something or make assumptions.
Play video starting at :3:43 and follow transcript3:43
Asking a teammate to review your data from a fresh perspective and
getting feedback from others is very valuable in this stage.
Play video starting at :3:51 and follow transcript3:51
This is also the time to notice if anything sticks out to
you as suspicious or potentially problematic in your data.
Again, step back, take a big picture view, and
ask yourself, do the numbers make sense?
Play video starting at :4:7 and follow transcript4:07
Let's go back to our e-commerce company example.
Imagine an analyst is reviewing the cleaned up data
from the customer satisfaction survey.
The survey was originally sent to 1,000 customers, but what if
the analyst discovers that there is more than a thousand responses in the data?
This could mean that one customer figured out a way to take the survey more
than once.
Or it could also mean that something went wrong in the data cleaning process, and
a field was duplicated.
Either way, this is a signal that it's time to go back to the data-cleaning
process and correct the problem.
Play video starting at :4:42 and follow transcript4:42
Verifying your data ensures that the insights you gain from analysis can be
trusted.
It's an essential part of data-cleaning that helps companies avoid big mistakes.
This is another place where data analysts can save the day.
Play video starting at :4:55 and follow transcript4:55
Coming up,
we'll go through the next steps in the data-cleaning process. See you there.
In this video,
we'll discuss how to begin the process of verifying your data-cleaning efforts.
Play video starting at ::7 and follow transcript0:07
Verification is a critical part of any analysis project.
Without it you have no way of knowing that your insights can be relied on for
data-driven decision-making.
Think of verification as a stamp of approval.
Play video starting at ::20 and follow transcript0:20
To refresh your memory, verification is a process to confirm that a data-cleaning
effort was well-executed and the resulting data is accurate and reliable.
It also involves manually cleaning data to compare your expectations with what's
actually present.
The first step in the verification process is going back to your
original unclean data set and comparing it to what you have now.
Review the dirty data and try to identify any common problems.
For example, maybe you had a lot of nulls.
In that case, you check your clean data to ensure no nulls are present.
To do that, you could search through the data manually or
use tools like conditional formatting or filters.
Play video starting at :1:6 and follow transcript1:06
Or maybe there was a common misspelling like someone keying in the name of
a product incorrectly over and over again.
In that case, you'd run a FIND in your clean data to make sure no instances of
the misspelled word occur.
Play video starting at :1:21 and follow transcript1:21
Another key part of verification involves taking a big-picture view of your project.
This is an opportunity to confirm you're actually focusing on the business problem
that you need to solve and the overall project goals
and to make sure that your data is actually capable of solving that problem
and achieving those goals.
Play video starting at :1:41 and follow transcript1:41
It's important to take the time to reset and focus on the big picture
because projects can sometimes evolve or
transform over time without us even realizing it.
Maybe an e-commerce company decides to survey 1000 customers to
get information that would be used to improve a product.
But as responses begin coming in, the analysts notice a lot of comments about
how unhappy customers are with the e-commerce website platform altogether.
So the analysts start to focus on that.
While the customer buying experience is of course important for
any e-commerce business, it wasn't the original objective of the project.
The analysts in this case need to take a moment to pause, refocus,
and get back to solving the original problem.
Play video starting at :2:32 and follow transcript2:32
Taking a big picture view of your project involves doing three things.
First, consider the business problem you're trying to solve with the data.
Play video starting at :2:41 and follow transcript2:41
If you've lost sight of the problem,
you have no way of knowing what data belongs in your analysis.
Taking a problem-first approach to analytics is essential at all stages of
any project.
You need to be certain that your data will actually make it possible to solve your
business problem.
Second, you need to consider the goal of the project.
It's not enough just to know that your company wants to analyze customer feedback
about a product.
What you really need to know is that the goal of getting this feedback is to make
improvements to that product.
On top of that, you also need to know whether the data you've collected and
cleaned will actually help your company achieve that goal.
And third, you need to consider whether your data is capable of solving
the problem and meeting the project objectives.
That means thinking about where the data came from and
testing your data collection and cleaning processes.
Play video starting at :3:35 and follow transcript3:35
Sometimes data analysts can be too familiar with their own data,
which makes it easier to miss something or make assumptions.
Play video starting at :3:43 and follow transcript3:43
Asking a teammate to review your data from a fresh perspective and
getting feedback from others is very valuable in this stage.
Play video starting at :3:51 and follow transcript3:51
This is also the time to notice if anything sticks out to
you as suspicious or potentially problematic in your data.
Again, step back, take a big picture view, and
ask yourself, do the numbers make sense?
Play video starting at :4:7 and follow transcript4:07
Let's go back to our e-commerce company example.
Imagine an analyst is reviewing the cleaned up data
from the customer satisfaction survey.
The survey was originally sent to 1,000 customers, but what if
the analyst discovers that there is more than a thousand responses in the data?
This could mean that one customer figured out a way to take the survey more
than once.
Or it could also mean that something went wrong in the data cleaning process, and
a field was duplicated.
Either way, this is a signal that it's time to go back to the data-cleaning
process and correct the problem.
Play video starting at :4:42 and follow transcript4:42
Verifying your data ensures that the insights you gain from analysis can be
trusted.
It's an essential part of data-cleaning that helps companies avoid big mistakes.
This is another place where data analysts can save the day.
Play video starting at :4:55 and follow transcript4:55
Coming up,
we'll go through the next steps in the data-cleaning process. See you there.
Hey there. In this video,
we'll continue building on the verification process.
As a quick reminder,
the goal is to ensure that
our data-cleaning work was done
properly and the results can be counted on.
You want your data to be verified so you know
it's 100 percent ready to go.
It's like car companies running tons of tests to
make sure a car is safe before it hits the road.
You learned that the first step in
verification is returning to
your original, unclean dataset
and comparing it to what you have now.
This is an opportunity to search for common problems.
After that, you clean up
the problems manually. For example,
by eliminating extra spaces
or removing an unwanted quotation mark.
But there's also some great tools for
fixing common errors automatically,
such as TRIM and remove duplicates.
Earlier, you learned that TRIM
is a function that removes leading,
trailing, and repeated spaces and data.
Remove duplicates is a tool that automatically searches
for and eliminates duplicate entries from a spreadsheet.
Now sometimes you had an error that
shows up repeatedly, and it can't be
resolved with a quick manual edit or
a tool that fixes the problem automatically.
In these cases, it's helpful to create a pivot table.
A pivot table is
a data summarization tool
that is used in data processing.
Pivot tables sort, reorganize, group,
count, total or average data stored in a database.
We'll practice that now using
the spreadsheet from a party supply store.
Let's say this company was
interested in learning which of
its four suppliers is most cost-effective.
An analyst pulled this data on
the products the business sells,
how many were purchased,
which supplier provides them,
the cost of the products, and the ultimate revenue.
The data has been cleaned.
But during verification, we noticed that one of
the suppliers' names was keyed in incorrectly.
Play video starting at :2:16 and follow transcript2:16
We could just correct the word as "plus,"
but this might not solve the problem
because we don't know if this was
a one-time occurrence or if
the problem's repeated throughout the spreadsheet.
There are two ways to answer that question.
The first is using Find and replace.
Find and replace is a tool that looks for
a specified search term in
a spreadsheet and allows
you to replace it with something else.
We'll choose Edit. Then Find and replace.
We're trying to find P-L-O-S,
the misspelling of "plus" in the supplier's name.
In some cases you might not want to replace the data.
You just want to find something. No problem.
Just type the search term,
leave the rest of the options as
default and click "Done."
But right now we do want to replace it with
P-L-U-S. We'll type that in here.
Then click "Replace all" and "Done."
Play video starting at :3:20 and follow transcript3:20
There we go. Our misspelling has been corrected.
That was of course the goal.
But for now let's undo our Find and
replace so we can
practice another way to determine if
errors are repeated throughout a dataset,
like with the pivot table.
We'll begin by selecting the data we want to use.
Choose column C. Select "Data." Then "Pivot Table."
Choose "New Sheet" and "Create."
Play video starting at :3:59 and follow transcript3:59
We know this company has four suppliers.
If we count the suppliers and
the number doesn't equal four,
we know there's a problem.
First, add a row for suppliers.
Play video starting at :4:13 and follow transcript4:13
Next, we'll add a value for
our suppliers and summarize by COUNTA.
COUNTA counts the total number of values within a specified range.
Here we're counting the number of times
a supplier's name appears in
column C. Note that there's also function called COUNT,
which only counts the numerical values
within a specified range.
If we use it here,
the result would be zero.
Not what we have in mind.
But in other special applications,
COUNT would give us information
we want for our current example.
As you continue learning more
about formulas and functions,
you'll discover more interesting options.
If you want to keep learning,
search online for spreadsheet formulas and functions.
There's a lot of great information out there.
Our pivot table has counted the number of misspellings,
and it clearly shows that the error occurs just once.
Otherwise our four suppliers
are accurately accounted for in our data.
Now we can correct the spelling, and we
verify that the rest of the supplier data is clean.
This is also useful practice when querying a database.
If you're working in SQL,
you can address misspellings using a CASE statement.
The CASE statement goes through
one or more conditions and
returns a value as soon as a condition is met.
Let's discuss how this works in real life
using our customer_name table.
Check out how our customer,
Tony Magnolia, shows up as Tony and Tnoy.
Tony's name was misspelled.
Let's say we want a list of our customer IDs and
the customer's first names so we can write
personalized notes thanking
each customer for their purchase.
We don't want Tony's note to be
addressed incorrectly to "Tnoy."
Here's where we can use: the CASE statement.
We'll start our query with the basic SQL structure.
SELECT, FROM, and WHERE.
We know that data comes from
the customer_name table
in the customer_data dataset,
so we can add customer underscore data
dot customer underscore name after FROM.
Next, we tell SQL what data to pull in the SELECT clause.
We want customer_id and first_name.
We can go ahead and add customer
underscore ID after SELECT.
But for our customer's first names,
we know that Tony was misspelled,
so we'll correct that using CASE. We'll
add CASE and then
WHEN and type first underscore name equal "Tnoy."
Next we'll use the THEN command and type "Tony,"
followed by the ELSE command.
Here we will type first underscore name,
followed by End As
and then we'll type cleaned underscore name.
Finally, we're not filtering our data,
so we can eliminate the WHERE clause.
As I mentioned, a CASE statement
can cover multiple cases.
If we wanted to search for a few more misspelled names,
our statement would look similar to the original,
with some additional names like this.
Play video starting at :8:6 and follow transcript8:06
There you go. Now that you've learned how you can use
spreadsheets and SQL to fix errors automatically,
we'll explore how to keep track of our changes next.
Data-cleaning verification: A checklist
This reading will give you a checklist of common problems you can refer to when doing
your data cleaning verification, no matter what tool you are using. When it comes to T
verification, there is no one-size-fits-all approach or a single checklist that can be
universally applied to all projects. Each project has its own organization and data
requirements that lead to a unique list of things to run through for verification.
Keep in mind, as you receive more data or a better understanding of the project goal(s),
you might want to revisit some or all of these steps.
Correct the most common problems
Make sure you identified the most common problems and corrected them, including:











Sources of errors: Did you use the right tools and functions to find the source
of the errors in your dataset?
Null data: Did you search for NULLs using conditional formatting and filters?
Misspelled words: Did you locate all misspellings?
Mistyped numbers: Did you double-check that your numeric data has been
entered correctly?
Extra spaces and characters: Did you remove any extra spaces or characters
using the TRIM function?
Duplicates: Did you remove duplicates in spreadsheets using the Remove
Duplicates function or DISTINCT in SQL?
Mismatched data types: Did you check that numeric, date, and string data are
typecast correctly?
Messy (inconsistent) strings: Did you make sure that all of your strings are
consistent and meaningful?
Messy (inconsistent) date formats: Did you format the dates consistently
throughout your dataset?
Misleading variable labels (columns): Did you name your columns
meaningfully?
Truncated data: Did you check for truncated or missing data that needs
correction?

Business Logic: Did you check that the data makes sense given your
knowledge of the business?
Review the goal of your project
Once you have finished these data cleaning tasks, it is a good idea to review the goal of your
project and confirm that your data is still aligned with that goal. This is a continuous
process that you will do throughout your project-- but here are three steps you can keep in
mind while thinking about this:



Confirm the business problem
Confirm the goal of the project
Verify that data can solve the problem and is aligned to the goal
Embrace changelogs
What do engineers, writers, and data analysts have in common? Change.
Engineers use engineering change orders (ECOs) to keep track of new product design
details and proposed changes to existing products. Writers use document revision
histories to keep track of changes to document flow and edits. And data analysts use
changelogs to keep track of data transformation and cleaning. Here are some examples of
these:
Automated version control takes you most of the way
Most software applications have a kind of history tracking built in. For example, in Google
sheets, you can check the version history of an entire sheet or an individual cell and go back
to an earlier version. In Microsoft Excel, you can use a feature called Track Changes. And
in BigQuery, you can view the history to check what has changed.
Here’s how it works:
Google
Sheets
1. Right-click the cell and select Show edit history. 2. Click the left-arrow < or right a
and forward in the history as needed.
Microsoft
Excel
1. If Track Changes has been enabled for the spreadsheet: click Review. 2. Under Trac
Accept/Reject Changes option to accept or reject any change made.
BigQuery
Bring up a previous version (without reverting to it) and figure out what changed by c
version.
Changelogs take you down the last mile
A changelog can build on your automated version history by giving you an even more
detailed record of your work. This is where data analysts record all the changes they make
to the data. Here is another way of looking at it. Version histories record what was done in
a data change for a project, but don't tell us why. Changelogs are super useful for helping us
understand the reasons changes have been made. Changelogs have no set format and you
can even make your entries in a blank document. But if you are using a shared changelog, it
is best to agree with other data analysts on the format of all your log entries.
Typically, a changelog records this type of information:







Data, file, formula, query, or any other component that changed
Description of what changed
Date of the change
Person who made the change
Person who approved the change
Version number
Reason for the change
Let’s say you made a change to a formula in a spreadsheet because you observed it in
another report and you wanted your data to match and be consistent. If you found out later
that the report was actually using the wrong formula, an automated version history would
help you undo the change. But if you also recorded the reason for the change in a
changelog, you could go back to the creators of the report and let them know about the
incorrect formula. If the change happened a while ago, you might not remember who to
follow up with. Fortunately, your changelog would have that information ready for you! By
following up, you would ensure data integrity outside your project. You would also be
showing personal integrity as someone who can be trusted with data. That is the power of a
changelog!
Finally, a changelog is important for when lots of changes to a spreadsheet or query have
been made. Imagine an analyst made four changes and the change they want to revert to is
change #2. Instead of clicking the undo feature three times to undo change #2 (and losing
changes #3 and #4), the analyst can undo just change #2 and keep all the other changes.
Now, our example was for just 4 changes, but try to think about how important that
changelog would be if there were hundreds of changes to keep track of.
What also happens IRL (in real life)
A junior analyst probably only needs to know the above with one exception. If an analyst is
making changes to an existing SQL query that is shared across the company, the company
most likely uses what is called a version control system. An example might be a query
that pulls daily revenue to build a dashboard for senior management.
Here is how a version control system affects a change to a query:
1. A company has official versions of important queries in their version control
system.
2. An analyst makes sure the most up-to-date version of the query is the one they
will change. This is called syncing
3. The analyst makes a change to the query.
4. The analyst might ask someone to review this change. This is called a code
review and can be informally or formally done. An informal review could be as
simple as asking a senior analyst to take a look at the change.
5. After a reviewer approves the change, the analyst submits the updated version
of the query to a repository in the company's version control system. This is
called a code commit. A best practice is to document exactly what the change
was and why it was made in a comments area. Going back to our example of a
query that pulls daily revenue, a comment might be: Updated revenue to
include revenue coming from the new product, Calypso.
6. After the change is submitted, everyone else in the company will be able to
access and use this new query when they sync to the most up-to-date queries
stored in the version control system.
7. If the query has a problem or business needs change, the analyst can undo the
change to the query using the version control system. The analyst can look at a
chronological list of all changes made to the query and who made each change.
Then, after finding their own change, the analyst can revert to the previous
version.
8. The query is back to what it was before the analyst made the change. And
everyone at the company sees this reverted, original query, too.
Advanced functions for speedy data cleaning
In this reading, you will learn about some advanced functions that can help you speed up
the data cleaning process in spreadsheets. Below is a table summarizing three functions
and what they do:
IMPORTRANGE: Syntax: =IMPORTRANGE(spreadsheet_url, range_string) Menu Options: Paste Link (copy
the data first) Primary Use: Imports (pastes) data from one sheet to another and keeps it automatically
updated QUERY: Syntax: =QUERY(Sheet and Range, "Select *") Menu Options: Data > From Other
Sources > From Microsoft Query Primary Use: Enables pseudo SQL (SQL-like) statements or a wizard to
import the data. FILTER: Syntax: =FILTER(range, condition1, [condition2, ...]) Menu Options:
Filter(conditions per column) Primary Use: Displays only the data that meets the specified conditions.
Keeping data clean and in sync with a source
The IMPORTRANGE function in Google Sheets and the Paste Link feature (a Paste Special
option in Microsoft Excel) both allow you to insert data from one sheet to another. Using
these on a large amount of data is more efficient than manual copying and pasting. They
also reduce the chance of errors being introduced by copying and pasting the wrong data.
They are also helpful for data cleaning because you can “cherry pick” the data you want to
analyze and leave behind the data that isn’t relevant to your project. Basically, it is like
canceling noise from your data so you can focus on what is most important to solve your
problem. This functionality is also useful for day-to-day data monitoring; with it, you can
build a tracking spreadsheet to share the relevant data with others. The data is synced with
the data source so when the data is updated in the source file, the tracked data is also
refreshed.
If you are using IMPORTRANGE in Google sheets, data can be pulled from another
spreadsheet, but you must allow access to the spreadsheet the first time it pulls the data.
The URL shown below is for syntax purposes only. Don't enter it in your own
spreadsheet. Replace it with a URL to a spreadsheet you have created so you can control
access to it by clicking the Allow access button.
Refer to the Google support page for IMPORTRANGE for the sample usage and syntax.
Example of using IMPORTRANGE
An analyst monitoring a fundraiser needs to track and ensure that matching funds are
distributed. They use IMPORTRANGE to pull all the matching transactions into a
spreadsheet containing all of the individual donations. This enables them to determine
which donations eligible for matching funds still need to be processed. Because the total
number of matching transactions increases daily, they simply need to change the range
used by the function to import the most up-to-date data.
On Tuesday, they use the following to import the donor names and matched amounts:
=IMPORTRANGE(“https://docs.google.com/spreadsheets/d/1cOsHnBDzm9tBb8Hk_aLYfq
3-o5FZ6DguPYRJ57992_Y”, “Matched Funds!A1:B4001”)
On Wednesday, another 500 transactions were processed. They increase the range used by
500 to easily include the latest transactions when importing the data to the individual
donor spreadsheet:
=IMPORTRANGE(“https://docs.google.com/spreadsheets/d/1cOsHnBDzm9tBb8Hk_aLYfq
3-o5FZ6DguPYRJ57992_Y”, “Matched Funds!A1:B4501”)
Note: The above examples are for illustrative purposes only. Don't copy and paste
them into your spreadsheet. To try it out yourself, you will need to substitute your
own URL (and sheet name if you have multiple tabs) along with the range of cells in
the spreadsheet that you have populated with data.
Pulling data from other data sources
The QUERY function is also useful when you want to pull data from another spreadsheet.
The QUERY function's SQL-like ability can extract specific data within a spreadsheet. For a
large amount of data, using the QUERY function is faster than filtering data manually. This
is especially true when repeated filtering is required. For example, you could generate a list
of all customers who bought your company’s products in a particular month using manual
filtering. But if you also want to figure out customer growth month over month, you have to
copy the filtered data to a new spreadsheet, filter the data for sales during the following
month, and then copy those results for the analysis. With the QUERY function, you can get
all the data for both months without a need to change your original dataset or copy results.
The QUERY function syntax is similar to IMPORTRANGE. You enter the sheet by name and
the range of data that you want to query from, and then use the SQL SELECT command to
select the specific columns. You can also add specific criteria after the SELECT statement by
including a WHERE statement. But remember, all of the SQL code you use has to be placed
between the quotes!
Google Sheets run the Google Visualization API Query Language across the data. Excel
spreadsheets use a query wizard to guide you through the steps to connect to a data source
and select the tables. In either case, you are able to be sure that the data imported is
verified and clean based on the criteria in the query.
Examples of using QUERY
Check out the Google support page for the QUERY function with sample usage, syntax, and
examples you can download in a Google sheet.
Link to make a copy of the sheet: QUERY examples
Real life solution
Analysts can use SQL to pull a specific dataset into a spreadsheet. They can then use the
QUERY function to create multiple tabs (views) of that dataset. For example, one tab could
contain all the sales data for a particular month and another tab could contain all the sales
data from a specific region. This solution illustrates how SQL and spreadsheets are used
well together.
Filtering data to get what you want
The FILTER function is fully internal to a spreadsheet and doesn’t require the use of a
query language. The FILTER function lets you view only the rows (or columns) in the
source data that meet your specified conditions. It makes it possible to pre-filter data
before you analyze it.
The FILTER function might run faster than the QUERY function. But keep in mind, the
QUERY function can be combined with other functions for more complex calculations. For
example, the QUERY function can be used with other functions like SUM and COUNT to
summarize data, but the FILTER function can't.
Example of using FILTER
Check out the Google support page for the FILTER function with sample usage, syntax, and
examples you can download in a Google sheet.
Link to make a copy of the sheet: FILTER examples
Keeping data organized with sorting and filters
You have learned about four phases of analysis:
Organize data
Format and adjust data
Get input from others
Transform data
The organization of datasets is really important for data analysts. Most of the datasets you
will use will be organized as tables. Tables are helpful because they let you manipulate your
data and categorize it. Having distinct categories and classifications lets you focus on, and
differentiate between, your data quickly and easily.
Data analysts also need to format and adjust data when performing an analysis. Sorting and
filtering are two ways you can keep things organized when you format and adjust data to
work with it. For example, a filter can help you find errors or outliers so you can fix or flag
them before your analysis. Outliers are data points that are very different from similarly
collected data and might not be reliable values. The benefit of filtering the data is that after
you fix errors or identify outliers, you can remove the filter and return the data to its
original organization.
In this reading, you will learn the difference between sorting and filtering. You will also be
introduced to how a particular form of sorting is done in a pivot table.
Sorting versus filtering
Left image of a pair of hands sorting letters and numbers. Right image is a hand holding a
filter sorting numbers and letters
Sorting is when you arrange data into a meaningful order to make it easier to understand,
analyze, and visualize. It ranks your data based on a specific metric you choose. You can
sort data in spreadsheets, SQL databases (when your dataset is too large for spreadsheets),
and tables in documents.
For example, if you need to rank things or create chronological lists, you can sort by
ascending or descending order. If you are interested in figuring out a group’s favorite
movies, you might sort by movie title to figure it out. Sorting will arrange the data in a
meaningful way and give you immediate insights. Sorting also helps you to group similar
data together by a classification. For movies, you could sort by genre -- like action, drama,
sci-fi, or romance.
Filtering is used when you are only interested in seeing data that meets a specific criteria,
and hiding the rest. Filtering is really useful when you have lots of data. You can save time
by zeroing in on the data that is really important or the data that has bugs or errors. Most
spreadsheets and SQL databases allow you to filter your data in a variety of ways. Filtering
gives you the ability to find what you are looking for without too much effort.
For example, if you are only interested in finding out who watched movies in October, you
could use a filter on the dates so only the records for movies watched in October are
displayed. Then, you could check out the names of the people to figure out who watched
movies in October.
To recap, the easiest way to remember the difference between sorting and filtering is that
you can use sort to quickly order the data, and filter to display only the data that meets the
criteria that you have chosen. Use filtering when you need to reduce the amount of data
that is displayed.
It is important to point out that, after you filter data, you can sort the filtered data, too. If
you revisit the example of finding out who watched movies in October, after you have
filtered for the movies seen in October, you can then sort the names of the people who
watched those movies in alphabetical order.
Sorting in a pivot table
Items in the row and column areas of a pivot table are sorted in ascending order by any
custom list first. For example, if your list contains days of the week, the pivot table allows
weekday and month names to sort like this: Monday, Tuesday, Wednesday, etc. rather than
alphabetically like this: Friday, Monday, Saturday, etc.
If the items aren’t in a custom list, they will be sorted in ascending order by default. But, if
you sort in descending order, you are setting up a rule that controls how the field is sorted
even after new data fields are added.
Hey, great to see you again.
Earlier we talked about why you should organize your data, no matter what part of
the lifecycle it's in.
Just like any collection, it's easier to manage and care for a group of things
when there's structure around them.
Play video starting at ::15 and follow transcript0:15
Now we should keep in mind that organization isn't just about making
things look orderly.
It's also about making it easier to search and
locate the data you need in a quick and easy way.
As a data analyst, you'll find yourself rearranging and
sifting through databases pretty often.
Two of the most common ways of doing this are with sorting and filtering.
We've briefly discussed sorting and filtering before, and
it's important you know exactly what each one does.
Sorting is when you arrange data into a meaningful order to make it easier to
understand, analyze, and visualize.
Sorting ranks your data based on a specific metric that you can choose.
You can sort data in spreadsheets and databases that use SQL.
We'll get to all the cool functions you can use in both a little later on.
A common way to sort items when you're shopping on a website is from lowest to
highest price, but you can also sort by alphabetical order,
like books in a library.
Or you can sort from newest to oldest,
like the order of text messages in a phone.
Or nearest to furthest away, like when you're searching for restaurants online.
Another way to organize information is with a filter.
Filtering is showing only the data that meets a specific criteria while
hiding the rest.
Typically you can use filters when you want to narrow down the amount of data you
want to sift through.
Say you're searching for green sneakers online. To save time, you filter for
green shoes only.
Using a filter slims down larger data sets to smaller subsets
that are relevant to what you need.
Sorting and filtering are two actions you probably perform a lot online.
Whether you're sorting movie showtimes from earliest to latest, or
filtering your search results to just images, you're probably already familiar
with how helpful they can be for making sense of data.
Now let's take that knowledge and apply it.
When it comes to sifting through large, disorganized piles of data,
filters are your friend.
You might remember from a previous video that you can use filters and
spreadsheet programs, like Excel and Sheets,
to only display data from rows that match the range or condition you've set.
You can also filter data in SQL using the WHERE clause. The WHERE
clause works similarly to filtering in a spreadsheet because it returns rows
based on a condition you name.
Let's learn how you can use a WHERE clause in a database.
We'll use BigQuery to access the database and run our query.
If you're joining us, open up your tool of choice for using SQL and
reference the earlier resource on how to access the dataset.
Otherwise, watch as the WHERE clause does its thing.
Here's the database.
Play video starting at :3:5 and follow transcript3:05
You might recognize it from past videos. Basically, it's a long list of movies.
Each row includes an entry for the columns named Movie_Title,
Release_Date, Genre, Director, Cast_Members, Budget, and
Total_Revenue. It also includes a link to the film's Wikipedia page.
If you scroll down the list, the list goes on for a long time.
Of course, we won't need to go through everything to find the data we want.
That's the beauty of a filter!
In this case, we'll use the WHERE clause to filter the database and
narrow down the list to movies in the comedy genre.
To start, we'll use the SELECT command followed by an asterisk.
In SQL, an asterisk selects all of the data.
On a new line, we'll type FROM and
the name of the database: movie_data.movies.
To filter the movies by comedy, we're going to type WHERE,
then list the condition, which is Genre.
Play video starting at :4:5 and follow transcript4:05
Genre is a column in the dataset, and we only want to select rows where
the cell in the Genre column exactly matches "Comedy."
Next we'll type the equals sign and write the specific genre we're filtering for,
which is comedy.
Since the data in the Genre column is a string format,
we have to use single or double quotations when writing it.
And keep in mind that capitalization matters here,
so we have to make sure that the letter casing matches the column name exactly.
And now we can click Run to check out the results.
What we're left with is a shorter list of comedy movies. Pretty cool, right?
Here's something else you should know.
You can apply multiple filters to a database. You can even sort and
filter data at the same time for even more precise results.
As a data analyst, knowing how to sort and filter data will make you a superstar.
That's all for now. Coming up, we'll get down to the nitty-gritty of sorting
functions in spreadsheets. See you there!
Sorting and filtering in Sheets and Excel
In this reading, we will describe the sorting and filtering options in Google Sheets and
Microsoft Excel. Both offer basic sorting and filtering from set menu options. But, if you
need more advanced sorting and filtering capabilities, you can use their respective SORT
and FILTER functions.
Sorting and filtering in Sheets
Sorting in Google Sheets helps you quickly spot trends in numbers. One trend might be
gross revenue by sales region. In this case, you could sort the gross revenue column in
descending (Z to A) order to spot the top performing regions at the top, or sort the gross
revenue column in ascending (A-Z) order to spot the lowest performing regions at the top.
Although an alphabetical order is implied, these sorting options do sort numbers, as our
gross revenue example highlighted.
If you want to learn more about the set menu options for sorting and filtering, start with
these resources:

Sort and filter data (Google Help Center): instructions to sort data in
alphabetical or numerical order and create filter views
 Sort data by selecting a range of data in a column: video of steps to achieve the
task
 Sort a range of data using sort criteria for multiple columns: technical tip video
to sort data across multiple columns
In addition to the standard menu options, there is a SORT function for more advanced
sorting. Use this function to create a custom sort. You can sort the rows of a given range of
data by the values in one or more columns. And you get to set the sort criteria per column.
Refer to the SORT function page for the syntax.
And like the SORT function, you can use the FILTER function to filter by any matching
criteria you like. This creates a custom filter.
You might recall that you can filter data and then sort the filtered results. Using the FILTER
and SORT functions together in a range of cells can programmatically and automatically
achieve these results for you.
Sorting and filtering in Excel
You can also sort in ascending (A-Z) and descending (Z-A) order in Microsoft Excel. Excel
offers Smallest to Largest and Largest to Smallest sorting when you are working with
numbers.
Similar to the SORT function in Google Sheets, Excel includes custom sort capabilities that
are available from the menu. After you select the data range, click the Sort & Filter button
to select the criteria for sorting. You can even sort by the data in rows instead of by the data
in columns if you select Sort left to right under Options. (Sort top to bottom is the default
setting to sort the data in columns.)
If you want to learn more about sorting and filtering in Excel, start with these resources:

Sort data in a range or table (Microsoft Support): instructions and video to
perform sorting in 11 different use cases
 Excel training: sort and filter data (Microsoft Support): sorting and filtering
videos with transcripts
 Excel: sorting data: video of how to use the Sort & Filter and Data menu options
for sorting
Excel also has SORT, SORTBY, and FILTER functions. Explore how you can use these
functions to automatically sort and filter your data in spreadsheets without having to select
any menu options at all.
Hello there! If you're hoping to learn
about sorting—in SQL this time—
you've definitely come to the right place.
So far, we've sorted spreadsheets
through the menu and with a written function.
Which brings us to the next part of our learning:
more sort functions, but this time in SQL.
Data analysts love playing
with the way data is presented.
Sorting is a useful way to rearrange data because it can
help you understand the data
you have in a different light.
As you've probably already noticed,
a lot of things you can do in
spreadsheets can also be done in SQL.
Sorting is one of those things.
We've talked about using SQL with large datasets before.
When a spreadsheet has too much data,
you can get error messages,
or it can cause your program to crash.
That's definitely something we want to avoid.
SQL shortens processes that would otherwise take
a very long time or be
impossible to complete in a spreadsheet.
Personally, I use SQL to
pull and combine different data tables.
It's much quicker than
a spreadsheet, and that usually comes in handy.
Here's something pretty helpful you can do with SQL.
You can use the ORDER BY clause
to sort results returned in a query.
Let's go back to our movie spreadsheet to
get a better idea of how this works.
Feel free to follow along in
a SQL tool of your choice as we go.
As a quick refresher,
we have a database of movies listed with
data like release date, director, and more.
We can sort this table in lots of
different ways using the ORDER BY function.
For this example, let's sort by release date.
First, we have the SELECT function and an asterisk.
Play video starting at :1:51 and follow transcript1:51
Keep in mind that the asterisk
means all columns are selected.
Then we have FROM and the name of
the database and table we're in right now.
Now let's check out the next line.
It's empty, but that's where we'll
write our ORDER BY function.
The ORDER BY command is
usually the last clause in your query.
Back to the actual sorting!
We'll type ORDER BY with the space.
With this clause, you can choose to
order data by fields in a certain column.
Because we want to sort by release date,
we'll type Release_Date.
By default, the ORDER BY
clause sorts data in ascending order.
If you run the query as it is right now,
the movies will be sorted from
oldest to the most recent release dates.
Let's run the query and see what we've got.
You can also sort the release dates in
the reverse order from
the most recent dates to the oldest.
To do this, just specify
the descending order in the ORDER BY
command written as DESC,
D-E-S-C. Let's run this query.
Play video starting at :3:6 and follow transcript3:06
As you'll notice, the most recently released films
are now at the top of the database.
In spreadsheets, you can combine sorts and
filters to display information differently.
You can do something similar in SQL too.
You might remember that while sorting
puts data in a specific order,
filters narrow down data,
so you only see data that fits the filter.
For example, let's say we want to filter movies by
genre so that we're only working with comedies.
But we still want release dates to be
sorted in descending order,
from most recent to oldest films.
We can do this with the WHERE clause.
Let's try that now.
First, we'll check that the ORDER BY
clause is always the last line.
That makes sure that all the results of
the query you're running are sorted by that clause.
Then, we'll add a new line for the WHERE clause
after FROM and before ORDER BY.
Play video starting at :4:9 and follow transcript4:09
Here's what we've got so far.
From there, we want to type
the column we're filtering for.
In this case, we want to
filter the database for comedies.
After the WHERE clause,
we'll type the column list's name as Genre.
Now, we'll add an equal sign after Genre because we
only want to include genres that
match what we're filtering for.
In this case, we're filtering for comedy,
so we'll type Comedy between two apostrophes.
Now, if you check out the entire query as a whole,
you'll notice that we're selecting all columns,
and we know it's all columns
because that's what an asterisk means.
The FROM clause specifies
the name of the movie database we're using,
and the WHERE clause filters the data to include
entries whose genre is specified as comedy.
Then in the last line,
we have the ORDER BY clause,
which will sort the data we've chosen to filter
by release dates in descending order.
This means when we run the query,
we'll only have comedy movies listed
from newest releases to oldest releases.
Let's run it and figure out if that's the case.
Play video starting at :5:25 and follow transcript5:25
Cool. Check out all those comedy movies
and the way those dates are sorted.
Play video starting at :5:33 and follow transcript5:33
Now, let's take this query a step further.
We'll filter for two conditions at
once using the AND filter.
Working off the query we've been using,
we'll add a second condition in the WHERE clause.
We'll keep the sorting the same.
Let's say you wanted to filter by comedy movies and
movies that earned over 300 million in the box office.
In this case, after the AND function,
you'd add the revenue condition by typing Revenue.
From there, you'll specify that you only want to return
films with revenues over $300 million.
To do that, type the greater than sign
and then the complete number of
300 million without commas.
Now let's run the query.
Play video starting at :6:23 and follow transcript6:23
Here, the data only shows comedy movies with revenues of
over $300 million, and it's
sorted in descending order by release date.
It looks really good.
You just filtered and sorted
a database like it's your job.
And with practice, one day it can be.
Just like that, you've finished
another step in your data analyst journey.
By now, you really dug and learned
about the analysis process with
a special emphasis on how organization
can change how you go through your data.
You also learned about both spreadsheets and SQL,
and how to sort and filter data
in both types of programs.
To help you get more comfortable
using spreadsheet and SQL features,
you'll be getting some materials you
can use as a resource.
Coming up, we'll check out how
an organizational mindset can take
your analytical skills even further.
We'll also cover converting, formatting,
and adjusting data to combine
information in a way that makes sense.
Learning those skills early on
can make your work as a data analyst much more
efficient and effective in the long run. See you soon.
Hands-On Activity: Analyze weather data in BigQuery
Total points 2
1.
Question 1
Activity overview
Previously, you learned how to use BigQuery to clean data and prepare it for analysis. Now you will
query a dataset and save the results into a new table. This is a useful skill when the original data
source changes continuously and you need to preserve a specific dataset for continued analysis. It’s
also valuable when you are dealing with a large dataset and know you’ll be doing more than one
analysis using the same subset of data.
In this scenario, you’re a data analyst at a local news station. You have been tasked with answering
questions for meteorologists about the weather. You will work with public data from the National
Oceanic and Atmospheric Administration (NOAA), which has data for the entire United States. This
is why you will need to save a subset of the data in a separate table.
By the time you complete this activity, you will be able to use SQL queries to create new tables when
dealing with complex datasets. This will greatly simplify your analysis in the future.
Access the public dataset
For this activity you will need the NOAA weather data from BigQuery’s public datasets.
1. Click on the + ADD DATA button in the Explorer menu pane and select Explore public
datasets. This will open a new menu where you can search public datasets that are already
available through Google Cloud. If you have already loaded the BigQuery public datasets into your
console, you can just search noaa_gsod in your Explorer menu and skip these steps.
2. Type noaa_gsod into the search bar. You’ll find the GSOD of Global Surface Summary of the
Day Weather Data.
3. Click the GSOD dataset to open it. This will provide you with more detailed information about the
dataset if you’re interested. Click VIEW DATASET to open this dataset in your console.
4. Search noaa_gsod in your Explorer menu pane to find the dataset. Click the dropdown menu
to explore the tables in this dataset. Scroll down to gsod2020 and open the table menu by
clicking the three vertical dots.
5. Check the table’s schema and preview it to get familiar with the data. Once you’re ready, you
can click COMPOSE NEW QUERY to start querying the dataset.
Querying the data
The meteorologists who you’re working with have asked you to get the temperature, wind speed,
and precipitation for stations La Guardia and JFK, for every day in 2020, in descending order by
date, and ascending order by Station ID. Use the following query to request this information:
SELECT
date,
stn,
-- Use the IF function to replace 9999.9 values, which the dataset
description explains is the default value when temperature is missing, with
NULLs instead.
IF(
temp=9999.9,
NULL,
temp) AS temperature,
-- Use the IF function to replace 999.9 values, which the dataset
description explains is the default value when wind speed is missing, with
NULLs instead.
wdsp="999.9",
wind_speed,
IF(
NULL,
CAST(wdsp AS Float64)) AS
-- Use the IF function to replace 99.99 values, which the dataset description
explains is the default value when precipitation is missing, with NULLs
instead.
IF(
prcp=99.99,
0,
prcp) AS precipitation FROM
public-data.noaa_gsod.gsod2020` WHERE
OR stn="744860" -- JFK ORDER BY
`bigquery-
stn="725030" -- La Guardia
date DESC,
stn ASC
The meteorologists also asked you a couple questions while they were preparing for the nightly
news: They want the average temperature in June 2020 and the average wind_speed in December
2020.
Instead of rewriting similar, but slightly different, queries over and over again, there is an easier
approach: Save the results from the original query as a table for future queries.
Save a new table
In order to make this subset of data easier to query from, you can save the table from the weather
data into a new dataset.
1. From your Explorer pane, click the three vertical dots next to your project and select Create
dataset. You can name this dataset demos and leave the rest of the default options. Click CREATE
DATASET.
2. Open your new dataset and select COMPOSE NEW QUERY. Input the following query to get the
average temperature, wind speed, visibility, wind gust, precipitation, and snow depth La Guardia and
JFK stations for every day in 2020, in descending order by date, and ascending order by Station ID:
SELECT stn,
date, -- Use the IF function to replace 9999.9 values, which the dataset description explains is the
default value when temperature is missing, with NULLs instead.
IF(
temp=9999.9,
NULL,
temp) AS temperature,
-- Use the IF function to replace 999.9 values, which the dataset description explains is the default
value when wind speed is missing, with NULLs instead.
IF(
wdsp="999.9",
NULL,
CAST(wdsp AS Float64)) AS
wind_speed,
-- Use the IF function to replace 99.99 values, which the dataset description
explains is the default value when precipitation is missing, with NULLs
instead.
IF(
prcp=99.99,
0,
prcp) AS precipitation FROM
public-data.noaa_gsod.gsod2020` WHERE
OR stn="744860" -- JFK ORDER BY
`bigquery-
stn="725030" -- La Guardia
date DESC,
stn ASC
3. Before you run the query, select the MORE menu from the Query Editor and open the Query
Settings menu. In the Query Settings menu, select Set a destination table for query results. Set
the dataset option to demos and name the table nyc_weather.
4. Run the query from earlier; now it will save as a new table in your demos dataset.
5. Return to the Query settings menu by using the MORE dropdown menu. Reset the settings to
Save query results in a temporary table. This will prevent you from accidentally adding every
query as a table to your new dataset.
Query your new table
Now that you have the subset of this data saved in a new table, you can query it more easily. Use
the following query to find the average temperature from the meteorologists first question:
SELECT
AVG(temperature)
FROM
`airy-shuttle-315515.demos.nyc_weather` --remember to change the project name
to your project before running this query
WHERE
date BETWEEN '2020-06-01' AND '2020-06-30'
You can also use this syntax to find the average wind_speed or any other information from this
subset of data you’re interested in. Try constructing a few more queries to answer the
meteorologists’ questions!
The ability to save your results into a new table is a helpful trick when you know you're only
interested in a subset of a larger complex dataset that you plan on querying multiple times, such as
the weather data for just La Guardia and JFK. This also helps minimize errors during your analysis.
Confirmation and reflection
What was the average temperature at JFK and La Guardia stations between June 1, 2020 and June
30, 2020?
1 / 1 point
72.883
92.099
87.671
74.909
Correct
The average was 72.883. To find out the average temperature during this time period, you
successfully created a new table using a query and ran another query against that table. Going
forward, you will be able to use this skill to create tables with specific subsets of your data to query.
This will help you draw insights from multiple data sources in the future.
Converting data in spreadsheets
In this reading, you will learn about converting data from one format to another. One of the ways to
help ensure that you have an accurate analysis of your data is by putting all of it in the correct
format. This is true even if you have already cleaned and processed your data. As a part of getting
your data ready for analysis, you will need to convert and format your data early on in the process.
A tornado sweeping everything up; an arrow indicating Data Conversion and second image of bar
graph, pie chart, line graph
As a data analyst, there are lots of scenarios when you might need to convert data in a spreadsheet:
String to date
How to convert text to date in Excel: Transforming a series of numbers into dates is a common
scenario you will encounter. This resource will help you learn how to use Excel functions to convert
text and numbers to dates, and how to turn text strings into dates without a formula.
Google Sheets: Change date format: If you are working with Google Sheets, this resource will
demonstrate how to convert your text strings to dates and how to apply the different date formats
available in Google Sheets.
String to numbers
How to convert text to number in Excel: Even though you will have values in your spreadsheet that
resemble numbers, they may not actually be numbers. This conversion is important because it will
allow your numbers to add up and be used in formulas without errors in Excel.
How to convert text to numbers in Google Sheets: This resource is useful if you are working in Google
Sheets; it will demonstrate how to convert text strings to numbers in Google Sheets. It also includes
multiple formulas you can apply to your own sheets, so you can find the method that works best for
you.
Combining columns
Convert text from two or more cells: Sometimes you may need to merge text from two or more cells.
This Microsoft Support page guides you through two distinct ways you can accomplish this task
without losing or altering your data. It also includes a step-by-step video tutorial to help guide you
through the process.
How to split or combine cells in Google Sheets: This guide will demonstrate how to to split or combine
cells using Google Sheets specifically. If you are using Google Sheets, this is a useful resource to
reference if you need to combine cells. It includes an example using real data.
Number to percentage
Format numbers as percentages: Formatting numbers as percentages is a useful skill to have on any
project. This Microsoft Support page will provide several techniques and tips for how to display your
numbers as percentages.
TO_PERCENT: This Google Sheets support page demonstrates how to use the TO_PERCENT formula
to convert numbers to percentages. It also includes links to other formulas that can help you convert
strings.
Pro tip: Keep in mind that you may have lots of columns of data that require different formats.
Consistency is key, and best practice is to make sure an entire column has the same format.
Additional resources
If you find yourself needing to convert other types of data, you can find resources on Microsoft
Support for Excel or Google Docs Editor Help for Google Sheets.
Converting data is quick and easy, and the same functions can be used again and again. You can
also keep these links bookmarked for future use, so you will always have them ready in case any of
these issues arise. Now that you know how to convert data, you are on your way to becoming a
successful data analyst.
Transforming data in SQL
Data analysts usually need to convert data from one format to another to complete an
analysis. But what if you are using SQL rather than a spreadsheet? Just like spreadsheets,
SQL uses standard rules to convert one type of data to another. If you are wondering why
data transformation is an important skill to have as a data analyst, think of it like being a
driver who is able to change a flat tire. Being able to convert data to the right format speeds
you along in your analysis. You don’t have to wait for someone else to convert the data for
you.
In this reading, you will go over the conversions that can be done using the CAST function.
There are also more specialized functions like COERCION to work with big numbers, and
UNIX_DATE to work with dates. UNIX_DATE returns the number of days that have passed
since January 1, 1970 and is used to compare and work with dates across multiple time
zones. You will likely use CAST most often.
Common conversions
The following table summarizes some of the more common conversions made with the
CAST function. Refer to Conversion Rules in Standard SQL for a full list of functions and
associated rules.
Starting with
CAST function can convert to:
Numeric
(number)
- Integer - Numeric (number) - Big number - Floating integer - String
String
- Boolean - Integer - Numeric (number) - Big number - Floating integer - String - B
Time - Timestamp
Date
- String - Date - Date time - Timestamp
The CAST function (syntax and examples)
CAST is an American National Standards Institute (ANSI) function used in lots of
programming languages, including BigQuery. This section provides the BigQuery syntax
and examples of converting the data types in the first column of the previous table. The
syntax for the CAST function is as follows:
CAST (expression AS typename)
Where expression is the data to be converted and typename is the data type to be
returned.
Converting a number to a string
The following CAST statement returns a string from a numeric identified by the variable
MyCount in the table called MyTable.
SELECT CAST (MyCount AS STRING) FROM MyTable
In the above SQL statement, the following occurs:





SELECT indicates that you will be selecting data from a table
CAST indicates that you will be converting the data you select to a different
data type
AS comes before and identifies the data type which you are casting to
STRING indicates that you are converting the data to a string
FROM indicates which table you are selecting the data from
Converting a string to a number
The following CAST statement returns an integer from a string identified by the variable
MyVarcharCol in the table called MyTable. (An integer is any whole number.)
SELECT CAST(MyVarcharCol AS INT) FROM MyTable
In the above SQL statement, the following occurs:





SELECT indicates that you will be selecting data from a table
CAST indicates that you will be converting the data you select to a different
data type
AS comes before and identifies the data type which you are casting to
INT indicates that you are converting the data to an integer
FROM indicates which table you are selecting the data from
Converting a date to a string
The following CAST statement returns a string from a date identified by the variable
MyDate in the table called MyTable.
In the above SQL statement, the following occurs:





SELECT indicates that you will be selecting data from a table
CAST indicates that you will be converting the data you select to a different
data type
AS comes before and identifies the data type which you are casting to
STRING indicates that you are converting the data to a string
FROM indicates which table you are selecting the data from
Converting a date to a datetime
Datetime values have the format of YYYY-MM-DD hh: mm: ss format, so date and time are
retained together. The following CAST statement returns a datetime value from a date.
In the above SQL statement, the following occurs:





SELECT indicates that you will be selecting data from a table
CAST indicates that you will be converting the data you select to a different
data type
AS comes before and identifies the data type which you are casting to
DATETIME indicates that you are converting the data to a datetime value
FROM indicates which table you are selecting the data from
The SAFE_CAST function
Using the CAST function in a query that fails returns an error in BigQuery. To avoid errors
in the event of a failed query, use the SAFE_CAST function instead. The SAFE_CAST
function returns a value of Null instead of an error when a query fails.
The syntax for SAFE_CAST is the same as for CAST. Simply substitute the function directly
in your queries. The following SAFE_CAST statement returns a string from a date.
Optional: Prepare to use the bike sharing
dataset in BigQuery
The next video demonstrates how to use CONCAT in a SQL query to return data from two
columns in a single column.
If you would like to follow along with the instructor, you will need to log in to your
BigQuery account to use the open (public) dataset called new_york_citibike. If you need a
refresher, the reading Using BigQuery in the Prepare Data for Exploration course
explains how to set up a BigQuery account.
Prepare for the next video
Step 1: In the BigQuery Explorer, enter citibike in the search bar to locate the
new_york_citibike dataset under bigquery-public-data.
Step 2: Click the citibike_trips table, then click the Preview tab to view the data in the
table.
What to expect from the query
You will be using CONCAT to combine the data in the start_station_name column with the
data in the end_station_name column to create the route information in another column;
for example, the route from Station 509 to Station 442 in the first row of the table above
would be 9 Ave & W 22 St to W 27 St & 7 Ave, a combination of the start and end station
names.
Great to see you back.
In this video, we'll build on
what we've learned about CONCATENATE
and IMPORTRANGE by exploring a new SQL query: CONCAT.
You might remember that CONCATENATE is a function
that joins together two or more LEN.
As a quick reminder,
a text string is a group of characters within
a cell most often composed of letters.
You've seen how that works within a single spreadsheet.
But there's a similar function in
SQL that allows you to join
multiple text strings from multiple sources, CONCAT.
Let's use CONCAT to combine strings
from multiple tables to create new strings.
For this example, we'll use open data from Citi Bike,
which is a public bicycle sharing system in New York.
As you've learned earlier,
open data initiatives have created
a ton of data for analysts to use.
Openness or open data is free access,
usage, and sharing of data.
It's a great resource if you want to practice or
experiment with the data analysis tools
you've been learning here.
You have open access to
the New York city bike-sharing data,
which has information about
the use of shared bikes across the city.
Now we can use CONCAT to pull and
concatenate data from different columns stored here.
The first thing we need to do is
figure out which columns we need.
That way we can tell SQL where the strings we want are.
For example, the bike-sharing company
has two different kinds of customers;
one-time paying customers and subscribers.
Let's say we want to find out what routes are
most popular with different user types.
To do that, we need to create strings of
recognizable route names that we can count and sort.
We know that the information we need
is in the stations and trips table.
We'll start building our query from there.
First, we'll input SELECT user type to
let SQL know that we want the user type as a column.
Then we'll use CONCAT
to combine the names of the beginning
and ending stations for each trip in a new column.
This will create one column
based on the routes people take.
We also need to input a title for this new column.
We'll type in, AS route,
to name the route column using those beginning and
ending station names we combined with CONCAT.
This will make these route names
easy for us to read and understand.
After that, we want SQL to count the number of trips.
So we'll input COUNT to do that.
We can use an asterisk to tell it to count up
the number of rows in the data we're selecting.
In this case, each row represents a trip,
which is why we can just count all
of the rows we've selected.
We'll name this output as num_trips.
Play video starting at :2:46 and follow transcript2:46
Now let's also get the average
trip duration for each route.
In this case, we don't need the exact average,
so we can use the ROUND function to round up.
We'll put that first and then in the parentheses
use average to get the average trip duration.
We'll also want this data to be in
integer form for this calculation,
so we'll input cast as int 64.
Big query stores numbers in a 64-bit memory system,
which is why there's a 64 after integer in this case.
Next, we'll divide it by the number
of rows and tell it how
far we want it to round, two decimal places.
We'll name this output as duration.
We'll need to tell SQL where this information is stored.
We'll use FROM and the location we're pulling it from.
Play video starting at :3:42 and follow transcript3:42
Since we're using COUNT and
AVERAGE functions in our select clause,
we have to use GROUP BY to group together summary rows.
Let's group by the start station,
the end station, and the user type for this query.
Finally, we'll use ORDER BY to
tell it how we want to organize this data.
For this, we want to figure out
the most common trips so we can input the number of
trips column and use DESC to put it in descending order.
Finally, we only want the top 10,
so let's add LIMIT 10.
Now thanks to CONCAT,
we can easily read
these route names and trace them back to real places.
We can see which kinds of
customers are taking which routes,
which can help the bike-sharing company
understand their user base
in different parts of the city and where
to keep more bikes for people to rent.
Being able to combine multiple pieces of data
can give you new ways to organize and analyze data.
There's a lot of different tools to help you do that.
Now you've seen CONCAT in action,
and later you will come across
another similar query, JOIN.
But up next, we'll talk more about
working with strings. See you soon.
Manipulating strings in SQL
Knowing how to convert and manipulate your data for an accurate analysis is an important
part of a data analyst’s job. In this reading, you will learn about different SQL functions and
their usage, especially regarding string combinations.
A string is a set of characters that helps to declare the texts in programming languages
such as SQL. SQL string functions are used to obtain various information about the
characters, or in this case, manipulate them. One such function, CONCAT, is commonly
used. Review the table below to learn more about the CONCAT function and its variations.
Function
Usage
Example
CONCAT
A function that adds strings together to
create new text strings that can be used as
unique keys
CONCAT (‘Google’, ‘.com’);
CONCAT_WS
A function that adds two or more strings
together with a separator
CONCAT_WS (‘ . ’, ‘www’, ‘google’, ‘c
the period) gets input before and af
SQL function
CONCAT with
+
Adds two or more strings together using the
+ operator
‘Google’ + ‘.com’
CONCAT at work
When adding two strings together such as ‘Data’ and ‘analysis’, it will be input like this:
 SELECT CONCAT (‘Data’, ‘analysis’);
The result will be:
 Dataanalysis
Sometimes, depending on the strings, you will need to add a space character, so your
function should actually be:

SELECT CONCAT (‘Data’, ‘ ‘, ‘analysis’);
And the result will be:
 Data analysis
The same rule applies when combining three strings together. For example,

SELECT CONCAT (‘Data’,’ ‘, ‘analysis’, ‘ ‘, ‘is’, ‘ ‘, ‘awesome!’);
And the result will be

Data analysis is awesome!
Practice makes perfect
W3 Schools is an excellent resource for interactive SQL learning, and the following links
will guide you through transforming your data using SQL:

SQL functions: This is a comprehensive list of functions to get you started. Click
on each function, where you will learn about the definition, usage, examples,
and even be able to create and run your own query for practice. Try it out for
yourself!
 SQL Keywords: This is a helpful SQL keywords reference to bookmark as you
increase your knowledge of SQL. This list of keywords are reserved words that
you will use as your need to perform different operations in the database
grows.
 While this reading went through the basics of each of these functions, there is
still more to learn, and you can even combine your own strings.
1. Practice using CONCAT
2. Practice using CONCAT WS
3. Practice using CONCAT with +
Pro tip: The functions presented in the resources above may be applied in slightly different
ways depending on the database that you are using (e.g. mySQL versus SQL Server). But,
the general description provided for each function will prepare you to customize how you
use these functions as needed.
Hi there. Data analysts
spend a lot of time problem-solving,
and that means there's going to be
times when you get stuck,
but the trick is knowing what to do when that happens.
In this video, we'll talk about
the importance of knowing how to get help,
whether that means asking someone else for
help or searching the internet for answers.
Asking other people about
a problem you're having can help
you find new solutions that move a project forward.
It's always a good idea to
reach out to your peers and mentors,
especially if they're working with you on that project.
Your team members have valuable
knowledge and insight that can
help you find the solution you need to get unstuck.
Sometimes we spend a lot of
time spinning our wheels saying,
"I can do this myself," but we can be way more
productive if we engage with other people,
find new resources to lean on and try
to get as many voices as we can involved.
For example, let's say you're working with
the bike trip time data from the previous videos.
Maybe you're trying to find the average time
between bike rides in a given month.
Calculating the difference between
bike rides before midnight is easy,
but you can run into a problem if
the elapsed time crosses into the next day.
If someone went on a bike ride at 11:00 PM,
but the next ride wasn't until 06:00 AM,
your formula would return a negative number
because the end time is less than the start time.
You know that you can add one minus
the start time if two bike rides
start and end on different days,
but that formula won't work on
times that happened in the same day,
and it's pretty inefficient to scroll through
every bike ride to pinpoint these special cases.
You need to find a way to build a conditional formula,
but you aren't sure how.
You decide to check in with other analysts
working on your team to see if they have any ideas.
You could send them a quick email,
or stop by their desk,
to find out if they have a minute
to talk it over with you.
Turns out they had a similar problem
on a previous project,
and they're able to show you a conditional formula that
you could use to speed up your calculations.
Great! They suggest using an IF formula like this.
This basically says that,
"if the end time is larger than the start time,
replace the standard end time minus start
time formula with one minus start time plus end time."
Now it's also possible that your team members
don't have an answer; that's okay too.
There's definitely someone else with
the same problem asking the same questions online.
Knowing how to find solutions online is
an incredibly valuable problem-solving tool
for data analysis.
There's also all kinds of forums where
spreadsheet users can ask questions,
and you never know what you can turn
up with just a basic search.
For example, let's say you look
at "calculate number of hours between
times" spreadsheets and find
a helpful walk-through for
a more complicated formula using MOD.
This flips the negative values into
positive ones, solving your calculation problem.
Whether you're asking someone you know
or searching the internet for answers,
reaching out for help can give you
some really interesting solutions and
new ways to solve problems for future analysis.
Coming up, we'll learn even more about searching
for solutions online. See you soon.
Advanced spreadsheet tips and tricks
Like a lot of the things you’re learning in this program, spreadsheets will get easier the
more you practice. This reading provides you with a list of resources that may help advance
your knowledge and experience with spreadsheet functions and functionality. The goal is
to provide you with access to a variety of advanced tips and tricks that will help make you
more efficient and effective when working with spreadsheets to perform data analysis.
Review the description of each resource below, click the links to learn more, and save or
bookmark any links that are useful to you. You can immediately start practicing anything
that you learn to increase the chances of your understanding and to build your familiarity
with spreadsheets. This reading provides a range of resources, so feel free to explore the
ones that are applicable to you and skip the ones that aren’t.
Google Sheets

Keyboard shortcuts for Google Sheets: This is a great resource for quickly
learning a range of keyboard shortcuts that can make regular tasks quicker and
easier, like navigating your spreadsheet or accessing formulas and functions.
This list contains shortcuts for the desktop and mobile versions of Google
Sheets so that you can apply them to your work no matter what device you are
using.



List of Google Sheets functions: This is a comprehensive list of the Google
Sheets functions and syntax. Each function is listed with a link to learn more.
20 Google Sheets Formulas You Must Know: This blog article summarizes
and describes 20 of the most useful Google Sheets formulas.
18 Google Sheets Formula Tips and Techniques: These are tips for using
Google Sheets shortcuts when working with formulas.
Excel
Keyboard shortcuts in Excel: Earlier in this list, you were provided with a
resource for keyboard shortcuts in Google Sheets. Similarly, this resource
provides a list of keyboard shortcuts in Excel that will make performing regular
spreadsheet tasks more efficient. This includes keyboard shortcuts for both
desktop and mobile versions of Excel, so you can apply them no matter what
platform you are working on.
 222 Excel shortcuts: A compilation of shortcuts includes links to more
detailed explanations about how to use them. This is a great way to quickly
reference keyboard shortcuts. The list has been organized by functionality, so
you can go directly to the sections that are most useful to you.
 List of spreadsheet functions: This is a comprehensive list of Excel
spreadsheet functions with links to more detailed explanations. This is a useful
resource to save so that you can reference it often; that way, you’ll have access
to functions and examples that you can apply to your work.
 List of spreadsheet formulas: Similar to the previous resource, this
comprehensive list of Excel spreadsheet formulas with links to more detailed
explanations and can be saved and referenced any time you need to check out a
formula for your analysis.
 Essential Excel Skills for Analyzing Data: This blog post includes more
advanced functionalities of some spreadsheet tools that you have previously
learned about, like pivot tables and conditional formatting. These skills have
been identified as particularly useful for data analysis. Each section includes a
how-to video that will take you through the process of using these functions
step-by-step, so that you can apply them to your own analysis.
 Advanced Spreadsheet Skills: Mark Jhon C. Oxillo’s presentation starts with a
basic overview of spreadsheet but also includes advanced functions and
exercises to help you apply formulas to actual data in Excel. This is a great way
to review some basic concepts and practice the skills you have been learning so
far.
There are lots of resources online about advanced spreadsheet tips and tricks. You'll
probably discover new resources and tools on your own, but this list is a great starting
point as you become more familiar with spreadsheets.

Question 1
Overview
Now that you are learning how to convert and format data for analysis, you can pause for a moment
and think about what you are learning. In this self-reflection, you will consider how you can seek help
while you learn, then respond to brief questions.
This self-reflection will help you develop insights into your own learning and prepare you to ask the
data analytics community on Stack Overflow about what you’re learning. As you answer questions—
and come up with questions of your own—you will consider concepts, practices, and principles to
help refine your understanding and reinforce your learning. You’ve done the hard work, so make
sure to get the most out of it: This reflection will help your knowledge stick!
Seeking help on Stack Overflow
Stack Overflow is an online platform where programmers ask code-related questions and peers are
available to suggest answers. You can ask questions about programming languages such as SQL
and R (which you will learn about in Course 7), data tools, and much more. Follow the steps below
to get started on Stack Overflow.
Sign up for an account
To sign up for Stack Overflow:
1. Click on the Sign up button in the upper right corner
2. Follow the on-screen prompts to enter your desired login information.
3. Click the Sign up button.
Explore Stack Overflow
From the home page, click the dropdown in the upper left corner and click Questions.
The Questions page provides different categories of questions for you to choose. Some examples
include the “Newest” and “Active” categories. Read some of the questions under the different
categories.
Tags will help you find questions. On the left pane, click on Tags.
On the Tags page, type in a tag name and then press Enter or Return. Next, you can click on a tag
to view questions that have that particular tag.
Use the Search bar at the top of the web page to search for keywords and questions. If you would
like to view only questions that have a certain tag, include the tag name in brackets with your search.
For example, if you want to only find questions that have the tag “SQL,” then type [SQL] in the
search field, along with your keywords or question. See the example below.
To learn more about searching, read these instructions about how to search. For a quick guide on
syntax structures, check out this list of search types and search syntax.
Write your own question
When asking a question on Stack Overflow, keep it specific. Don’t use Stack Overflow to ask
questions with opinion-based answers.
For example, “Which SQL function can I use to add two numbers together?” is an appropriate
question. “Which SQL function is your favorite?” is not.
It is a best practice to search the Stack Overflow website for your question in case someone has
already asked it. This reduces redundant questions on the site and saves you the time it would take
to wait for an answer.
Write clear and concise questions in complete sentences. Then people are more likely to understand
what you ask and give you helpful answers.
To begin asking a question, click the blue Ask Question button on this page.
The form for asking a question has three sections: Title, Body, and Tags.
 Title: This is where you ask your question.
 Body: Summarize your problem and include expected and actual results. Include any error
codes. If you think that inserting code into the Body section will help, press Ctrl+K
(Windows) or Cmd+K (Mac OS) on your keyboard. Then type your code.
 Tags: Tags include specific keywords, like program names. They help other people find your
question. You can add up to five tags. Check out this list of existing tags for examples of
what tags to use.
Note: Stack Overflow is a public forum. Do not post any confidential company information or code
that could impact the company you work for or yourself. When in doubt, first ask your manager
whether you may post your question and code excerpt on Stack Overflow.
Weekly challenge 2
Latest Submission Grade 37.5%
1.
Question 1
An analyst working for a British school system just downloaded a dataset that was created in the
United States. The numerical data is correct but it is formatted as U.S. dollars, and the analyst needs
it to be in British pounds. What spreadsheet tool can help them select the right format?
0 / 1 point
Format as Pounds
Format as Currency
EXCHANGE
CURRENCY
Incorrect
Review the video on formatting data for a refresher.
2.
Question 2
You are using a spreadsheet to organize a list of upcoming home repairs. Column A contains the list
of repairs, and column B notes the priority of each item on the list: High Priority or Low Priority. What
spreadsheet tool can you use to create a drop-down list of priorities for each cell in column B?
0 / 1 point
Pop-up menus
Conditional formatting
Data validation
Find
Incorrect
Review the video on spreadsheet features for formatting data for a refresher.
3.
Question 3
A data analyst in human resources uses a spreadsheet to keep track of employees’ work
anniversaries. They add color to any employee who has worked for the company for more than 10
years. Which spreadsheet tool changes how cells appear when values equal 10 or more?
0 / 1 point
Data validation
CONVERT
Conditional formatting
Add color
Incorrect
Review the video on spreadsheet features for formatting data for a refresher.
4.
Question 4
You are analyzing data about the capitals of different countries. In your SQL database, you have one
column with the names of the countries and another column with the names of the capitals. What
function can you use in your query to combine the countries and capitals into a new column?
0 / 1 point
CONCAT
COMBINE
GROUP
JOIN
Incorrect
Review the video on joining text strings for a refresher.
5.
Question 5
You are querying a database of ice cream flavors to determine which stores are selling the most
mint chip. For your project, you only need the first 80 records. What clause should you add to the
following SQL query?
1 / 1 point
LIMIT_80
LIMIT,80
LIMIT = 80
LIMIT 80
Correct
To return only the first 80 records, type LIMIT 80.
6.
Question 6
Fill in the blank: A data analyst is working with a spreadsheet that has very long text strings. They
use the LEN function to count the number of _____ in the text strings.
1 / 1 point
values
substrings
fields
characters
Correct
They use the LEN function to count the number of characters in a text string.
7.
Question 7
Spreadsheet cell E13 contains the text string “Database”. To return the substring “Data”, what is the
correct syntax?
1 / 1 point
=RIGHT(E13, 4)
=LEFT(E13, 4)
=RIGHT(4,E13)
=LEFT(4,E13)
Correct
The function =LEFT(E13, 4) will return “Data” The LEFT function returns a set number of characters
from the left side of a text string. In this case, it returns a four-character substring from the end of the
string in E13, starting from the left.
8.
Question 8
When working with a spreadsheet, data analysts use the FIND function to locate specific characters
in a string. FIND is case-sensitive, so it’s necessary to input the substring exactly how it appears.
0 / 1 point
True
False
Incorrect
Review the video on strings in spreadsheets for a refresher.
VLOOKUP core concepts
Functions can be used to quickly find information and perform calculations using specific
values. In this reading, you will learn about the importance of one such function,
VLOOKUP, or Vertical Lookup, which searches for a certain value in a spreadsheet column
and returns a corresponding piece of information from the row in which the searched value
is found.
When do you need to use VLOOKUP?
Two common reasons to use VLOOKUP are:


Populating data in a spreadsheet
Merging data from one spreadsheet with data in another
VLOOKUP syntax
A VLOOKUP function is available in both Microsoft Excel and Google Sheets. You will be
introduced to the general syntax in Google Sheets. (You can refer to the resources at the
end of this reading for more information about VLOOKUP in Microsoft Excel.)
Here is the syntax.
search_key


The value to search for.
For example, 42, "Cats", or I24.


The range to consider for the search.
The first column in the range is searched to locate data matching the value
specified by search_key.

The column index of the value to be returned, where the first column in range is
numbered 1.
If index is not between 1 and the number of columns in range, #VALUE! is
returned.
range
index

is_sorted



Indicates whether the column to be searched (the first column of the specified
range) is sorted. TRUE by default.
It’s recommended to set is_sorted to FALSE. If set to FALSE, an exact match is
returned. If there are multiple matching values, the content of the cell
corresponding to the first value found is returned, and #N/A is returned if no
such value is found.
If is_sorted is TRUE or omitted, the nearest match (less than or equal to the
search key) is returned. If all values in the search column are greater than the
search key, #N/A is returned.
What if you get #N/A?
As you have just read, #N/A indicates that a matching value can't be returned as a result of
the VLOOKUP. The error doesn’t mean that anything is actually wrong with the data, but
people might have questions if they see the error in a report. You can use the IFNA function
to replace the #N/A error with something more descriptive, like “Does not exist.”
Here is the syntax.
value


This is a required value.
The function checks if the cell value matches the value; such as #N/A.
value_if_na


This is a required value.
The function returns this value if the cell value matches the value in the first
argument; it returns this value when the cell value is #N/A.
Helpful VLOOKUP reminders



TRUE means an approximate match, FALSE means an exact match on the
search key. If the data used for the search key is sorted, TRUE can be used.
You want the column that matches the search key in a VLOOKUP formula to be
on the left side of the data. VLOOKUP only looks at data to the right after a
match is found. In other words, the index for VLOOKUP indicates columns to
the right only. This may require you to move columns around before you use
VLOOKUP.
After you have populated data with the VLOOKUP formula, you may copy and
paste the data as values only to remove the formulas so you can manipulate the
data again.
VLOOKUP resources for Microsoft Excel
VLOOKUP may slightly differ in Microsoft Excel, but the overall concepts can still be
generally applied. Refer to the following resources if you are working with Excel.

How to use VLOOKUP in Excel: This tutorial includes a video to help you get a
general understanding of how the VLOOKUP function works in Excel, as well as
practical examples to look through.




VLOOKUP in Excel tutorial: Follow along in this video lesson and learn how to
write a VLOOKUP formula in Excel and master time-saving useful tips and
tricks.
23 things you should know about VLOOKUP in Excel: Explore this list of 23
VLOOKUP facts as well as challenges you might run into, and start to learn how
to master them.
How to use Excel's VLOOKUP function: This article shares a specific example
around how to apply VLOOKUP in your searches.
VLOOKUP in Excel vs Google Sheets: This guide offers a VLOOKUP comparison
of Excel and Google Sheets.
Optional: Upload the employee dataset to
BigQuery
The next video demonstrates how to use JOINS to merge and return data from two tables
based on a common attribute used in both tables.
If you would like to follow along with the instructor, you will need to log in to your
BigQuery account and upload the employee data provided as two CSV files. If you have
hopped around courses, Using BigQuery in the Prepare Data for Exploration course
covers how to set up a BigQuery account.
Prepare for the next video

First, download the CSV files from the attachments below:
Employees Table - Understanding JOINSCSV File
Download file
Departments Table - Understanding JOINSCSV File
Download file

Next, complete the following steps in your BigQuery console to upload the
employees and departments tables.
Step 1: Open your BigQuery console and click on the project you want to upload the data
to.
Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next to your
project name and select Create dataset.
Step 3: Enter employee_data for the Dataset ID.
Step 4: Click CREATE DATASET (blue button) to add the dataset to your project.
Step 5: In the Explorer on the left, click to expand your project, and then click the
employee_data dataset you just created.
Step 6: Click the Actions icon (three vertical dots) next to employee_data and select Open.
Step 7: Click the blue + icon at the top right to open the Create table window.
Step 8: Under Source, for the Create table from selection, choose where the data will be
coming from.
Select Upload.
Click Browse to select the Employees Table CSV file you downloaded.
Choose CSV from the file format drop-down.
Step 9: For Table name, enter employees if you plan to follow along with the video.



Step 10: For Schema, click the Auto detect check box.
Step 11: Click Create table (blue button). You will now see the employees table under
your employee_data dataset in your project.
Step 12: Click the employee_data dataset again.
Step 13: Click the icon to open the Create table window again.
Step 14: Under Source, for the Create table from selection, choose where the data will be
coming from.
Select Upload.
Click Browse to select the Departments Table CSV file you downloaded.
Choose CSV from the file format drop-down.
Step 15: For Table name, enter departments if you plan to follow along with the video.



Step 16: For Schema, click the Auto detect check box.
Step 17: Click Create table (blue button). You will now see the departments table under
your employee_data dataset in your project.
Step 18: Click the employees table and click the Preview tab to verify that you have the
data shown below.
Step 19: Click the departments table and click the Preview tab to verify that you have the
data shown below.
If your data previews match, you are ready to follow along with the next video.
Hey, welcome back.
So far we've checked out a few different tools you
can use to aggregate data within spreadsheets.
In this video, we'll cover how to use
JOIN in SQL to aggregate data in databases.
First, I'll tell you
a little bit about what a JOIN actually is,
and then we'll explore some of
the most common JOINs in action. Let's get started.
JOIN is a SQL clause that's used to combine
rows from two or more tables based on a related column.
Basically, you can think of a JOIN as
a SQL version of VLOOKUP which we just covered.
There are four common JOINs data analysts use,
inner, left, right, and outer.
Here's a handy visualization of
what each JOIN actually does.
We'll use these to help us understand these functions.
JOINs help you combine
matching or related columns from different tables.
When we learned about relational databases,
we refer to these values as primary and foreign keys.
Primary keys reference columns in
which each value is unique to that table.
But that table can have
multiple foreign keys which are
primary keys in other tables.
For example, in a table about employees,
the employee ID is
a primary key and the office ID is a foreign key.
JOIN use these keys to
identify relationships and corresponding values.
An inner JOIN is a function that returns
records with matching values in both tables.
If we think about our tables as
a circles of this Venn diagram,
then an inner JOIN would return the records
that exist where the tables are overlapping.
For the records to appear in the results table,
they'll have to be key values in both tables.
The records will only merge if there are
matches in both tables.
When we input JOIN into SQL,
it usually defaults to inner JOIN.
A lot of analysts will use JOIN
as shorthand instead of typing the whole query.
A LEFT JOIN is a function
that will return all the records
from the left table and
only the matching records from the right table.
Here's how you can figure out
which table is left or right.
In English and SQL we read from left to right.
The table mentioned first is
left and the table mentioned second is right.
You can also think of left as a table name to the left of
the JOIN statement and right as
a table name to the right of the JOIN statement.
In this diagram, you'll notice
that the entire left table is colored in,
and that's the overlap with
the right table which shows us that
the left table and the records it
shares with the right table are being selected.
Each row in the left table appears in
the results even if there are
no matches in the right table.
RIGHT JOIN does the opposite.
It will return all records from
the right table and only the
matching records from the left.
You can get the same results if you flip the order
of the tables and use a left JOIN.
For example, SELECT from table A,
LEFT JOIN table B is the same as SELECT from table B,
RIGHT JOIN table A.
Finally, there's OUTER JOIN.
OUTER join combines RIGHT and LEFT JOIN to
return all matching records in both tables.
This means it will return all records in both tables.
If there are records in one table without a match,
it'll create a record with no values for the other table.
Using JOINs can make working with
multiple data sources a lot easier and it can make
relationships between tables more
clear. Here's an example.
Let's say we're working with
employee data across multiple departments.
We have an employees table and
a departments table which both
have some columns like department ID.
We can use different JOIN clauses to help us
put different data from our tables and aggregate it.
Maybe we want to get a list of
employees with their department name,
excluding any employee without a department ID.
Because the department ID record is used in both tables,
we can use an INNER JOIN to return
a list with only those employees.
As a quick reminder,
analysts will sometimes just input JOIN for
an INNER JOIN but for this example, we'll write it out.
To build this query,
we'll start with SELECT and AS
to tell SQL how we want the columns titled.
Play video starting at :4:43 and follow transcript4:43
Then we'll use FROM to
tell it where we're getting this data,
in this case the employees table.
Then we'll input INNER JOIN and
the other table we're using, which is departments.
Play video starting at :4:58 and follow transcript4:58
We can specify which column and each table
will contain the matching JOIN key by writing
ON employees.department_id
equals departments.departments_id.
Now, let's run it, and there.
Now we've got a list of employee names and
department IDs for the employees that have those IDs.
But we could use LEFT or RIGHT join to return
a list of all employee names
and their departments when available.
Let's try both really quickly.
This will start similar to the last query,
we'll put in SELECT AS and FROM again.
But this time we'll say LEFT JOIN
and use ON like we did with the last query.
When we execute the query,
we get back this new list with
the employee names and departments.
But you'll notice there's null values.
These are places where the right table which is
departments in this case
didn't have corresponding values.
Let's try RIGHT JOIN just to test it out.
This query will be almost the same.
Only difference is that we'll use
the RIGHT JOIN clause to return
all the rows from the right table,
whether they have matching values in
the table to the left of the JOIN statement or not.
In this case, the right table is departments.
Play video starting at :6:28 and follow transcript6:28
Now, let's try out one last JOIN: OUTER.
OUTER JOIN will fetch all of
the employee names and departments.
Again, this query will
start a lot like the other ones we've done,
we'll use SELECT AS and FROM to
choose what data we want and how.
We'll grab this from the employees table,
and put FULL OUTER JOIN with
the departments table to get
all of the records from both.
We'll also use ON again here.
Now we can run this,
Play video starting at :7:1 and follow transcript7:01
and we'll get all of the employee names
and departments from these tables.
There will be nulls in the
department.name column, the employee.name
column and role column because we've joined columns
that don't have matching values, and there.
Now you know how JOINs work.
JOINs are super useful when you need to work
with data from multiple related tables.
They give you a lot of flexibility with
how you combine and view that data.
If you ever have trouble remembering what INNER, RIGHT,
LEFT, or OUTER JOIN do,
just think back to our Venn diagram.
We'll keep learning about aggregating data in
SQL next time. See you soon.
Secret identities: The importance of aliases
In this reading, you will learn about using aliasing to simplify your SQL queries. Aliases are
used in SQL queries to create temporary names for a column or table. Aliases make
referencing tables and columns in your SQL queries much simpler when you have table or
column names that are too long or complex to make use of in queries. Imagine a table name
like special_projects_customer_negotiation_mileages. That would be difficult to retype
every time you use that table. With an alias, you can create a meaningful nickname that you
can use for your analysis. In this case “special_projects_customer_negotiation_mileages” can
be aliased to simply “mileage.” Instead of having to write out the long table name, you can
use a meaningful nickname that you decide.
Basic syntax for aliasing
Aliasing is the process of using aliases. In SQL queries, aliases are implemented by making
use of the AS command. The basic syntax for the AS command can be seen in the following
query for aliasing a table:
Notice that AS is preceded by the table name and followed by the new nickname. It is a
similar approach to aliasing a column:
In both cases, you now have a new name that you can use to refer to the column or table
that was aliased.
Alternate syntax for aliases
If using AS results in an error when running a query because the SQL database you are
working with doesn't support it, you can leave it out. In the previous examples, the
alternate syntax for aliasing a table or column would be:


FROM table_name alias_name
SELECT column_name alias_name
The key takeaway is that queries can run with or without using AS for aliasing, but using AS
has the benefit of making queries more readable. It helps to make aliases stand out more
clearly.
Aliasing in action
Let’s check out an example of a SQL query that uses aliasing. Let’s say that you are working
with two tables: one of them has employee data and the other one has department data.
The FROM statement to alias those tables could be:
FROM work_day.employees AS employees
These aliases still let you know exactly what is in these tables, but now you don’t have to
manually input those long table names. Aliases can be really helpful for long, complicated
queries. It is easier to read and write your queries when you have aliases that tell you what
is included within your tables.
For more information
If you are interested in learning more about aliasing, here are some resources to help you
get started:



SQL Aliases: This tutorial on aliasing is a really useful resource to have when
you start practicing writing queries and aliasing tables on your own. It also
demonstrates how aliasing works with real tables.
SQL Alias: This detailed introduction to aliasing includes multiple examples.
This is another great resource to reference if you need more examples.
Using Column Aliasing: This is a guide that focuses on column aliasing
specifically. Generally, you will be aliasing entire tables, but if you find yourself
needing to alias just a column, this is a great resource to have bookmarked.
Using JOINs effectively
In this reading, you will review how JOINs are used and will be introduced to some
resources that you can use to learn more about them. A JOIN combines tables by using a
primary or foreign key to align the information coming from both tables in the combination
process. JOINs use these keys to identify relationships and corresponding values across
tables.
If you need a refresher on primary and foreign keys, refer to the glossary for this course, or
go back to Databases in data analytics.
The general JOIN syntax
As you can see from the syntax, the JOIN statement is part of the FROM clause of the query.
JOIN in SQL indicates that you are going to combine data from two tables. ON in SQL
identifies how the tables are to be matched for the correct information to be combined
from both.
Type of JOINs
There are four general ways in which to conduct JOINs in SQL queries: INNER, LEFT,
RIGHT, and FULL OUTER.
The circles represent left and right tables, and where they are joined is highlighted in blue
Here is what these different JOIN queries do.
INNER JOIN
INNER is optional in this SQL query because it is the default as well as the most commonly
used JOIN operation. You may see this as JOIN only. INNER JOIN returns records if the data
lives in both tables. For example, if you use INNER JOIN for the 'customers' and 'orders'
tables and match the data using the customer_id key, you would combine the data for each
customer_id that exists in both tables. If a customer_id exists in the customers table but not
the orders table, data for that customer_id isn’t joined or returned by the query.
The results from the query might look like the following, where customer_name is from the
customers table and product_id and ship_date are from the orders table:
customer_name
product_id
ship
Martin's Ice Cream
043998
202
Beachside Treats
872012
202
Mona's Natural Flavors
724956
202
... etc.
... etc.
The data from both tables was joined together by matching the customer_id common to
both tables. Notice that customer_id doesn’t show up in the query results. It is simply used
to establish the relationship between the data in the two tables so the data can be joined
and returned.
... et
LEFT JOIN
You may see this as LEFT OUTER JOIN, but most users prefer LEFT JOIN. Both are correct
syntax. LEFT JOIN returns all the records from the left table and only the matching records
from the right table. Use LEFT JOIN whenever you need the data from the entire first table
and values from the second table, if they exist. For example, in the query below, LEFT JOIN
will return customer_name with the corresponding sales_rep, if it is available. If there is a
customer who did not interact with a sales representative, that customer would still show
up in the query results but with a NULL value for sales_rep.
The results from the query might look like the following where customer_name is from the
customers table and sales_rep is from the sales table. Again, the data from both tables was
joined together by matching the customer_id common to both tables even though
customer_id wasn't returned in the query results.
customer_name
sales_rep
Martin's Ice Cream
Luis Reyes
Beachside Treats
NULL
customer_name
sales_rep
Mona's Natural Flavors
Geri Hall
...etc.
...etc.
RIGHT JOIN
You may see this as RIGHT OUTER JOIN or RIGHT JOIN. RIGHT JOIN returns all records
from the right table and the corresponding records from the left table. Practically speaking,
RIGHT JOIN is rarely used. Most people simply switch the tables and stick with LEFT JOIN.
But using the previous example for LEFT JOIN, the query using RIGHT JOIN would look like
the following:
The query results are the same as the previous LEFT JOIN example.
customer_name
sales_rep
Martin's Ice Cream
Luis Reyes
Beachside Treats
NULL
Mona's Natural Flavors
Geri Hall
...etc.
...etc.
FULL OUTER JOIN
You may sometimes see this as FULL JOIN. FULL OUTER JOIN returns all records from the
specified tables. You can combine tables this way, but remember that this can potentially
be a large data pull as a result. FULL OUTER JOIN returns all records from both tables even
if data isn’t populated in one of the tables. For example, in the query below, you will get all
customers and their products’ shipping dates. Because you are using a FULL OUTER JOIN,
you may get customers returned without corresponding shipping dates or shipping dates
without corresponding customers. A NULL value is returned if corresponding data doesn’t
exist in either table.
The results from the query might look like the following.
customer_name
ship_date
Martin's Ice Cream
2021-02-23
Beachside Treats
2021-02-25
NULL
2021-02-25
The Daily Scoop
NULL
Mountain Ice Cream
NULL
Mona's Natural Flavors
2021-02-28
...etc.
...etc.
For more information
JOINs are going to be useful for working with relational databases and SQL—and you will
have plenty of opportunities to practice them on your own. Here are a few other resources
that can give you more information about JOINs and how to use them:




SQL JOINs: This is a good basic explanation of JOINs with examples. If you need
a quick reminder of what the different JOINs do, this is a great resource to
bookmark and come back to later.
Database JOINs - Introduction to JOIN Types and Concepts: This is a really
thorough introduction to JOINs. Not only does this article explain what JOINs
are and how to use them, but it also explains the various scenarios in more
detail of when and why you would use the different JOINs. This is a great
resource if you are interested in learning more about the logic behind JOINing.
SQL JOIN Types Explained in Visuals: This resource has a visual
representation of the different JOINs. This is a really useful way to think about
JOINs if you are a visual learner, and it can be a really useful way to remember
the different JOINs.
SQL JOINs: Bringing Data Together One Join at a Time: Not only does this
resource have a detailed explanation of JOINs with examples, but it also

provides example data that you can use to follow along with their step-by-step
guide. This is a useful way to practice JOINs with some real data.
SQL JOIN: This is another resource that provides a clear explanation of JOINs
and uses examples to demonstrate how they work. The examples also combine
JOINs with aliasing. This is a great opportunity to see how JOINs can be
combined with other SQL concepts that you have been learning about in this
course.
Optional: Upload the warehouse dataset to
BigQuery
The next video demonstrates how to use COUNT and COUNT DISTINCT in SQL to count and
return the number of certain values in a dataset.
If you would like to follow along with the instructor, you will need to log in to your
BigQuery account and upload the warehouse data provided as two CSV files. If you have
hopped around courses, Using BigQuery in the Prepare Data for Exploration course
covers how to set up a BigQuery account.
Prepare for the next video

First, download the two CSV files from the attachments below:
Warehouse Orders - WarehouseCSV File
Download file
Warehouse Orders - OrdersCSV File
Download file

Next, complete the following steps in your BigQuery console to upload the
Warehouse Orders dataset with the two Warehouse and Orders tables.
Step 1: Open your BigQuery console and click on the project you want to upload the data
to.
Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next to your
project name and select Create dataset.
Step 3: In the upcoming video, the name "warehouse_orders" will be used for the dataset. If
you plan to follow along with the video, enter warehouse_orders for the Dataset ID.
Step 4: Click CREATE DATASET (blue button) to add the dataset to your project.
Step 5: In the Explorer on the left, click to expand your project, and then click the
warehouse_orders dataset you just created.
Step 6: Click the Actions icon (three vertical dots) next to warehouse_orders and
select Open.
Step 7: Click the blue + icon at the top right to open the Create table window.
Step 8: Under Source, for the Create table from selection, choose where the data will be
coming from.
Select Upload.
Click Browse to select the Warehouse Orders - Warehouse CSV file you
downloaded.
 Choose CSV from the file format drop-down.
Step 9: For Table name, enter Warehouse if you plan to follow along with the video.


Step 10: For Schema, click the Auto detect check box.
Step 11: Click Create table (blue button). You will now see the Warehouse table under
your warehouse_orders dataset in your project.
Step 12: Click the warehouse_orders dataset again.
Step 13: Click the icon to open the Create table window again.
Step 14: Under Source, for the Create table from selection, choose where the data will be
coming from.
Select Upload.
Click Browse to select the Warehouse Orders - Orders CSV file you
downloaded.
 Choose CSV from the file format drop-down.
Step 15: For Table name, enter Orders if you plan to follow along with the video.


Step 16: For Schema, click the Auto detect check box.
Step 17: Click Create table (blue button). You will now see the Orders table under your
warehouse_orders dataset in your project.
Step 18: Click the Warehouse table and click the Preview tab to verify that you have 10
rows of data.
Step 19: Click the Orders table and click the Preview tab to verify that you have the data
shown below.
If your data previews match, you are ready to follow along with the next video.
Hi, it's great to have you back. By now we've
discovered that spreadsheets and
SQL have a lot of tools in common.
Earlier in this program, we learned
about COUNT in spreadsheets.
Now it's time to look at similar tools in SQL:
COUNT and COUNT DISTINCT.
In this video, we'll talk about when you'd
use these queries and check out an example.
Let's get started. COUNT can be used to count
the total number of numerical values
within a specific range in spreadsheets.
COUNT in SQL does the same thing.
COUNT is a query that returns
the number of rows in a specified range,
but COUNT DISTINCT is a little different.
COUNT DISTINCT is a query that only
returns the distinct values in that range.
Basically, this means COUNT
DISTINCT doesn't count repeating values.
As a data analyst,
you'll use COUNT and COUNT
DISTINCT anytime you want to
answer questions about how many.
Like how many customers did this?
Or how many transactions were there this month?
Or how many dates are in this dataset?
And you'll use them throughout the data analysis process
at different stages.
For example, you might need them while you're cleaning
data to check how many rows are left in your dataset.
Or you might use COUNT and COUNT DISTINCT during
the actual analysis to answer a "how many" question.
You'll run into these kinds of questions a lot.
So COUNT and COUNT DISTINCT are really useful to know.
But let's check out an example to see
COUNT and COUNT DISTINCT in action.
For this example, we're working with
a company that manufactures socks.
We have two tables: Warehouse and Orders.
Let's take a quick look at
these tables before we start querying.
First, we'll check out the Warehouse table.
Play video starting at :1:46 and follow transcript1:46
You can see the columns here:
warehouse ID, warehouse alias,
the maximum capacity, the total number of employees,
and the state the warehouse is located in.
We'll pull up the top 100 rows of the Orders table next.
We can use LIMIT here to limit the number of rows returned.
This is useful if you're working with large datasets,
especially if you just want to explore
a small sample of that dataset.
From this query, we're actually
going to start with a FROM
statement so that we can alias our tables.
Aliasing is when you temporarily name a table or
column in your query to make it easier to read and write.
Because these names are temporary,
they only last for the given query.
We can use our FROM statement to write in what
our tables' aliases are going to be to
save us some time in other parts of the query.
So we'll start with FROM and use aliasing to name
the Warehouse Orders table, just "orders."
Let's say we need both the warehouse details and
the order details because we want to
report on the distribution of orders by state.
We're going to JOIN
these two tables together since we want data from
both of them and alias
our warehouse table in the process.
In this case, we're using JOIN as shorthand for
INNER JOIN because we want
corresponding data from both tables.
Play video starting at :3:14 and follow transcript3:14
And now that we have the aliases in place,
let's build out the SELECT statement
that comes before FROM.
Play video starting at :3:23 and follow transcript3:23
Let's run that. And there.
Now we have data from both tables joined
together, and we know how to create these handy aliases.
Now, we want to count
how many states are in our ordered data.
To do that, we'll use COUNT and COUNT DISTINCT now.
We can try a simple COUNT query first.
We'll JOIN the Orders and
Warehouse tables in our FROM statement.
And in this case we'll start with
SELECT and COUNT the number of states.
Let's run this query and see what we get.
Wait, that's not quite right.
This query returned over 9,000 states
because we counted every single row
that included a state.
But we actually want to count the distinct states.
Let's try this again with COUNT DISTINCT.
This query is going to look similar to the last one,
but we'll use DISTINCT to cut out
the repeated instances we got the last time.
We'll use the query we just built,
but replace COUNT with COUNT
DISTINCT in our SELECT statement.
Let's try this query.
That's more like it.
According to these results,
we have three distinct states in our Orders data.
Let's check out what happens when we
group by the state column in the warehouse table,
which we'll call warehouse dot state.
We'll use JOIN and GROUP BY in our FROM statement.
Let's start there again.
Then GROUP BY warehouse state.
Play video starting at :4:58 and follow transcript4:58
Now let's build out our SELECT statement on top of that.
We're still going to use COUNT DISTINCT. Let's run it.
Now we have three rows,
one of each state represented in the Orders data.
And our COUNT DISTINCT on the number of orders sums up
the count we ran earlier: 9,999.
You'll find yourself using COUNT and COUNT
DISTINCT during every stage of the data analysis process.
Understanding what these queries
are and how they are different is key.
Great job, and I'll see you again soon!
SQL functions and subqueries: A functional
friendship
In this reading, you will learn about SQL functions and how they are sometimes used with
subqueries. SQL functions are tools built into SQL to make it possible to perform
calculations. A subquery (also called an inner or nested query) is a query within another
query.
How do SQL functions, function?
SQL functions are what help make data aggregation possible. (As a reminder, data
aggregation is the process of gathering data from multiple sources in order to combine it
into a single, summarized collection.) So, how do SQL functions work? Going back to
W3Schools, let’s review some of these functions to get a better understanding of how to run
these queries:

SQL HAVING: This is an overview of the HAVING clause, including what it is
and a tutorial on how and when it works.



SQL CASE: Explore the usage of the CASE statement and examples of how it
works.
SQL IF: This is a tutorial of the IF function and offers examples that you can
practice with.
SQL COUNT: The COUNT function is just as important as all the rest, and this
tutorial offers multiple examples to review.
Subqueries - the cherry on top
Think of a query as a cake. A cake can have multiple layers contained within it and even
layers within those layers. Each of these layers are our subqueries, and when you put all of
the layers together, you get a cake (query). Usually, you will find subqueries nested in the
SELECT, FROM, and/or WHERE clauses. There is no general syntax for subqueries, but the
syntax for a basic subquery is as follows:
SELECT account_table.* FROM ( SELECT * FROM transaction.sf_model_feature_2014_01
WHERE day_of_week = 'Friday' ) account_table WHERE account_table.availability = 'YES'
You will find that, within the first SELECT clause is another SELECT clause. The second
SELECT clause marks the start of the subquery in this statement. There are many different
ways in which you can make use of subqueries, and resources referenced will provide
additional guidance as you learn. But first, let’s recap the subquery rules.
There are a few rules that subqueries must follow:




Subqueries must be enclosed within parentheses
A subquery can have only one column specified in the SELECT clause. But if you
want a subquery to compare multiple columns, those columns must be selected
in the main query.
Subqueries that return more than one row can only be used with multiple value
operators, such as the IN operator which allows you to specify multiple values
in a WHERE clause.
A subquery can’t be nested in a SET command. The SET command is used with
UPDATE to specify which columns (and values) are to be updated in a table.
Additional resources
The following resources offer more guidance into subqueries and their usage:
SQL subqueries: This detailed introduction includes the definition of a
subquery, its purpose in SQL, when and how to use it, and what the results will
be
 Writing subqueries in SQL: Explore the basics of subqueries in this interactive
tutorial, including examples and practice problems that you can work through
As you continue to learn more about using SQL, functions, and subqueries, you will realize
how much time you can truly save when memorizing these tips and tricks.

0:01
Welcome back!
One of the first calculations most kids learn how to do is counting.
Soon after, they learn adding, and that doesn't go away.
No matter what age we are, we're always counting or adding something,
whether it's change at the grocery store or measurements in a recipe.
Data analysts do a lot of counting and adding too.
And with the amount of data you'll come across as a data analyst,
you'll be grateful to have functions that can do the counting and adding for you.
So let's learn how these functions COUNTIF and
SUMIF can help you do calculations for your analysis more easily and accurately.
We'll start with the COUNTIF function.
You might remember COUNTIF from some of the earlier videos about data cleaning.
COUNTIF returns the number of cells that match a specified value.
Earlier, we showed how COUNTIF can be used to find and count errors in a data set.
Play video starting at ::55 and follow transcript0:55
Here we'll only be counting.
Just a reminder though, while we won't be actively searching for
errors in this video, you'll still want to watch out for
any data that doesn't look right when doing your own analysis.
As a data analyst, you'll look for and fix errors every step of the way.
Play video starting at :1:13 and follow transcript1:13
For this example,
we'll look at a sample of data from an online kitchen supplies retailer.
Play video starting at :1:19 and follow transcript1:19
Our stakeholders have asked us to answer a few questions about the data to understand
more about customer transactions, including the revenue they're bringing in.
We've added the questions we need to answer to the spreadsheet.
Play video starting at :1:34 and follow transcript1:34
We'll set up a simple summary table,
which is a table used to summarize statistical information about data.
We'll use the questions to create the attributes for our table columns:
count, revenue total, and average revenue per transaction.
Play video starting at :1:52 and follow transcript1:52
Each of our questions ask about transactions with one item or transactions
with more than one item, so those will be the observations for our rows.
Play video starting at :2:6 and follow transcript2:06
We'll make Quantity the heading for our observations.
Play video starting at :2:14 and follow transcript2:14
We'll also add borders to make the summary table nice and clear.
Play video starting at :2:22 and follow transcript2:22
The first question asks, How many transactions include exactly one item?
To answer this, we'll add a formula using the COUNTIF function in cell G11.
Play video starting at :2:33 and follow transcript2:33
We'll begin with an equal sign, COUNTIF, and an open parenthesis.
Play video starting at :2:40 and follow transcript2:40
Column B has data about quantity.
So we'll select cells B3 through B50, followed by a comma.
Play video starting at :2:53 and follow transcript2:53
Next, we need to tell the formula the value that we're looking for
in the cells we've selected.
We want to tell the data to count the number of transactions if they equal 1.
In this case, between quotation marks, we'll type an equal sign and
the number 1 because that's the exact value we need to count.
When we add a closed parenthesis and press enter, we get the total count for
transactions with only one item, which is 25.
We can follow the same steps to count values greater than one.
Play video starting at :3:40 and follow transcript3:40
But this time, because we only want values greater than 1,
we'll type a greater than sign in our formula inside of an equals sign.
Play video starting at :3:48 and follow transcript3:48
Getting this information helps us compare the data about quantity.
Play video starting at :3:54 and follow transcript3:54
Okay, now we need to find out how much total revenue each transaction type
brought in.
Since the data isn't organized by quantity,
we'll use the SUMIF function to help us add the revenue for
transactions with one item and with one more item separately.
SUMIF is a function that adds numeric data based on one condition.
Building a formula with SUMIF is a bit different than one with COUNTIF.
They both start the same way with an equal sign and the function, but
a SUMIF formula contains the range of cells to be evaluated by your criteria,
and the criteria.
In other words,
SUMIF has a list of cells to check based on the criteria you set in the formula.
Then the range where we want to add the numbers is placed in the formula if that
range is different from the range being evaluated.
There's commas between each of these parts.
Adding a space after each comma is optional.
So let's try this.
In cell H11, we'll type our formula.
Play video starting at :5:1 and follow transcript5:01
The range to be evaluated is in column B, so we'll select those cells.
Play video starting at :5:14 and follow transcript5:14
The condition we want the data to meet is for the values in
the column to be equal to one.
So we'll type a comma and then inside quotes an equal sign and the number one.
Play video starting at :5:24 and follow transcript5:24
Then we'll select the range to be added based on whether the data from our first
range is equal to one.
This range is in column C, which lists the revenue for each transaction.
Play video starting at :5:46 and follow transcript5:46
So every amount of revenue earned from a transaction with only one item will be
added together.
And there's our total.
Since this is revenue, we'll change the format of the number to currency, so
it shows up as dollars and cents.
Play video starting at :6: and follow transcript6:00
So the transactions with exactly one item earned $1,555.00 in revenue.
Let's see how much the transactions with more than one item earned.
Play video starting at :6:38 and follow transcript6:38
Okay, let's check out the results.
Just like with our COUNTIF examples, the second SUMIF formula will be the same
as the first, except for the condition, which will make it greater than one.
Play video starting at :6:51 and follow transcript6:51
When we run the formula,
we discover that the revenue total is much higher, $4,735.00.
This makes sense,
since the revenue is coming from transactions with more than one item.
Good news.
To complete our objective, we'll do two more quick calculations.
First, we'll find the average revenue per transaction by dividing each total by
its count.
This will show our stakeholders how much of a difference there is
in revenue per transaction between one item and multiple item transactions.
This information could be useful for lots of reasons.
For example, figuring out whether to add a discount on purchases with more than one
item to encourage customers to buy more.
We'll put these calculations in the last column of our summary table.
You might remember that we use a slash in a formula as the operator for
division calculations.
Play video starting at :7:44 and follow transcript7:44
The average revenue for transactions with one item is $62.20.
Play video starting at :7:55 and follow transcript7:55
And the average revenue for transactions with more than one item is $205.87.
And that's it for our analysis.
Our summary table now gives the stakeholders and
team members a snapshot of the analysis that's easy to understand.
Our COUNTIF and SUMIF functions played a big role here.
Using these functions to complete calculations,
especially in large datasets, can help speed up your analysis.
They can also make counting and adding a little more interesting.
Nothing wrong with that.
And coming up,
we'll explore more functions to make your calculations run smoothly.
Bye for now.
Functions with multiple conditions
In this reading, you will learn more about conditional functions and how to construct
functions with multiple conditions. Recall that conditional functions and formulas perform
calculations according to specific conditions. Previously, you learned how to use functions
like SUMIF and COUNTIF that have one condition. You can use the SUMIFS and COUNTIFS
functions if you have two or more conditions. You will learn their basic syntax in Google
Sheets, and check out an example.
Refer to the resources at the end of this reading for information about similar functions in
Microsoft Excel.
SUMIF to SUMIFS
The basic syntax of a SUMIF function is: =SUMIF(range, criterion, sum_range)
The first range is where the function will search for the condition that you have set. The
criterion is the condition you are applying and the sum_range is the range of cells that will
be included in the calculation.
For example, you might have a table with a list of expenses, their cost, and the date they
occurred.
Column A: A1 - Expense
A2 - Fuel A3 - Food A4 - Taxi A5 - Coffee A6 - Fuel A7 - Taxi A8 - Coffee A9 - Food Column
B: B1 - Price B2 - $48.00 B3 - $12.34 B4 - $21.57 A5 - $2.50 A6 - $36.00 A7 - $15.88 A8 $4.15 A9 - $6.75 Column C: C1 - Date C2 - 12/14/2020 C3 - 12/14/2020 C4 - 12/14/2020 C5 12/15/2020 C6 - 12/15/2020 C7 - 12/15/2020 C8 - 12/15/2020 C9 - 12/15/2020
You could use SUMIF to calculate the total price of fuel in this table, like this:
But, you could also build in multiple conditions by using the SUMIFS function. SUMIF and
SUMIFS are very similar, but SUMIFS can include multiple conditions.
The basic syntax is: =SUMIFS(sum_range, criteria_range1, criterion1, [criteria_range2,
criterion2, ...])
The square brackets let you know that this is optional. The ellipsis at the end of the
statement lets you know that you can have as many repetition of these parameters as
needed. For example, if you wanted to calculate the sum of the fuel costs for one date in this
table, you could create a SUMIFS statement with multiple conditions, like this:
This formula gives you the total cost of every fuel expense from the date listed in the
conditions. In this example, C1:C9 is our second criterion_range and the date 12/15/2020
is the second condition. As long as you follow the basic syntax, you can add up to 127
conditions to a SUMIFS statement!
COUNTIF to COUNTIFS
Just like the SUMIFS function, COUNTIFS allows you to create a COUNTIF function with
multiple conditions.
The basic syntax for COUNTIF is: =COUNTIF(range, criterion)
Just like SUMIF, you set the range and then the condition that needs to be met. For example,
if you wanted to count the number of times Food came up in the Expenses column, you
could use a COUNTIF function like this:
COUNTIFS has the same basic syntax as SUMIFS: =COUNTIFS(criteria_range1, criterion1,
[criteria_range2, criterion2, ...])
The criteria_range and criterion are in the same order, and you can add more conditions to
the end of the function. So, if you wanted to find the number of times Coffee appeared in the
Expenses column on 12/15/2020, you could use COUNTIFS to apply those conditions, like
this:
This formula follows the basic syntax to create conditions for “Coffee” and the specific date.
Now we can find every instance where both of these conditions are true.
For more information
SUMIFS and COUNTIFS are just two examples of functions with multiple conditions. They
help demonstrate how multiple conditions can be built into the basic syntax of a function.
But, there are other functions with multiple conditions that you can use in your data
analysis. There are a lot of resources available online to help you get started with these
other functions:
How to use the Excel IFS function: This resource includes an explanation and
example of the IFS function in Excel. This is a great reference if you are
interested in learning more about IFS. The example is a useful way to
understand this function and how it can be used.
 VLOOKUP in Excel with multiple criteria: Similar to the previous resource,
this resource goes into more detail about how to use VLOOKUP with multiple
criteria. Being able to apply VLOOKUP with multiple criteria will be a useful
skill, so check out this resource for more guidance on how you can start using it
on your own spreadsheet data.
 INDEX and MATCH in Excel with multiple criteria: This resource explains
how to use the INDEX and MATCH functions with multiple criteria. It also
includes an example which helps demonstrate how these functions work with
multiple criteria and actual data.
 Using IF with AND, OR, and NOT functions in Excel: This resource combines
IF with AND, OR, and NOT functions to create more complex functions. By
combining these functions, you can perform your tasks more efficiently and
cover more criteria at once.
Welcome back. In the last video,
we created a pivot table of movie data and
revenue calculations to help our manager
think through new movie ideas.
We used our pivot table to make
some initial observations about annual revenue.
We also discovered that the average revenue for 2015
was lower than other years even
though more movies were released that year.
We hypothesized that this was because more movies
that earn less than $10 million
in revenue were released in 2015.
To test this theory,
we created a copy of our original pivot table.
Now we are going to apply filters in
calculated fields to explore the data more.
Let's get started.
You all remember that the filter option
lets us view only the values we need.
We'll select a cell in
our copied pivot table and add
a filter to the box office revenue column.
The filter will then be applied to the entire table.
When we open the status menu,
we can choose to filter the data to show specific values.
Play video starting at :1:11 and follow transcript1:11









































































But in our case, we want to filter by condition so we can
figure out how many movies
in each year earn less than $10 million.
The condition we'll use in
our filter is less than and our value will
be $10 million which is why
we renamed these columns earlier.
We'll type our number in a dollar and cents format
so the condition matches the data in our pivot table.
This might not be necessary,
but it prevents potential errors from happening.
Now we know that 20 movies released in
2015 made less than $10 million.
This seems like a high number
compared to the other years.
But keep in mind,
there were more movies from our data
set released in 2015.
Before we move on,
let's use a calculated field to verify our average
because it was copied from
another pivot table before we filtered it.
That way we can check that it's correct.
We'll create a customized column called
a calculated field using our values menu.
A calculated field is
a new field within a pivot table that carries
out certain calculations based
on the values of other fields.
You can do this in Excel too using
field settings and the create formula menu.
For the formula in our calculated field,
we'll use the sum function and divide the sum of
the box office revenue data from
our original table by the count of the same data.
Because we applied our filter
to this pivot table earlier,
this formula will only return
the average revenue of movies under $10 million.
That worked. We were able to check
the accuracy of some of our data before analyzing it.
Always a good thing.
But it's still difficult to tell how much of an impact
these lower earning movies had on the average revenue.
Let's run a quick formula to find the percentage of
movies for each year that earned less than $10 million.





































This will make it easier to compare from year to year.
Instead of a calculated field,
we'll add this as a formula in a new column,
that way we can pull data from both of our pivot tables.
We'll put a header for our table in
cell G10 and name it percent of total movies.
Then we'll add our formula to
the next cell in the column.
Divide the number of movies in
the copy table by the number
of movies in the original table.
Then we'll use the fill handle in the cell with
a formula and drag it to apply
the formula to the rest of the years.
Finally, we'll format these numbers as percentages.
Now our analysis shows that
16 percent of the movies released in
2015 earned less than $10 million of revenue.
The other years are all close to 10 percent.
This is one possible explanation for why
the average revenue is comparatively low in 2015.
In real life, we'd most likely need to take
our analysis even further depending on our goals.
But for now, we're all set.
You've learned how you can use
pivot tables to perform data calculations.
It will take practice,
but pivot tables are worth it
because they do more than calculate.
They organize and filter data too.
Together we've covered functions,
formulas, and pivot tables.
All great tools to use in analysis.
With practice and experience,
it will feel like you've used them forever.
Just take your time getting to know how they work.
Keep exploring these videos and the readings. Great work.
Elements of a pivot table
Previously, you learned that a pivot table is a tool used to sort, reorganize, group, count,
total, or average data in spreadsheets. In this reading, you will learn more about the parts
of a pivot table and how data analysts use them to summarize data and answer questions
about their data.
Pivot tables make it possible to view data in multiple ways in order to identify insights and
trends. They can help you quickly make sense of larger data sets by comparing metrics,
performing calculations, and generating reports. They’re also useful for answering specific
questions about your data.
A pivot table has four basic parts: rows, columns, values, and filters.
The rows of a pivot table organize and group data you select horizontally. For example, in
the Working with pivot tables video, the Release Date values were used to create rows that
grouped the data by year.
The columns organize and display values from your data vertically. Similar to rows,
columns can be pulled directly from the data set or created using values. Values are used
to calculate and count data. This is where you input the variables you want to measure.
This is also how you create calculated fields in your pivot table. As a refresher, a calculated
field is a new field within a pivot table that carries out certain calculations based on the
values of other fields
In the previous movie data example, the Values editor created columns for the pivot table,
including the SUM of Box Office Revenue, the AVERAGE of Box Office Revenue, and the
COUNT of Box Office Revenue columns.
Finally, the filters section of a pivot table enables you to apply filters based on specific
criteria — just like filters in regular spreadsheets! For example, a filter was added to the
movie data pivot table so that it only included movies that generated less than $10 million
in revenue.
Being able to use all four parts of the pivot table editor will allow you to compare different
metrics from your data and execute calculations, which will help you gain valuable insights.
Using pivot tables for analysis
Pivot tables can be a useful tool for answering specific questions about a dataset so you can
quickly share answers with stakeholders. For example, a data analyst working at a
department store was asked to determine the total sales for each department and the
number of products they each sold. They were also interested in knowing exactly which
department generated the most revenue.
Instead of making changes to the original spreadsheet data, they used a pivot table to
answer these questions and easily compare the sales revenue and number of products sold
by each department.
They used the department as the rows for this pivot table to group and organize the rest of
the sales data. Then, they input two Values as columns: the SUM of sales and a count of the
products sold. They also sorted the data by the SUM of sales column in order to determine
which department generated the most revenue.
Now they know that the Toys department generated the most revenue!
Pivot tables are an effective tool for data analysts working with spreadsheets because they
highlight key insights from the spreadsheet data without having to make changes to the
spreadsheet. Coming up, you will create your own pivot table to analyze data and identify
trends that will be highly valuable to stakeholders.
In this reading, you will learn how to create and use pivot tables for data analysis. You will
also get some resources about pivot tables that you can save for your own reference when
you start creating pivot tables yourself. Pivot tables are a spreadsheet tool that let you
view data in multiple ways to find insights and trends.
Pivot tables allow you to make sense of large data sets by giving you tools to easily
compare metrics, quickly perform calculations, and generate readable reports. You can
create a pivot table to help you answer specific questions about your data. For example, if
you were analyzing sales data, you could use pivot tables to answer questions like, “Which
month had the most sales?” and “What products generated the most revenue this year?”
When you need answers to questions about your data, pivot tables can help you cut
through the clutter and focus on only the data you need.
Create your pivot table
Before you can analyze data with pivot tables, you will need to create a pivot table with
your data. The following includes the steps for creating a pivot table in Google Sheets, but
most spreadsheet programs will have similar tools.
First, you will open the Data menu from the toolbar; there will be an option for Pivot table.
This pop-up menu will appear:
There is an option to
select New sheet or Existing sheet and a Create button
Generally, you will want to create a new sheet for your pivot table to keep your raw data
and your analysis separate. You can also store all of your calculations in one place for easy
reference. Once you have created your pivot table, there will be a pivot table editor that you
can access to the right of your data.
This is where you will be able to customize your pivot table, including what variables you
want to include for your analysis.
Using your pivot table for analysis
You can perform a wide range of analysis tasks with your pivot tables to quickly draw
meaningful insights from your data, including performing calculations, sorting, and filtering
your data. Below is a list of online resources that will help you learn about performing basic
calculations in pivot tables as well as resources for learning about sorting and filtering data
in your pivot tables.
Perform calculations
Microsoft Excel
Google Sheets
Calculate values in a pivot table: Microsoft Support’s
introduction to calculations in Excel pivot tables. This is a useful
Create and use pivot tables: This
pivot tables in Google Sheets and i
Microsoft Excel
Google Sheets
starting point if you are learning how to perform calculations
with pivot tables specifically in Excel.
creating calculated fields. This is a
save and reference as a quick remi
calculated fields.
Pivot table calculated field example: This resource includes a
detailed example of a pivot table being used for calculations.
This step-by-step process demonstrates how calculated fields
work, and provides you with some idea of how they can be used
for analysis.
All about calculated field in pivo
comprehensive guide to calculated
you are working with Sheets and a
more about pivot tables, this is a g
Pivot table calculated fields: step-by-step tutorial: This
tutorial for creating your own calculated fields in pivot tables is
a really useful resource to save and bookmark for when you
start to apply calculated fields to your own spreadsheets.
Pivot tables in Google Sheets: Th
the basics of pivot tables and calcu
and uses examples and how-to vid
these concepts.
Sort your data
Microsoft Excel
Google Sheets
Sort data in a pivot table or PivotChart: This is a Microsoft
Support how-to guide to sorting data in pivot tables. This is a
useful reference if you are working with Excel and are
interested in checking out how filtering will appear in Excel
specifically.
Customize a pivot table: This guid
focuses on sorting pivot tables in Go
quick reference if you are working o
need a step-by-step guide.
Pivot tables- Sorting data: This tutorial for sorting data in
pivot tables includes an example with real data that
demonstrates how sorting in Excel pivot tables works. This
example is a great way to experience the entire process from
start to finish.
How to sort pivot table columns:
data to demonstrate how the sortin
pivot tables will work. This is a grea
slightly more detailed guide with sc
environment.
How to sort a pivot table by value: This source uses an
example to explain sorting by value in pivot tables. It includes
a video, which is a useful guide if you need a demonstration of
the process.
Pivot table ascending and descen
beginner’s guide is a great way to br
tables if you are interested in a quic
Filter your data
Microsoft Excel
Google Sheets
Filter data in a pivot table: This resource from the
Microsoft Support page provides an explanation of filtering
data in pivot tables in Excel. If you are working in Excel
spreadsheets, this is a great resource to have bookmarked for
quick reference.
Customize a pivot table: This is the
filtering pivot table data. This is a use
working with pivot tables in Google S
resource to review the process.
Microsoft Excel
Google Sheets
How to filter Excel pivot table data: This how-to guide for
filtering data in pivot tables demonstrates the filtering
process in an Excel spreadsheet with data and includes tips
and reminders for when you start using these tools on your
own.
Filter multiple values in pivot table
about how to filter for multiple value
This resource expands some of the fu
already learned and sets you up to cr
Google Sheets.
Format your data
Microsoft Excel
Google Sheets
Design the layout and format of a PivotTable: This Microsoft Support
Create and edit pivot t
article describes how to change the format of the PivotTable by applying a
article provides informa
predefined style, banded rows, and conditional formatting.
table to change its style,
Pivot tables are a powerful tool that you can use to quickly perform calculations and gain
meaningful insights into your data directly from the spreadsheet file you are working in! By
using pivot table tools to calculate, sort, and filter your data, you can immediately make
high-level observations about your data that you can share with stakeholders in reports.
But, like most tools we have covered in this course, the best way to learn is to practice. This
was just a small taste of what you can do with pivot tables, but the more you work with
pivot tables, the more you will discover.
Optional: Upload the avocado dataset to BigQuery
Using public datasets is a great way to practice working with SQL. Later in the course, you are going
to use historical data on avocado prices to perform calculations in BigQuery. This is a step-by-step
guide to help you load this data into your own BigQuery console so that you can follow along with
the upcoming video.
If you have hopped around courses, Using BigQuery in the Prepare Data for Exploration course covers
how to set up a BigQuery account.
Step 1: Download the CSV file from Kaggle
Avocado prices: The publicly available avocado dataset from Kaggle you are going to use (made
available by Justin Kiggins under an Open Data Commons license).
You can download this data onto your own device and then upload it to BigQuery. There are also other
public datasets on Kaggle that you can download and use. You can follow these steps to load them
into your console and practice on your own!
Screenshot of Kaggle dataset page. There is a header titled Avocado Prices
You will find some more information about the avocado dataset, including the context, content, and
original source on this page. For now, you can simply download the file.
Step 2: Open your BigQuery console and create a new dataset
Open BigQuery. After you have downloaded the dataset from Kaggle, you can upload it to your
BigQuery console.
In the Explorer on the left side of your console, click the project where you want to add a dataset - note
that your project will not be named the same as the one in the example ("oval-flow-286322"). Don't
choose "bigquery-public-data" as your project because that's a public project that you can't change.
Click the Actions icon (three vertical dots) next to your project and select Create dataset.
Here, you will name the dataset; in this case, enter avocado_data. Then, click Create dataset (blue
button) at the bottom to create your new dataset. This will add data in the Explorer on the left of your
console.
Screenshot of Create Dataset menu
Step 3: Open the new dataset and create a new table
Navigate to the dataset in your console by clicking to expand your project and selecting the correct
dataset listed. In this case, it will be avocado_data.
Screenshot of the Create table menu
Click the Actions icon (three vertical dots) next to your dataset and select Open. Then click the + icon
to create a table.
Next, do the following:
Under Source, for the Create table from selection, select Upload.
Click Browse to select the CSV file you just downloaded to your computer from Kaggle. The file
format should automatically change from Avro to CSV when you select the file.
For Table Name, enter avocado_prices for the table.
For Schema, click the Auto detect check box. Then, click Create table (blue button).
Screenshot of second Create table menuthere are options to upload data, name a project, name the table,
and more
In the Explorer, the avocado data will appear in the table under the dataset you created. Now you are
ready to follow along with the video and learn more about performing calculations with queries!
Further reading
Introduction to loading data: This step-by-step guide is a useful resource that you can bookmark and
save for later. You can refer to it the next time you need to load data into BigQuery.
Hi, again.
Earlier, we covered data validation, a spreadsheet function that adds
drop-down lists to cells.
Using data validation lets you control what can and
can't be entered into your worksheet.
One of its uses is protecting structured data and formulas in your spreadsheets.
But as useful as it is,
the data validation function is just one part of a larger data validation process.
This process involves checking and rechecking the quality of your data so
that it is complete, accurate, secure, and consistent.
While the data validation process is a form of data cleaning,
you should use it throughout your analysis.
If this all sounds familiar to you, that's good.
Ensuring you have good data is super important.
And in my opinion, it's kind of fun because you
can pair your knowledge of the business with your technical skills.
This will help you understand your data, check that it's clean, and
make sure you're aligning with your business objectives.
In other words, it's what you do to make sure your data makes sense.
Play video starting at :1:6 and follow transcript1:06
Keep in mind, you'll build your business knowledge with time and experience.
And here's a pro tip.
Asking as many questions as possible whenever you need to will make this much
easier.
Okay, let's say we're analyzing some data for a furniture retailer.
We want to check that the values in the purchase price column are always equal
to the number of items sold times the product price.
So we'll add a formula in a new column to recalculate the purchase prices using
a multiplication formula.
Play video starting at :1:46 and follow transcript1:46
Now, comparing the totals, there's at least one value
that doesn't match the value in the purchase price column.
We need to find an answer to help us move forward with our analysis.
Play video starting at :1:58 and follow transcript1:58
By doing some research and asking questions, we find that there's a discount
of 30% when customers buy five or more of certain items.
Play video starting at :2:6 and follow transcript2:06
If we hadn't run this check, we could have missed this completely.
Play video starting at :2:11 and follow transcript2:11
You've learned that as an analyst, calculations are a big part of your job.
So it's important that whenever you do calculations,
you always check to make sure you've done them in the right way.
Sometimes you'll run data validation checks that are common-sense checks.
For example, let's say you're working on an analysis to
figure out the effectiveness of in-store promotions for
a business that's only open on weekdays.
Play video starting at :2:36 and follow transcript2:36
You check to make sure that there's no sales data for Saturday and Sundays.
If your data does show sales on weekends,
it might not be a problem with the data itself.
It might not even be a problem at all.
There might be a good reason.
Maybe your business hosts special events on Saturdays and Sundays.
Then you would have sales for those weekends.
You still might want to leave out the weekend sales in your analysis if your
objective is only to look at the weekdays.
But doing this data validation might save you from miscalculations and
other errors in your analysis.
You should always do data validation no matter what analysis tool you're using.
In an earlier video, we used SQL to analyze some data about avocados.
One of the queries was a check to make sure the data showing the total number
of bags was the sum of small, large, and extra-large bags.
By running this query,
we were able to determine that the total number column was accurate.
We compared our two columns briefly in that video.
But to be absolutely sure that there's no issues with the data
values in those columns, we could have also run another query.
In this query, we would select all using the asterisk, and FROM
the avocado prices data set.
Play video starting at :3:58 and follow transcript3:58
In our WHERE clause, we'd also type out where our calculated total does not
equal the total bags column.
If no values are returned,
we can be sure that the values in the Total Bags column are accurate.
And that led us to continue our analysis.
Play video starting at :4:48 and follow transcript4:48
But when we tried to find what percent of the total number of bags was small,
we ran into a small problem.
We received an error message about dividing by zero.
We fixed that error by adjusting our query.
If we had linked that query to a presentation that went to
our stakeholders,
they'd show us the divide by zero error instead of the figures we wanted.
By building in these types of checks as part of your data validation process,
you can avoid errors in your analysis and
complete your business objectives to make everyone happy.
And trust me. It's a great feeling when you do.
And another great feeling is knowing that you've made it through another video and
learned something new.
And we have more where that came from coming soon.
See you.
Types of data validation
This reading describes the purpose, examples, and limitations of six types of data
validation. The first five are validation types associated with the data (type, range,
constraint, consistency, and structure) and the sixth type focuses on the validation of
application code used to accept data from user input.
As a junior data analyst, you might not perform all of these validations. But you could ask if
and how the data was validated before you begin working with a dataset. Data validation
helps to ensure the integrity of data. It also gives you confidence that the data you are using
is clean. The following list outlines six types of data validation and the purpose of each, and
includes examples and limitations.



Purpose: Check that the data matches the data type defined for a field.
Example: Data values for school grades 1-12 must be a numeric data type.
Limitations: The data value 13 would pass the data type validation but would
be an unacceptable value. For this case, data range validation is also needed.

Purpose: Check that the data falls within an acceptable range of values defined
for the field.
Example: Data values for school grades should be values between 1 and 12.
Limitations: The data value 11.5 would be in the data range and would also
pass as a numeric data type. But, it would be unacceptable because there aren't
half grades. For this case, data constraint validation is also needed.





Purpose: Check that the data meets certain conditions or criteria for a field.
This includes the type of data entered as well as other attributes of the field,
such as number of characters.
Example: Content constraint: Data values for school grades 1-12 must be
whole numbers.
Limitations: The data value 13 is a whole number and would pass the content
constraint validation. But, it would be unacceptable since 13 isn’t a recognized
school grade. For this case, data range validation is also needed.






Purpose: Check that the data follows or conforms to a set structure.
Example: Web pages must follow a prescribed structure to be displayed
properly.
Limitations: A data structure might be correct with the data still incorrect or
inaccurate. Content on a web page could be displayed properly and still contain
the wrong information.
Purpose: Check that the application code systematically performs any of the
previously mentioned validations during user data input.
 Example: Common problems discovered during code validation include: more
than one data type allowed, data range checking not done, or ending of text
strings not well defined.
 Limitations: Code validation might not validate all possible variations with
data input.
Hello again.
Now, if you're like me, you always have sticky notes available nearby to write
a reminder or figure out a quick math problem.
Sticky notes are useful and important, but they're also disposable since you usually
only need them for a short time before you recycle them.
Data analysts have their own version of sticky notes when they're working in SQL.
They're called temporary tables and we're here to find out what they're all about.








Purpose: Check that the data makes sense in the context of other related data.
Example: Data values for product shipping dates can’t be earlier than product
production dates.
Limitations: Data might be consistent but still incorrect or inaccurate. A
shipping date could be later than a production date and still be wrong.












































A temporary table is a database table that is created and
exists temporarily on a database server.
Temp tables as we call them store subsets of data from standard data tables for
a certain period of time.
Then they're automatically deleted when you end your SQL database session.
Since temp tables aren't stored permanently, they're useful when you only
need a table for a short time to complete analysis tasks, like calculations.
For example, you might have a lot of tables you're performing calculations on
at the same time.
If you have a query that needs to join seven or eight of them,
you could join the two or three tables having the fewest number of rows and
store their output in a temp table.
You could then join this temp table to one of the other bigger tables.
Another example is when you have lots of different databases you're running
queries
on. You can run these initial queries in each separate database, and
then use a temp table to collect the results of all of these queries.
The final report query would then run on the temporary table.
You might not be able to make use of this reporting structure without
temporary tables.
They're also useful if you've got a large number of records in a table and
you need to work with a small subset of those records repeatedly to complete
some calculations or other analysis.
So instead of filtering the data over and over to return the subset,
you can filter the data once and store it in a temporary table.
Then you can run your queries using a temporary table you've created.
Imagine that you've been asked to analyze data about the bike sharing system we
looked at earlier. You only need to analyze the data for
bike trips that were over 60 minutes or longer, but
you have several questions to answer about the specific data.
Play video starting at :2:11 and follow transcript2:11
Using a temporary table will let you run several queries about this data without
having to keep filtering it.
There's different ways to create temporary tables in SQL,
depending on the relational database management system you're using.
We'll explore some of these options soon.
For this scenario we'll use BigQuery. We'll apply a WITH clause to our query.
The WITH clause is a type of temporary table that you can query from
multiple times.
The WITH clause approximates a temporary table.
Basically, this means it creates something that does the same thing as a temporary
table. Even if it doesn't add a table to the database you're working in for
others to see, you can still see your results and
anyone who needs to review your work can see the code that led to your results.
Play video starting at :2:59 and follow transcript2:59











































Let's get this query started.
We'll start this query with the WITH command.
Play video starting at :3:5 and follow transcript3:05
We'll then name our temp table trips, underscore, over,
underscore, 1, underscore, hr.
Then we'll type the AS command and an open parenthesis.
On a new line, we'll use the SELECT- FROM-WHERE structure for our subquery.
We'll type SELECT followed by an asterisk.
You might remember the asterisk means you're selecting all the columns in
the table.
Play video starting at :3:33 and follow transcript3:33
Now we'll type the FROM command and
name the database that we're pulling from bigquery, dash, public, dash,
data, dot, new, underscore, york, dot, citibike, underscore, trips.
Play video starting at :3:55 and follow transcript3:55
Next, we'll add a WHERE clause with the condition that the length of the bike
trips we need in our temp table are greater than or equal to 60 minutes.
In the query it goes like this:
trip duration, space, greater than sign, equal sign, space, 60.
Finally, we'll add a close parenthesis on a new line to end our subquery.
And that sets up our temporary table.
Now we can run queries that'll only return results for
trips that lasted 60 minutes or longer.
Let's try one.
Since we're working in our version of a temp table,
we don't need to open a new query.
Instead, we'll label our queries before we add our code to describe what we're
doing.
For this query, we'll type two hashtags.
Play video starting at :4:42 and follow transcript4:42
This tells the server that this is a description and not part of the code.
Next, we'll add the query description.
Play video starting at :4:49 and follow transcript4:49
Count how many trips are 60 plus minutes long.
Play video starting at :4:59 and follow transcript4:59
And then we'll add our query. SELECT,
then on a new line COUNT with an asterisk in parentheses.
As followed by cnt to name the column with our COUNT.
Play video starting at :5:12 and follow transcript5:12
Next we'll add FROM and the name we're using for
our version of a temporary table: trips over one hour.
Play video starting at :5:21 and follow transcript5:21
When we run our query, the results show the total number of bike trips
from the dataset that lasted 60 minutes or longer,





















Play video starting at :5:35 and follow transcript5:35
We can keep running queries on this temp table over and
over as long as we're looking to analyze bike trips that were 60 minutes and over.
And if you need to end your session and start a new runtime later,
most servers store the code used in temp tables.
You'll just need to recreate the table by running the code.
Play video starting at :5:55 and follow transcript5:55
When you use temporary tables, you make your own work more efficient. Naming
and
using temp tables can help you deal with a lot of data in a more streamlined way,
so you don't get lost repeating query after query with the same code that you
could just include in a temp table.
And here's another bonus to using temp tables:
they can help your fellow team members too.
With temp tables your code is usually less complicated and easier to read and
understand which your team will appreciate!
Play video starting at :6:24 and follow transcript6:24
Once you start to explore temporary tables on your own,
you might not be able to stop.
Don't say I didn't warn you.
Coming up, we'll explore even more things you can do with temp tables.
See you soon.
Working with temporary tables
Temporary tables are exactly what they sound like—temporary tables in a SQL database
that aren’t stored permanently. In this reading, you will learn the methods to create
temporary tables using SQL commands. You will also learn a few best practices to follow
when working with temporary tables.
A quick refresher on what you have already learned about
temporary tables




They are automatically deleted from the database when you end your SQL
session.
They can be used as a holding area for storing values if you are making a series
of calculations. This is sometimes referred to as pre-processing of the data.
They can collect the results of multiple, separate queries. This is sometimes
referred to as data staging. Staging is useful if you need to perform a query on
the collected data or merge the collected data.
They can store a filtered subset of the database. You don’t need to select and
filter the data each time you work with it. In addition, using fewer SQL
commands helps to keep your data clean.
It is important to point out that each database has its own unique set of commands to
create and manage temporary tables. We have been working with BigQuery, so we will
focus on the commands that work well in that environment. The rest of this reading will go
over the ways to create temporary tables, primarily in BigQuery.
Temporary table creation in BigQuery
Temporary tables can be created using different clauses. In BigQuery, the WITH clause can
be used to create a temporary table. The general syntax for this method is as follows:
Breaking down this query a bit, notice the following:
The statement begins with the WITH clause followed by the name of the new
temporary table you want to create
 The AS clause appears after the name of the new table. This clause instructs the
database to put all of the data identified in the next part of the statement into
the new table.
 The opening parenthesis after the AS clause creates the subquery that filters
the data from an existing table. The subquery is a regular SELECT statement
along with a WHERE clause to specify the data to be filtered.
 The closing parenthesis ends the subquery created by the AS clause.
When the database executes this query, it will first complete the subquery and assign the
values that result from that subquery to “new_table_data,” which is the temporary table.
You can then run multiple queries on this filtered data without having to filter the data
every time.

Temporary table creation in other databases (not supported in
BigQuery)
The following method isn’t supported in BigQuery, but most other versions of SQL
databases support it, including SQL Server and mySQL. Using SELECT and INTO, you can
create a temporary table based on conditions defined by a WHERE clause to locate the
information you need for the temporary table. The general syntax for this method is as
follows:
SELECT *
INTO AfricaSales FROM GlobalSales WHERE Region = "Africa"
This SELECT statement uses the standard clauses like FROM and WHERE, but the INTO
clause tells the database to store the data that is being requested in a new temporary table
named, in this case, “AfricaSales.”
User-managed temporary table creation
So far, we have explored ways of creating temporary tables that the database is responsible
for managing. But, you can also create temporary tables that you can manage as a user. As
an analyst, you might decide to create a temporary table for your analysis that you can
manage yourself. You would use the CREATE TABLE statement to create this kind of
temporary table. After you have finished working with the table, you would then delete or
drop it from the database at the end of your session.
Note: BigQuery uses CREATE TEMP TABLE instead of CREATE TABLE, but the general
syntax is the same.
CREATE TABLE table_name ( column1 datatype, column2 datatype, column3 datatype, .... )
After you have completed working with your temporary table, you can remove the table
from the database using the DROP TABLE clause. The general syntax is as follows:
Best practices when working with temporary tables


Global vs. local temporary tables: Global temporary tables are made
available to all database users and are deleted when all connections that use
them have closed. Local temporary tables are made available only to the user
whose query or connection established the temporary table. You will most
likely be working with local temporary tables. If you have created a local
temporary table and are the only person using it, you can drop the temporary
table after you are done using it.
Dropping temporary tables after use: Dropping a temporary table is a little
different from deleting a temporary table. Dropping a temporary table not only
removes the information contained in the rows of the table, but removes the
table variable definitions (columns) themselves. Deleting a temporary table
removes the rows of the table but leaves the table definition and columns ready
to be used again. Although local temporary tables are dropped after you end
your SQL session, it may not happen immediately. If a lot of processing is
happening in the database, dropping your temporary tables after using them is
a good practice to keep the database running smoothly.
For more information

































BigQuery Documentation for Temporary Tables: Documentation has the syntax
to create temporary tables in BigQuery
 How to use temporary tables via WITH in Google BigQuery: Article describes
how to use WITH
 Introduction to Temporary Tables in SQL Server: Article describes how to use
SELECT INTO and CREATE TABLE
 SQL Server Temporary Tables: Article describes temporary table creation and
removal
 Choosing Between Table Variables and Temporary Tables: Article describes the
differences between passing variables in SQL statements vs. using temporary
tables
Welcome back, future data analyst.
As a budding analyst, you'll be exposed to a lot of data.
People learn and absorb data in so many different ways, and
one of the most effective ways that this can happen is through visualization.
Data visualization is the graphic representation and presentation of data.
In reality, it's just putting information into an image to make it easier for
other people to understand.
If you've ever looked at any kind of map, whether it's paper or online,
then you know exactly how helpful visuals could be.
Data visualizations are definitely having a moment right now.
Online we are surrounded by images that show information in all kinds of ways,
but the history of data visualization goes back way further than the Web.
Visualizing data began long ago with maps,
which are the visual representation of geographic data.
This map of the known world is from 1502. Map makers continued to improve
their visualizations as new lands were charted.
New data was collected about those locations,
and new methods for visualizing the data were created.
Scientists and mathematicians began to truly embrace the idea of arranging data
visually in the 1700s and 1800s.
This bar graph is from 1821 and
it doesn't look too different from bar graphs that we see today.
But since the beginning of the digital age of data analytics in the 1990s,
the scope and
reach of visualizations have grown along with the data they graphically represent.
As we keep learning how to more efficiently communicate with visuals,
the quality of our insights continue to grow too.
Today we can quantify human behavior through data, and
we've learned to use computers to collect, analyze and visualize that data.
As an analyst in today's world,
you'll probably split your time with data visuals in two ways:
looking at visuals in order to understand and draw conclusions about data or














































creating visuals from raw data to tell a story.
Either way, it's always good to keep in mind that data visualizations will be your
key to success.
This is especially true once you reach the point where you're ready to present
the results of your data analysis to an audience.
Getting people to understand your vision and thought process can feel challenging.
But a well-made data visualization has the power to change people's minds.
Plus, it can help someone who doesn't have the same technical background or
experience as you form their own opinions.
So here's a quick rule for creating a visualization.
Your audience should know exactly what they're looking at within the first five
seconds of seeing it.
Basically, this means the visual should be clear and easy to follow.
In the five seconds after that,
your audience should understand the conclusion your visualization is making.
Even if they aren't totally familiar with the research you've been doing.
They might not agree with your conclusion, and that's okay.
You can always use their feedback to adjust your visualization and
go back to the data to do further analysis.
So now let's talk about what we have to do to create a visualization that's
understandable, effective and, most importantly, convincing.
Let's start from the beginning.
Data visualizations are a helpful tool for
fitting a lot of information into a small space.
To do this, you first need to structure and organize your thoughts.
Think about your objectives and
the conclusions you've reached after sorting through data.
Then think about the patterns you've noticed in the data, the things that
surprised you and, of course, how all of this fits together into your analysis.
Identifying the key elements of your findings help set the stage for
how you should organize your presentation.
Check out this data visualization made by David McCandless,
a well-known data journalist.
This graphic includes four key elements:
the information or data, the story, the goal and the visual form.
It's arranged in a four-part Venn diagram,
which tells us that all four elements are needed for a successful visualization.
So far, you've learned a lot about the data used in visualizations.
That's important because it's a key building block for your visualization.
The story or concept adds meaning to the data and makes it interesting.
We'll talk more about the importance of data storytelling later, but for now,
just remember that the story and
the data combined provide an outline of what you're trying to show.
The goal or function makes the data both useful and usable, and
the visual form creates both beauty and structure.
With just two elements, you can create a rough sketch of a visual.
































This could work if you're at an early stage, but won't give you a complete
visualization because you'd be missing other key elements.
Even using three elements gets you closer, but you're not quite finished.
For example, if you combine information, goal, and
visual form without any story, your visual will probably look fine,
but it won't be interesting. On their own,
each element has value, but visualizations only become truly powerful and effective
when you combine all four elements in a way that makes sense. And
when you think about all of these elements together,
you can create something meaningful for your audience.
At Google I make sure to develop visualizations
to tell stories about data that include all four of these elements, and
I can tell you that each element is a key to a visualization success.
That's why it's so important for
you as the analyst to pay close attention to each element as we move forward.
Other people might not know or
understand the exact steps you took to come to the conclusions you've made, but
that shouldn't stop them from understanding your reasoning.
Basically, an effective data visualization should lead viewers to reach the same
conclusion you did, but much more quickly.
Because of the age we live in,
we're constantly being shown different ways to view and absorb information.
This means that you've already seen lots of visuals you can reference as you design
your own visualizations.
You have the power to tell convincing stories that could change opinions and
shift mindsets.
That's pretty cool. But you also have the responsibility to pay attention to
the perspectives of others as you create these stories.
So it's important to always keep that in mind.
Coming up, we'll start drawing connections between data and
images to create a strong foundation for your visual masterpieces.
I can't wait to get started.

Welcome back, future data analyst.

As a budding analyst, you'll be exposed to a lot of data.

People learn and absorb data in so many different ways, and

one of the most effective ways that this can happen is through visualization.

Data visualization is the graphic representation and presentation of data.

In reality, it's just putting information into an image to make it easier for

other people to understand.

If you've ever looked at any kind of map, whether it's paper or online,

then you know exactly how helpful visuals could be.

Data visualizations are definitely having a moment right now.

Online we are surrounded by images that show information in all kinds of ways,

but the history of data visualization goes back way further than the Web.

Visualizing data began long ago with maps,

which are the visual representation of geographic data.

This map of the known world is from 1502. Map makers continued to improve

their visualizations as new lands were charted.

New data was collected about those locations,

and new methods for visualizing the data were created.

Scientists and mathematicians began to truly embrace the idea of arranging data

visually in the 1700s and 1800s.

This bar graph is from 1821 and

it doesn't look too different from bar graphs that we see today.

But since the beginning of the digital age of data analytics in the 1990s,

the scope and

reach of visualizations have grown along with the data they graphically represent.

As we keep learning how to more efficiently communicate with visuals,

the quality of our insights continue to grow too.

Today we can quantify human behavior through data, and

we've learned to use computers to collect, analyze and visualize that data.

As an analyst in today's world,

you'll probably split your time with data visuals in two ways:

looking at visuals in order to understand and draw conclusions about data or

creating visuals from raw data to tell a story.

Either way, it's always good to keep in mind that data visualizations will be your

key to success.

This is especially true once you reach the point where you're ready to present

the results of your data analysis to an audience.

Getting people to understand your vision and thought process can feel challenging.

But a well-made data visualization has the power to change people's minds.

Plus, it can help someone who doesn't have the same technical background or

experience as you form their own opinions.

So here's a quick rule for creating a visualization.

Your audience should know exactly what they're looking at within the first five

seconds of seeing it.

Basically, this means the visual should be clear and easy to follow.

In the five seconds after that,

your audience should understand the conclusion your visualization is making.

Even if they aren't totally familiar with the research you've been doing.

They might not agree with your conclusion, and that's okay.

You can always use their feedback to adjust your visualization and

go back to the data to do further analysis.

So now let's talk about what we have to do to create a visualization that's

understandable, effective and, most importantly, convincing.

Let's start from the beginning.

Data visualizations are a helpful tool for

fitting a lot of information into a small space.

To do this, you first need to structure and organize your thoughts.

Think about your objectives and

the conclusions you've reached after sorting through data.

Then think about the patterns you've noticed in the data, the things that

surprised you and, of course, how all of this fits together into your analysis.

Identifying the key elements of your findings help set the stage for

how you should organize your presentation.

Check out this data visualization made by David McCandless,

a well-known data journalist.

This graphic includes four key elements:

the information or data, the story, the goal and the visual form.

It's arranged in a four-part Venn diagram,

which tells us that all four elements are needed for a successful visualization.

So far, you've learned a lot about the data used in visualizations.

That's important because it's a key building block for your visualization.

The story or concept adds meaning to the data and makes it interesting.

We'll talk more about the importance of data storytelling later, but for now,

just remember that the story and

the data combined provide an outline of what you're trying to show.

The goal or function makes the data both useful and usable, and

the visual form creates both beauty and structure.

With just two elements, you can create a rough sketch of a visual.

This could work if you're at an early stage, but won't give you a complete

visualization because you'd be missing other key elements.

Even using three elements gets you closer, but you're not quite finished.

For example, if you combine information, goal, and

visual form without any story, your visual will probably look fine,

but it won't be interesting. On their own,

each element has value, but visualizations only become truly powerful and effective

when you combine all four elements in a way that makes sense. And

when you think about all of these elements together,

you can create something meaningful for your audience.

At Google I make sure to develop visualizations

to tell stories about data that include all four of these elements, and

I can tell you that each element is a key to a visualization success.

That's why it's so important for

you as the analyst to pay close attention to each element as we move forward.

Other people might not know or

understand the exact steps you took to come to the conclusions you've made, but

that shouldn't stop them from understanding your reasoning.

Basically, an effective data visualization should lead viewers to reach the same

conclusion you did, but much more quickly.

Because of the age we live in,

we're constantly being shown different ways to view and absorb information.

This means that you've already seen lots of visuals you can reference as you design

your own visualizations.

You have the power to tell convincing stories that could change opinions and

shift mindsets.

That's pretty cool. But you also have the responsibility to pay attention to

the perspectives of others as you create these stories.

So it's important to always keep that in mind.

Coming up, we'll start drawing connections between data and

images to create a strong foundation for your visual masterpieces.

I can't wait to get started.
A data visualization, sometimes referred to as a “data viz,” allows analysts to properly
interpret data. A good way to think of data visualization is that it can be the difference
between utter confusion and really grasping an issue. Creating effective data visualizations
is a complex task; there is a lot of advice out there, and it can be difficult to grasp it all. In
this reading, you are going to learn some tips and tricks for creating effective data
visualizations. First, you'll review two frameworks that are useful for thinking about how
you can organize the information in your visualization. Second, you'll explore pre-attentive
attributes and how they can be used to affect the way people think about your
visualizations. From there, you'll do a quick review of the design principles that you should
keep in mind when creating your visualization. You will end the reading by reviewing some
practices that you can use to avoid creating misleading or inaccurate visualizations.
Frameworks for organizing your thoughts about visualization
Frameworks can help you organize your thoughts about data visualization and give you a
useful checklist to reference. Here are two frameworks that may be useful for you as you
create your own data viz:
1) The McCandless Method
You learned about the David McCandless method in the first lesson on effective data
visualizations, but as a refresher, the McCandless Method lists four elements of good data
visualization:
1. Information: the data you are working with
2. Story: a clear and compelling narrative or concept
3. Goal: a specific objective or function for the visual
4. Visual form: an effective use of metaphor or visual expression
Note: One useful way of approaching this framework is to notice the parts of the graphic
where there is incomplete overlap between all four elements. For example, visual form
without a goal, story, or data could be a sketch or even art. Data plus visual form without a
goal or function is eye candy. Data with a goal but no story or visual form is boring. All four
elements need to be at work to create an effective visual.
2) Kaiser Fung’s Junk Charts Trifecta Checkup
This approach is a useful set of questions that can help consumers of data visualization
critique what they are consuming and determine how effective it is. The Checkup has three
questions:
1. What is the practical question?
2. What does the data say?
3. What does the visual say?
Note: This checklist helps you think about your data viz from the perspective of your
audience and decide if your visual is communicating your data effectively to them or not. In
addition to these frameworks, there are some other building blocks that can help you
construct your data visualizations.
Pre-attentive attributes: marks and channels
Creating effective visuals means leveraging what we know about how the brain works, and
then using specific visual elements to communicate the information effectively. Preattentive attributes are the elements of a data visualization that people recognize
automatically without conscious effort. The essential, basic building blocks that make
visuals immediately understandable are called marks and channels.
Marks
Marks are basic visual objects like points, lines, and shapes. Every mark can be broken
down into four qualities:
1. Position - Where a specific mark is in space in relation to a scale or to other
marks
2. Size - How big, small, long, or tall a mark is
3. Shape - Whether a specific object is given a shape that communicates something about it
4. Color - What color the mark is
Channels
Channels are visual aspects or variables that represent characteristics of the data.
Channels are basically marks that have been used to visualize data. Channels will vary in
terms of how effective they are at communicating data based on three elements:
1. Accuracy - Are the channels helpful in accurately estimating the values being
represented?
For example, color is very accurate when communicating categorical differences, like
apples and oranges. But it is much less effective when distinguishing quantitative data like
5 from 5.5.
2. Popout - How easy is it to distinguish certain values from others?
There are many ways of drawing attention to specific parts of a visual, and many of them
leverage pre-attentive attributes like line length, size, line width, shape, enclosure, hue, and
intensity.
3. Grouping - How good is a channel at communicating groups that exist in the data?
Consider the proximity, similarity, enclosure, connectedness, and continuity of the channel.
But, remember: the more you emphasize different things, the less that emphasis counts.
The more you emphasize one single thing, the more that counts.
Design principles
Once you understand the pre-attentive attributes of data visualization, you can go on to
design principles for creating effective visuals. These design principles are important to
your work as a data analyst because they help you make sure that you are creating
visualizations that communicate your data effectively to your audience. By keeping these
rules in mind, you can plan and evaluate your data visualizations to decide if they are
working for you and your goals. And, if they aren’t, you can adjust them!
Principle
Description
Choose the
right visual
One of the first things you have to decide is which visual will be the most effective
for your audience. Sometimes, a simple table is the best visualization. Other times,
you need a more complex visualization to illustrate your point.
Optimize the
data-ink ratio
The data-ink entails focusing on the part of the visual that is essential to
understanding the point of the chart. Try to minimize non-data ink like boxes
around legends or shadows to optimize the data-ink ratio.
Use
orientation
effectively
Make sure the written components of the visual, like the labels on a bar chart, are
easy to read. You can change the orientation of your visual to make it easier to read
and understand.
Color
There are a lot of important considerations when thinking about using color in
your visuals. These include using color consciously and meaningfully, staying
consistent throughout your visuals, being considerate of what colors mean to
different people, and using inclusive color scales that make sense for everyone
viewing them.
Principle
Description
Numbers of
things
Think about how many elements you include in any visual. If your visualization
uses lines, try to plot five or fewer. If that isn’t possible, use color or hue to
emphasize important lines. Also, when using visuals like pie charts, try to keep the
number of segments to less than seven since too many elements can be distracting.
Avoiding misleading or deceptive charts
As you are considering what kind of visualization to create and how to design it, you will
want to be sure that you are not creating misleading or deceptive charts. As you have been
learning, data analysis provides people with insights and knowledge they can use to make
decisions. So, it is important that the visualizations you create are communicating your
data accurately and truthfully. Here are some common errors to avoid so that your
visualizations aren’t accidentally misleading:
What to avoid
Why
Cutting off the y-axis
Changing the scale on the y-axis can make the differences betwe
data seem more dramatic, even if the difference is actually quite
Misleading use of a dual y-axis
Using a dual y-axis without clearly labeling it in your data visua
misleading charts.
Artificially limiting the scope of the
data
If you only consider the part of the data that confirms your anal
be misleading because they don’t take all of the data into accoun
Problematic choices in how data is
binned or grouped
It is important to make sure that the way you are grouping data
misrepresenting your data and disguising important trends and
Using part-to-whole visuals when
the totals do not sum up
appropriately
If you are using a part-to-whole visual like a pie chart to explain
parts should add up to equal 100%. If they don’t, your data visu
What to avoid
Why
Hiding trends in cumulative charts
Creating a cumulative chart can disguise more insightful trends
visualization too large to track any changes over time.
Adding smooth trend lines between points in a scatter plot can
Artificially smoothing trends
plot, but replacing the points with just the line can actually mak
more connected over time than it actually was.
Finally, keep in mind that data visualization is an art form, and it takes time to develop
these skills. Over your career as a data analyst, you will not only learn how to design good
data visualizations, but you will also learn how to evaluate good data visualizations. Use
these tips to think critically about data visualization—both as a creator and as an audience
member.
Further reading








The beauty of data visualization: In this video, David McCandless explains the
need for design to not just be beautiful, but for it to be meaningful as well. Data
visualization must be able to balance function and form for it to be relevant to
your audience.
 ‘The McCandless Method’ of data presentation: At first glance, this blog appears
to be written by a David McCandless fan, and it is. However, it contains very
useful information and provides an in-depth look at the 5-step process that
McCandless uses to present his data.
 Information is beautiful: Founded by McCandless himself, this site serves as a
hub of sample visualizations that make use of the McCandless method. Explore
data from the news, science, the economy, and so much more and learn how to
make visual decisions based on facts from all kinds of sources.
 Beautiful daily news: In this McCandless collection, explore uplifting trends and
statistics that are beautifully visualized for your creative enjoyment. A new
chart is released every day so be sure to visit often to absorb the amazing
things happening all over the world.
 The Wall Street Journal Guide to Information Graphics: The Dos and Don'ts of
Presenting Data, Facts, and Figures: This is a comprehensive guide to data
visualization, including chapters on basic data visualization principles and how
to create useful data visualizations even when you find yourself in a tricky
situation. This is a useful book to add to your data visualization library, and you
can reference it over and over again.
Hello again. Earlier we talked about why
data visualizations are so
important to both analysts and stakeholders.
Now we'll discuss the connections you can make
between data and images in your visualizations.
Visual communication of data is important to
those using the data to help make decisions.














































To better understand the connection
between data and images,
let's talk about some examples of
data visualizations and how
they can communicate data effectively.
You may become across lots of these in your daily life.
We'll explore them a little bit more here.
A good place to start is a bar graph.
Bar graphs use size contrast to
compare two or more values.
The horizontal line of
a bar graph usually placed at the bottom,
is called the x-axis,
and bar graphs with vertical bars,
the x-axis is used to represent categories,
time periods, or other variables.
The vertical line of a bar graph usually
placed to the left is called the y-axis.
The y-axis usually has
a scale of values for the variables.
In this example, the time of day is compared to
someone's level of motivation
throughout the whole workday.
Bar graphs are a great way to clarify trends.
Here, it's clear this person's motivation is low at
the beginning of the day and gets higher
and higher by the end of the workday.
This type of visualization makes it
very easy to identify patterns.
Another example is a line graph.
Line graphs are a type of visualization that can help
your audience understand shifts or changes in your data.
They're usually used to track
changes through a period of time,
but they can be paired with other factors too.
In this line graph,
we're using two lines to compare
the popularity of cats and dogs over a period of time.
With two different line colors,
we can immediately tell that
dogs are more popular than cats.
We'll talk more about using colors and patterns to make
visualizations more accessible to audiences later too.
Even as a line moves up and down,
there's a general trend upwards and the line
for dogs always stays higher than the line for cats.














































Now let's check out another visualization
you'll probably recognize.
Say hello to the pie chart.
Pie charts show how much each part
of something makes up the whole.
This pie chart shows us
all the activities that make up someone's day.
Half of it's spent working,
which is shown by the amount of
space that the blue section takes up.
From a quick scan,
you can easily tell
which activities make up a good chunk of
the day in this pie chart
and which ones take up less time.
Earlier, we learned how maps
help organize data geographically.
The great thing about maps is they can hold a lot of
location-based information and they're
easy for your audience to interpret.
This example shows survey data
about people's happiness in Europe.
The borderlines are well-defined and the colors
added make it even easier to tell the countries apart.
Understanding the data represented here,
which we'll come back to again later,
can happen pretty quickly.
So data visualization is an excellent tool for making
the connection between an image
and the information it represents,
but it can sometimes be misleading.
One way visualizations can be
manipulated is with scaling and proportions.
Think of a pie chart. Pie charts show
proportions and percentages between categories.
Each part of the circle or
pi should reflect its percentage to the whole,
which is equal to 100 percent.
So if you want to visualize your sales analysis to show
the percentage of your company sales
that come from online transactions,
you could use a pie chart.
The size of each slice would be
the percentage of total sales that it represents.
So if your online sales accounted for 60 percent,
the slice would be 60 percent of the whole pie.














































Now here's a misleading pie chart.
It's supposed to show opinions about pizza toppings,
but each slice or
segment represents more than one option.
They all add up to well over 100 percent.
There is lots of ingredients listed below
the image that are not even included in the visual data.
All of the segments are the same size,
even though they're supposed to
be showing different values.
If a visualization looks
confusing then it probably is confusing.
Let's explore another example where
the size of the graphic components comes into play.
This time with a bar chart.
In a truncated bar chart like this one,
the values on the y-axis don't start at zero.
The data points start at 9,100 and at intervals of 100.
This makes it seem like the data, let's say,
it's for novel clicks per day on
different website links, is fairly wide-ranging.
In this view, website E
seems to clearly receive way more clicks than website D,
which receives more clicks than website C and so on.
While the graph is clear and
the elements are easy to understand,
the way the data is presented is misleading.
Let's try to fix this by changing the graph's y-axis,
so that it starts at zero instead.
Now, the difference between the website
clicks per day don't look nearly as drastic.
By making the y-axis start at zero,
we're changing the visual proportions to be
more accurate and more honest.
Some platforms always start their y-axis at zero,
but other programs like
spreadsheets might not fix the y-axis.
So it's important to keep this in
mind when creating visualizations.
By following the conventions of data analysis,
you'll be able to avoid misleading visualizations.
You always want your visualization to be
clear and easy to understand,
but never at the expense of
communicating ideas that are true to the data.
So we've talked about









some effective data-driven visualizations
like bar graphs,
line graphs, and pie charts,
and when to use them.
On top of that, we've discussed some things to avoid in
your visualizations to keep them from being misleading.
Coming up, we'll check out how to make
those visualizations reach
your target audience. See you then.

Hey, there. You're back and ready to

learn how to create powerful data visualizations.

Coming up, we'll explore how to take

our findings and turn them into compelling visuals.

Earlier, we discussed the relationship

between data and images.

Now we'll build on that to explore

what visualizations can reveal to

your audience and how to make

your graphics as effective as possible.

One of your biggest considerations when creating

a data visualization is where

you'd like your audience to focus.

Showing too much can be

distracting and leave your audience confused.

In some cases, restricting data can be a good thing.

On the other hand, showing too little can

make your visualization unclear and less meaningful.

As a general rule,

as long as it's not misleading,

you should visually represent only the data that

your audience needs in order to understand your findings.

Now let's talk about what you

can show with visualizations.

Change over time is a big one.

If your analysis involves how

the data has changed over a certain period,

which could be days, weeks, months, or years.

You can set your visualization to show

only the time period relevant to your objective.

This visualization shows the search interests in

news story topics like

environment and science and social issues.

The viz is set up to show how

the search entries change day to day.

The bubbles represent the most popular topic

on each day in a given part of the US.

As new stories come up,

the data changes to reflect the topic of those stories.

If we wanted the data for weekly or monthly news cycles,

we change the interactive feature

to show changes by week or month.

Another situation is when you need to

show how your data is distributed.

A histogram resembles a bar graph,

but it's a chart that shows how often

data values fall into certain ranges.

This histogram shows a lot of data and how it's

distributed on a narrow range

from a negative one to a positive one.

Each bin or bucket,

as the bar is called,

contains a certain number of values that fall

into one small part of the range.

If you don't need to show that much data,

other histograms would be more effective,

like this one about the length of dinosaurs.

Here the bins or buckets of data values are segmented.

You can show each value that falls

into each part of the range.

If your data needs to be ranked,

like when ordering the number of

responses to survey questions.

You should first think about what you

want to highlight in your visualization.

Bar charts with horizontal bars

effectively show data that are ranked,

with bars arranged in ascending or descending order.

A bar chart should always be ranked by value,

unless there's a natural order to

the data like age or time, for example.

This simple bar chart shows metals

like gold and platinum ranked by density.

An audience would be able to clearly see the ranking and

quickly determine which metals had the highest density,

even if this database included a lot more metals.

Correlation charts can show relationships among data,

but they should be used with

caution because they might lead

viewers to think that the data shows causation.

Causation or a cause-effect relationship

occurs when an action directly leads to an outcome.

Correlation and causation are often mixed up because

humans like to find patterns even when they don't exist.

If two variables look

like they're associated in some way,

we might assume that one is dependent on the other.

That implies causation, even

if the variables are completely independent.

If we put that data into a visualization,

then it would be misleading.

But correlation charts that do

show causation can be effective.

For example, this correlation chart has one line

of data showing the average traffic for

Google searches on Tuesdays in Brazil.

The other lines for a specific date

of search traffic, June 15th.

The data is automatically correlated because

both lines are representing the same basic information.

But the chart also shows one big difference.

When a football match or soccer match for

Americans began on June 15th,

the search traffic showed a significant drop.

This implies causation.

Football is a very popular and

important sport for Brazilians,

and the data in this chart verifies that.

We've now talked about time series charts,

histograms, ranked bar charts, and correlation charts.

Each of these charts can

visualize a different type of analysis.

Your business objective and audience will

help figure out which of

these common visualizations to choose.

Or you may want to check

some other kinds of visualizations out there.

There are also glossary visualizations

that you'll be able to reference later.

That wraps up our lesson on creating visualizations.

Coming up next, we'll add some more layers to

your planning and execution of visuals. Hang on tight.
: Added to Selection. Press [CTRL + S] to save as a note

Correlation and causation
In this reading, you will examine correlation and causation in more detail. Let’s review the
definitions of these terms:


Correlation in statistics is the measure of the degree to which two variables
move in relationship to each other. An example of correlation is the idea that
“As the temperature goes up, ice cream sales also go up.” It is important to
remember that correlation doesn’t mean that one event causes another. But, it
does indicate that they have a pattern with or a relationship to each other. If
one variable goes up and the other variable also goes up, it is a positive
correlation. If one variable goes up and the other variable goes down, it is a
negative or inverse correlation. If one variable goes up and the other variable
stays about the same, there is no correlation.
Causation refers to the idea that an event leads to a specific outcome. For
example, when lightning strikes, we hear the thunder (sound wave) caused by
the air heating and cooling from the lightning strike. Lightning causes thunder.
Why is differentiating between correlation and causation
important?
When you make conclusions from data analysis, you need to make sure that you don’t
assume a causal relationship between elements of your data when there is only a
correlation. When your data shows that outdoor temperature and ice cream consumption
both go up at the same time, it might be tempting to conclude that hot weather causes
people to eat ice cream. But, a closer examination of the data would reveal that every
change in temperature doesn’t lead to a change in ice cream purchases. In addition, there
might have been a sale on ice cream at the same time that the data was collected, which
might not have been considered in your analysis.
Knowing the difference between correlation and causation is important when you make
conclusions from your data since the stakes could be high. The next two examples illustrate
the high stakes to health and human services.
Cause of disease
For example, pellagra is a disease with symptoms of dizziness, sores, vomiting, and
diarrhea. In the early 1900s, people thought that the disease was caused by unsanitary
living conditions. Most people who got pellagra also lived in unsanitary environments. But,
a closer examination of the data showed that pellagra was the result of a lack of niacin
(Vitamin B3). Unsanitary conditions were related to pellagra because most people who
couldn’t afford to purchase niacin-rich foods also couldn’t afford to live in more sanitary
conditions. But, dirty living conditions turned out to be a correlation only.
Distribution of aid
Here is another example. Suppose you are working for a government agency that provides
food stamps. You noticed from the agency’s Google Analytics that people who qualify for
food stamps are browsing the official website, but they are leaving the site without signing
up for benefits. You think that the people visiting the site are leaving because they aren’t
finding the information they need to sign up for food stamps. Google Analytics can help you
find clues (correlations), like the same people coming back many times or how quickly
people leave the page. One of those correlations might lead you to the actual cause, but you
will need to collect additional data, like in a survey, to know exactly why people coming to
the site aren’t signing up for food stamps. Only then can you figure out how to increase the
sign-up rate.
Key takeaways
In your data analysis, remember to:



Critically analyze any correlations that you find
Examine the data’s context to determine if a causation makes sense (and can be
supported by all of the data)
Understand the limitations of the tools that you use for analysis
Further information
You can explore the following article and training for more information about correlation
and causation:
Correlation is not causation: This article describes the impact to a business
when correlation and causation are confused.
 Correlation and causation (Khan Academy lesson): This lesson describes
correlation and causation along with a working example. Follow the examples
of the analysis and notice if there is a positive correlation between frostbite and
sledding accidents.
Hey, great to see you again.
So far we've shown that there's lots of choices you'll
make as a data analyst when creating visualizations.
Each of your choices should help make sure that
your visuals are meaningful and effective.
Another choice you'll need to make is whether you want
your visualizations to be static or dynamic.
Static visualizations do not
change over time unless they're edited.
They can be useful when you want to control
your data and your data story.
Any visualization printed on
paper is automatically static.
Charts and graphs created in
spreadsheets are often static too.
For example, the owner of this spreadsheet might have
to change the data in
order for the visualization to update.
Now, dynamic visualizations are
interactive or change over time.
The interactive nature of these graphics means that
users have some control over what they see.
This can be helpful if
stakeholders want to adjust what they're able to view.
Let's check out a visualization about
happiness that we've created in Tableau.
Tableau is a business intelligence and analytics platform
that helps people see,
understand, and make decisions with data.
Visualizations in Tableau are automatically interactive.
We'll go into the dashboard to
see how the happiness score has
changed from 2015 to 2017.
We can check this out in
our 12th slide, yearly happiness changes.
On the left are
the country level changes in happiness score.
The countries are sorted by
largest increase to largest decrease.
On the right, there's a map


















































































with overall happiness scores.
The color scale moves from blue for
the countries with the highest happiness score,
to red for those with the lowest.
If you look below the map,
you'll notice a year to view slider where people can
choose which year's happiness scores
to display on the map.
It's currently set for 2016,
but if someone wants to know the scores for 2015 or 2017,
they can adjust the slider.
They could then make note of how
the color-coding and score labels
change from year to year.
Other dynamic visualizations
upload new data automatically.
These bar graphs continually
update data by the minute and second.
Other data visuals can do the same by day, week or month.
If you need to,
you can show trends in real-time.
Having an interactive visualization can be
useful for both you and the audience you share it with.
But it's good to remember that
the more power you give the user,
the less control you have over
the story you want the data to tell.
It's something to keep in mind as you learn
how to create your own visualizations.
You want to find the right balance
between interactivity and control.
Something else to consider is,
a choice between using a static or dynamic visualization.
This will usually depend on the data you're visualizing,
the audience you're presenting to,
and how you're giving your presentation.
Now that we've made some decisions about
what kind of data vis we want to create,
we can start thinking about the design,
which is exactly where we're going to start
talking about next time. See you there.
The wonderful world of visualizations
As a data analyst, you will often be tasked with relaying information and data that your
audience might not readily understand. Presenting your data visually is an effective way to
communicate complex information and engage your stakeholders. One question to ask
yourself is: “what is the best way to tell the story within my data?” This reading includes
several options for you to choose from (although there are many more).
Line chart
A line chart is used to track changes over short and long periods of time. When smaller
changes exist, line charts are better to use than bar graphs. Line charts can also be used to
compare changes over the same period of time for more than one group.
Let’s say you want to present the graduation frequency for a particular high school
between the years 2008-2012. You would input your data in a table like this:
Year
Graduation rate
2008
87
2009
89
2010
92
2011
92
2012
96
From this table, you are able to present your data in a line chart like this:
Maybe your data is more specific than above. For example, let’s say you are tasked with
presenting the difference of graduation rates between male and female students. Then your
chart would resemble something like this:
Column chart
Column charts use size to contrast and compare two or more values, using height or
lengths to represent the specific values.
The below is example data concerning sales of vehicles over the course of 5 months:
Month
Vehicles sold
August
2,800
September
3,700
October
3,750
November
4,300
December
Visually, it would resemble something like this:
4,600
What would this column chart entail if we wanted to add the sales data for a competing car
brand?
Heatmap
Similar to bar charts, heatmaps also use color to compare categories in a data set. They are
mainly used to show relationships between two variables and use a system of color-coding
to represent different values. The following heatmap plots temperature changes for each
city during the hottest and coldest months of the year.
Pie chart
The pie chart is a circular graph that is divided into segments representing proportions
corresponding to the quantity it represents, especially when dealing with parts of a whole.
For example, let’s say you are determining favorite movie categories among avid movie
watchers. You have gathered the following data:
Movie category
Preference
Comedy
41%
Drama
11%
Sci-fi
3%
Romance
17%
Action
28%
Visually, it would resemble something like this:
Action- 28% Comedy- 41% Romance- 17% Sci-fi- 3% Drama- 11%
Scatter plot
Scatter plots show relationships between different variables. Scatter plots are typically
used for two variables for a set of data, although additional variables can be displayed.
For example, you might want to show data of the relationship between temperature
changes and ice cream sales. It would resemble something like this:
As you may notice, the higher the temperature got, the more demand there was for ice
cream – so the scatter plot is great for showing the relationship between the two variables.
Distribution graph
A distribution graph displays the spread of various outcomes in a dataset.
Let’s apply this to real data. To account for its supplies, a brand new coffee shop owner
wants to measure how many cups of coffee their customers consume, and they want to
know if that information is dependent on the days and times of the week. That distribution
graph would resemble something like this:
From this distribution graph, you may notice that the amount of coffee sales steadily
increases from the beginning of the week, reaching the highest point mid-week, and then
decreases towards the end of the week.
If outcomes are categorized on the x-axis by distinct numeric values (or ranges of numeric
values), the distribution becomes a histogram. If data is collected from a customer
rewards program, they could categorize how many customers consume between one and
ten cups of coffee per week. The histogram would have ten columns representing the
number of cups, and the height of the columns would indicate the number of customers
drinking that many cups of coffee per week.
Reviewing each of these visual examples, where do you notice that they fit in relation to
your type of data? One way to answer this is by evaluating patterns in data. Meaningful
patterns can take many forms, such as:




Change: This is a trend or instance of observations that become different over
time. A great way to measure change in data is through a line or column chart.
Clustering: A collection of data points with similar or different values. This is
best represented through a distribution graph.
Relativity: These are observations considered in relation or in proportion to
something else. You have probably seen examples of relativity data in a pie
chart.
Ranking: This is a position in a scale of achievement or status. Data that
requires ranking is best represented by a column chart.

Correlation: This shows a mutual relationship or connection between two or
more things. A scatter plot is an excellent way to represent this type of data
pattern.
Studying your data
Data analysts are tasked with collecting and interpreting data as well as displaying data in a
meaningful and digestible way. Determining how to visualize your data will require
studying your data’s patterns and converting it using visual cues. Feel free to practice your
own charts and data in spreadsheets. Simply input your data in the spreadsheet, highlight
it, then insert any chart type and view how your data can be visualized based on what you
choose.
Data grows on decision trees
With so many visualization options out there for you to choose from, how do you decide
what is the best way to represent your data?
A decision tree is a decision-making tool that allows you, the data analyst, to make
decisions based on key questions that you can ask yourself. Each question in the
visualization decision tree will help you make a decision about critical features for your
visualization. Below is an example of a basic decision tree to guide you towards making a
data-driven decision about which visualization is the best way to tell your story. Please
note that there are many different types of decision trees that vary in complexity, and can
provide more in-depth decisions.
-Does your data have only one numeric variable? Histogram or Density plot -Are there multiple
data sets? Line chart or pie chart -Are you measuring changes over time? Bar chart -Do
relationships between the data need to be shown? Scatter plot or heatmap
Begin with your story
Start off by evaluating the type of data you have and go through a series of questions to
determine the best visual source:

Does your data have only one numeric variable? If you have data that has
one, continuous, numerical variable, then a histogram or density plot are the
best methods of plotting your categorical data. Depending on your type of data,
a bar chart can even be appropriate in this case. For example, if you have data
pertaining to the height of a group of students, you will want to use a histogram
to visualize how many students there are in each height range:

Are there multiple datasets? For cases dealing with more than one set of data,
consider a line or pie chart for accurate representation of your data. A line chart
will connect multiple data sets over a single, continuous line, showing how
numbers have changed over time. A pie chart is good for dividing a whole into
multiple categories or parts. An example of this is when you are measuring
quarterly sales figures of your company. Below are examples of this data
plotted on both a line and pie chart.

Are you measuring changes over time? A line chart is usually adequate for
plotting trends over time. However, when the changes are larger, a bar chart is
the better option. If, for example, you are measuring the number of visitors to
NYC over the past 6 months, the data would look like this:

Do relationships between the data need to be shown? When you have two
variables for one set of data, it is important to point out how one affects the
other. Variables that pair well together are best plotted on a scatter
plot. However, if there are too many data points, the relationship between
variables can be obscured so a heat map can be a better representation in that
case. If you are measuring the population of people across all 50 states in the
United States, your data points would consist of millions so you would use a
heat map. If you are simply trying to show the relationship between the
number of hours spent studying and its effects on grades, your data would look
like this:
Additional resources
The decision tree example used in this reading is one of many. There are multiple decision
trees out there with varying levels of details that you can use to help guide your visual
decisions. If you want more in-depth insight into more visual options, explore the following
resources:


From data to visualization: This is an excellent analysis of a larger decision tree.
With this comprehensive selection, you can search based on the kind of data
you have or click on each graphic example for a definition and proper usage.
Selecting the best chart: This two-part YouTube video can help take the
guesswork out of data chart selection. Depending on the type of data you are
aiming to illustrate, you will be guided through when to use, when to avoid, and
several examples of best practices. Part 2 of this video provides even more
examples of different charts, ensuring that there is a chart for every type of data
out there.
Data is beautiful
At this point, you might be asking yourself: What makes a good visualization? Is it the data
you use? Or maybe it is the story that it tells? In this reading, you are going to learn more
about what makes data visualizations successful by exploring David McCandless’ elements
of successful data visualization and evaluating three examples based on those elements.
Data visualization can change our perspective and allow us to notice data in new, beautiful
ways. A picture is worth a thousand words—that’s true in data too! You will have the
option to save all of the data visualization examples that are used throughout this reading;
these are great examples of successful data visualization that you can use for future
inspiration.
Where just two or three ovals overlap, there are different types of incomplete data
visualization. At the center, where all four overlap, contains the words “successful
visualization”. This visualization stresses the idea that all four elements are necessary to
create a successful data visualization
You can also access a PDF version of this visualization and save it for your own reference
by clicking the file below:
WEB_What-Makes-a-Good-Infoviz.pdfPDF File
Open file
Four elements of successful visualizations
The Venn diagram by David McCandless identifies four elements of successful
visualizations:
Information (data): The information or data that you are trying to convey is a
key building block for your data visualization. Without information or data, you
cannot communicate your findings successfully.
 Story (concept): Story allows you to share your data in meaningful and
interesting ways. Without a story, your visualization is informative, but not
really inspiring.
 Goal (function): The goal of your data visualization makes the data useful and
usable. This is what you are trying to achieve with your visualization. Without a
goal, your visualization might still be informative, but can’t generate actionable
insights.
 Visual form (metaphor): The visual form element is what gives your data
visualization structure and makes it beautiful. Without visual form, your data is
not visualized yet.
All four of these elements are important on their own, but a successful data visualization
balances all four. For example, if your data visualization has only two elements, like the
information and story, you have a rough outline. This can be really useful in your early
planning stages, but is not polished or informative enough to share. Even three elements
are not quite enough— you need to consider all four to create a successful data
visualization.

In the next part of this reading, you will use these elements to examine two data
visualization examples and evaluate why they are successful.
Example 1: Visualization of dog breed comparison
It uses two axes, popularity and data score, to place different dog breeds on a foursquare chart. The squares are labelled “Inexplicably Overrated,” “The Rightly Ignored,”
“Hot Dogs!,” and “Overlooked Treasures.” Different dog breeds, visualized with plotted
points shaped like dogs, are distributed on the chart based on their popularity and their
data score.
Save this data visualization as a PDF by clicking the file below:
IIB-LICENSED_Best-in-Show.pdfPDF File
Open file
View the data
The Best in Show visualization uses data about different dog breeds from the American
Kennel Club. The data has been compiled in a spreadsheet. Click the link below and select
"Use Template" to view the data.
Link to the template: KIB - Best in Show
Or, if you don't have a Google account, download the file below.
KIB - Best in Show (public)XLSX File
Download file
Examine the four elements
This visualization compares the popularity of different dog breeds to a more objective data
score. Consider how it uses the elements of successful data visualization:




Information (data): If you view the data, you can explore the metrics being
illustrated in the visualization.
Story (concept): The visualization shows which dogs are overrated, which are
rightly ignored, and those that are really hot dogs! And, the visualization
reveals some overlooked treasures you may not have known about previously.
Goal (function): The visualization is interested in exploring the relationship
between popularity and the objective data scores for different dog breeds. By
comparing these data points, you can learn more about how different dog
breeds are perceived.
Visual form (metaphor): In addition to the actual four-square structure of this
visualization, other visual cues are used to communicate information about the
dataset. The most obvious is that the data points are represented as dog
symbols. Further, the size of a dog symbol and the direction the dog symbol
faces communicate other details about the data.
Example 2: Visualization of rising sea levels
demonstrates how much sea levels are projected to rise over the course of 8,000 years.
On the y-axis, it lists both the number of years and the sea level in meters. From right to
left, starting with the lowest sea level, the chart includes silhouettes of different cities
around the world to demonstrate how long it would take for most of the world to be
underwater. It also includes inset maps of the continents and how they would appear at
different times as sea levels continue to rise.
Save this data visualization as a PDF by clicking the file below:
IIB-LICENSED_Sea-Levels.pdfPDF File
Open file
Examine the four elements
This When Sea Levels Attack visualization illustrates how much sea levels are projected to
rise over the course of 8,000 years. The silhouettes of different cities with different sea
levels, rising from right to left, helps to drive home how much of the world will be affected
as sea levels continue to rise. Here is how this data visualization stacks up using the four
elements of successful visualization:




Information (data): This visualization uses climate data on rising sea levels
from a variety of sources, including NASA and the Intergovernmental Panel on
Climate Change. In addition to that data, it also uses recorded sea levels from
around the world to help illustrate how much rising sea levels will affect the
world.
Story (concept): The visualization tells a very clear story: Over the course of
8,000 years, much of the world as we know it will be underwater.
Goal (function): The goal of this project is to demonstrate how soon rising sea
levels are going to affect us on a global scale. Using both data and the visual
form, this visualization makes rising sea levels feel more real to the audience.
Visual form (metaphor): The city silhouettes in this visualization are a
beautiful way to drive home the point of the visualization. It gives the audience
a metaphor for how rising sea levels will affect the world around them in a way
that showing just the raw numbers can’t do. And for a more global perspective,
the visualization also uses inset maps.
Key takeaways
Notice how each of these visualizations balance all four elements of successful
visualization. They clearly incorporate data, use storytelling to make that data meaningful,
focus on a specific goal, and structure the data with visual forms to make it beautiful and
communicative. The more you practice thinking about these elements, the more you will be
able to include them in your own data visualizations.
Design thinking for visualization
improvement
Design thinking for data visualization involves five phases:
1. Empathize: Thinking about the emotions and needs of the target audience for
the data visualization
2. Define: Figuring out exactly what your audience needs from the data
3. Ideate: Generating ideas for data visualization
4. Prototype: Putting visualizations together for testing and feedback
5. Test: Showing prototype visualizations to people before stakeholders see them
As interactive dashboards become more popular for data visualization, new importance
has been placed on efficiency and user-friendliness. In this reading, you will learn how
design thinking can improve an interactive dashboard. As a junior analyst, you wouldn’t be
expected to create an interactive dashboard on your own, but you can use design thinking
to suggest ways that developers can improve data visualizations and dashboards.
An example: online banking dashboard
Suppose you are an analyst at a bank that has just released a new dashboard in their online
banking application. This section describes how you might explore this dashboard like a
new user would, consider a user’s needs, and come up with ideas to improve data
visualization in the dashboard. The dashboard in the banking application has the following
data visualization elements:



Monthly spending is shown as a donut chart that reflects different categories
like utilities, housing, transportation, education, and groceries.
When customers set a budget for a category, the donut chart shows filled and
unfilled portions in the same view.
Customers can also set an overall spending limit, and the dashboard will
automatically assign the budgeted amounts (unfilled areas of the donut chart)
to each category based on past spending trends.
Empathize
First, empathize by putting yourself in the shoes of a customer who has a checking account
with the bank.

Do the colors and labels make sense in the visualization?


How easy is it to set or change a budget?
When you click on a spending category in the donut chart, are the transactions
in the category displayed?
What is the main purpose of the data visualization? If you answered that it was to help
customers stay within budget or to save money, you are right! Saving money was a top
customer need for the dashboard.
Define
Now, imagine that you are helping dashboard designers define other things that customers
might want to achieve besides saving money.
What other data visualizations might be needed?


Track income (in addition to spending)
Track other spending that doesn’t neatly fit into the set categories (this is
sometimes called discretionary spending)
 Pay off debt
Can you think of anything else?
Ideate
Next, ideate additional features for the dashboard and share them with the software
development team.


What new data visualizations would help customers?
Would you recommend bar charts or line charts in addition to the standard
donut chart?
 Would you recommend allowing users to create their own (custom) categories?
Can you think of anything else?
Prototype
Finally, developers can prototype the next version of the dashboard with new and
improved data visualizations.
Test
Developers can close the cycle by having you (and others) test the prototype before it is
sent to stakeholders for review and approval.
Key takeaways
This design thinking example showed how important it is to:



Understand the needs of users
Generate new ideas for data visualizations
Make incremental improvements to data visualizations over time
You can refer to the following articles for more information about design thinking:







































Three Critical Aspects of Design Thinking for Big Data Solutions
Data and Design Thinking: Why Use Data in the Design Process?
Hello again. We've learned data visualizations are
designed to help an audience process information
quickly and memorably.
You might remember the 5-second rule we covered earlier.
Within the first five seconds of
seeing a data visualization,
your audience should understand
exactly what you're trying to convey.
Five seconds might seem like a flash,
but adding in descriptive wording can really help
your audience interpret and
understand the data in the right way.
Your audience will be less likely to
have questions about what you're
sharing if you add headlines, subtitles, and labels.
One of the easiest ways to highlight key data
in your data viz, is through headlines.
A headline is a line of words printed in large letters at
the top of the visualization to
communicate what data is being presented.
It's the attention-grabber that makes
your audience want to read more.
Take charts, for example.
A chart without a headline is
like a report without a title.
You want to make it easy to
understand what your chart's about.
Be sure to use clear, concise language,
explaining all information as plainly as possible.
Try to avoid using abbreviations or acronyms,
even if you think they're common knowledge.
The typography and placement
of the headline is important too.
It's best to keep it simple.
Make it bold or a few sizes larger than
the rest of the text and
place it directly above the chart,


































aligned to the left.
Then, explain your data viz even further with a subtitle.
A subtitle supports the headline
by adding more context and description.
Use a font style that matches the rest of
the charts elements and place
the subtitle directly underneath the headline.
Now, let's talk about labels.
Earlier, we mentioned Dona Wong,
a visual journalist who's well known for sharing
guidelines on making data viz more effective.
She makes a very strong case for using labels
directly on the data instead of relying on legends.
This is because lots of charts
use different visual properties like
colors or shapes to represent different values of data.
A legend or key
identifies the meaning of various elements in
a data visualization and can be used as
an alternative to labeling data directly.
Direct labeling like this keeps your audience's attention
fixed on your graphic and
helps them identify data quickly.
While legends force the audience to do more work,
because a legend is
positioned away from the chart's data.
The truth is, the more support we provide our audience,
the less work they have to do trying to
understand what the data is trying to say,
and the faster our story will make an impact.
Now that we've covered how to make
a data viz as effective as possible,
next up, we'll figure out how to make it
accessible to all. See you in a bit.
Pro tips for highlighting key information
Headlines, subtitles, labels, and annotations help you turn your data visualizations into
more meaningful displays. After all, you want to invite your audience into your
presentation and keep them engaged. When you present a visualization, they should be
able to process and understand the information you are trying to share in the first five
seconds. This reading will teach you what you can do to engage your audience
immediately.
If you already know what headlines, subtitles, labels and annotations do, go to the
guidelines and style checks at the end of this reading. If you don’t, these next sections are
for you.
Headlines that pop
A headline is a line of words printed in large letters at the top of a visualization to
communicate what data is being presented. It is the attention grabber that makes your
audience want to read more. Here are some examples:

Which Generation Controls the Senate?: This headline immediately generates
curiosity. Refer to the subreddit post in the dataisbeautiful community,
r/dataisbeautiful, on January 21, 2021.
 Top 10 coffee producers: This headline immediately informs how many coffee
producers are ranked. Read the full article: bbc.com/news/business-43742686.
Check out the chart below. Can you identify what type of data is being represented?
Without a headline, it can be hard to figure out what data is being presented. A graph like
the one below could be anything from average rents in the tri-city area, to sales of
competing products, or daily absences at the local elementary, middle, and high schools.
Turns out, this illustration is showing average rents in the tri-city area. So, let’s add a
headline to make that clear to the audience. Adding the headline, “Average Rents in the
Tri-City Area” above the line chart instantly informs the audience what it is comparing.
Subtitles that clarify
A subtitle supports the headline by adding more context and description. Adding a subtitle
will help the audience better understand the details associated with your chart. Typically,
the text for subtitles has a smaller font size than the headline.
In the average rents chart, it is unclear from the headline “Average Rents in the Tri-City
Area” which cities are being described. There are tri-cities near San Diego, California
(Oceanside, Vista, and Carlsbad), tri-cities in the San Francisco Bay Area (Fremont, Newark,
and Union City), tri-cities in North Carolina (Raleigh, Durham, and Chapel Hill), and tricities in the United Arab Emirates (Dubai, Ajman, and Sharjah).
We are actually reporting the data for the tri-city area near San Diego. So adding
“Oceanside, Vista, and Carlsbad” becomes the subtitle in this case. This subtitle enables
the audience to quickly identify which cities the data reflects.
Labels that identify
A label in a visualization identifies data in relation to other data. Most commonly, labels in
a chart identify what the x-axis and y-axis show. Always make sure you label your axes. We
can add “Months (January - June 2020)” for the x-axis and “Average Monthly Rents ($)”
for the y-axis in the average rents chart.
Data can also be labeled directly in a chart instead of through a chart legend. This makes it
easier for the audience to understand data points without having to look up symbols or
interpret the color coding in a legend.
We can add direct labels in the average rents chart. The audience can then identify the data
for Oceanside in yellow, the data for Carlsbad in green, and the data for Vista in blue.
Annotations that focus
An annotation briefly explains data or helps focus the audience on a particular aspect of
the data in a visualization.
Suppose in the average rents chart that we want the audience to pay attention to the rents
at their highs. Annotating the data points representing the highest average rents will help
people focus on those values for each city.
Guidelines and pro tips
Refer to the following table for recommended guidelines and style checks for headlines,
subtitles, labels, and annotations in your data visualizations. Think of these guidelines as
guardrails. Sometimes data visualizations can become too crowded or busy. When this
happens, the audience can get confused or distracted by elements that aren’t really
necessary. The guidelines will help keep your data visualizations simple, and the style
checks will help make your data visualizations more elegant.
Visualization
components
Guidelines
Style checks
Headlines
- Content: Briefly describe the data - Length:
Usually the width of the data frame - Position:
Above the data
- Use brief language - Don’
- Don’t use acronyms - Don
use humor or sarcasm
Subtitles
- Content: Clarify context for the data - Length:
Same as or shorter than headline - Position:
Directly below the headline
- Use smaller font size than
undefined words - Don’t u
Don’t use acronyms - Don'
Labels
- Content: Replace the need for legends - Length:
Usually fewer than 30 characters - Position: Next
to data or below or beside axes
- Use a few words only - Us
Use callouts to point to the
bold, or italic
Annotations
- Content: Draw attention to certain data Length: Varies, limited by open space - Position:
Immediately next to data annotated
- Don’t use all caps, bold, o
text - Don’t distract viewer
You want to be informative without getting too detailed. To meaningfully communicate the
results of your data analysis, use the right visualization components with the right style. In
other words, let simplicity and elegance work together to help your audience process the
data you are sharing in five seconds or less.
Accessible visualizations
Notes
Discuss
Save note
Download
Transcript
English
Help Us Translate
Interactive Transcript - Enable basic transcript mode by pressing the escape
key
You may navigate through the transcript using tab. To save a note for a section of text press CTRL +
S. To expand your selection you may use CTRL + arrow key. You may contract your selection using
shift + CTRL + arrow key. For screen readers that are incompatible with using arrow keys for
shortcuts, you can replace them with the H J K L keys. Some screen readers may require using
CTRL in conjunction with the alt key
Play video starting at ::1 and follow transcript0:01
Hey, great to have you back, let's dive back in.
Over 1 billion people in the world have a disability.
That's more than the populations of the United States, Canada,
France, Italy, Japan, Mexico, and Brazil combined.
Before you design a data viz, it's important to keep that fact in mind.
Not everyone has the same abilities, and
people take in information in lots of different ways.
You might have a viewer who's deaf or hard of hearing and relies on captions, or
someone who's color blind might look to specific labeling for more description.
We've covered a lot of ways to make a data visualization beautiful and informative.
And now it's time to take that knowledge and make it accessible to everyone,
including those with disabilities.
Accessibility can be defined a number of different ways.
Right from the start,
there's a few ways you can incorporate accessibility in your data visualization.
You'll just have to think a little differently,
it helps to label data directly instead of relying exclusively on legends,
which require color interpretation and more effort by the viewer to understand.
This can also just make it a faster read for those with or without disabilities.
Check out this data viz, the colors make it challenging to read and
the legend is confusing.
Now, if we just remove the legend and add in data labels, bam,
you've got a clearer presentation.
Another way to make your visualizations more accessible is to provide text
alternatives, so that it can be changed into other forms people need,
such as large print, braille, or speech.
Alternative text provides a textual alternative to non-text content.
It allows the content and function of the image to be accessible to those with
visual or certain cognitive disabilities.
Here's an example that shows additional text describing the chart.
And speaking of text, you can make data from charts and
diagrams available in a text-based format through an export to Sheets or Excel.
You can also make it easier for
people to see and hear content by separating foreground from background.
Using bright colors,
that contrast against the background can help those with poor visibility,
whether permanently or temporarily clearly see the information conveyed.
Another option is to avoid relying solely on color to convey information,
and instead distinguished with different textures and shapes.
Another general rule is to avoid over complicating data visualizations.
Overly complicated data visualizations turn off most audiences
because they can't figure out where and what to focus on.
That's why breaking down data into simple visualizations is key.
A common mistake is including too much information in a single piece, or
including long chunks, of text or too much information and graphs and charts.
This can defeat the whole purpose of your visualization,
making it impossible to understand at first glance.
Ultimately, designing with an accessibility mindset means thinking about
your audience ahead of time.
Focusing on simple, easy to understand visuals, and most importantly, creating
alternative ways for your audience to access and interact with your data.
And when you pay attention to these details, we can find solutions that make
data visualizations more effective for everyone.
So now you completed your first course of exploration of data visualization.
You've discovered the importance of creating data viz that cater to your
audience while keeping focus on the objective.
You learn different ways to brainstorm and plan your visualizations, and
how to choose the best charts to meet that objective.
And you also learned how to incorporate elements of science and
even philosophy into your visualizations.
Coming up we'll check out how to take all of these learnings and
apply them in Tableau.
You'll get to see how this data visualization tool makes your data viz
work more efficient and effective.
See you soon.
Designing a chart in 60 minutes
By now, you understand the principles of design and how to think like a designer. Among
the many options of data visualization is creating a chart, which is a graphical
representation of data.
Choosing to represent your data via a chart is usually the most simple and efficient method.
Let’s go through the entire process of creating any type of chart in 60 minutes. The goal
here is to develop a prototype or mock up of your chart that you can quickly present to an
audience. This will also enable you to have a sense of whether or not the chart is
communicating the information that you want.
5 minutes- prep 15 minutes- talk & listen 20 minutes- prototype & improve 20 minutes- sketch
& design
Follow this high level 60-minute chart to guide your thinking whenever you begin working
on a data visualization.
Prep (5 min): Create the mental and physical space necessary for an environment of
comprehensive thinking. This means allowing yourself room to brainstorm how you want
your data to appear while considering the amount and type of data that you have.
Talk and listen (15 min): Identify the object of your work by getting to the “ask behind
the ask” and establishing expectations. Ask questions and really concentrate on feedback
from stakeholders regarding your projects to help you hone how to lay out your data.
Sketch and design (20 min): Draft your approach to the problem. Define the timing and
output of your work to get a clear and concise idea of what you are crafting.
Prototype and improve (20 min): Generate a visual solution and gauge its effectiveness
at accurately communicating your data. Take your time and repeat the process until a final
visual is produced. It is alright if you go through several visuals until you find the perfect
fit.
Key takeaway
This is a great overview you can use when you need to create a visualization in a short
amount of time. As you become more experienced in data visualization, you will find
yourself creating your own process. You will get a more detailed description of different
visualization options in the next reading, including line charts, bar charts, scatter plots, and
more. No matter what you choose, always remember to take the time to prep, identify your
objective, take in feedback, design, and create.
Welcome back.
Mastering online tools like Tableau will make it easier for your audience to
understand difficult concepts or identify new patterns in your data.
Need to help a news outlet showcase changing real estate prices in regional
markets?
Check.
Want to help a nonprofit use their data in better ways to streamline operations?
Check.
Play video starting at ::23 and follow transcript0:23
Need to explore what video games sales look like over the past few decades?
Double check many different kinds of companies are using Tableau right now to
do all of these things and more.
This means there's a good chance you'll end up using it at some point.
in your career.
But I'm getting ahead of myself.
First, let's talk about what Tableau actually is.
You might remember learning that Tableau is a business intelligence and
analytics platform that you can use online to help people see,
understand, and make decisions with data.
But it's not all business all the time.
Take this data viz, for example,
created by Tableau enthusiast Steve Thomas to record Bigfoot sightings across the US.
It's available on Tableau
Public, which will be using together in our activities in this course
Tableau can help you make and easily share interactive dashboards,
maps, and graphs with your data.
Without any coding, you can connect to data and lots of formats like Excel,
CSV, and Google Sheets.
You might also find yourself working with a company that uses another option,
like Looker or Google Data Studio. for example. Like Tableau, Looker and
Google Data Studio help you take raw data and bring it to life visually,
but each does this in different ways.
For example, while Tableau's offered in a variety of formats like browser and
desktop, Looker and Google Data Studio are completely browser-based.
But here's the great news.
Once you learn the fundamentals of Tableau,
you'll find they easily transfer to other visualization tools,
Ready to get started using it? Then, without further ado, meet Tableau up next.
Visualizations in spreadsheets and Tableau
This reading summarizes the seven primary chart types: column, line, pie, horizontal bar,
area, scatter, and combo. Then, it describes how visualizations in spreadsheets compare to
those in Tableau.
Primary chart types in spreadsheets
In spreadsheets, charts are graphical representations of data from one or more sheets.
Although there are many variations to choose from, we will focus on the most broadly
applicable charts to give you a sense of what is possible in a spreadsheet. As you review
these examples, keep in mind that these are meant to give you an overview of
visualizations rather than a detailed tutorial. Another reading in this program will describe
the applicable steps and process to create a chart more specifically. When you are in an
application, you can always select Help from the menu bar for more information.
To create a chart In Google Sheets, select the data cells, click Insert from the
main menu, and then select Chart. You can set up and customize the chart in
the dialog box on the right.
 To create a chart in Microsoft Excel, select the data cells, click Insert from the
main menu, and then select the chart type. Tip: You can optionally click
Recommended Charts to view Excel’s recommendations for the data you
selected and then select the chart you like from those shown.
These are the primary chart types available:


Column (vertical bar): a column chart allows you to display and compare
multiple categories of data by their values.

Line: a line chart showcases trends in your data over a period of time. The last
line chart example is a combo chart which can include a line chart. Refer to the
description for the combo chart type.

Pie: a pie chart is an easy way to visualize what proportion of the whole each
data point represents.

Horizontal bar: a bar chart functions similarly to a column chart, but is flipped
horizontally.

Area: area charts allow you to track changes in value across multiple categories
of data.

Scatter: scatter plots are typically used to display trends in numeric data.

Combo: combo charts use multiple visual markers like columns and lines to
showcase different aspects of the data in one visualization. The example below
is a combo chart that has a column and line chart together.
You can find more information about other charts here:


Types of charts and graphs in Google Sheets: a Google Help Center page with a
list of chart examples you can download.
Excel Charts: a tutorial outlining all of the different chart types in Excel,
including some subcategories.
How visualizations differ in Tableau
As you have also learned, Tableau is an analytics platform that helps data analysts display
and understand data. Most if not all of the charts that you can create in spreadsheets are
available in Tableau. But, Tableau offers some distinct charts that aren’t available in
spreadsheets. These are handy guides to help you select chart types in Tableau:

Which chart or graph is right for you? This presentation covers 13 of the most
popular charts in Tableau.
 The Ultimate Cheat Sheet on Tableau Charts. This blog describes 24 chart
variations in Tableau and guidelines for use.
The following are visualizations that are more specialized in Tableau with links to
examples or the steps to create them:


Highlight tables appear like tables with conditional formatting. Review the
steps to build a highlight table.
Heat maps show intensity or concentrations in the data. Review the steps to
build a heat map.








Density maps show concentrations (like a population density map). Refer to
instructions to create a heat map for density.
Gantt charts show the duration of events or activities on a timeline. Review the
steps to build a Gantt chart.
Symbol maps display a mark over a given longitude and latitude. Learn more
from this example of a symbol map.
Filled maps are maps with areas colored based on a measurement or
dimension. Explore an example of a filled map.
Circle views show comparative strength in data. Learn more from this example
of a circle view.
Box plots also known as box-and whiskers charts show the distribution of
values along a chart axis. Refer to the steps to build a box plot.
Bullet graphs compare a primary measure with another and can be used
instead of dial gauge charts. Review the steps to build a bullet graph.
Packed bubble charts display data in clustered circles. Review the steps to
build a packed bubble chart.
Key takeaway
This reading described the chart types you can create in spreadsheets and introduced
visualizations that are more unique to Tableau.
Misleading visualizations
You can create data visualizations in Tableau using a wide variety of charts, colors, and styles. And
you have tremendous freedom in the tool to decide how these visualizations will look and how they will
present your data.
Below is an example of a visualization created in Tableau:
A heatmap listing different supplies and order dates 2010, 2011, 2012, 2013. The cells are colored in
yellow, green, and red
Study the visualization and think about these questions:
Red normally indicates danger or a warning. Why do you think cells are highlighted in red?
Green normally indicates a positive or “go” status. Is it clear why certain cells are highlighted in
green?
The purpose of the color coding isn’t clear without a legend, but can you guess what might have been
the intent?
Post your theory of what the colors mean. In the same post, share in 3-5 sentences (150-200 words)
how this table could be misleading and how you would improve it to avoid confusion. Then, visit the
discussion forum to browse what other learners shared and engage in at least two discussions about
the visualization.
Participation is optional
Stephen Few, an innovator, author, a teacher,
and data visualization expert,
once said," Numbers have an important story to tell.
They rely on you to
give them a clear and convincing voice."
Facts and figures are
very important in the business world,
but they rarely make a lasting impression.
To create strong communications that make
people think and convince them to take action,
you need data storytelling.
Data storytelling is communicating
the meaning of a data set with
visuals and a narrative that are
customized for each particular audience.
A narrative is another word for a story. In this video,
you'll learn about data storytelling steps.
These are: tiled your audience,
create compelling visuals, and
tell the story in an interesting way.
Here's an example from the music streaming industry.
Some companies send their customers
a year in review email.
It highlights the songs
the users have listened to most and
sometimes congratulate them for being
a top fan of a particular artist.
This is a much more exciting way to share
data than just a printout of the customer's activity.
It also reminds the listener about
how much time they spend enjoying the service,
a great way to build customer loyalty.
Here's another example,
some ride-sharing companies are
using data storytelling to show
their customers how many miles they've
traveled and how that equals spending less money on gas,
reducing carbon emissions and saving
time they might otherwise have spent fighting traffic.
It makes it really easy for the rider to clearly see
the value of the service in the simple and fun visual.
Data stories like these keep the customer
engaged and make them feel like
their choices matter because the companies are
taking the time to create something just for them,
and importantly, the stories are interesting.
Knowing how to reach people in this way
is an essential part of data storytelling.
Images can draw us in at a subconscious level.
This is the concept of engaging
people through data visualizations.
So far you've been learning about the importance of
focusing on your audience. Coming up,
you'll keep building on that knowledge,
you'll discover that there are
three data storytelling steps,
and the first is knowing how to engage your audience.
Engagement is capturing and
holding someone's interest and attention.
When your audience is engaged,
you're much more likely to connect with them and
convince them to see the same story you see.
Every data story should
start with audience engagement,
all successful storytellers consider
who's listening first. For instance,
when a kindergarten teacher
is choosing books for their class.
they'll pick ones that are
appropriate for five-year-olds.
If they were to choose high school level novels,
the complex subject matter would probably
confuse the kids and they'd get bored and tune out.
The second step is to create compelling visuals.
In other words, you want to show the story of your data,
not just tell it. Visuals should take your audience on
a journey of how the data changed over
time or highlight the meaning behind the numbers.
Here's an example, let's say a cosmetic company
keeps track of stores that
buy its product and how much they buy.
You could communicate the data to
others in a spreadsheet like this,
or you could create a colorful
visual such as this pie chart,
which makes it easy to see which stores are
most and least profitable as business partners.
That's a much clearer and
more visually interesting approach.
Now, the third and final step is to
tell the story in an interesting narrative.
A narrative has a beginning,
a middle, and an end.
It should connect the data you've
collected to the project objective
and clearly explain
important insights from your analysis.
To do this, it's important that
your data storytelling is organized and concise.
Soon you'll learn how to do that using slides for
discussion during a meeting and a formal presentation.
We'll discuss how the content,
visuals and tone of your message
changes depending on the way you're communicating it.
And speaking of business communications,
one of the many ways that companies
use visualization to tell data stories,
is with word clouds.
Word clouds are a pretty simple visualization of data.
These words are presented in different sizes
based on how often they appear in your data set.
It's a great way to get
someone's attention and to unlock stories from
big blocks of text where
each word alone could never be seen.
Word clouds can be used in all sorts of ways.
On social media, they can show you
which topics show up in posts most often,
or you can use them in blogs to
highlight the ideas that interest readers the most.
This word cloud was created using
text from the syllabus of this course.
It tells a pretty engaging story where data analytics,
analysis, SQL and spreadsheets are,
unsurprisingly, some of the lead characters.
Let's continue turning the pages
of your data analytics story.
There's lots of action and adventure to come.
Effective data stories
In data analytics, data storytelling is communicating the meaning of a dataset with visuals
and a narrative that is customized for a particular audience. In data journalism, journalists
engage their audience of readers by combining visualizations, narrative, and context into
data-driven articles. It turns out that data analysts and data journalists have a lot in
common! As a junior data analyst, you might learn a few things about effective storytelling
from data journalism. Read further to explore the role and work of a data journalist in
telling a good story.
Note: This reading refers to an article published in The New Yorker. Non-subscribers may
access several free articles each month. If you already reached your monthly limit on free
articles, bookmark the article and come back to this reading later.
Take a tour of a data-driven article
Ben Wellington, a contributing writer for The New Yorker and a professor at the Pratt
Institute, used New York City’s open data portal to track down noise complaints from
logged service requests. He analyzed the data to gain a more quantitative understanding of
where the noise was coming from and which neighborhoods were the noisiest. Then, he
presented his findings in the Mapping New York's Noisiest Neighborhoods article.
First, click the link above to skim the article and familiarize yourself with the data
visualizations. Then, join the bus tour of the data! You will be directed to three
visualizations (tour stops) to observe how each visualization helped strengthen the overall
storytelling in the article.
Tour stop 1: setting context
Earlier in the training, you learned how context is important to understand data. Context is
the condition in which something exists or happens. Based on the categorization of noise
complaints, the data journalist set the context in the article by defining what people
considered to be noise.
In the article, review the combo table and bar chart that categorizes the noise
complaints. Evaluate the visualization:



How does the visualization help set the context? The combo table and bar
chart is effective in summarizing the noise categories as percentages of the
logged complaints. This helps set the context by answering the question, “what
is noise?” Notice that the data journalist created a combo table and bar chart
instead of a pie chart. With 11 noise categories, a list with a bar chart showing
relative proportions is an elegant representation. A pie chart with 11 slices
would have been harder to read.
How does the visualization help clarify the data? If you add the percentages
in the categories in the combo table and bar chart, the total is ninety-eight
percent. There is a difference of two percent that can’t be accounted for in the
visualization. So, rather than clarifying the data, the visualization actually
causes a little confusion. One lesson is to always make sure that your
percentages add up correctly. Sometimes rounding decimal places up or down
causes percentages to be off so they don’t add up to 100%.
Do you notice a data visualization best practice? You learned that a
companion table in Tableau shows data in a different way in case some in your
audience prefer tables. It appears that the data journalist had the same idea by
using a combo table and bar chart. Note: As a refresher, a companion table in
Tableau is displayed right next to a visualization. A companion table displays
the same data as the visualization, but in a table format. You may replay the
Getting Creative video which includes an example of a companion table.
Tour stop 2: analyzing variables
After setting the context by identifying the noise categories, the data journalist describes
his analysis of the noise data. One interesting analysis is the distribution of noise
complaints versus the time of day.
In the article, review the stacked area chart for the distribution of noise complaints by hour
of the day. Evaluate the visualization:



How does the visualization perform against the five-second rule? Recall
that the five-second rule states that you should understand what is being
conveyed within the first five seconds of seeing a chart. We are guessing that
this visualization performs quite well! The area charts for loud music and
barking dogs help the audience understand that more of these types of noise
complaints were made during late night and early morning hours (between
10:00 PM and 2:00 AM). Notice also that the color coding in the legend aligns
with the colors in the chart. A chart legend normally has the largest category at
the top, but the data journalist chose to order the legend so the largest
category, “Loud music or party” appears at the bottom instead. How much time
do you think this alignment saved readers?
How does the visualization help clarify the data? Unlike the visualization
from the previous tour stop, this visualization does a better job of clearly
showing that all percentages add up to 100%.
Do you notice a data visualization best practice? As a best practice, both the
x-axis and y-axis should be labeled. But, the data journalist chose to include %
or A.M. and P.M. with each tick on an axis. As a result, labeling the x-axis “Time
of Day'' and the y-axis “Percentage of Noise Complaints” isn’t required. This
demonstrates that a little creativity with labeling can help you achieve a cleaner
chart.
Tour stop 3: drawing conclusions
After describing how the data was analyzed, the data journalist shares which
neighborhoods are the noisiest using a variety of visualizations: combo table and bar chart,
density map, and neighborhood map.
In the article, review the neighborhood map for how close a noisy neighborhood is to a
quiet neighborhood. Evaluate the visualization:



How does the visualization help make a point? The data journalist observed
that one of the noisiest neighborhoods was right next to one of the quietest
neighborhoods. The neighborhood map is effective in emphasizing this
observation as a dark blue area versus a white area.
How does the visualization help clarify the data? The visualization classifies
the data by neighborhood and allows the audience to follow along when the
journalist focuses specifically on the Williamsburg, East Williamsburg, and
North Side/South Side neighborhoods.
Do you notice a data visualization best practice? Each neighborhood is
directly labeled so a legend isn’t necessary.
End of the tour: being inspired
We hope you enjoyed your tour of a data journalist’s work! May this inspire your data
storytelling to be as engaging as possible. For additional information about effective data
storytelling, read these articles:







































What is Data Storytelling?
The Art of Storytelling in Analytics and Data Science | How to Create Data
Stories?
 Use Data and Analytics to Tell a Story
 Tell a Meaningful Story With Data
Welcome back. When you
want to communicate something to others,
a great story can help you reach people's hearts and
minds and make them more open to what you have to say.
In other words, stories make people care.
As you learned before,
the first of the three data
storytelling steps teach us
that for a story to be successful,
you need to focus on who's listening.
Data analysts do this by making
sure that they're engaging their audience.
That's what we'll explore together now.
First, you need to know your audience.
Think back to the example of telling
someone a joke they've heard many times
before and expecting them to laugh at
the punchline. Not likely.
To get the response you're seeking,
you've got to understand your audience's point of view.
That means thinking about how
your data project might affect them.
It helps to ask yourself a few questions.
What role does this audience play?
What is their stake in the project?
What do they hope to get from
the data insights I deliver?
Let's say you're analyzing
readership data from customers to help
a magazine publisher decide if they should
switch from quarterly to monthly issues.
If your stakeholder audience
includes people from the printing company,
they're going to care because the change means
they have to order paper and ink more frequently.
They also might need to assign
more staff members to the project.














































Or if your stakeholders
include the magazine authors and editors,
you'll want to keep in mind that
your recommendations might change the way they work.
For instance, they might need to write and
edit stories at a faster pace than they're used to.
Once you've considered the answers to those questions,
it's time to choose your primary message.
Every single part of your story
flows from this one key point,
so it's got to be clear and direct.
With that in mind,
let's think about the key message for
the data project about our pretend magazine.
Maybe the readership data from customers shows that
print magazine subscriptions have
been going down recently.
You discover in survey data that this is
mainly because readers feel the information is outdated,
so this finding suggests
that readers would probably appreciate
a publication cycle that gets
the information into their hands
more often. But that's not all.
Your reader survey data also shows that
readers prefer shorter articles with quick takeaways.
The data is generating a lot of possible decision points.
The volume and variety of information
in front of you may feel challenging.
To get the key message,
you'll need to take a few steps back and
pinpoint only the most useful pieces.
Not every piece of data is
relevant to the questions you're trying to answer.
A big part of being a data analyst is knowing
how to eliminate the less important details.
One way to do this is with something called spotlighting.
Spotlighting is scanning through the data to
quickly identify the most important insights.
There are many ways to spotlight,
but lots of data analysts like
to use sticky notes on a whiteboard,
like how archaeologists make sense
of the artifacts they discover in a dig.
To do this, you write
each insight from your analysis on a piece of paper,





































spread them out, and display them
on a whiteboard. Then you examine it.
It's important not to get bogged
down in every tiny detail.
Instead, look for broad universal ideas and messages.
Try to find ideas or concepts
that keep popping up again and again
or numbers and words that are repeated often.
Maybe you're finding things that look like
they're connecting or forming patterns.
Highlight these items or group
them together on your whiteboard.
Next, explore your discoveries.
Find the meaning behind the numbers.
The idea is to identify
which insights are most likely to help
solve your business problem
or give you the answers you've been seeking.
This is how spotlighting can lead
you to your key message.
Remember to keep your key message clear and concise,
as an overly-long message like this one shown on
screen has less chance
of conveying the most important conclusion.
Here's a clear, concise message that's
likely to engage your audience
because it's short and to the point.
Of course, no matter how much time and
effort you put into studying your audience,
you can't predict exactly
how they'll react to your recommendations.
But if you follow the steps we're discussing,
you'll be much more likely to have good results.
In an upcoming video,
you'll learn how to deal with situations
that don't go quite according to plan.
That's okay. It happens to all of us.
Have you ever been driving a car when one of
the warning lights on the dashboard suddenly comes on?
Maybe the gas gauge starts
blinking because you're getting low on fuel.
It's handy when you have that alert right in front of you,
clearly showing you that you
need to pay attention to your gas level.
Can you imagine if cars didn't have dashboards?
We'd never know if we were about to run out of gas.
We'd have no idea if our tire pressure was
low or if it was time for an oil change.
Without dashboards,
if our cars started acting differently,
we'd have to pull out the user manual,
sift through all that information inside,
and try to figure out the problem ourselves.
Car dashboards make it easy for
drivers to understand and respond to
any issues with their vehicles because they're
constantly tracking and analyzing the car status.
But as you've been learning,
dashboards aren't just for cars.
Companies also use them to share information,
get people engaged with business plans and goals,
and uncover potential problems.
Just like a car's dashboard,
data analytics dashboards take tons of
information and bring it to life in
a clear, visually-interesting way.
This is extremely important
when telling a story with data,
which is why it's a big part of number two
in our three data storytelling steps.
You've learned that a dashboard is
a tool that organizes information from
multiple data sets into
one central location for tracking,
analysis, and simple visualization
through tables, charts, and graphs.
Dashboards do this by constantly
monitoring live incoming data.
As we've been discussing,
you can make dashboards that are
specifically designed to speak to your stakeholders.
You can think about who will be looking at
the data and what they need from it
and how often they'll use it.
Then you can make a dashboard with
the perfect information just for them.
This is helpful because people can get
confused and distracted when
they're presented with too much data.
A dashboard keeps things neat
and tidy and easy to understand.
When designing a dashboard,
it's best to start simple with
just the most important data points,
and if later on you discover something's missing,
you can always go back and tweak
your dashboard or create a new one.
An important part of dashboard design
is the placement or layout of your charts,
graphs, and other visuals.
These elements need to be cohesive,
which means they're balanced and make
good use of the space on the dashboard.
After you decide what information
should be on your dashboard,
you might need to resize and reorganize
it so it works better for your users.
One option in Tableau is choosing
between a vertical or horizontal layout.
A vertical layout adjusts the height.
A horizontal layout resizes the width
of the views and objects it contains.
Also, as you can see here,
evenly distributing the items within your layout
helps create a clear and organized data visual.
You can select either tiled or floating layouts.
Tiled items are part of a single-layer grid that
automatically resizes based on
the overall dashboard size.
Floating items can be layered over other objects.
In this example, the map and
scatter plots are tiled—they don't overlap.
This really helps make clear what the data is all about,
which is valuable because the majority of
people in the world are visual learners—they
process information based on what they see.
That's why sharing your dashboards with
stakeholders is such a valuable practice.
Now there's something important
to keep in mind about that.
Sharing dashboards with others
likely means that you'll lose control of the narrative;
in other words, you won't be there to tell
the story of your data and share your key messages.
Dashboards put storytelling power
in the hands of the viewer.
That means they'll craft
their own narrative and draw their own conclusions,
but don't let that scare you away from
being collaborative and open.
Just understand the risks
that come with sharing your dashboards.
After all, sharing information and
resources means that you'll have more people
working on the solution to
a big problem or coming up with that next big idea.
This leads to more connections, which can result
in really exciting new practices and innovations.
Live and static insights
Previously, you learned about data storytelling and interpreting your dataset through a narrative. In
this reading, you will explore the difference between live and static insights to make your data even
clearer.
An image of a man driving. His car’s dashboard is made up of bar chart, pie chart, line graph, and
heatmap
Live versus static
Identifying whether data is live or static depends on certain factors:
How old is the data?
How long until the insights are stale or no longer valid to make decisions?
Does this data or analysis need updating on a regular basis to remain valuable?
Static data involves providing screenshots or snapshots in presentations or building dashboards
using snapshots of data. There are pros and cons to static data.
PROS
Can tightly control a point-in-time narrative of the data and insight
Allows for complex analysis to be explained in-depth to a larger audience
CONS
Insight immediately begins to lose value and continues to do so the longer the data remains in a
static state
Snapshots can't keep up with the pace of data change
Live data means that you can build dashboards, reports, and views connected to automatically
updated data.
PROS
Dashboards can be built to be more dynamic and scalable
Gives the most up-to-date data to the people who need it at the time when they need it
Allows for up-to-date curated views into data with the ability to build a scalable “single source of
truth” for various use cases
Allows for immediate action to be taken on data that changes frequently
Alleviates time/resources spent on processes for every analysis
CONS
Can take engineering resources to keep pipelines live and scalable, which may be outside the scope of
some companies' data resource allocation
Without the ability to interpret data, you can lose control of the narrative, which can cause data chaos
(i.e. teams coming to conflicting conclusions based on the same data)
Can potentially cause a lack of trust if the data isn’t handled properly
Key takeaways
Analysts need to familiarize themselves with the business and data so they can recommend when an
updated static analysis is needed or should be refreshed. Also, this data insight will help you make
the case for what sorts of analyses, visualizations, and additional data are recommended for the types
of decisions that the business needs to make.
Keep this customer survey spreadsheet on hand as it will be useful for the next video.
So far, we've focused a lot on understanding our audience.
Whether you're trying to engage people with data storytelling or
creating dashboards designed for a certain person or group,
understanding your audience is key.
As you've learned, you can make dashboards that are tailored to meet different
stakeholder requirements.
To do this, it's important to think about who will be looking at the data and
what they need from it.
In this video, we'll continue exploring how to create
compelling visuals to tell an interesting and persuasive data story.
One great tool for doing this is a filter. You've learned about filters and
spreadsheets and queries, but as a refresher, filtering means showing only
the data that meets a specific criteria while hiding the rest.
Play video starting at ::50 and follow transcript0:50
Filtering works the same way with dashboards—you
can apply different filters for different users based on their needs.
Play video starting at ::59 and follow transcript0:59
Tableau lets you limit the data you see based on the criteria you specify.
Maybe you want to filter data and the data set to show only the last six months, or
maybe you want to see information from one particular customer.
You can even limit the number of rows or columns in a view.
To explore these options, let's return to our world happiness example.
Say your stakeholders were interested in only a few
of the topics that affect overall happiness.
Filtering for just gross domestic product, family, generosity,
freedom, trust, and health, and then creating individual scatter plots for
each would make this possible.
Play video starting at :1:42 and follow transcript1:42
You can also use filters to highlight or hide individual data points.
For instance, if you have a scatter plot with outliers,
you may want to explore what your plot would look like without them.
However, note that this is just an example to show you how filters work;
it's not okay to drop a data point just because it's an outlier.
Outliers could be important observations, sometimes even the most interesting ones,
so be sure to put on your data detective hat and
investigate that outlier before deciding to remove it from your dashboard.
Here's how to do it.
To filter data points from the view, we can choose a single data point or
click and drag in the view to select several points.
Let's choose just one. Then on the tool tip that appears,
we'll select "exclude" to hide it or
we could have chosen to do it the other way by keeping only selected data points.
Play video starting at :2:36 and follow transcript2:36
Here's another example.
If your data is in a table, you can filter entire rows or columns from your view.
To do this, we'll select the rows we want in the view. Then,
on the tool tip that appears, we'll choose to keep only those countries.
Again, we could have also selected the data points we wanted to exclude and
picked that option instead.
Play video starting at :3: and follow transcript3:00
Or if you like, we can even prefilter a Tableau dashboard.
This means that your stakeholders don't have to filter the data themselves.
Basically, by doing the filtering for them, you can save them time and
effort and direct them to the important data you want them to focus on.
Personally, I think the best thing about filters is they let
you zero in on what's important.
Sometimes I'm working with a huge data set, and I want to concentrate only on
a specific area, so I'll add a filter to limit the data displayed on my dashboard.
This cuts the clutter and gives me a simple, clear visual.
I use filters a lot when working with data about advertising campaign performance.
Filters help me isolate specific tactics, such a search or YouTube ads, to see
which ones are working best and which ones could be improved.
By limiting and customizing the information I'm looking at,
it's much easier for me to see the story behind the numbers.
And as I'm sure you've noticed, I love a good data story.
As a data analyst, you'll often be relying on spreadsheets to
create quick visualizations of your data to tell your story.
Let's practice building a chart in a spreadsheet. To follow along,
use the spreadsheet link in the previous reading,
also included in the video.
We'll be using Google Sheets, so this might look a little different in other
spreadsheet platforms, like Excel. We'll begin by filtering just the data on how
many customers purchase basic plus or premium software packages.
To start, select the column for the software package and insert a chart.
The spreadsheet suggests what it thinks is the best type of chart for
our data, but we can choose any type of chart you'd like.
Spreadsheet charts also let you assign different styles,
access titles, a legend, and many other options.
Feel free to explore the different functionality later on.
We'll also cover this more in a reading.
There's lots of different options to choose from.
Let's say we also have data on which countries our customers are from and
their overall satisfaction score for the software they purchased.
First, highlight columns A and B, then click on "insert" and
then "chart" again under "chart type."
You want to select the first map option.
Voila!
Now we have a map that summarizes a customer survey scores by country.
We can also customize this chart by clicking "customize" in the top right
corner.
Let's say we wanted to change our colors from red and green to a gradient so
it's more accessible.
We can do that by clicking "geo" and then change the min color to the lightest
shade of blue, the mid color to the middle shade of blue, and the max color
to the darkest shade of blue to show the spectrum of scores from low to high.
Now we have a map chart that shows where respondents are most satisfied with their
software in dark blue and least satisfied with their software in light blue.
And this will be easier for
anyone in our audience with color vision deficiencies to understand.
Tableau and spreadsheets are common tools for creating data visualizations.
By using their built-in functionalities like filters and charts,
you can zero in on what information is most important and
create compelling visuals for your audience.
And now that we've explored some ways to create visuals,
it's time to start preparing our data narrative. Coming up,
we're going to talk more about telling stories with data and
organizing presentations.
I'll see you soon.
Hands-On Activity: Build a dashboard in Tableau
Total points 1
1.
Question 1
Activity overview
The video you just watched showed you how to create a dashboard in Tableau. Now, you can use
the template, dataset, and instructions in this activity to create the visualization yourself. Feel free to
refer back to the previous video if you get stuck.
In previous activities, you linked data sources and created data visualizations. Now, you’ll use what
you learned about the process of data visualization to add data to a dashboard.
By the time you complete this activity, you will be able to create and use a dashboard to present
data in an accessible and interactive way. This will enable you to communicate your work and
display dynamic data in professional settings.
Note: You will need the Tableau Public Desktop app to import the Dashboards Starter Template in
this activity. For more information on downloading the Tableau Public app, see the Reading:
Optional: Using Tableau Desktop. If you are unable to download the app to your device, use the two
visualizations you created in the last Tableau activities as Sheet 1 and Sheet 2 of this activity.
What you will need
A starter template with a few existing data sources and visualizations and a data set have been
provided. Click the link to the folder containing the starter template and data set.
If you are logged into your Google Account:
Click and drag to highlight both the template and the data set. Then, right-click on the selected files
and click Download.
If you are not logged into your Google Account:
To download both items, click the DOWNLOAD ALL button in the top right corner of the page. You
do not need a Google account to download the files.
Download the starter template and data set: Starter template and data set
Open the template and load the data
In a business context, data visualizations are most useful when they are presented in a dashboardstyle format to stakeholders. Dashboards put all the pertinent information in the same place, making
it easier to understand the important takeaways. Many dashboards are also constantly updating to
reflect new data, and some are even interactive. No matter what style of dashboard you choose,
they can help you deliver the work you’ve done when creating visualizations.
Now it's time to begin the activity. After you download the Dashboards Starter Template, find the
file in your storage and open it in Tableau Public Desktop.
Upon opening the Tableau project template, your screen should look like this:
there are multiple multi-colored lines each representing different data
The Dashboards Starter Template workbook allows you to explore and manipulate the visualizations
found in two sheets: Sheet 1 and Sheet 2. However, the Tableau workbook does not contain the
actual dataset. Next, you will load the dataset.
To load the actual dataset:
1. Click the Data Source tab in the bottom left-hand corner of the window. This will open the
Datasources folder Tableau Public has created on your computer by default.
2. Navigate to the location on your computer where you downloaded the World Bank CO2 dataset
and open it.
3. Locate the My Tableau Repository folder on your computer. This is usually placed in the
Documents folder of your local files. If you cannot find the folder, use the search bar in your
computer’s file explorer.
4. Double-click the folder My Tableau Repository, then double-click the folder Datasources.
5. Drag your datasets for Tableau from where you downloaded them into the Datasources folder.
This will help you keep track of your datasets for various projects and stay organized.
Note: As a best practice, you should always move your datasets for Tableau into the Datasources
folder.
Create a dashboard
The example project contains the World Bank CO2 dataset, with two separate visualizations. Click
Sheet 1. This visualization shows the average CO2 per capita of each country. Now, click Sheet 2.
This visualization is a line chart of the CO2 production of each global region over time.
You will use these visualizations to create a dashboard. Click the Add Dashboard button, which is
the middle button on the bottom row with a symbol that appears like a spreadsheet with a plus sign.
This will open a new dashboard. Your screen should appear like this:
Now, you just need to add some visualizations to your dashboard.
Add visualizations
To add visualizations, drag the appropriate sheets onto the dashboard in the layout that you prefer.
In this case, you’ll add the map visualization from Sheet 1 on top of the line graph from Sheet 2.
1. Start by finding Sheet 1 in the Sheets section on the left side of the screen. Click and drag Sheet
1 onto the area that says Drop sheets here. Your screen should appear like this:
2. Click and drag Sheet 2 onto the visualization. You’ll notice that the visualization adjusts to show
the layout depending on where you drag the sheet. Place Sheet 2 so that it takes up the bottom half.
Clean the dashboard
The dashboard currently contains three legends, but only two of them are needed. The topmost
legend of grayscale values represents the CO2 Per Capita by size.
CO2 per capita is represented by size and color. As such, Tableau creates two legends. To simplify
the visualization, your best choice is to delete the topmost legend that corresponds to size.
The relationship between small and large emissions can be interpreted by the relative sizes of the
circles. However, the color representing the number of emissions per capita is not interpretable
without the legend.
1. Delete the topmost legend. To do this, click it and then click the X attached to it to remove it from
the dashboard.
Now that it’s been removed, you’ll set the remaining legends to float.
2. Click on a legend.
3. Click the arrow pointing downwards for More Options. From there, select Floating.
4. Drag the legend onto the top-right corner of the map visualization.
5. Repeat steps 2-4 and float the remaining legend onto the top-right corner of the bottom graph.
Once you’ve done it, your dashboard should appear like this:
You’ve now created a basic dashboard. Tableau contains tons of other functionality that allows for
dashboards that update in real-time or interactive dashboards and visualizations.
Businesses everywhere know the power of using data to solve problems and
achieve goals.
But all the data in the world won't get you anywhere if your stakeholders can't
understand it or if they can't stay focused on what you're telling them.
So you want to create presentations that are logically organized,
interesting, and communicate your key messages clearly.
An effective presentation supports your narrative by making it
more interesting than words alone.
Play video starting at ::28 and follow transcript0:28
It starts with how you want to organize your data insights.
The narrative you share with your stakeholders needs characters,
a setting, a plot, a big reveal, and an "aha moment," just like any other story.
The characters are the people affected by your story.
This could be your stakeholders, customers, clients, and others.
When adding information about your characters to your story,
you have a great opportunity to include a personal account and bring more human
context to the facts that the data has revealed—think about why they care.
Next up is a setting, which describes what's going on,
how often it's happening, what tasks are involved, and other background
information about the data project that describes the current situation.
Play video starting at :1:17 and follow transcript1:17
The plot,
sometimes called the conflict, is what creates tension in the current situation.
This could be a challenge from a competitor, an inefficient process that
needs to be fixed, or a new opportunity that the company just can't pass up.
This complication of the current situation should reveal the problem your analysis
is solving and compel the characters to act.
The big reveal, or
resolution, is how the data has shown that you can solve the problem
the characters are facing by becoming more competitive, improving a process,
inventing a new system, or whatever the ultimate goal of your data project may be.
Finally, your "aha moment" is when you share your recommendations and
explain why you think they'll help your company be successful.
When I'm working on a presentation, this is where I like to start, too.
Using these basic elements to outline your presentation could be a great place
to start, and they can help you organize your findings into a clear story.
And once you decided on these five key parts of your story,
it's time to think about how to pair your narrative with interesting visuals
because as you're learning, an interesting and
persuasive data story needs interesting and persuasive visuals. Coming up,
you'll learn even more about how to be an expert data storyteller.
[MUSIC] Hi, my name is Sundas, and
I'm analytical lead at Google, my role is turning data
into powerful stories that influences business decisions.
I have an untraditional background where I have a six year gap between
my high school and my college career.
So for me when I was trying to start all over again, so
I started at a community college, that was my first exposure to online learning,
and it was perfect because, I was managing kids at home.
So at Google, we talk a lot about imposter syndrome, and
I personally relate to it quite a bit, being the first female in my family
to graduate university and also being an immigrant,
a lot of times I'm surrounded by people who do not look like me.
For example,
there was one time where I was presenting to a senior leaders in my org, and
I was so nervous presenting to them, I was like I'm going to totally blow this up,
and they're going to figure out that I'm just a totally a fraud and a fake.
One of the things that I changed is that, even though I was the only female in my
team, I started networking, I started expanding my network.
And I met a lot of women who are from the country where I am from and they were also
immigrant, they also struggled with English and they also looked like me and
they were doing very well in their career, they were being successful.
So when I looked at them, I was like, okay, if they can do it, then so can I, so
that for me was a very a big confidence boost to kind of like get over that
imposter syndrome feeling.
But I struggle from it day to day, I'm struggling with it right now, standing
in front of you, do I even deserve to be talking about my journey and my skills?
So it's completely normal, there are few things that I like to do,
one is that I like to give myself a pep talk, pep talk definitely works,
just saying you're totally worth it, you you deserve it,
it does wonders for me personally.
The second thing I like to do is, I like to keep a log off my success and failures.
So when I am at a down point, when I'm feeling down or
feeling I do not belong here, I look at all the things that I have
achieved from that log, and that kind of helps me.
That's a good reminder of the hard work that I put in to kind of get here,
so I did not get here because of luck,
I got here because I worked hard and I earned it.
My family is actually really proud of me, after seeing me go to school and
graduate, the two kids, my younger brother,
he actually went to school with two kids as well, and he graduated.
He finished his master's program and my sister in law, who also had two kids and
she was managing, and after seeing me that I could do it,
they had somebody to look up to.
And so my sister in law went back to school and she finish her degree as well.
So I think just being the first in my family was really hard,
because I don't have anybody to look up to.
But now I am that person that people in my family can look up to specifically girls
and they can pursue whatever they put their minds to.
Hi again.
Earlier in this program, you learned how to keep your audience in mind when
communicating your data findings. By making sure that you're thinking about who your
audience is and what they need to know,
you'll be able to tell your story more effectively.
In this video, we'll learn how to use a strategic framework to help your
audience understand the most important takeaways from your presentation.
To make your data findings accessible to your audience,
you'll need a framework to guide your presentation.
Play video starting at ::34 and follow transcript0:34
This helps to create logical connections that tie back to the business tasks and
metrics. As a quick reminder,
the business task is the question or problem.
your data analysis answers.
The framework you choose gives your audience context to better understand
your data.
Play video starting at ::53 and follow transcript0:53
On top of that, it helps keep you focused on the most important information during
your presentation.
Play video starting at :1:1 and follow transcript1:01
The framework for
your presentation starts with your understanding of the business task.
Raw data doesn't mean much to most people, but
if you present your data in the context of the business task,
your audience will have a much easier time connecting with it.
This makes your presentation more informative and
helps you empower your audience with knowledge.
That's why understanding the business task early on is key.
Play video starting at :1:29 and follow transcript1:29
Here's an example.
Play video starting at :1:31 and follow transcript1:31
Let's say we're working with a grocery store chain.
They've asked us to identify trends and online searches for
avocados to help them make seasonal stocking decisions.
Play video starting at :1:43 and follow transcript1:43
During our presentation, we want to make sure that we continue focusing on
this task and framing our information with it.
Let's check out this example slide presentation.
We can begin our presentation by framing it with the business task here.
In this second slide, I've added goals for the discussion.
It starts with "share an overview of historical online avocado searches."
Play video starting at :2:13 and follow transcript2:13
Under that, a more detailed explanation:
"We'll cover how avocado searches have grown year over year and
what that means for your business."
Then we'll "examine seasonal trends in
online avocado searches using historical data."
This is important because "understanding seasonal trends can
help forecast stocking needs and inform planning." And
finally, "discuss any potential areas for further exploration."
Play video starting at :2:43 and follow transcript2:43
This is where we'll address next steps in the presentation.
This clearly outlines the presentation so our audience knows what to expect.
Play video starting at :2:53 and follow transcript2:53
It also lets them know how the information we share is going to be connected to
the business task. You might remember,
we talked about telling a story with data before.
You can think of this like outlining the narrative.
We can do the same thing with our data
viz examples.
If we're showing this visual graph of annual searches for
avocados, we might want to frame it by saying this graph shows the months with
the most online searches for avocados last year,
so we can expect that this interest in avocados will fall on the same
months this year. That can even be used in our speaker notes for the slide.
This is a great place to add important points you want to remember during
the presentation ahead of time.
These notes aren't visible to your audience in presentation mode, so
they're great reminders you can refer to as you present.
Plus, you could even share your presentation with speaker notes ahead of
time to make the content more accessible for your audience.
Play video starting at :3:56 and follow transcript3:56
Using this data, the grocery store can anticipate demand and
make a plan to stock enough avocados to match their customers' interests.
That's just one way we can use the business task to frame our data and
make it easier to understand.
Play video starting at :4:12 and follow transcript4:12
You also want to make sure you're outlining and
connecting with your business metrics by showcasing what business metrics you use.
You can help your audience understand the impact your findings will have.
Play video starting at :4:26 and follow transcript4:26
Think about the metrics we use for our avocado presentation.
We track the number of online searches for avocados from different months over
several years to anticipate trends and demand.
By explaining this in our presentation, it's easy for
our audience to understand how we used our data.
These data points alone—the
dates or number of searches— aren't useful for our audience,
but when we explain how they're combined as metrics, the data we're sharing makes so
much more sense. Here's another potential data viz that we want to use.
We can frame it for our audience by including some of our metrics.
There's an explanation of what time period this data covers:
"Our data shows Google search queries from 2004 to 2018." Where we
gathered this data from: "Search queries are limited to the United States only."
And a quick explanation of how the trends are being measured:
Play video starting at :5:25 and follow transcript5:25
"Google trends scores are normalized at 100."
So now that our audience understands the metrics we use to organize this data,
they'll be able to understand the graph more clearly.
Using a strategic framework to guide your presentation
can help your audience understand your findings,
which is what the sharing phase of the data analysis process is all about.
Coming up, we'll learn even more about how to weave data into your presentations.
Hey, great to have you back.
So we know how to use our business tasks and
metrics to frame our data findings during a presentation.
Now let's talk about how you work data into your presentations to help your
audience better understand and interpret your findings.
First, it's helpful for
your audience to understand what data was available during data collection.
You can also tell them if any new relevant data has come up, or
if you discovered that you need different data.
Play video starting at ::32 and follow transcript0:32
For our analysis, we used data about online searches for
avocados over several years.
The data we collected includes all searches with the word "avocado," so
it includes a lot of different kinds of searches.
This helps our audience understand what data they're actually looking at and
what questions they can expect it to answer.
With the data we collected on searches containing the word avocado,
we can answer questions about the general interest in avocados.
But if we wanted to know more about something specific, like guacamole,
we'd probably need to collect different data to better understand that part of our
search data.
Next, you'll want to establish the initial hypothesis.
Your initial hypothesis is a theory you're trying to prove or disprove with data.
In this example, our business task was to compile average monthly prices.
Our hypothesis is that this will show clear trends that can help
the grocery store chain plan for avocado demand in the coming year.
You want to establish your hypothesis early in the presentation.
That way, when you present your data,
your audience has the right context to put it in.
Next, you'll want to explain the solution to your business tasks using examples and
visualizations.
A good example is the graph we used last time that clearly visualized the search
trend score for the word avocado from year to year.
Raw data could take time to sink in, but a good example or visualization can make it
much easier for your audience to understand you during a presentation.
Play video starting at :2:7 and follow transcript2:07
Keep in mind, presenting your visualizations effectively is just as
important as the content, if not more.
And that's where the McCandless Method we learned about earlier can help.
So let's talk through the steps of this method and
then apply them to our own data visualizations.
The McCandless Method moves from the general to the specific,
like it's building a pyramid.
Play video starting at :2:30 and follow transcript2:30
You start with the most basic information: introduce the graphic you're
presenting by name.
This direct your audience's attention.
Let's open the slide deck we were working on earlier.
We've got the framework we explored last time and our two data viz examples.
According to the McCandless Method, we want to introduce our graphic by name.
The name of this graph, "yearly avocado search trends," is clearly written here.
When we present it, we'll be sure to share that title with our audience so
they know where to focus and what the graphic is all about.
Next, you'll want to answer the obvious questions your audience might have before
they're asked.
Start with the high-level information and
work your way into the lowest level of detail that's useful to your audience.
This way, your audience won't get distracted trying to understand something
that could have easily been answered when the graphic was introduced.
We added in the information about when, where, and
how this data was gathered to frame this data viz.
But it also answers the first question many stakeholders will ask,
"Where is this data from, and what does it cover?"
So going back to the second graph in our presentation, let's think about
some obvious questions our audience might have when they see this graph at first.
This data viz is really interesting, but it can be hard to understand at a glance,
so our audience might have questions about how to read it.
Knowing that, we can add an explanation to our speaker notes to answer these
questions as soon as this graph is introduced.
"This shows time running in a circle with winter months on top and summer on bottom.
The farther elements are away from the center,
the more queries happened around that time for 'avocado.'"
Now some of the answers to these questions are built into our presentation.
Once you've answered any potential questions your audience might have,
you'll want to state the insight your data viz provides.
It's important to get everyone on the same page before you move into the supporting
details.
We can write in some key takeaways to this slide to help our audience understand
the most important insights from the graphic.
Here we let the audience know that this data shows us a consistent seasonal
trend year over year.
We can also see that there's low online interest in avocados from October through
December.
This is an important insight that we definitely want to share.
Play video starting at :4:56 and follow transcript4:56
Even though avocados are a seasonal summer fruit,
searches peak in January and February.
For a lot of people in the United States, watching the Super Bowl and
eating chips with guacamole is popular this time of year.
Now our audience knows what takeaways we want them to have before moving on.
The fourth step in the McCandless Method is calling out
data to support that insight.
This is your chance to really wow your audience,
so gives many examples as you can.
With our avocado graphs, it might be worth pointing to specific examples.
In our monthly trends graph, we can point to specific weeks recorded here.
"During the week of November 25th, 2018, the search score was around 49, but
the week of February 4th the search score was 90.
This shows the rise and fall of online search interest, with the help of some of
the very cool data in our graphs."
Finally, it's time to tell your audience why it matters.
This is the "so what" moment.
Why is this insight interesting or important to them?
This is a good time to present the possible business impact of the solution
and clear action stakeholders can take.
Play video starting at :6:12 and follow transcript6:12
You might remember that we outlined this in our framework at the beginning of our
presentation.
Play video starting at :6:18 and follow transcript6:18
So let's explain what this data helps our grocery store stakeholder do.
First, they can account for
lower interest in avocados between the months of October and December.
They can also prepare for
the Super Bowl surge in avocado interest in late January/early February.
And they'll be able to consider how to optimize stocking practices
during summer and spring.
There's a little more detail under each of these points, but
this is a basic breakdown of the impact.
And that's how we use the McCandless Method to introduce
data visualizations during our presentations.
I have one more piece of advice.
Take a second to self-check and ask yourself, "Does this data point or
chart support the point I want people to walk away with?"
It's a good reminder to think about your audience every time you add data to
a presentation.
So now you know how to present data using a framework, and
weave data into your presentation for your audience. And
you got to learn the McCandless Method for data presentation.
Coming up, we'll learn some best practices for actually creating presentations.
See you soon.
Step-by-step critique of a presentation
This reading provides an orientation of two upcoming videos:
Connor: Messy example of a data presentation
Connor: Good example of a data presentation
To get the most out of these videos, you should watch them together (back to back). In the
first video, Connor introduces a presentation that is confusing and hard to follow. In the
second video, he returns to talk about what can be done to improve it and help the
audience better understand the data and conclusions being shared.


Messy data presentation
In the first video, watch and listen carefully for the specific reasons the “messy”
presentation falls short. Here is a preview:


No story or logical flow
No titles



Too much text
Inconsistent format (no theme)
No recommendation or conclusion at the end
Messy presentation: people don’t know where to focus their attention
The main problem with the messy presentation is the lack of a logical flow. Notice also how
the data visualizations are hard to understand and appear without any introduction or
explanation. The audience has no sense of what they are looking at and why. When people
in the audience have to figure out what the data means without any help, they can end up
being lost, confused, and unclear about any actions they need to take.
Good data presentation
In the second video, numerous best practices are applied to create a better presentation on
the same topic. This “good” presentation is so much easier to understand than the messy
one! Here is a preview:








Title and date the presentation was last updated
Flow or table of contents
Transition slides
Visual introduction to the data (also used as a repeated theme)
Animated bullet points
Annotations on top of visuals
Logic and progression
Limitations to the data (caveats) - what the data can’t tell you
Tip: As you watch this video, take notes about what Connor suggests to create a good
presentation. You can keep these notes in your journal. When you create your own
presentations, refer back to your notes. This will help you to develop your own thinking
about the quality of presentations.
Good presentation: people are logically guided through the data
The good presentation logically guides the audience through the data – from the objectives
at the beginning all the way to the conclusions at the end. Notice how the data
visualizations are introduced using a common theme and are thoughtfully placed before
each conclusion. A good presentation gives people in the audience the facts and data, helps
them understand what the data means, and provides takeaways about how they can use
their understanding to make a change or do some good.
Up next
Get started with the messy vs. good presentation comparison by viewing the first video:
Connor: Messy example of a data presentation.
Hey there. So far we've learned about using a framework
to guide your audience through
your presentation and how to weave data in.
Now I want to talk about why
these presentation skills are so important
and give you some simple tips
you can use during your own presentations.
As a data analyst,
you have two key responsibilities:
analyze data and present your findings effectively.
Analyzing data seems pretty obvious.
It's in the title "data analyst," after all.
But data analysis is all about turning
raw information into knowledge.
If you can't actually
communicate what you've learned during your analysis,
then that knowledge can't help anyone.
There's plenty of ways data analysts communicate:
emails, memos, dashboards, and of course, presentations.
Effective presentations start with
the things we've already talked about,
like creating effective visualizations
and organizing your slides,
but how you deliver those things can make
a big difference in
how well your audience understands them.
You want to make sure they leave
your presentation empowered by
the knowledge and ready to make
decisions based on your analysis.
That's why strong presentation skills
are so important as a data analyst.
If the idea of giving a presentation makes you nervous,
don't worry—a lot of people feel that way.
Here's a secret: it gets easier the more you practice.
Now let's look at some tips and tricks
you can use when giving your presentations.
We'll go over some more advanced ones later,
but let's start with the basics for now.
It's natural to feel
your adrenaline levels rise before giving a presentation.
That's just because you're excited to be there.
To help keep that excitement in check,
try taking deep, controlled breaths
to calm your body down.
As a bonus, this will also
help you channel all that excitement into
a presentation style that
shows your passion for the work you've done.
You might remember we talked earlier about using
the McCandless Method to present data visualizations.
Well, it's also a good rule of
thumb for presentations in general.
Start with the broader ideas,
the obvious questions your audience might have,
and what they need to understand to
put your findings in context.
Then you can get more specific about
your analysis and the insights you've uncovered.
Let's go back to our avocado example and
imagine how we'd start that presentation.
After we introduce ourselves
and the title of our presentation,
we have a slide with our goals for the discussion.
We start with the most general goals
and then get more specific.
We might say our goal for today is to first provide
you all with the state of the world
on online avocado searches.
Then we'll examine the opportunities and risks of
seasonal trends in online avocado searches.
We'll move into actionable next steps that can
help you start taking advantage of these opportunities,
as well as help to mitigate the risks.
Finally, we'd love to make the third part a
discussion with you about
what you think of these next steps.
What you'll want to notice here is how our presentation
focuses on the general interest in avocados online
before getting into specifics about
what that means for our stakeholders.
We also learned about the five-second rule.
As a quick refresher,
whenever you introduce a data visualization,
you should use the five-second rule
and ask two questions.
First, wait five seconds after showing
a data visualization to let your audience process it,
then ask if they understand it.
If not, take time to explain it,
then give your audience another five seconds to let that
sink in before telling them
the conclusion you want them to understand.
Try not to rush through data visualizations.
This will be the first time some of the people
in your audience are encountering your data,
and it's worth making time in
your presentations for them.
Here's our first data viz in the avocado presentation.
When we get to this slide,
we want to introduce
our yearly avocado search trends graph
and explain the basic background we've included here.
After we wait five seconds,
we can ask, "Are there any questions about this graph?"
Let's say one of our stakeholders asks,
"Could you explain Google search trends?"
Great. After explaining that,
we wait another five seconds,
then we can tell them our conclusion:
Searches for avocados have been increasing every year.
You'll learn more about these concepts later on,
but these are some great tips for starting out.
Finally, when it comes to presenting
data, preparation is key.
For some people, that means doing dress rehearsals.
For others, it means writing out
a script and repeating it in their head.
Others find visualizing themselves
giving the presentation helps.
Try to find a method that works for you.
The most important thing to remember
is that the more prepared you are,
the better you'll perform when the lights
are on and it's your turn to present.
Coming up, we'll cover more best practices for
presentations and also look at
some examples. Looking forward to it.
Hey there. So far we've learned about using a framework
to guide your audience through
your presentation and how to weave data in.
Now I want to talk about why
these presentation skills are so important
and give you some simple tips
you can use during your own presentations.
As a data analyst,
you have two key responsibilities:
analyze data and present your findings effectively.
Analyzing data seems pretty obvious.
It's in the title "data analyst," after all.
But data analysis is all about turning
raw information into knowledge.
If you can't actually
communicate what you've learned during your analysis,
then that knowledge can't help anyone.
There's plenty of ways data analysts communicate:
emails, memos, dashboards, and of course, presentations.
Effective presentations start with
the things we've already talked about,
like creating effective visualizations
and organizing your slides,
but how you deliver those things can make
a big difference in
how well your audience understands them.
You want to make sure they leave
your presentation empowered by
the knowledge and ready to make
decisions based on your analysis.
That's why strong presentation skills
are so important as a data analyst.
If the idea of giving a presentation makes you nervous,
don't worry—a lot of people feel that way.
Here's a secret: it gets easier the more you practice.
Now let's look at some tips and tricks
you can use when giving your presentations.
We'll go over some more advanced ones later,
but let's start with the basics for now.
It's natural to feel
your adrenaline levels rise before giving a presentation.
That's just because you're excited to be there.
To help keep that excitement in check,
try taking deep, controlled breaths
to calm your body down.
As a bonus, this will also
help you channel all that excitement into
a presentation style that
shows your passion for the work you've done.
You might remember we talked earlier about using
the McCandless Method to present data visualizations.
Well, it's also a good rule of
thumb for presentations in general.
Start with the broader ideas,
the obvious questions your audience might have,
and what they need to understand to
put your findings in context.
Then you can get more specific about
your analysis and the insights you've uncovered.
Let's go back to our avocado example and
imagine how we'd start that presentation.
After we introduce ourselves
and the title of our presentation,
we have a slide with our goals for the discussion.
We start with the most general goals
and then get more specific.
We might say our goal for today is to first provide
you all with the state of the world
on online avocado searches.
Then we'll examine the opportunities and risks of
seasonal trends in online avocado searches.
We'll move into actionable next steps that can
help you start taking advantage of these opportunities,
as well as help to mitigate the risks.
Finally, we'd love to make the third part a
discussion with you about
what you think of these next steps.
What you'll want to notice here is how our presentation
focuses on the general interest in avocados online
before getting into specifics about
what that means for our stakeholders.
We also learned about the five-second rule.
As a quick refresher,
whenever you introduce a data visualization,
you should use the five-second rule
and ask two questions.
First, wait five seconds after showing
a data visualization to let your audience process it,
then ask if they understand it.
If not, take time to explain it,
then give your audience another five seconds to let that
sink in before telling them
the conclusion you want them to understand.
Try not to rush through data visualizations.
This will be the first time some of the people
in your audience are encountering your data,
and it's worth making time in
your presentations for them.
Here's our first data viz in the avocado presentation.
When we get to this slide,
we want to introduce
our yearly avocado search trends graph
and explain the basic background we've included here.
After we wait five seconds,
we can ask, "Are there any questions about this graph?"
Let's say one of our stakeholders asks,
"Could you explain Google search trends?"
Great. After explaining that,
we wait another five seconds,
then we can tell them our conclusion:
Searches for avocados have been increasing every year.
You'll learn more about these concepts later on,
but these are some great tips for starting out.
Finally, when it comes to presenting
data, preparation is key.
For some people, that means doing dress rehearsals.
For others, it means writing out
a script and repeating it in their head.
Others find visualizing themselves
giving the presentation helps.
Try to find a method that works for you.
The most important thing to remember
is that the more prepared you are,
the better you'll perform when the lights
are on and it's your turn to present.
Coming up, we'll cover more best practices for
presentations and also look at
some examples. Looking forward to it.
Guide: Sharing data findings in presentations
Use this guide to help make your presentation stand out as you tell your data story. Follow
the recommended tips and slide sequence in this guide for a presentation that will truly
impress your audience.
You can also download this guide as a PDF, so you can reference it in the future:
Sharing your data findings in presentations _ Tips and Tricks.pdfPDF File
Open file
Telling your data story (tips and tricks to present your data
and results)
Use the following tips and sample layout to build your own presentation.
Tip 1: Know your flow
Just like in any good story, a data story must have a good plot (theme and flow), good
dialogue (talking points), and a great ending or big reveal (results and conclusions). One
flow could be an overview of what was analyzed followed by resulting trends and potential
areas for further exploration.
In order to develop the right flow for your presentation, keep your audience in mind. Ask
yourself these two questions to help you define the overall flow and build out your
presentation.
Who is my audience?



If your intended audience is executives, board members, directors, or other Clevel (C-Suite) executives, your storytelling should be kept at a high level. This
audience will want to hear about your story but might not have time to hear the
entire story. Executives tend to focus on endings that encourage improving,
correcting, or inventing things. Keep your presentation brief and spend most of
your time on your results and recommendations. Refer to an upcoming topic in
this reading—Tip 3: end with your recommendations.
If your intended audience is stakeholders and managers, they might have more
time to learn about how you performed your analysis and they might ask more
data-specific questions. Be prepared with talking points about the aspects of
your analysis that led you to your final results and conclusions.
If your intended audience is other analysts and individual contributors, you will
have the most freedom—and perhaps the most time—to go more deeply into
the data, processes, and results.
What is the purpose of my presentation?



If the goal of your presentation is to request or recommend something at the
end, like a sales pitch, you can have each slide work toward the
recommendations at the end.
If the goal of your presentation is to focus on the results of your analysis, each
slide can help mark the path to the results. Be sure to include plenty of
breadcrumbs (views of the data analysis steps) to demonstrate the path you
took with the data.
If the goal of your presentation is to provide a report on the data analysis, your
slides should clearly summarize your data and key findings. In this case, it is
alright to let the data be the star or speak for itself.
Tip 2: Prepare talking points and limit text on slides
As you create each slide in your presentation, prepare talking points (also called speaker
notes) on what you will say.
Don’t forget that you will be talking at the same time that your audience is reading your
slides. If your slides start becoming more like documents, you should rethink what you will
say so that you can remove some text from the slides. Make it easy for your audience to
skim read the slides while still paying attention to what you are saying. In general, follow
the five-second rule. Your audience should not be spending more than five seconds reading
any block of text on a slide.
Knowing exactly what you will say when explaining each slide throughout your
presentation also creates a natural flow to your story. Talking points help you avoid
awkward pauses between topics. Slides that summarize data can also be repetitive (and
boring). If you prepare a variety of interesting talking points about the data, you can keep
your audience alert and paying attention to the data and its analysis.
Tip 3: End with your recommendations
When climbing a mountain, getting to the top is the goal. Making recommendations at the
end of your presentation is like getting to the mountaintop.


Use one slide for your recommendations at the end. Be clear and concise.
If you are recommending that something be done, provide next steps and
describe what you would consider a successful outcome.
Tip 4: Allow enough time for the presentation and questions
Assume that everyone in your audience is busy. Keep your presentation on topic and as
short as possible by:



Being aware of your timing. This applies to the total number of slides and the
time you spend on each slide.
Presenting your data efficiently. Make sure that every slide tells a unique and
important part of your data story. If a slide isn’t that unique, you might think
about combining the information on that slide with another slide.
Saving enough time for questions at the end or allowing enough time to answer
questions throughout your presentation.
Putting it all together: Your slide deck layout
In this section, we will describe how to put everything together in a sample slide deck
layout.
First slide: Agenda
Provide a high-level bulleted list of the topics you will cover and the amount of time you
will spend on each. Every company’s norms are different, but in general, most
presentations run from 30 minutes to an hour at most. Here is an example of a 30-minute
agenda:






Introductions (4 minutes)
Project overview and goals (5 minutes)
Data and analysis (10 minutes)
Recommendations (3 minutes)
Actionable steps (3 minutes)
Questions (5 minutes)
Second slide: Purpose
Everyone might not be familiar with your project or know why it is important. They didn’t
spend the last couple of weeks thinking about the analysis and results of your project like
you did. This slide summarizes the purpose of the project and why it is important to the
business for your audience.
Here is an example of a purpose statement:
Service center consolidation is an important cost savings initiative. The aim of this project
was to determine the impact of service center consolidation on customer response times.
Third slide: Data/analysis
First, It really is possible to tell your data story in a single slide if you summarize the key
things about your data and analysis. You may have supporting slides with additional data
or information in an appendix at the end of the presentation.
But, if you choose to tell your story using more than one slide, keep the following in mind:

Slides typically have a logical order (beginning, middle, and end) to fully build
the story.
 Each slide should logically introduce the slide that follows it. Visual cues from
the slides or verbal cues from your talking points should let the audience know
when you will go on to the next slide.
 Remember not to use too much text on the slides. When in doubt, refer back to
the second tip on preparing talking points and limiting the text on slides.
 The high-level information that people read from the slides shouldn’t be the
same as the information you provide in your talking points. There should be a
nice balance between the two to tell a good story. You don’t want to simply
read or say the words on the slides.
For extra visuals on the slides, use animations. For example, you can:

Fade in one bullet point at a time as you discuss each on a slide.


Only display the visual that is relevant to what you are talking about (fade out
non-relevant visuals).
Use arrows or callouts to point to a specific area of a visual that you are using.
Fourth slide: Recommendations
If you have been telling your story well in the previous slides, the recommendations will be
obvious to your audience. This is when you might get a lot of questions about how your
data supports your recommendations. Be ready to communicate how your data backs up
your conclusion or recommendations in different ways. Having multiple words to state the
same thing also helps if someone is having difficulty with one particular explanation.
Fifth slide: Call to action
Sometimes the call to action can be combined with the recommendations slide. If there are
multiple actions or activities recommended, a separate slide is best.
Recall our example of a purpose statement: Service center consolidation is an important
cost savings initiative. The aim of this project was to determine the impact of service center
consolidation on customer response times.
Suppose the data analysis showed that service center consolidation negatively impacted
customer response times. A call to action might be to examine if processes need to change
to bring customer response times back to what they were before the consolidation.
Wrapping it up: Getting feedback
After you present to your audience, think about how you told your data story and how you
can get feedback for improvement. Consider asking your manager or another data analyst
for candid thoughts about your storytelling and presentation overall. Feedback is great to
help you improve. When you have to write a brand new data story (or a sequel to the one
you already told), you will be ready to impress your audience even more!
Hey, good to see you again.
By now you've learned some ways to organize and
incorporate data into your presentations.
You've also covered why
effective presentation skills are
so important as a data analyst.
Now you're ready to start presenting like a pro.
Coming up, I'll share some pro tips and
best practices with you. Let's get started.
We've talked about how important
your audience is throughout
this program, and it's
especially important for presentations.
It's also important to remember that
not everyone can experience
your presentations the same way.
Sharing your presentation via email
and putting some forethought into how accessible
your data viz is before your presentation can
help ensure your work is accessible and understandable.
But during the actual presentation,
it can be tempting to focus on
what's most interesting and exciting to
us and not on what the audience actually needs to hear.
Sometimes, even the best audiences
can lose focus and get distracted,
but here's a few things you can do during
your final presentation to help you
stay focused on your audience and keep them engaged.
First, try to keep in mind that your audience
won't always get the steps
you took to reach a conclusion.
Your work makes sense to you because you did it—this
is called the curse of knowledge.
Basically, it means that because you know something,
it can be hard to imagine your audience not knowing it.
It's important to remember that
your audience doesn't have the same context you
do, so focus on
what information they need to
reach the same conclusion you did.
Earlier, we covered some useful things you
can add to your presentations to help with this.
First, answer basic questions
about where the data came from and what it covers:
How is it collected?
Does it focus on a specific time or place?
You can also include
your guiding hypothesis and
the goals that drove your analysis.
Adding any assumptions or methods you
used to reach your conclusions can also be useful.
For example, in our avocado presentation,
we grouped months by season and looked at overall trends.
And finally, explain your conclusion
and how you reached it.
Your audience also has a lot on their mind already.
They might be thinking about
their own work projects
or what they want to have for lunch.
They aren't trying to be rude, and it
doesn't mean they aren't interested;
they're just busy people with a lot going on.
Try to keep your presentation focused and
to the point to keep their minds from wandering.
Try not to tell stories that take
your audience down into unrelated line of
thinking, and try not to go into
too much detail about things
that don't concern your audience.
You might have found a really exciting new SQL database,
but unless your presentation is about databases,
you can probably leave that out.
Your audience can also be easily
distracted by information in your presentation.
For example, the more you include in a chart,
the more your audience will need to explore it.
Try to avoid including
information in your presentations that
you don't think will be productive to
discussions with your audience,
sharing the right amount of content to keep
your audience focused and ready to take action.
It's also good to note that how
you present information is just as
important as what you present, and I
have some best practices for delivering presentations.
First, pay attention to how you speak.
Keep your sentences short.
Don't use long words where short words will work.
Build in intentional pauses to give
your audience time to think about what you've just said.
Try to keep the pitch of
your sentences level so that
your statements aren't confused for questions.
Also, try to be mindful of any nervous habits you have.
Maybe you talk faster,
tap your toes, or touch your hair when you're nervous.
That's totally normal—everyone does—but
these habits can be distracting for your audience.
When you're presenting, try to stay
still and move with purpose.
Practice good posture and make
positive eye contact with the people in your audience.
Finally, remember that you can
practice and improve these skills with
every presentation. Accept and
seek out feedback from people you trust.
Feedback is a gift and an opportunity to grow.
With that, you've completed another module.
The presentation skills you've learned here,
like using frameworks,
weaving data into your presentation,
and best practices you can apply during
your actual presentations, are going to help you
communicate your findings with audiences effectively.
Five se
Hello. So let's talk about
how you can be sure you're prepared for a Q & A.
For starters, knowing the questions
ahead of time can make a big difference.
You don't have to be a mind reader,
but there's a few things you
can do to prepare that'll help.
For this example, we'll go back to the presentation
we created about health and happiness around the world.
We put together these slides,
clean them up a bit,
and now we're getting ready for the actual presentation.
Let's go over some ways we can
anticipate possible questions before
our Q & A to give us more time to think about the answers.
Understanding your stakeholder's expectations will
help you predict the questions they might ask.
As we previously discussed,
it's important to set
stakeholder expectations early in the project.
Keep their expectations in mind while you're
planning presentations and Q & A sessions.
Make sure you have
a clear understanding of the objective and what
the stakeholders wanted when
they asked you to take on this project.
For this project, our stakeholders were interested in
what factors contributed to
a happier life around the world.
Our objective was to identify if there were geographic,
demographic, and/or economic
factors that contributed to a happier life.
Knowing that, we can start thinking about
the potential questions about
that objective they might have.
At the end of the day,
if you misunderstood your stakeholders' expectations
or the project objectives,
you won't be able to correctly
anticipate or answer their questions.
Think about these things early and
often when planning for a Q & A.
Once you feel confident that you fully understand
your stakeholders' expectations and the project goals,
you can start identifying possible questions.
A great way to identify audience questions
is to do a test run of your presentation.
I like to call this the "colleague test."
Show your presentation or
your data viz to a colleague who has
no previous knowledge of
your work, and see what questions they ask you.
They might have the same questions
your real audience does.
We talked about feedback as a gift,
so don't be afraid to seek it out and
ask colleagues for their opinions.
Let's say we ran through
our presentation with a colleague,
we showed them our data visualizations,
then asked them what questions they had.
They tell us they weren't sure how we were
measuring health and happiness
with our data in this slide.
That's a great question.
We can absolutely work
that information into our presentation.
Sometimes the questions asked during
our colleague tests help us revise our presentation.
Other times, they help us anticipate
questions that might come up during the presentation,
even if we didn't originally want to build
that information into the presentation itself.
It helps to be prepared to go into detail about
your process, but only if someone asks.
Either way, their feedback can
help take your presentation to the next level.
Next, it's helpful to start with zero assumptions.
Don't assume that your audience is
already familiar with jargon,
acronyms, past events, or
other necessary background information.
Try to explain these things in
the presentation, and be
ready to explain them further if asked.
When we showed our presentation to our colleague,
we accidentally assumed that
they already knew how health and
happiness were measured and
left that out of our original presentation.
Now, let's look at our second data viz.
This graph is showing
the relationship between health, wealth,
and happiness, but includes GDP to measure the economy.
We don't want to assume that
our audience knows what that means,
so during the presentation,
we'll want to include a definition of GDP.
In our speaker notes,
we've added gross domestic product:
total monetary or market
value of all the finished goods and
services produced within a country's borders
in a specific period of time.
We'll fully explain what GDP
means as soon as this graphic comes up;
that way, no one in
our audience is confused by that acronym.
It helps to work with your team to
anticipate questions and draft responses.
Together, you'll be able to include
their perspectives and coordinate answers
so that everyone on your team is prepared and
ready to share their unique insights with stakeholders.
The team working on the world happiness project with
you probably have a lot of great insights about the data,
like how it was gathered
or what it might be missing.
Touch base with them so you
don't miss out on their perspective.
Finally, be prepared to consider and
describe to your stakeholders
any limitations in your data.
You can do this by critically analyzing
the patterns you've discovered
in your data for integrity.
For example, could the correlations
found be explained as coincidence?
On top of that, use
your understanding of the strengths
and weaknesses of the tools
you use in your analysis to
pinpoint any limitations they may have introduced.
While you probably don't have
the power to predict the future,
you can come pretty close to predicting
stakeholder and audience questions
by doing a few key things.
Remember to focus on
stakeholder expectations and project goals,
identify possible questions with your team,
review your presentation with zero assumptions,
and consider the limitations of your data.
Sometimes, though, your audience might raise
objections to the data
before and after your presentation.
Coming up, we'll talk
about the kind of objections they might
have and how you can respond. See you next time.
Welcome back. In this video, we'll talk about how you can handle objections about
the data you're presenting.
Stakeholders might raise objections during or after your presentation.
Usually, these objections are about the data, your analysis, or your findings.
We'll start by discussing what questions these objections are asking and
then talk about how to respond.
Objections about the data could mean a few different things.
Sometimes, stakeholders might be asking where you got the data and
what systems that came from, or they might want to know what transformations
happened to it before you worked with it, or how fresh and accurate your data is.
You can include all this information in the beginning of your presentation
to set up the data context. You can add a more detailed breakdown in your
appendix in case there are more questions.
When we're talking about cleaning data,
you learned keeping a detailed log of data transformations is useful.
That log can help you answer the questions we're talking about here, and
if you keep it in your presentation's appendix, it'll be easy to reference if
any of your stakeholders want more detail during a Q & A.
Now, your audience might also have questions or
objections about your analysis.
They might want to know if your analysis is reproducible, so
it helps to keep a change log documenting the steps you took. This way, someone else
could follow along and reproduce your process.
You can even create a slide in the appendix section of your presentation
explaining these steps, if you think it will be necessary.
And it can be useful to keep a clean version of your script if you're working
with a programming language like SQL or R, which we'll learn all about later.
Also, be prepared to answer questions like,
"Who did you get feedback from during this process?"
This is especially important when your analysis reveals insights that
are the opposite of your audience's
gut feelings about the data.
Making sure to include lots of perspectives throughout your analysis
process will help you back up your findings during your presentation.
Finally, you might be faced with objections to the findings themselves.
Play video starting at :2:12 and follow transcript2:12
A lot of the time these will be questions like, "Do these findings exist
in previous time periods, or did you control for the differences in your data?"
Your audience wants to be sure that your final results accounted for
any possible inconsistencies and that they're accurate and useful.
Now that you know some of the possible kinds of objections your audience
might raise,
let's talk about how you can think about responding. First,
it can be useful to communicate any assumptions about the data,
your analysis, or your findings that might help answer their questions.
For example, did your team clean and format your data before analysis?
Telling your audience that can clear up any doubts they might have.
Second, explain why your analysis might be different than expected.
Walk your audience through the variables that change the outcomes to help them
understand how you got there.
And third, some objections have merit,
especially if they bring up something you hadn't thought of before.
If that's true, you can acknowledge that those objections are valid and
take steps to investigate further.
Following up with more details afterwards is great, too.
And now you know some of the basic objections you might run into.
Understanding that your audience might have questions about your data,
your analysis, or your findings can help you prepare responses ahead of time, and
walking your audience through any assumptions about the data or
unexpected results are great approaches to responding.
Play video starting at :3:43 and follow transcript3:43
Coming up, we'll go over even more best practices for
responding to questions during a Q & A.
Bye for now.
Hello again. Earlier we talked about some ways that
you can respond to objections
during or after your presentations.
In this video, I want to share
some more Q & A best practices.
Let's go back to
our world happiness presentation example.
Imagine we finished preparing for a Q & A,
and it's time to actually
answer some of our audience's questions.
Let's go over some ways that we can be
sure that we're answering questions effectively.
Will start with a really simple one:
listen to the whole question.
I know this sounds like a given,
but it can be really tempting to start
thinking about your answer before
the person you're talking to has
even finished asking their question.
On slide 11 of our presentation,
we outline our conclusions.
After explaining these conclusions,
one of our stakeholders asks,
"How was happiness measured for this project?"
It's important to listen to
the whole question and wait
to respond until they're done talking.
Take a moment to repeat the question.
Repeating the question is
helpful for a few different reasons.
For one, it helps you make
sure that you're understanding the question.
Second, it gives the person
asking it a chance to correct you if you're not.
Anyone who couldn't hear the question
will still know what's being asked.
Plus, it gives you a moment to get your thoughts together.
After listening to the question and
repeating it to make sure you understand,
you can explain that
participants in different countries were
given a survey that asked them to rate
their happiness, and just like that,
your audience has a better understanding of
the project because you took
the time to listen carefully.
Now that they know about the survey,
they're interested in knowing more.
At this point, we can go into
more detail about that data.
We have a slide built in here called the appendix.
This is a great place to keep
extra information that might not be necessary
for our presentation but could
be useful for answering questions afterwards.
This is also a great place for us to
have more detailed information
about the survey data so we can reference it more easily.
As always, make sure you understand
the context questions are being asked in.
Think about who is your audience and
what kinds of concerns or backgrounds they might have.
Remember the project goals
and your stakeholders' interests in them,
and try to keep your answers
relevant to that specific context,
just like you made sure your presentation
itself was relevant to your stakeholders.
We have this slide with data about
life expectancy as a metric for health.
If you're presenting to a group of
stakeholders who are in the healthcare industry,
they're probably going to be more interested in
the medical data and the relationship
between overall health and happiness.
Knowing this, you can tailor your answers to focus on
their interests so that
the presentation is relevant and useful to them.
When answering, try to involve the whole audience.
You aren't just having a one-on-one conversation
with the person that's asked the question;
you're presenting to a group of
people who might also have
the same question or need to know what that answer is.
It's important to not
accidentally exclude other audience members.
You can also include other voices.
If there's someone in your audience or
team that might have insight,
ask them for their thoughts.
Keep your responses short and to the point.
Start with a headline response that
gives your stakeholders the basic answer.
Then if they have more questions,
you can go into more detail.
This can be difficult as a data analyst.
You have all the background information and
want to share your hard work, but
you don't want to lose your audience
with a long and potentially confusing answer.
Stay focused on the question itself.
This is why listening to
the whole question is so important.
It keeps the focus on that specific question.
Answer the question as directly as
possible using the fewest words you can.
From there, you can expand on your answer or
add color, contexts, and detail as needed.
Like when one of our stakeholders asked
how the data measuring happiness was gathered.
We started by telling them
that a survey was used to measure
an individual's happiness, and only when they are
interested in hearing more about
the survey did we go into more detail.
To recap, when you're
answering questions during a presentation Q & A,
remember to listen to the whole question,
repeat the question if necessary,
understand the context, involve your whole audience,
and keep your responses short.
Remember, you don't have to
answer every question on the spot.
If it is a tough question that will
require additional analysis or research,
it's fine to let your audience
know that you'll get back to them;
just remember to follow up in a timely manner.
These tips will make it easier to answer
questions and make you seem prepared and professional.
Now that your presentation-ready,
it's time to wrap up.
We covered a lot about how to
consider questions before a Q & A,
how to handle different kinds of objections,
and some best practices you
can use in your next presentation.
That's it for now. See you in the next video.
: Added to Selection. Press [CTRL + S] to save as a note
Hello again. Earlier we talked about some ways that
you can respond to objections
during or after your presentations.
In this video, I want to share
some more Q & A best practices.
Let's go back to
our world happiness presentation example.
Imagine we finished preparing for a Q & A,
and it's time to actually
answer some of our audience's questions.
Let's go over some ways that we can be
sure that we're answering questions effectively.
Will start with a really simple one:
listen to the whole question.
I know this sounds like a given,
but it can be really tempting to start
thinking about your answer before
the person you're talking to has
even finished asking their question.
On slide 11 of our presentation,
we outline our conclusions.
After explaining these conclusions,
one of our stakeholders asks,
"How was happiness measured for this project?"
It's important to listen to
the whole question and wait
to respond until they're done talking.
Take a moment to repeat the question.
Repeating the question is
helpful for a few different reasons.
For one, it helps you make
sure that you're understanding the question.
Second, it gives the person
asking it a chance to correct you if you're not.
Anyone who couldn't hear the question
will still know what's being asked.
Plus, it gives you a moment to get your thoughts together.
After listening to the question and
repeating it to make sure you understand,
you can explain that
participants in different countries were
given a survey that asked them to rate
their happiness, and just like that,
your audience has a better understanding of
the project because you took
the time to listen carefully.
Now that they know about the survey,
they're interested in knowing more.
At this point, we can go into
more detail about that data.
We have a slide built in here called the appendix.
This is a great place to keep
extra information that might not be necessary
for our presentation but could
be useful for answering questions afterwards.
This is also a great place for us to
have more detailed information
about the survey data so we can reference it more easily.
As always, make sure you understand
the context questions are being asked in.
Think about who is your audience and
what kinds of concerns or backgrounds they might have.
Remember the project goals
and your stakeholders' interests in them,
and try to keep your answers
relevant to that specific context,
just like you made sure your presentation
itself was relevant to your stakeholders.
We have this slide with data about
life expectancy as a metric for health.
If you're presenting to a group of
stakeholders who are in the healthcare industry,
they're probably going to be more interested in
the medical data and the relationship
between overall health and happiness.
Knowing this, you can tailor your answers to focus on
their interests so that
the presentation is relevant and useful to them.
When answering, try to involve the whole audience.
You aren't just having a one-on-one conversation
with the person that's asked the question;
you're presenting to a group of
people who might also have
the same question or need to know what that answer is.
It's important to not
accidentally exclude other audience members.
You can also include other voices.
If there's someone in your audience or
team that might have insight,
ask them for their thoughts.
Keep your responses short and to the point.
Start with a headline response that
gives your stakeholders the basic answer.
Then if they have more questions,
you can go into more detail.
This can be difficult as a data analyst.
You have all the background information and
want to share your hard work, but
you don't want to lose your audience
with a long and potentially confusing answer.
Stay focused on the question itself.
This is why listening to
the whole question is so important.
It keeps the focus on that specific question.
Answer the question as directly as
possible using the fewest words you can.
From there, you can expand on your answer or
add color, contexts, and detail as needed.
Like when one of our stakeholders asked
how the data measuring happiness was gathered.
We started by telling them
that a survey was used to measure
an individual's happiness, and only when they are
interested in hearing more about
the survey did we go into more detail.
To recap, when you're
answering questions during a presentation Q & A,
remember to listen to the whole question,
repeat the question if necessary,
understand the context, involve your whole audience,
and keep your responses short.
Remember, you don't have to
answer every question on the spot.
If it is a tough question that will
require additional analysis or research,
it's fine to let your audience
know that you'll get back to them;
just remember to follow up in a timely manner.
These tips will make it easier to answer
questions and make you seem prepared and professional.
Now that your presentation-ready,
it's time to wrap up.
We covered a lot about how to
consider questions before a Q & A,
how to handle different kinds of objections,
and some best practices you
can use in your next presentation.
That's it for now. See you in the next video.
: Added to Selection. Press [CTRL + S] to save as a note
The R-versus-Python debate
People often wonder which programming language they should learn first. You might be
wondering about this, too. This certificate teaches the open-source programming language,
R. R is a great starting point for foundational data analysis, and it has helpful packages that
beginners can apply to projects. Python isn’t covered in the curriculum, but we encourage
you to explore Python after completing the certificate. If you are curious about other
programming languages, make every effort to continue learning.
Any language a beginner starts to learn will have some advantages and challenges. Let’s put
this into context by looking at R and Python. The following table is a high-level overview
based on a sampling of articles and opinions of those in the field. You can review the
information without necessarily picking a side in the R vs. Python debate. In fact, if you
check out RStudio’s blog article in the Additional resources section, it’s actually more about
working together than winning a debate.
Languages
R
Python
Common
features
- Open-source - Data stored in data frames Formulas and functions readily available Community for code development and support
- Open-source - Data stored
functions readily available development and support
Unique
advantages
- Data manipulation, data visualization, and
statistics packages - "Scalpel" approach to data: find
- Easy syntax for machine le
cloud platforms like Google
and Azure
Unique
challenges
- Inconsistent naming conventions make it harder
for beginners to select the right functions - Methods
for handling variables may be a little complex for
beginners to understand
packages to do what you want with the data
- Many more decisions for b
input/output, structure, var
"Swiss army knife" approac
do what you want with the d
Additional resources
For more information on comparing R and Python, refer to these resources:



R versus Python, a comprehensive guide for data professionals: This article is
written by a data professional with extensive experience using both languages
and provides a detailed comparison.
R versus Python, an objective comparison: This article provides a comparison
of the languages using examples of code use.
R versus Python: What’s the best language for data science?: This blog article
provides RStudio’s perspective on the R vs. Python debate.
Key takeaways
Certain aspects make some programming languages easier to learn than others. But, that
doesn’t make the harder languages impossible for beginners to learn. On the flip side, a
programming language’s popularity doesn’t always make it the best language for beginners
either.
R has been used by professionals who have a statistical or research-oriented approach to
solving problems; among them are scientists, statisticians, and engineers. Python has been
used by professionals looking for solutions in the data itself, those who must heavily mine
data for answers; among them are data scientists, machine learning specialists, and
software developers.
As you grow as a data analytics professional, you may need to learn additional
programming languages. The skills and competencies you learn from your first
programming experience are a good foundation. That's why this course focuses on the
basics of R. You can develop the right perspective, that programming languages play an
important part in the data analysis process no matter what job title you have.
The good news is that many of the concepts and coding principles that you will learn from
using R in this course are transferable to other programming languages. You will also learn
how to write R code in an Integrated Development Environment (IDE) called RStudio.
RStudio allows you to manage projects that use R or Python, or even a combination of the
two. Refer to RStudio: A Single Home for R & Python for more information. So, after you
have worked with R and RStudio, learning Python or another programming language in the
future will be more intuitive.
For a better idea of popular programming languages by job role, refer to Ways to learn
about programming. The programming languages most commonly used by data analysts,
web designers, mobile and web application developers, and game developers are listed,
along with links to resources to help you start learning more about those languages.
From spreadsheets to SQL to R
Although the programming language R might be new to you, it actually has a lot of
similarities to the other tools you have explored in this program. In this reading, you will
compare spreadsheet programs, SQL, and R to have a better sense of how to use each
moving forward.
Spreadsheets, SQL, and R: a comparison
As a data analyst, there is a good chance you will work with SQL, R, and spreadsheets at
some point in your career. Each tool has its own strengths and weaknesses, but they all
make the data analysis process smoother and more efficient. There are two main things
that all three have in common:


They all use filters: for example, you can easily filter a dataset using any of
these tools. In R, you can use the filter function. This performs the same task as
a basic SELECT-FROM-WHERE SQL query. In a spreadsheet, you can create a
filter using the menu options.
They all use functions: In spreadsheets, you use functions in formulas, and in
SQL, you include them in queries. In R, you will use functions in the code that is
part of your analysis.
The table below presents key questions to explore a few more ways that these tools
compare to each other. You can use this as a general guide as you begin to navigate R.
Key question
Spreadsheets
SQL
R
What is it?
A program that uses rows and
columns to organize data and allows
for analysis and manipulation
through formulas, functions, and
built-in features
A database programming
language used to
communicate with databases
to conduct an analysis of data
A general purp
programming
for statistical a
visualization, a
analysis
What is a primary
advantage?
Includes a variety of visualization
tools and features
Allows users to manipulate
and reorganize data as
needed to aid analysis
Provides an ac
language to or
and clean data
create insightf
visualizations
Which datasets does
it work best with?
Smaller datasets
Larger datasets
Larger dataset
What is the source of
the data?
Entered manually or imported from
an external source
Accessed from an external
database
Loaded with R
imported from
or loaded from
sources
Where is the data
from my analysis
usually stored?
In a spreadsheet file on your
computer
Inside tables in the accessed
database
In an R file on
Do I use formulas
and functions?
Yes
Yes
Yes
Yes
Yes, by using an additional
tool like a database
management system (DBMS)
or a business intelligence (BI)
tool
Yes
Can I create
visualizations?
When to use RStudio
As a data analyst, you will have plenty of tools to work with in each phase of your analysis.
Sometimes, you will be able to meet your objectives by working in a spreadsheet program
or using SQL with a database. In this reading, you will go through some examples of when
working in R and RStudio might be your better option instead.
Why RStudio?
One of your core tasks as an analyst will be converting raw data into insights that are
accurate, useful, and interesting. That can be tricky to do when the raw data is complex. R
and RStudio are designed to handle large data sets, which spreadsheets might not be able
to handle as well. RStudio also makes it easy to reproduce your work on different datasets.
When you input your code, it's simple to just load a new dataset and run your scripts again.
You can also create more detailed visualizations using RStudio.
When RStudio truly shines
When the data is spread across multiple categories or groups, it can be challenging to
manage your analysis, visualize trends, and build graphics. And the more groups of data
that you need to work with, the harder those tasks become. That’s where RStudio comes in.
For example, imagine you are analyzing sales data for every city across an entire country.
That is a lot of data from a lot of different groups–in this case, each city has its own group of
data.
Here are a few ways RStudio could help in this situation:



Using RStudio makes it easy to take a specific analysis step and perform it for
each group using basic code. In this example, you could calculate the yearly
average sales data for every city.
RStudio also allows for flexible data visualization. You can visualize differences
across the cities effectively using plotting features like facets–which you’ll learn
more about later on.
You can also use RStudio to automatically create an output of summary stats—
or even your visualized plots—for each group.
As you learn more about R and RStudio moving forward in this program, you’ll get a better
understanding of when RStudio should be your data analysis tool of choice.
For more information


The Advantages of RStudio: This web page explains some of the reasons why
RStudio is many analysts’ preferred choice for interfacing with R. You’ll learn
about the advantages of using RStudio for data analysis, from ease of use to
accessibility of graphics and more.
Data analysis and R programming: This online introduction to data analysis
and R programming is a good starting point for R and RStudio users. It also
includes a list of detailed explanations about the advantages of using R and
RStudio. You’ll also find a helpful guide for getting set up with RStudio.

transcript0:00

Hey there.

Anytime you're learning a new skill from cooking to driving to dancing,

you should always start with the fundamentals.

Programming with R is no different.

To build this foundation, you'll get familiar with the basic concepts of R,

including functions, comments, variables, data types, vectors, and pipes.

Some of these terms might sound familiar.

For example, we've come across functions in spreadsheets and SQL.

As a quick refresher, functions are a body of

reusable code used to perform specific tasks in R.

Functions begin with function names like print or paste, and

are usually followed by one or more arguments in parentheses.

An argument is information that a function in R needs in order to run.

Here's a simple function in action.

Feel free to join in and try it yourself in RStudio using your cloud account.

Check out the reading for more details on how to get started.

Play video starting at :1:10 and follow transcript1:10

You can pause the video anytime you need to.

We'll open RStudio Cloud to get started.

We'll start our function in the console with the function name print.

This function name will return whatever we include in the values in parentheses.

We'll type an open parenthesis followed by a quotation mark.

Both the close parenthesis and

end quote automatically pop up because RStudio recognizes this syntax.

Now we just have to add the text string.

We'll type Coding in R.

Play video starting at :1:45 and follow transcript1:45

Then we'll press enter.

Play video starting at :1:48 and follow transcript1:48

Success! The code returns the words "Coding in R."

If you want to find out more about the print function or any function, all you

have to do is type a question mark, the function name, and a set of parentheses.

Play video starting at :2:5 and follow transcript2:05

This returns a page in the Help window,

which helps you learn more about the functions you're working with.

Keep in mind that functions are case-sensitive,

so typing Print with a Capital P brings back an error message.

Play video starting at :2:24 and follow transcript2:24

Functions are great, but it can be pretty time-consuming to type out lots of values.

To save time, we can use variables to represent the values.

This lets us call out the values any time we need to with just the variable.

Earlier, we learned about variables in SQL.

A variable is a representation of a value in R that can be stored for

use later during programming.

Variables can also be called objects.

As a data analyst, you'll find variables are very useful when programming.

For example, if you want to filter a dataset,

just assign a variable to the function you used to filter the data.

That way, all you have to do is use that variable to filter the data later.

When naming a variable in R, you can use a short phrase.

A variable name should start with a letter and

can also contain numbers and underscores.

So the variable 5penguin wouldn't work well because it starts with a number.

Also just like functions, variable names are case-sensitive.

Using all lower case letters is good practice whenever possible.

Now, before we get to coding a variable, let's add a comment.

Comments are helpful when you want to describe or

explain what's going on in your code.

Use them as much as possible so that you and

everyone can understand the reasoning behind it.

Comments should be used to make an R script more readable.

A comment shouldn't be treated as code, so we'll put a # in front of it.

Then we'll add our comment.

Here's an example of a variable.

Play video starting at :4:16 and follow transcript4:16

Now let's go ahead with our example.

It makes sense to use a variable name to connect to what the variable is

representing.

So we'll type the variable name first_variable.

Play video starting at :4:30 and follow transcript4:30

Then after the variable name, we'll type a < sign, followed by a -.

Play video starting at :4:36 and follow transcript4:36

This is the assignment operator.

It assigns the value to the variable.

It looks like an arrow, which makes sense,

since it's pointing from the value to the variable.

There are other assignment operators that work too, but

it's always good to stick with just one type in your code.

Next, we'll add the value that our variable will represent.

We'll use the text, "This is my variable."

Play video starting at :5:5 and follow transcript5:05

If we type the variable and hit Run,

it will return the value that the variable represents.

This is a very basic way of using a variable.

You'll learn more ways of using variables in your code soon.

For now, let's assign a variable to a different data type, numeric.

We'll name this second_variable, and type our assignment operator.

Play video starting at :5:30 and follow transcript5:30

We'll give it the numeric value 12.5.

Play video starting at :5:35 and follow transcript5:35

The Environment pane in the upper- right part of our work space now shows

both of our variables and their values.

There are other data types in R like logical, date, and date time.

R has a few options for dealing with these data types. We'll explore them later.

With functions, comments, variables, and data types,

you've got a good foundation for working with R.

We'll revisit these throughout this program, and

show you how they're used in different ways during analysis.

Let's finish up with two more fundamental concepts, vectors and pipes.

Simply put, a vector is a group of data elements of the same

type stored in a sequence in R.

You can make a vector using the combined function.

In R this function is just the letter c followed

by the values you want in your vector inside parentheses.

All right, let's create a vector.

Imagine this vector is for a measurement data that we need to analyze.

We'll start our code with the variable vec_1 to assign to the vector.

Play video starting at :6:52 and follow transcript6:52

Then we'll type c and the open parenthesis.

Play video starting at :6:57 and follow transcript6:57

Then we'll type our list of numbers separated by commas.

Play video starting at :7:4 and follow transcript7:04

We'll then close our parentheses and press enter.

Play video starting at :7:11 and follow transcript7:11

This time when we type our variable and press enter, it returns our vector.

We can use this vector anywhere in our analysis with only

its variable name vec_1.

The values in the vector will automatically be applied to our analysis.

That brings us to the last of our fundamentals, pipes.

A pipe is a tool in R for expressing a sequence of multiple operations.

A pipe is represented by a % sign, followed by a > sign, and another % sign.

It's used to apply the output of one function into another function.

Pipes can make your code easier to read and understand.

For example, this pipe filters and sorts the data.

Later, we'll learn how each part of the pipe works.

So there they are, the super six fundamentals: functions,

comments, variables, data types, vectors, and pipes.

They all work together as a foundation for using R.

It's a lot to take in, so

feel free to watch any of these videos again if you need a refresher.

When you're ready, there's so much more to know about R and RStudio.

So let's get to it.
Vectors and lists in R
You can save this reading for future reference. Feel free to download a PDF version of this
reading below:
Vectors and lists in R.pdfPDF File
Open file
In programming, a data structure is a format for organizing and storing data. Data
structures are important to understand because you will work with them frequently when
you use R for data analysis. The most common data structures in the R programming
language include:




Vectors
Data frames
Matrices
Arrays
Think of a data structure like a house that contains your data.
This reading will focus on vectors. Later on, you’ll learn more about data frames, matrices,
and arrays.
There are two types of vectors: atomic vectors and lists. Coming up, you’ll learn about the
basic properties of atomic vectors and lists, and how to use R code to create them.
Atomic vectors
First, we will go through the different types of atomic vectors. Then, you will learn how to
use R code to create, identify, and name the vectors.
Earlier, you learned that a vector is a group of data elements of the same type, stored in a
sequence in R. You cannot have a vector that contains both logicals and numerics.
There are six primary types of atomic vectors: logical, integer, double, character (which
contains strings), complex, and raw. The last two–complex and raw–aren’t as common in
data analysis, so we will focus on the first four. Together, integer and double vectors are
known as numeric vectors because they both contain numbers. This table summarizes the
four primary types:
Type
Description
Logical
True/False
Integer
Positive and negative whole values
Double
Decimal values
Character
String/character values
This diagram illustrates the hierarchy of relationships among these four main types of
vectors:
Bottom: logical (arrow points to atomic), integer (arrow points to numeric), double
(arrow points to numeric), character (arrow points to atomic) Second to bottom: numeric
(arrow points to atomic) second level: atomic (arrow points to vector) top: vector
Creating vectors
One way to create a vector is by using the c() function (called the “combine” function). The
c() function in R combines multiple values into a vector. In R, this function is just the letter
“c” followed by the values you want in your vector inside the parentheses, separated by a
comma: c(x, y, z, …).
For example, you can use the c() function to store numeric data in a vector.
c(2.5, 48.5, 101.5)
To create a vector of integers using the c() function, you must place the letter "L" directly
after each number.
c(1L, 5L, 15L)
You can also create a vector containing characters or logicals.
c(“Sara” , “Lisa” , “Anna”)
c(TRUE, FALSE, TRUE)
Determining the properties of vectors
Every vector you create will have two key properties: type and length.
You can determine what type of vector you are working with by using the typeof()
function. Place the code for the vector inside the parentheses of the function. When you run
the function, R will tell you the type. For example:
typeof(c(“a” , “b”))
#> [1] "character"
Notice that the output of the typeof function in this example is “character”. Similarly, if
you use the typeof function on a vector with integer values, then the output will include
“integer” instead:
typeof(c(1L , 3L))
#> [1] "integer"
You can determine the length of an existing vector–meaning the number of elements it
contains–by using the length() function. In this example, we use an assignment operator to
assign the vector to the variable x. Then, we apply the length() function to the variable.
When we run the function, R tells us the length is 3.
x <- c(33.5, 57.75, 120.05)
length(x)
#> [1] 3
You can also check if a vector is a specific type by using an is function: is.logical(),
is.double(), is.integer(), is.character(). In this example, R returns a value of TRUE
because the vector contains integers.
x <- c(2L, 5L, 11L)
is.integer(x)
#> [1] TRUE
In this example, R returns a value of FALSE because the vector does not contain characters,
rather it contains logicals.
y <- c(TRUE, TRUE, FALSE)
is.character(y)
#> [1] FALSE
Naming vectors
All types of vectors can be named. Names are useful for writing readable code and
describing objects in R. You can name the elements of a vector with the names() function.
As an example, let’s assign the variable x to a new vector with three elements.
x <- c(1, 3, 5)
You can use the names() function to assign a different name to each element of the vector.
names(x) <- c("a", "b", "c")
Now, when you run the code, R shows that the first element of the vector is named a, the
second b, and the third c.
x
#> a b c
#> 1 3 5
Remember that an atomic vector can only contain elements of the same type. If you want to
store elements of different types in the same data structure, you can use a list.
Creating lists
Lists are different from atomic vectors because their elements can be of any type—like
dates, data frames, vectors, matrices, and more. Lists can even contain other lists.
You can create a list with the list() function. Similar to the c() function, the list() function is
just list followed by the values you want in your list inside parentheses: list(x, y, z, …). In
this example, we create a list that contains four different kinds of elements: character
("a"), integer (1L), double (1.5), and logical (TRUE).
list("a", 1L, 1.5, TRUE)
Like we already mentioned, lists can contain other lists. If you want, you can even store a
list inside a list inside a list—and so on.
list(list(list(1 , 3, 5)))
Determining the structure of lists
If you want to find out what types of elements a list contains, you can use the str() function.
To do so, place the code for the list inside the parentheses of the function. When you run
the function, R will display the data structure of the list by describing its elements and their
types.
Let’s apply the str() function to our first example of a list.
str(list("a", 1L, 1.5, TRUE))
We run the function, then R tells us that the list contains four elements, and that the
elements consist of four different types: character (chr), integer (int), number (num), and
logical (logi).
#> List of 4
#>
$ : chr "a"
#>
$ : int 1
#>
$ : num 1.5
#>
$ : logi TRUE
Let’s use the str() function to discover the structure of our second example. First, let’s
assign the list to the variable z to make it easier to input in the str() function.
z <- list(list(list(1 , 3, 5)))
Let’s run the function.
str(z)
#> List of 1
#>
$ :List of 1
#>
..$ :List of 3
#>
.. ..$ : num 1
#>
.. ..$ : num 3
#>
.. ..$ : num 5
The indentation of the $ symbols reflect the nested structure of this list. Here, there are
three levels (so there is a list within a list within a list).
Naming lists
Lists, like vectors, can be named. You can name the elements of a list when you first create
it with the list() function:
list('Chicago' = 1, 'New York' = 2, 'Los Angeles' = 3)
$Chicago
[1] 1
$`New York`
[1] 2
$`Los Angeles`
[1] 3
Additional resource
To learn more about vectors and lists, check out R for Data Science, Chapter 20: Vectors. R
for Data Science is a classic resource for learning how to use R for data science and data
analysis. It covers everything from cleaning to visualizing to communicating your data. If
you want to get more details about the topic of vectors and lists, this chapter is a great
place to start.
Download