Uploaded by milikoj926

Introduction to Career Skills in Data Analytics

advertisement
Introduction to Career Skills in Data Analytics
Defining data analysis and roles in data analysis
One of the challenges we face as we decide to pursue data as a career choice is the
fact that there are many different paths and specializations. Let's define some of
those roles, and then discuss the common skills that are shared among all. The most
universal role is that of the data worker. This person consumes data regularly. Works
with data often. Performs some data manipulation, and presents that data as a part
of their everyday work. Let's take Sally for an example. She works in the business
unit, and not necessarily the IT department, and each week she prepares a report for
her manager. She'll then prepare the data for the reports, and her reports are the
same as last week, the only difference is new data. You see most data workers have
limited access to all the different systems through the backend. They likely receive
data from people who have access to the databases. Data workers like Sally may even
export the data out of a system into CSV or Excel files, and the process of their data
work begins. A data analyst goes further. Generally, they might have a little more
access to data, model the data, and have connected to the data, so they can just
refresh the reports, and begin the analysis, and the presentation of the data. The
data analyst will handle a lot of ad hoc requests, and especially if they're
efficient. They will likely have more than just Excel to work with and are likely
considered a guru or a wizard in their department. The data worker and the data
analyst are what I consider the largest amount of roles available. Most people are
some form of data worker, and they dive into data analysis more than they even
know. The common skills of all data professionals are gathering data. manipulating
it to me requirements, and then reporting the outcomes in some way. Data
engineers take a special skill of being able to build and design data sets. Whereas a
data worker, and the data analysts will work with what is already built, and model the
data as needed. You will find a lot of people in crossover roles where sometimes they
play data engineers, and sometimes they act as a data analyst. One could argue on
the top of the hierarchy of data roles are the data architect, and the data scientist. A
data architect is a creator of architecture, no different than an architect that designs
a building. The data architect designs data systems. The importance of the
architecture really can't be understated as all roles including the data scientist needs
this architecture. What most see as the literal top of the hierarchy is the data
scientist. And I believe this is likely due to the fact that most companies have their
data architecture in place, and now it's time to take that data, and put it to use. This
is where the data scientist comes into play. A data scientist will likely have all the
common skills of the data analyst, the data engineer and the data architect. They'll
also have deeper skills in coding, statistics and math. It's okay to not know where
you'll end on your journey, but I think it's important to start. You can begin either as
a data worker or recognizing you already are one. You can increase your skills as a
data analyst, and then as you grow deeper in your experience with data, you'll
discover where you want to be. In all roles, you'll gain a deeper understanding of
data, and it's okay to find a place and stay there.
Developing data fluency
Is your organization data literate, data fluent, or none of the above? Let's break these
terms down. Data literate means that you could read it, converse about it, and
understand it. Let me give you an example, your bank account. It's all about your
finances, right? And if you really look at it, it's just all your data through
transactions. Can you read your balance? Can you tell me when something is there
that shouldn't be? And if it is, can you call the bank and explain it? That means you're
literate about your banking data. Now to the meaning of fluent. Fluent means you
can create something with it that shows skills outside of just being able to read it
and use it. We know people who speak other languages. They are either literate or
fluent. Someone who is literate can, again, pick up on the common things with the
language and speak in simple sentences, but a person who is fluent can carry on
conversations and author stories in that language. Just like these terms apply to
language, they apply to data. Let's go back to our banking example. If you are fluent
with data, you can then turn your banking data from last year into insights that will
allow you to build a budgeting system and a finance tracking system. To really build
your data skills, you must begin to think about how that data skill applies to your
everyday life. If you are data fluent, and at work someone hands you
information, and it's the first time you've ever seen it, you will have an approach that
lets you learn that data, and you will have questions that seem natural for you to
ask. Approach is everything, and building an approach or identifying it can start
today, right now even. Start thinking about every time you have a new data set in
front of you, what do you do? That's your approach. If your approach is to stare at it
and wonder, well, that tells you where to begin. Now there are degrees of data
literacy and data fluency that are appropriate in the workplace. And I would argue
that everyone should be data literate. The ability to read, speak, listen, and
understand the data, or at least the data that applies to them. Could be time sheets
or even paychecks. An organization that only has a small percentage of data fluent
people means they do not have enough people to do the exploring and building
that might just be the tool that takes their company to the next level. Becoming data
literate and then transitioning to data fluent can be a game changer in your
career. You can go from reading and basic understanding to producing insight and
data tools for your organization or maybe for yourself.
Understanding how data governance impacts the data analyst
Have you ever asked permission to gain access to data and been denied? Have you
ever asked for permissions and just been given global admin when all you needed
was read permissions? If so, then you've been a part of the data governance of the
organization, or the lack of it. Data governance is a framework that incorporates
strategies to create solid quality data, enable accountability, and provide
transparency to the data in the organization. Data governance has processes,
procedures, and people at various levels of the organization. It's meant to control
every aspect of the data in the organization. Data governance can support quality of
data, accountability, trust, and compliance. There is some form of data governance in
every organization at every size and level. If you work in a regulated industry, then
data governance will likely be more mature than other industries. I have worked in
almost every size industry, regulated or not regulated, and I'm either at the mercy of
the data governance or protecting myself for the lack of it. Here are some common
components of data governance that directly impacts the data analyst. Access to
information, how you can access it. There is typically a chain of command and the
data analyst is rarely meant to be the top of it. If you need access to information,
there is someone that you will request permission to gain access to the information
like your manager. Once they hear the request, they will typically instruct you to
contact the next person responsible, or they will contact them on your behalf. I once
requested access to the back end of a system to my manager. He then sent the
request to the technology department who then between the two parties agreed I
could have it. Little did I know it would go to a third person to implement it and
notify me. The third person was the person in the cube to my left. I ate lunch with
him every day. It was very controlled, and I did not understand it at the time, but now
I have an appreciation of it. As a data analyst, we seek the source of truth, the golden
record, and data governance is a part of providing that. We want to make sure
there's an identifiable truth and that we can trust what we're working with. When we
do not have at least two or three of these components to work with, we'll deal with
challenges. For example, you may have been given more access than you need, and
it might leave you wondering which data set you could really trust. Master data
management is also a key component of the data governance framework. Making
sure that the data we all need is complete, accurate, and meets the business
rules. This is one area where organizations that do not have a strong data
governance plan or strategy will have suffering data analysts. You may find yourself
always correcting something as simple as a product name that have been entered
incorrectly, but are literally the same product. You might be constantly customer
address information. I'm always telling organizations that regardless of
regulations, they have a data governance plan in place, whether they documented it
or not. As a data analyst, determining the data governance plan at your organization
will help you to know who to talk to, when to talk to them, and how to adequately
follow the process of all things that relate to the life cycle of data at the organization.
Understanding the importance of data quality
As a little girl, I got sick. I mean really sick. My mother immediately took me to the
doctor, and they did an x-ray because I had a headache so bad, I'd been sick for two
days. The x-ray showed nothing, but the physical signs of the illness, and the blood
work were enough for the doctor to send me to the ER. A day or two later, they did
another x-ray, and when they did, they discovered why I was so sick. You see, I had
a bacterial infection that was unfortunately on its way to my brain. I was hospitalized
for 11 days, and then given the right types of treatments to prevent it getting
worse, and treatments to help get it better. What does this have to do with data
quality? Well, the first x-ray showed nothing, and what they actually discovered is
that their machine was broken. Would I have gotten better faster if that first x-ray
showed them what the second x-ray did? We'll never know. We can't go back in
time. Quality data is data that can be trusted to produce accurate insights so
decisions can be made. In my situation, had they waited even longer to do the
second x-ray or even sent me home, I would not be here today. Not all data decisions
are life or death, but they can have terrible consequences for businesses if data
quality is not an everyday part of the culture. It is important for us to all remember as
data professionals that people are using data to make decisions, and bad data can
mean bad decisions with profound consequences. There are data quality
dimensions that you can be aware of as a data analyst. This isn't a complete list of
everything you will find for data quality, but here are the four major hallmarks of
quality data, complete, consistent, valid and accurate. Completeness of data. Do we
have all the data that's needed? Is any of it missing? Is it all usable? Consistency. Is
this data in other systems, and is the information consistent across all of them? In
other words, does the same record in production system match what we sent to the
invoicing system? Validity. Does the data meet the requirements of what we are
attempting to do with it? And is it in the right format in which we need to do
it? Accuracy. Is it accurate? This is a big one. Is this information accurate? And in my
case, it was not. I think it's important that we know quality can be measured, and we
can determine if it's complete, consistent, valid and accurate. And if it's not
100%, well, we need to know that. Again, some data means life or death. So data
quality at the highest rate is important.
What is BI and the value to business?
Have you ever heard the phrase, Our company makes data driven decisions? Well, of
course they do. And they are making data driven decisions all the time. It could just
be bad decisions because the data is bad. Data driven decisions happen all the
time. We need more money in the account, So someone get those salespeople
motivated. That is a data driven decision and action. The problem is this is a single
data point and at a crisis moment. I tend to ask people if they want to be data
informed? Data and business intelligence, lets you have both information and the
ability to make intelligent business decisions. For example, with the correct data and
an understanding of the process and the business goals defined with a solid set of
KPI, key performance indicators, a business can see a downward trend in sales before
it becomes a problem. This allows the business an opportunity to course correct to
attempt to prevent a crisis moment. For business intelligence to be practical, it
requires you to store the data that's important to the business and all its
processes. You can't just focus on one number like our earlier example. Just knowing
one number and that you have to hit it, means that you understand the
goal. However, it's not all the information you need. All the other data that impacts
that goal needs to be analyzed with the business roles. Fortunately, we have business
intelligence tools and they are a tool to build business intelligence with. The tools do
not provide it by themselves. Just like a hammer requires nails and someone to use
it to build something. Businesses need to define the metrics that help them track the
overall health of the organization. Again, these metrics are KPIs. Let me make it
practical. And I'll use health as an example. If you know your heart rate as an adult is
supposed to be anywhere between 60 to 100 beats per minute and you watch it
every day and suddenly you see it spike and stay elevated, it would indicate that
something is happening to make your heart beat faster. With a bit more
information like tracking what you eat or drink, you can analyze this data and you
notice that it's elevated when you drink a certain type of drink and it stays elevated
for a couple of hours and then goes back down. To make an adjustment, you can
stop drinking that drink or reduce the amount of it. Whatever the adjustment is, you
make it, you then analyze your heart rate to determine the adjustment to see if it
made a difference. When you apply this concept to the overall health of
business then you can easily determine what the heart rate of the business is. And
what are the items that impact it. This allows you to start to define the metrics that
help you monitor the health of the business and provide business intelligence.
How are business analytics and BI different?
I started running about four years ago and it made me realize that business
intelligence, business analytics, and even data analytics are really three individual
things with lots of overlap. Let me break it down. I was preparing for my first half
marathon. I needed the data to tell me how fast I was running so I could improve my
speed. That speed for my mile was my business intelligence. For this example, it
represents a single number. Business analysis focuses on all the numbers that would
allow me to get faster over time by again, analyzing the data and creating more of
it. For me, every time I made a run, I was tracking that information. Data analysis is
where we analyze and capture the actual data. We can then analyze the historical
data, keep capturing new data as it grows every day. To use all these concepts
together, I was comparing every run to the last run. However, initially I just started
capturing a run to establish a baseline and I just added more runs and more
miles. The business intelligence was telling me how fast on average I was running a
mile or the timing of certain miles, like, my 5K speed versus just my single mile. I had
a goal that I wanted to attain. So, using business analysis skills, I applied this to my
running. I would also use other values, like where I ran, what time I was running, to
determine my future outcomes. For example, I discovered I needed to change my
shoes. I changed my shoes, I ran a little bit faster and I hurt a lot less. I also
discovered that if I picked more familiar routes, that I might run a little bit faster than
on a route that was new. I was using these pieces of information to adjust my
routine so I could see a faster speed over time. So, business intelligence tells
us where we are on any given day for any process that we use data to study. My
example is running. But it could easily be applied to business metrics, like, sales or
production. And business analytics helps us to see the trends and predict future
outcomes which are critical to businesses. We need both business intelligence and
business analytics and we use data analysis to determine where we are and how to
reach our end goals or the desired outcomes. Think of it this way. Business
intelligence can tell you how you're performing today and business and data
analysis can tell you how you can potentially perform in the future.
How data can provide intelligence to the organization
Call me strange, but I have a relationship with data. To me, data is living. I do realize
it's an inanimate object, but I do think it's an intelligent object. Data cannot only
provide information like this month's sales, but it can also communicate with
software to provide automation. My first experience with data at this level was with
a regulated industry. They were required to provide information at different time
points, and then they were required to report on all the things they did to meet those
times. Mail came into the office, people went to retrieve their mail for the contracts
they supported three times a day. They also printed information and scanned it as
needed by walking to and from the printer. Let's just look at the data around the
printer, and the scanner process alone. Look at the data as a person walks, there is a
distance and that distance takes time. And what if they stop and talk? That's more
time. Think about the amount of time at the printer, and what if two people walk up
at the same time? Now there is a time the person is there, and the time the other
person is waiting. When they walk back to the cube, they begin the real work. When
a person has everything ready to go, they walk back to get the mail and deliver the
mail that's ready to go out. This occurred multiple times over time by different
people all the time. Okay, now let us multiply that by 10 people doing the same job
all day long and then let's multiply that by 260 business days. Now business
intelligence says that today that's X number of hours spent in transit, business
analysis says if we put a printer at their desk, we save X. This is just one way data can
help support improvements to the process. We often think about data in the form of
a field, or columns on a spreadsheet. Just with one key date, we can create other
dates to trigger other events. Technology will allow us to create information
automatically the very same information that a human would have to figure out and
then we can have the human verify the information, save time and be more
accurate. The most effective data analyst develop skills and a relationship with
data. It's important to start learning to see it as a living thing that can help us refine
current processes like walking to a printer and automate processes like creating
more data.
Understanding the value of data-driven decision-making
If I told you that we have an opportunity to purchase a product, and that it was going
to make a million dollars in the first year, you would get excited, right? That one
number sounds amazing to people. Some people would immediately look at that
number and begin to act; that is a data-driven decision. In our scenario, our company
thought the product at cost was a steal of a deal, and they bought it all with the
hopes that a million in revenue would produce a large profit. The only problem is
that million dollar number is only the top line and in no way reflects the impact to
the bottom line. When people get a single number in their mind, they can miss the
other more important numbers, and it can have damaging consequences. Let's take
our million-dollar project and break it down. The company bought the product at
cost, and the company will sell the product at a list price, and the difference between
that cost and the list price is the margin. All the numbers here, cost, list and margin,
matter but of the three, the margin matters the most. It's important to remember
that a million in revenue does not equal a million in profit. Our profit is made by the
margin. Some people see that margin and get excited, but if you stop there, you are
in trouble. You must account for the items that eat the margin because that eats the
profit. When you use data to inform your decision making, you must use the topdown and bottom-up approach together. If you're an experienced person with these
types of scenarios, you may have already figured out where this is headed. What do
we need to do to produce the distribution of this product? We'll keep it
simple. When companies sell products, someone has to sell them and there's a cost
to that. And even if it's an online sales model, there are people involved in
maintaining the information to make that happen. Let's just say for every $1 of the
product, it costs 10 cents of that dollar to pay for the sales process. Then there are
other costs: cost to store the product, cost to package the materials, cost to deliver
the product. There's cost in infrastructure. Cost to automate sales processes. Payroll
costs for people to maintain systems, answer phones, and ensure that delivery is
met. If you really dig in, you realize quickly that if everything that is required to make
the million in revenue eats up the entire margin, or you sold it at the wrong price, or
you get hit with unexpected costs like increase in delivery, increase in storage or
changes to tax, you are sunk. And if you can't sell it and you can't hold onto it, that
million dollars no longer looks like a gold mine. So the impact of being data
informed can ensure you're profitable, and that million dollars is really not a million
dollars after all, but a total loss, which is maybe why it was a steal of a deal in the first
place.
Questioning techniques to collect the right data
Have you ever heard of analysis paralysis? It's where you overthink a problem that
stops you from moving forward. It's a real thing for some people. It's likely due to
stress and anxiety related to making the wrong decision or not knowing exactly what
to do next. Building an approach and thinking through standard questions and
critical thinking with active listening should help you. Technical skills or hard skills are
one thing for the analyst, but the soft skills matter just as much. And if you're stuck,
no hard skill matters. To be fair, the more exposure to real problems and solving
them with data solutions will help you build your approach, but you can slowly start
building your questioning now. There are some common questions you might
ask for every data-related project and the questions might be more specific based
on actual problems and the data that you have at hand. Our scenario is that we have
five of our top products. They're being purchased all the time, but the company is
losing money. First, you need to understand that there is data in everything, in it and
around it. This will help you start to consider the questions. Our task as the analyst
is to try to determine why if the sales are moving, why are we losing? There are some
basic questions that you should ask about each of the five products. Have these
products ever been profitable? If they were profitable in the past, at what point in
time? What is different about this point in time versus that point in time? Did the
wholesale cost change? Did the list price change? Did the cost of storing or
delivering the product change? Any of these answers will lead you further into data
analysis. When we start with these basic questions and begin to answer them, then
it will lead to more questions. As an example, let's say that in our initial
questioning, we determine that the wholesale costs nor the list price have changed
in the last three years. The cost to deliver has not changed enough to drive an
impact. The cost of storing the products has been steadily increasing. The next round
of questions begin. Is it only these five products that are impacted by the steady
increase of storage cost? And what we discover, it's not just impacting these five
products but all the products. The company just started to realize it in these five
products. What can we do to reduce the storage costs? What type of increase can
we justify on the products without overpricing the product? Both these questions
lead to very different datasets within the organization and then each round of
questions and answers leads to more questions. The goal here is to remember you
must start asking questions and then remember they rarely stop. They just drive
further investigation. The greatest part of the question process is that the end result
is discovery and recommendations that are made to improve outcomes.
Discovering and interpreting existing data
Have you really thought about how much data is around a person? There's more
than you may think. There's data like date of birth, names, race, and ethnicity. There's
work data like employee ID, job title, hire date, or department. These data points are
the items we think about when we work with data related to people, right. Some of
this data is one value, like birthday. It's a value that it is, and it doesn't change. Then
there are other items like job title, which might change when you get a new
promotion at work. There's also real time data always occurring like heart rate, blood
sugar, blood pressure, and even temperature. There's also geographical data like
location. Imagine social data as well as what brands we follow, what brands we
purchase, how often we have food delivered versus go out to eat. Data is always
happening. The challenge we face as data analysts is there's a lot of potential data
and not all of it is actually available to us. We also find a lot of the same data is
redundant and in some cases can even be incomplete or inaccurate. All of us are
seeking the single source of truth from the data that we work with. We actually want
it to be accurate when we report on the data. Let me give you some
examples. Companies have several different software packages that are used to
handle different types of information. And they're often disconnected. There's
people management software for HR type information, which is employee data. We
have our marketing and sales management data. That's maybe in a couple of
different systems and it handles not only staff information in regards to sales, but
also customer information. There is also software that kicks in when a customer
goes from being in conversations with our sales team to purchasing from the
company. That data flows from purchasing to the warehouse. There's also data that
flows to the accounting team to handle transactions that support reporting like
profit and loss. What this means is that data flows through the organization at
different times. Systems are often disconnected so finding which systems have the
most accurate information is one of the first challenges. The only way to really know
is to begin the investigation and question along the way. We sometimes hit
roadblocks due to permissions and the sensitivity of data. For example, the data you
might need to confirm your values is stored in the accounting software and only the
accounting team has access to that data. Just because you can't directly access
it doesn't mean you're done. You can provide them the values and those teams will
work to help you validate. In reality, whether systems are connected or not, they
should hold the same record of information. If your sales team reports that there's a
hundred thousand dollars set to invoice this month, then the accounting software
should reflect a hundred thousand dollars worth of invoices. When they don't
balance out, you have to figure out where the breakdown has occurred. As a data
analyst, you need to be thoughtful of the type of data you might find. And then you
have to find the data you do have access to and develop strategies to validate your
reports. Just remember data shows up in everything but it's our job to bring it
together accurately.
Data sources and structures
We hear about data all the time, right? But what does that really mean? Let's start
with the basics. Data has a value, like your birthday, that's a value like November
20th of any year. So your birthday would be 11/20 of that year. Data has a type, like
birthday. It's a date data type. And data has a field name, like DOB, for Date of
Birth. When we put these fields together, like First Name, Last Name and Date of
Birth, we're creating a record. People use records and spreadsheets all the time, but
they don't really think of the sheet as a table, but it actually is. It's just a table called
Sheet One. And when fields are combined in a database, they're stored in
tables. They still have names, values, and data types. And when we fill in this
information for a person, we're creating a record. Tables are a great way to capture
multiple types of data in a structured way. This way of storing data is way more
flexible than the spreadsheet environment. There are also other types of
systems that collect and store data for the analysts to use for their reporting
requirements. This varies of course by company, but you can expect to find
spreadsheets, databases or even data warehouses. Data warehouses really are data
systems that have the refined tables from our production systems, like the
purchasing system, for example. A customer-dedicated software system might have
a database with hundreds of tables and details, but only certain tables and fields are
needed for reporting. These fields get cleaned up by data warehousing
professionals and brought into the warehouse for storage and safekeeping. It is a
valuable source of nicely structured data that has been vetted for the analysts to
begin their reporting projects. Structured data that fits neatly into tables and feeds
a beautifully designed warehouse is amazing, but not all data is structured. This is
where systems like data lakes help organizations capture data so they're storing
it before it's actually refined for reporting needs. Data warehousing and data lakes
and even data lake houses are very interesting. And if you're into designing
databases or designing data solutions, you may find you want to explore these skills
further. Data analysts will tap into these systems for the data. They don't necessarily
create them. As a data analyst, you will find yourself working with various systems
and file types. At the start of your career, you can expect a lot of spreadsheets and
CSP files as you work your way up into working with data stored in larger data
systems. And don't worry, no matter the level, most data professionals love a good
spreadsheet when it's used for analysis and not for storing data.
Describing data best practices
[Narrator] Do you have an approach to data? Have you ever really thought about
it? I know after years of working with data for projects or ad hoc reporting, that I've
built a pretty defined approach to every data set that I work with. There are just some
things that I do with every data set. The process may be a little bit different based on
the software that I'm working with, but in this example, I'm using Microsoft
Excel. This transactions file has actually been exported from a software that we use
to analyze our transactions. Normally, when we're working on an ad hoc report or a
project, we have an expectation of what we're going to deliver. But to show you that
this approach will work with any data set, I don't have an end goal in mind. I just
want to learn about this data set. If I take a little time upfront to learn more about
this data set, I'll be better off when I start trying to meet the end goal of the
project. Excel will sort, filter and perform data commands on what it sees as the data
set. And that's a key point, what Excel sees as the data set. So the very first thing I
want to do is confirm that the data that I'm working with in the transactions list, is
entirely recognized by Excel as a data set, meaning that there are no breaks in the
data. I do this by using one of my most favorite shortcuts. It will select all the data
that Excel sees in the range. To do this shortcut, I just simply do Ctrl+A. That's not
enough though, because this is a lot of data. It looks like it picked it all up. But if I
zoom out, I notice pretty quickly that I have a broken data set. You see all of column
Z is empty. So that means Excel will only sort and filter everything to the left. In order
to fix this data set, I can right-click column Z and delete it. Okay, let's do that shortcut
again. I'll do Ctrl+A, and now I have a fully intact data set that Excel will
recognize. This makes it easier for me to sort, filter, and do all sorts of data
commands. Okay, let me do Ctrl+Home to go up to A1. Before I go any further, one
of the very first things I'll do in working with the data set, is I'll make a copy of it. So
I'm going to take my mouse and put it on the bottom of the transactions list here on
the transaction sheet tab. I'll hold my Ctrl key, and then I'll drag and drop it one step
to the right. Now, it's important, I'm going to let go of my mouse first, and then let
go of Ctrl. That makes a copy. Okay, I'll rename it to working. Copy. That way, if I
mess up, I can always go back to the original transactions list. Okay. Let's take a
deeper look at this data. When I see fields named ID, like transaction ID, this is
database language for key fields. Okay, let's see how many of those we have. So I'm
going to hit the select all which selects the entire sheet, and double click in between
the A and the B column headers, and this sizes all of the data. So I'm looking at
transaction ID. I have product IDs. I have reference order ID. So these are key
fields and it automatically makes me wonder, are there duplicates in this data set? So
let me highlight the transaction ID because that's what I really need to be unique. So
I highlight transaction ID, and I want to spot the duplicates before I deal with them if
they exist or not. I'll go to conditional formatting. I'll choose highlight sales rules, and
I'll choose duplicate values. I'll go ahead and make them light red fill, and click okay. I
have to look at the data and immediately I see some duplicated data. That means,
that I have duplicates in this data set. So if I were to total it up or count the records, I
would get an inflated amount of information. Okay, so I need to address these
duplicates. Let me do Ctrl+Home to go back up to A1. It's easy to deal with
duplicates when you know what fields to choose. What makes this a duplicate
transaction, is the fact that the transaction ID is duplicated. I see them all highlighted
in red. It's a little bit more obvious now that we know that duplicates exist, but in a
sea of data, it can be hard to find them. Okay, let's go remove the duplicates. Now
this command will actually remove them, but that's okay, I have my copy here. I'll go
to data. I'll choose, remove duplicates. I'll choose, unselect all for this example. And
I'll choose transaction ID. I'll go ahead and click okay. It tells me that it found a ton
of duplicates, and that it's only going to leave me 1,228 records that are
unique. Perfect. I'll go ahead and click okay. Now I have a data set with integrity, no
blank rows, no solid columns. I know that I don't have duplicates because I've
removed them, and I have a working copy so that I can continue to explore this
data. This is in no way, a comprehensive list of approaches. These are just techniques
that when you start working with Excel data, you might want to do them on every
data set.
Assessing and adapting the data for transformation
[Instructor] Have you ever heard of data profiling? It's where we create a high-level
profile of the characteristics of the data that we're working with. We should apply
this approach to every data set. The greatest thing about profiling data is that when
we use this approach, we get to learn about the data we're working with at a high
level. Profiling helps to inform us on some pretty valuable items. It tells us how much
data we have in this set. It can also tell us what the totals counts or averages of any
number may be. This helps us validate our numbers later. It can also inform us about
the data cleaning we will need to complete when we get ready to transform our
data. I have some sales order data here, and I want to profile this data to help me
get started working towards a report on sales orders. I'll first start by profiling the
amount of data. I want to take a look at the record counts. How many records do I
have in this data set? To do this, I can click on column A, and use the auto calculate
feature on the bottom right-hand side of my screen. Now I have all of the auto
calculate functions turned on. To do that, I just right-click the auto calculate area and
then I can select each one of the options that I need. Okay, great. So when I look at
this, I can see there's a count and a numerical count. So count will count everything I
have highlighted and the numerical count will only count the numbers. So if I look
at this record set, I actually have 3,500 records that represent the sales orders. We
can also use some in average. Let's take a look at how much money is actually
represented in this record set based on total due. I'll highlight column L. And this
tells me I have approximately $33,700,000 worth of money represented in the total
due column. It also tells me that my average is $9,633. Let's look at the average of
the subtotal. This is the money before tax and freight. So the average sub total in
this data set is $8,581. And the total is around 30 million. This tells me that if I see
numbers like 60 million or 66 million, I have a problem in my data. So knowing how
much it would total is important for validation later. Data profiling is so easy to
do, but this is just the starting point of what you'll learn to profile your
data. Remember, it will also help us inform our data cleaning. Take a look at columns,
B, C, and D with me. These are order dates, but they look like zeros. If I click on B2, I
can see that there is actually a date included. Just can't see it based on the
formatting. Also for the purposes of my reporting, I don't need those
timestamps. They're all set to midnight anyway. So this informs me that on my data
cleaning process, I'll need to address the dates. There are additional profiling
options that we will uncover as we explore deeper into our data and with other
tools, but everyone with the data set and Excel, can use these options to profile their
data.
Understanding the rules of the data
We hear about business requirements in the world of business all the time. They
control what we are doing on any given project. Part of meeting the business
requirements is the business rules. It is important when working with any data that
you understand the rules around the data that you're working with. These rules can
inform you when to expect data, what you can do with certain data of certain
criteria, and also explain what needs to happen in the transformation of data. Let's
work through some examples of business rules and how they can impact our
data. Let's get started with just understanding what we mean by rules. Business rules
can be as simple as a definition, is a contact for a salesperson, a customer, or a
prospect. It could be as simple as a business rule that defines a customer is a
customer once they actually place an order. These rules also control the flow of
data. So if in our system, we have a sales order record, that means that the order has
occurred. It means that that prospect and the potential sale made it to a certain stage
of the process. Then the business can use this to easily distinguish a potential sale
from an actual sale. This is an example of a simple business rule, and this rule can
also be used to then convert a prospect to a customer using data. Some rules can be
a bit more specific and have a technical requirement. We have some sales order
data. This sales order data is going to be prepared to go into a new system that
provides additional reporting about our sales orders. This information will go to our
production team. So the business requirement is that we need to prepare the data
to go into the new system. Now we have the data that we want to transfer to another
system for reporting purposes. It has a specific template, and we must use this
data from our system to match that data specification of where it's going. We've
been provided this technical requirements document for our data. Let's take a quick
read through that. First of all, it tells us that the sales order ID must be converted to
a text data type, but it must not contain any letters. All of the date fields should not
include time stamps. We also have to have a main account GL number. And that main
account GL number holds a four-digit code for accounting and the last two digits to
specify the category. Also, we see that territory ID and comment fields need to be
removed. And the final step is to save our data in a CSV or comma-separated value
file so that we can import it into the new reporting system. So now that we have our
technical requirements, let's take a look at the data. Okay, so the business role in our
technical spec said that sales order ID and sales order number need to be text data
types. So I can look at sales order number and see pretty quickly it's a number data
type. I know that because it's right-aligned in the field. I can see the sales order
number is already a text data type. It's aligned left, but it doesn't meet the
requirements because it contains two letters, S and O for sales order. I'll take a look
at my dates. I can clearly see they include time stamps, so part of my technical
requirement will be to clean this data to meet the rules, which would be only dates
and no timestamps. Our specification also said we had to have a main account GL
number, and this is a four-digit code for accounting and the last two digits specify
the category. But when I look at the data, I don't see a main account GL
number. However, because I know the business rules of the account number for
these records, I know that that main account GL number could actually be created
from the account number. I also see we do have columns that they said to not
include, which would be the territory ID and the comments. When working with any
new data project, you want to make sure you consider the rules of the organization
in regard to their definitions for data. You also need to account for the flow of
data and any specific technical requirements.
Tips on preparing the data in Excel
[Instructor] In the life of every data analyst, you reach a point where it's time to
prepare the data. This is the part where we clean and transform our data to meet the
requirements, and if you haven't heard, we do this a lot. You've profiled your
data, you've reviewed all the business rules, and now it's time to dig in and actually
get started. I want to work with my sales order data to prepare it for a template to
import it into a new system. I typically start with a new blank workbook and I'll use
Power Query to connect to my data, and then I'll do my data transformations
there. I'll go to my data tab, I'll choose get data, and I'll choose from file. There are
several connections here, but because my data is an export that's stored in an Excel
workbook, I can choose from file and from workbook. Okay, I'll navigate to my data
for template, I'll double click it, and this is what is establishing the
connection between my Excel file and Power Query. I'll choose sales orders, and then
I have two options. I can go ahead and load the data to this spreadsheet, or I can
choose transform. Because I know I have transformations to make, I'll go ahead and
choose transform data. Okay, so I'm connected to my data, and I can see my sales
order query. I see my query settings and my applied steps. First off, I want to show
you that it promoted my headers. Now, what that actually means is it took the first
row of information that it saw from my spreadsheet and made that my column
headers. And then it changed the type. What that means is that it looked at the
second row of information, which was actually the first row of values, and tried to
determine what the data types would be based on the values that it sees. Okay, for
example, sales order ID. It has numbers. So it automatically translated that as a
number. Order date has a date and a time, so it automatically made that date and
time. Okay, as part of my requirements, I know that I have to change sales order to
be text. So I'll hit the one, two, three, and change it to a text data type. It's asking me
do I want to replace the current step or add a new step? I don't want to change how
it read every single data type, so I'll go ahead and add a new step, and then on the
right hand side, you see my applied steps has a new step where I changed the sales
order ID to text. We also know from our technical requirements that we can have
sales order number. It is also supposed to be text, but it cannot contain any
letters. So I need to remove the S and the O from the front of the data. So what I'll
do is I'll highlight that whole column and I can right click and choose replace values. I
can also just select a single field and choose replace values. I can highlight the whole
column and choose replace values up top. Okay, so I'll choose replace values. No
matter what step I choose, the outcome will be the same. So I want to find all of the
SOs in this column, and I want to replace them with nothing because I want just the
number. I'll take a look at the advanced options. It's asking me do I want to match
the entire sale contents or replace using special characters? Neither of these
apply. Okay. I'll choose okay, and then immediately, I see sales order number. It's still
text, which is appropriate for my requirements, but it no longer contains the S and
the O. On the right hand side in my applied steps, I see replaced value. And if I
needed to change anything, I could hit the little gear shape, and that takes me right
back into my steps. Okay, I'll choose cancel there. Because Power Query keeps all of
our steps, it's similar to what people do with recording macros or coding VBA for
data cleaning. Except we're not having to code or record. We're just actually
performing the actions and it's keeping up with it. Let me show you what I mean. So,
let me click on navigation. Notice that the first row contains my column headers. So
now when I choose the next step, it shows me that it promoted those headers. Then
it changed all the data types based on the data that it sees. And then I started my
first step which was changing the sales order ID, and notice I still see the SO until I
choose replaced value. That means if my data changes, I can update my data
source, and it will reapply all the same steps. Okay. Let's go ahead and change the
data types for dates. I don't need the timestamp, and also notice they're all set to
midnight anyway. I'll go ahead and hit the dropdown and choose date, choose date
again, and then date again. So now I have my dates in order. Perfect. Let's go ahead
and work with parsing text. So, first of all, we have an account number. This account
number actually really needs to be referred to as the main account GL. So I'll go
ahead and double click account number and change it to main account GL. Now
each piece of this main account GL actually represents another field of data that I
need. So, I need to actually parse this text. I need to split it apart and I'll use what's
called a delimiter to do that. Notice there's a dash in between each section. The first
thing I'll do is duplicate this field, and that will throw it all the way to the right. That
way I can keep the main account GL and then also create the three new fields. I'll
right click, I'll choose split column, and I'll choose delimiter. Notice there's several
options here. I'll choose delimiter. My delimiter is a dash, although I have multiple
options here. Okay, so custom dash, and I do want to split it at each occurrence of
the delimiter. All right, I'll go ahead and click okay. Let me scroll over. And I have my
three new fields built from the main account GL. Let's go ahead and name these. So
I'll call this gentle. This should be labeled GL number. This field will be called account
number 'cause that's what it represents. And then this last number here is called
category. Okay, perfect. Now, if you look up top, you see what's called M for
mashup. This is the language that's keeping all of my steps. If you want to see all of
those steps, you can go to the advanced editor, and this is its recording of everything
we're completing. Okay. I'll go ahead and close that advanced editor. I also need to
remove columns. Now, I can actually right click any column and choose remove. I
can keep the columns that I want and then right click and tell it to remove all other
columns, or I can go to choose columns up top and then just deselect the ones I do
not need. So I do not need territory ID or comments for my final file. I'll go ahead
and do okay. Now that all my transformations are made, I can go ahead and close
and load this data to my sheet. It tells me that I have 3,500 rows loaded. This is
perfect. Okay, great, tells me where my data sources are and the time of my last
refresh. These are basic steps that anyone can perform to clean up columns, convert
data types, and break text apart.
Transforming data in Excel with Power Query
[Instructor] We've been tasked to look at how long it takes for our supplier
transactions to go from the transaction date, to the finalization date. We want to see
if there's any suppliers that may take a little bit longer for any given
transaction. What we really hope to find, is that most of our transactions are under
three days. We have all the data we need, but we don't have all the calculations we
need to perform the analysis. So let's get started with a few transformations, and
building out the calculations we need. Okay, I'll go to queries and connections, and
I'll choose to edit my supplier's query. There are few transformations that, normally
would've caused me to create functions in Excel, but because I'm using Power
Query, I can perform these functions without having to create a function. Let me start
by showing you supplier name. For our purposes, we need all the supplier names to
be in uppercase. I can easily transform this column to uppercase. Also, we need the
transaction date, but we also need the transaction year. I'll right-click transaction
date, I'll duplicate that column, and then I'll transform this to just show the year. I can
do that by right-clicking, transform, I can choose year, and then choose year. Okay,
I'll go ahead and name that, transaction year. And then I'll just go ahead, and move
it over by my transaction date. I have two amounts here. I have the amount excluding
tax, and the actual tax amount. What I really need is the total amount. So I'm going
to create my first formula. I'll go to add column, I'll choose custom column, I'll name
it total amount. And then using my available columns on the right hand side, I'll
scroll, I'll double-click amount, excluding tax, I'll add the plus sign, I'll double-click
tax amount. It tells me that I have no syntax errors, and I can click OK. I'll go ahead
and adjust this to be a currency data type. I only need the total amount, so I'll go
ahead and right-click amount excluding tax, and choose remove. And then I can also
remove my tax amount. Now we want to look at the number of days that have
elapsed between the transaction date, and the finalization date. Let's go add another
column. I'll go to custom column, I'll name this days, I'll choose transaction
date, minus, finalization date, and click OK. Using this method will return the number
of days, but because the transaction date was before the finalization date, it's
showing as a negative number. It also doesn't really look like a number. It looks like
a timestamp. What I'll do is go ahead and change it to a whole number. And what
I'm really looking for is the absolute value. So again, I'll right-click, transform, and
choose absolute value. Now I have all of the information I need, except I don't have
the field that tells me if it's over or under three days. I'll use a conditional column. I'll
tell it to look at the days, and then provide me text, that says over three days, or
under. I'll go to conditional column, I'll name this over under, I'll choose days. And if
it's greater than, or equal to, three days, I want it to say three days or more. For
anything that's two days or less, I want it to say two days or less. This is a logical
function that looks at the days, and then gives me a value if it's true, or a value of
false. If I were doing this in Excel, it's similar to an IF function. I'll go ahead and click
OK. Now I'm prepared to start my analysis. I'll go to home, I'll choose close and
load. Now I see I have all of my extra columns that I've added, and my supplier name
is automatically capitalized. This is fantastic. I'm ready to start looking at my supplier
transactions, to determine if they're over or under three days. Now that our data is
prepared, we can answer a few common questions on the production days. We'll
start by inserting a pivot. I'll do insert, pivot table. It's going to use my supplier's
range, and it'll be on a new worksheet. Perfect. I'll drag my over and under to
rows, I'll go ahead and drag my supplier transaction ID to values. And because it's a
number, it will automatically sum it. I'll go ahead and change that to account. Click
OK. Just looking at the numbers, I can tell that most of the transactions have been
three days or more. Let me do one more quick analysis step. I can right-click, show,
value as, and tell it to show me the percentage of the grand total. This high level
detail tells us that 69% of our transactions, really are taking three days or more to
produce, only 31% approximately, are actually under two days. Okay. We need to do
some more analysis. Transforming data can mean a lot of different small
techniques applied to the data as you work to get to your analysis.
Transforming data in SQL
[Instructor] If you are a data analyst, at some point you will encounter or hear SQL
or SEQUEL. Let's start with the basics. SQL stands for Structured Query
Language. Structured query language is vast. It's not unlike any other
language. SEQUEL, however, is a computer language that works with data and the
relationships
between
them. Microsoft
SQL
Server
is
a
relational
database management system developed by Microsoft with the primary function of
being storing and retrieving data, although it does so much more. It was developed
over 30 years ago and it does a lot of different things with data. And it's important
to understand you don't need to know them all. As a data analyst, you do need to
know some basic queries. A basic query allows you to select data from the
database. There are two required statements for a SELECT. You must know what you
want and where you want it from. This is the SELECT and the FROM statement. The
SELECT will list all the fields from the table, and the FROM actually lists the table
name. If I want to filter data, then I'll use the WHERE statement. And if I want to sort
data, I can use the ORDER BY statement. WHERE and ORDER BY are not required to
be in the statement. However, when they are used together, they are required to be
in the right order. You have to filter the data before you sort it. Let me show you how
to run some basic SEQUEL statements. I'm using SQL Server Express and I'm
working with Microsoft SQL Server Management Studio on the Wide World Importer
Sample Database. I'm going to run the supplier transactions. I'll right click, I'll select
top 1000 rows. This generates a basic SQL statement. It selects all of the fields from
Wide World Importers. Okay, it's also a filter for the top 1000 records. I'll go ahead
and remove that statement and execute it again. If you look on the bottom right
hand side, this tells me I have 2438 supplier transactions. To add more meaning to
this data, I actually need to add another table. And this brings us to working with
joins. When you have data in multiple tables, you leverage joins to control what
data shows in the results. Okay, I'm going to highlight my select statement. I'll right
click it and go to the design query and editor. Even though I can code all these
statements, it is easier to work inside a GUI, a graphical user interface, especially if
you're at the beginning. Okay. So I'm going to size my table here so I can see
everything. All right, I'll right click and go add a table. And I want to add the
suppliers. Because these tables have an established relationship in the database
design, they're automatically joined. They're joined by the supplier ID being in both
tables. I can also see that it's a key shape with a one to many, meaning I have one
supplier listed and they may be attached to many transactions. When I hover over
the diamond shape, it shows me the inner join, but I can also see that in the
statement here. Okay, perfect. Now let me add the supplier name. Now it will
automatically throw the supplier name at the last part of the select statement. But if
I want to put it at the beginning, I can just drag it up. Okay, I'll click OK, and then I'll
execute my statement. An inner join works by looking at both tables to find a
match. And what that means with these two tables is that if I have a supplier
name and that supplier has a transaction record, they will show in the results. This is
showing me 2438 records where I have a supplier and a transaction. This is
perfect. The only issue I have with this data is if I wanted to report on suppliers that
we have in our system, regardless of their transactions, I have to adjust the join
type. All right, I'll highlight my statement, I'll go to the design view. The diamond
shape is where I can control the joins. I'll right click this and tell it to show me all rows
from suppliers. And this will create an outer join. If it's left or right, will be determined
on how the database sees left or right. So I've told it to show me all
suppliers, regardless of their transactions. I'll click OK. And if you'll notice in the
statement, it's a right outer join. It sees the supplier table on the right of the
data. Okay, I'll go ahead and execute. I now see that I have more records. I have 2,444
records. This means I do have suppliers listed in our data set that do not have
transaction records. Let's scroll to the bottom and see what that looks like. Starting
with nod publishers, I see the first set of suppliers that do not have transaction
records. They're easy to see because the transaction record all says null in each
field. That's because there are no supplier transactions for these final suppliers. If I
want to see if there are supplier transactions that do not have a supplier, then I can
adjust the join type. I can tell it to give me a left outer join. Now I could go to the
design view and adjust this, or I can just type left outer join here in my
statement. And then I can execute. That's because there's a relationship between
these tables that will not allow you to put a transaction in without a valid supplier. But
again, you could easily have suppliers that do not have transactions yet. Because join
types do impact the data we have in our results set, you always need to critically
think through what you're trying to achieve with your data and know that you might
need to adjust the join type.
Transforming data in Power BI
[Instructor] Power BI offers two core functions for the data analyst: Transforming the
data, as well as presenting the data. We want to analyze the sales of our products to
note our top 10 products. We will eventually visualize this data for an executive
meeting or a future dashboard. The opening screen of Power BI desktop is a blank
page ready for visualization. This happens after you've connected to your data. If you
notice in my Field pane on the far right, I've connected to the tables I need to analyze
the top products. It's the Order Details and the Products. We'll do the
transformations on Product and Order tables in Power Query. I'll get to show you the
Group By function, so I can total the orders. As well as the Query function, where I
merge the orders and the products together using the Merge Queries
option. Because I've connected to my data, I can just go to the Transform option and
begin my data cleanup. I'll start with Products. The only information I need for the
top product analysis is the product ID and the product name. So I'll click Product
ID. Hold my Control key and choose Product Name. I'll right-click and Remove Other
Columns. This just leaves me with the two that I need. Because this information might
present better with the product name being all uppercase, I'll go ahead and rightclick, and transform it to be uppercase. I'm now ready to move on to Order
Details. Order Details gives me the order ID, the product ID, the unit price of that
product, how much was ordered, and the discount. One of the very first things I need
to create is the function that gives me the total amount after the applied
discount. Okay, I'll choose Add Column. I'll do a Custom Column. I'll do Total Order
Amount. In this statement, I'm going to create the subtotal, calculate the
discount, and then deduct them from each other. Again, mathematically, you could
do this multiple ways. Okay. I'll click OK. And now I have my quantity times my unit
price minus the discount amount. I want this to be a fixed decimal number. Fantastic,
I now have my Total Order Amount. Now I'll use the Merge function to create the
query that merges the Products table to the Order Details table. I'll start by clicking
on Products. I'll go to my Home tab. And I have the option for Merge Queries. We
have two options here: Merge Queries or Merge Queries as New. Merge queries, if I
select it, will allow me to merge data directly into Products. Merge Queries as New
will give me a third object to work with. For the example in what I'm creating today
for analysis, I just need to do a Merge Query. I can merge that data directly into
Products. Now I have the Merge screen and I have Products. And I want to merge it
with Order Details. And the common field between the two is the Product ID. So I
want to make sure that I've highlighted those. The join types, just like any other data
set, are here in the Merge. If you look at the bottom of the screen, it says Join
Kind. And if you notice, there are 75 records that match 77 rows from the first
table. That means I actually have two products with no records. Meaning, they
haven't been ordered. And that's okay. We're looking at the top products, so
obviously, all of them wouldn't have been ordered. I'll go ahead and hit that drop
down. I have Left Outer, which would show me what it's showing now, all products,
and if it hasn't order. Right Outer, which would show me all order details regardless
of the match to the product. A Full Outer, meaning, if I have products and order
details that don't have records matching, it would show all rows from both. An Inner
join, which is what I need here, show me just products with orders. You also have Left
Anti and Right Anti. This would show you just the null values. So if I were to choose
Left Anti, it would only list the two products that didn't have order details. For my
top analysis, I need Inner. I'll go ahead and click OK. And now I can expand my
table. I'll hit my Expand here. I don't need to use the original column name as a
prefix, but that's a preference. I don't need all of the columns. I really just need the
Total Order Amount. I can go ahead and click OK. And now I see the Product Name
and the Total Order Amount. Now I'm ready to group them up. This will allow me to
use the Group By function, and total by each product. Okay, I'll go to my Transform
tab. I'll go to Group By. Okay, I want to group by the product name. And I want to
get a total... by summing up... the actual total order amount. This will take each
individual line item and total it up by product. Giving me the total orders. I'll go
ahead and click OK. Now I see each product name and how much was
ordered. Okay. When I go back into my visualization, I really only need to see
Products. So I'll go ahead and tell Order Details not to load. I'm not using it at any
visualizations, so it's okay for me to continue here. All right, I'll go ahead and go to
Home. And I'm ready to apply this data set... to my visualization page. So now on my
Fields list, I see my Products. I'll visualize this in a table, so I'll choose Table. And then
I'll drag my Product Name... and my Total to the Values. Okay, I'll go ahead and size
this out so I can see it. Now, right now, this represents every single product. And
we're trying to get to the top 10. I'll go to the Product Name on the filters. I'll tell it
to do an advanced Top N filter. Where top is 10. I'm going to base that on the
total, so I'll drag that Total to the By value. And then I'll apply that filter. After
applying that filter, I see the top 10 products. Let's go ahead and sort it. Just by
clicking that Total header. These techniques and joins show you exactly how
powerful Power BI and data can be when you establish cleaning routines for basic
presentations of data.
Common cleaning and transformation
When building your cleaning and transformation toolbox, there's some common
cleaning and transformation items you will use. Others will be more specific to the
deeds of the data you work with. Let's start with general cleaning. Spaces are invisible
to the eye, but in fact, they're characters. And when a field has extra spaces, you will
want to clean those by removing them. There are leading spaces which are
spaces that are at the front of the field. There are trailing spaces which are at the end
of the field. When we want to remove either leading or trailing spaces, then we can
use functions like trim or clean. The act of breaking out text is referred to as parsing
text. And we can do this with any type of delimiter and every program handles this
a little bit differently, but the outcome is the same. Spaces will also serve as a
delimiter, like the spaces between words are valid spaces. Imagine first name and last
name. In the case we want to have both last and first in their own individual columns
for sorting, as an example, we will use the space to break those columns. This is not
the only time we parse texts using delimiters. You might break apart text fields based
on things like a dash or even a comma. We use things like text-to-columns, split by
delimiter and functions like left, right and mid to work with parsing text. We don't
only break apart text. There's also times when we need to combine text fields
together. This is commonly known as concatenate or concat. We also replace text
with valid text. For example, if someone enters an abbreviation of a state in the
United States, but we want the full state spelled out, we might replace that text with
the valid response. It could be a misspelling that we're correcting. There are several
methods for replacing invalid data with valid data. We also change the case of
text. Example would be maybe we need everything to be in uppercase or
lowercase or even corrected to proper case. There are functions to do each of these
commands, and again, they might differ between programs, but the outcome will be
the same. These are very simple commands to perform in any data program. You
may find that you'll also remove duplicates from a dataset and this can be done with
commands like remove duplicates or using distinct keywords inquiry statements. We
also transform data types to be appropriate for what we need to do with the
data. You may have date fields that are stored as text, but to work with date-related
functions, you need to convert it to an actual date data type. The same goes for
numbers. If you need to work with a mathematical function, then the value of the
field must be a number data type. These are just a few of the basic commands that
we use for cleaning and transformation of data and some of the first ones to
understand and master.
Using built-in functions
[Instructor] There are a lot of people who don't enter into the field of data because
they're intimidated by math. It's important to recognize that one of the powers of
these tools is that it performs all types of math from basic to complex mathematical
computations for us. We don't have to manually create every function we need. The
tool provides us a lot of calculations. For example, in this Power BI dashboard, let's
take a look at the fields, look at Quantity and UnitPrice and even Discount. Do you
see how they have the sigma shape? It's because it recognizes them as numbers, and
this means it will automatically aggregate them for us and summarize them. Let me
show you what I mean. I'll go ahead and add a table, and I'll go ahead and expand
orders. I want to bring in the order ID. I also want to take a look at the product
name. Now let me expand this. I want to be able to see as I ad fields in. I'll go ahead
and bring in the quantity, and then I'll bring in the unit price. Do you see how it
automatically totals the quantity and the unit price? This doesn't make sense to
me. The unit price is just the price and the quantity, well, that's the quantity that was
ordered for that order ID. So what I'll do is I'll right-click on the quantity and tell it to
not summarize. I also go to unit price and tell it not to summarize. I think I would
prefer to see unit price over quantity. So what I'll do is I'll just drag that order and
change them, perfect. I do want to see the subtotal, and one thing I'm finding here
is that I don't have it. I'll build that in my model. I'll go to Transform Data. I want to
add it to the order details. I'll go to Add Column, and I'll choose Custom. Okay, I'll
go ahead and call the SubTotal, and here I'm using their function builder. I'm going
to go ahead and say UnitPrice, by double-clicking, multiplied by Quantity. Tells me I
have no syntax errors, which is great. I'll click OK, and now I have my new
subtotal. Notice that my default is A, B, C and 1, 2, 3, alphanumeric. I'm going to go
ahead and change that to a fixed decimal number, all right? I'll go to Home, Close &
Apply, and then I'll bring my subtotal into my table. Now, in this case, I do want this
number to total. This makes perfect sense to do that. This is my amount before I
apply a discount. Okay, let's take a look at something else Power BI does for us and,
in this case, it keeps us from having to write functions. Notice the order date. It's
actually got a date icon, and when I hit the little expand, it has a date hierarchy. That's
because Power BI assumes that I will probably want to work with year, quarter,
month or day. Let me drag my date hierarchy into my model, and I'll put it up by the
order ID. Notice, automatically, I get the four individual fields. There are times I do
want this and times I don't. In this case, I just want to see the order date. So what I'll
do is I'll actually right-click the order date hierarchy and tell it just to show me the
order date. If you work as a data analyst, you probably work with pivots and
matrix. Remember, that's roses, columns and summary values. Here, I'm going to add
a matrix, and I'm going to look at values based on the shipping country. I'll go ahead
and add Ship Country to the rows, and I'll grab my subtotal and add it to my
values. This lets me see every single country, and it automatically summarizes its
subtotal. Now, if I wanted it to be an average, I could right-click and choose
Average. If I wanted to show the max of any particular subtotal in a country, I could
choose Max. Again, I'm not completing this math. I'm just choosing the right
options. I'll go ahead and choose Sum. Another powerful feature of Power BI is the
ability to use quick measures. I'll go ahead and click on Quick measures. These are
actually measures that are written in DAX. They're freely available for me to use. I can
go ahead and hit the dropdown, and I can see options like Aggregate per
category, giving me average, variance, max or min. I have different filter
scenarios, different time intelligence scenarios, like year-to-date totals, year-overyear change. I also see totals, like running total. Let's do that. I'll choose the running
total. I want to work with my subtotal, and I want that running total to be based on
the different country. So I'll go to my orders. I'll choose my ship country. I'll go ahead
and leave it as ascending, and I can hover over each one of these options to learn
more about it. All right, I'll go ahead and click OK. Now I see the DAX behind this
particular calculation and on the right-and side, do you notice how I have my
subtotal running total? And it has a little calculator shape. Let me go change this to
read RT_SubTotal. Okay, and then I want to actually go put this into my matrix, which
is going to just drag it underneath my subtotal there. So the running total works by
adding each value to each value. So I started with 8,000 approximately. So my
starting running total is 8,000 approximately, and now when I go to the next value,
Austria at 134, it adds that 8,000 to that 134 and gives me the 142,000. That's the
running total. One of the really great things is that I can actually add more
variation to this and change my running total. For example, I want to see a running
total across the years. So I can actually drag Year into my columns, and then my
running total starts over again for each year. These are just simple examples of
some of the power of the built-in functionality in Power BI. Just remember, with
power comes a great responsibility. That really sounds like the beginning of a
superhero movie. I tell people all the time, anyone can make numbers show up, but
that does not make them correct. If I could offer you a piece of advice, really
think through what you're trying to accomplish with the numbers, consider what
functions you might need and then also read about what's available. The more
experience you have, the easier the research will be for you but don't worry, you'll
always be studying something new.
Relational databases
Have you ever really thought about how systems store data? I bet if you're a new
analyst, you have not gotten that deep into the idea of how data is stored, but you
just know that it is stored. Relational databases have been around for a while and
you will hear people talk about SQL databases or SQL scripts or statements. A data
analyst doesn't have to be fluent to be effective. You can do a lot with just being
somewhat literate with SQL. This is a key area that you can further study if you're
interested. RDBMS stands for Relational Database Management Systems and server
technology, like Microsoft SQL Server, can store these databases. There are
others. Even something as simple as an Access database has relationships and
relational data. We need to go back one step and discuss structured data. When you
work with a spreadsheet that has column headings and data values, then you're
actually working with structured data. This data has a field name, we see it in the
column headings, and then it has a data type and a value. When we build relational
databases, we build structured data sets that are stored in the form of tables. These
tables then become connected through a relationship between key fields. These key
fields are unique identifiers that help control the data that can and cannot go into a
table. When structured data is defined and then stored into tables and then the
tables are related, this creates a relational database. These relational databases are
used to hold information and we as data analysts use this structured and stored data
to build reports, visuals, and analyze data. One thing that is important to note is that
you as the data analyst must understand the structure that is used to store the
data does not always make it easy for reporting. Why? The rules for effective
storage are different from the rules used to combine data for reporting. They are two
very distinct roles and functions, even if they work with the same data. As an analyst,
you do not have to know how to design large-scale data systems, but you will want
to understand some database design techniques so that it makes understanding
someone else's design easier.
Modeling data for Power BI
We will work with different data in different data sets or tables to do analysis and
visualization. When we have multiple tables that we're working with, we'll want to
model our data to get the most out of it. When you have an entity relationship
diagram where the tables and relationships are showing in a model, you're actually
seeing the model of the data. Now I've already connected data to my Power BI
Desktop and on the right hand side, you see my fields list. I have different tables of
information required for my reporting. It appears that I'm ready to go but I need to
go one step further. These data sets are meant to be joined together. There are
several ways to join and model data in Power Query for Power BI. When we perform
merge queries, for example, we're actually establishing a join but we can also go to
the modeling section and model this data from the very beginning and this allows
the data to communicate through the joins, meaning if I reference an order, it knows
what product and what order details are related to that order. In looking at the
diagram, we see that there are some joins already established. Power BI as a
convenience tries to join the data automatically, this is called auto detect and it tries
to auto detect the relationships. You should always confirm that the
relationships that it establishes for you are correct. Remember, it's easy to model
data when you know what data is related to each other. Let's look at the orders table
and the order details. These are joined together by the order ID. Also notice we have
a '1' and a '*' or star symbol. This shows us the cardinality of this relationship, it's a
one to many, meaning we have one order and many order details, not unlike when
you place an order and buy multiple things, you have one order record and then the
different line items and quantities for the products that you purchased. Let's look at
the products information and the order details. These are joined by the product
ID and again, it's a one to many relationship. There are other relationships when we
refer to cardinality, there's one to many, many to one, one to one and many to
many. One to one means that there is only one record tied to one record between
the two tables. One to many and many to one, like our examples here mean that we
have one record in one table that's tied to many records in another table. I do have
a join that needs to exist but doesn't. Take a look at the employees, you see how
there's no line to any other table, this means that the model doesn't know how the
employees relate. I'll use the employee ID and drag it to employee ID, this establishes
my relationship. I can go ahead and look at the properties of this relationship. I'll
right click the line and go to properties. This shows me the orders table, which is the
many side and the employees table, which is the one side and I see the cardinality is
many to one. I'll go ahead and click OK. To manage all the relationships, I can go to
manage relationships up top and work with each one of them. Okay, let's see the
model at work. I'll go to report and I'll begin to build a basic visual. I'll start by just
adding a table. I'll go ahead and bring in the company name from customers. I'll
bring in the last name from employees. Okay, I'll collapse those so I can see. From
order details, I'll actually go ahead and bring in the order ID. I'll bring in the order
date hierarchy. I just want to actually show the order date so I'll right click that and
just show the order date and then I'll bring in the product. I actually want to put the
product in between the order ID and the order date and then I'll also bring in from
order details, the unit price and the quantity and then I'll bring in my total after
discount. Because I've modeled my data, I know that I have the correct
company listed with the correct last name of the salesperson with the appropriate
order ID and the order details for each one of their orders. Because we've modeled
this data together, we can now explore the data using all the features that help us
visually without having to create various merge queries to accomplish the joins.
Master data management
Have you ever been working with data and you see customer addresses all
have different ways to reference the region? In some countries, we have states,
provinces, or districts. And when they're used in the data and entered by different
people, they may reference the full state, province, or district name or they may list
the abbreviation. Data like our customer and their address information would be
considered master data. We want everyone in the organization who works with this
data to have the same consistent list of information. When an organization takes the
time to design rules around the master data, this will also inform all the data analysts
of what types of transformations apply. Using tools like Power Query, either in Excel
or Power BI, we can easily make these corrections and save these steps so that as
new data comes into our reports, it will conform to the standards. Master data is not
just address information, though. It could be project names or product names. If we
call a project something different, then it makes it difficult for the data analyst to
report on this information with ease. There are tools that exist to support large
scale organizations with master data management. But I would argue no matter the
size of your organization, if you do not have a plan in place, the analyst will be
dealing with it all the time. So as master data management aims to keep a clean,
complete, and accurate list of master data for the organization, if you don't have
master data management, then you will need to develop a plan to keep a
nice, consistent list of data when you report. Let's take products, as an example. Two
companies have merged. They sell the exact same products, but in both
companies, they're not called the same name. As a data analyst, you can use a
table that holds every possible name and the correct name so that when you report,
you can leverage joints to give yourself a master table of information. When a new
name pops up, you'll have to address it in your master table, but it's better to have
that table than to not have it. Your data set being clean and complete is one of the
most important parts of any project. Just remember that all of your data skills can
apply to many types of data scenarios, not just the analysis or the presentation part
of the job.
Unstructured data
Did you know that there is way more unstructured data in the world than
structured? As a matter of fact, did you know we use structured data to produce even
more unstructured data? Data that neatly fits into tables or spreadsheets is
structured data, and unstructured data is literally everything else. When we post
videos, take pictures, create PDFs of bills for our clients, we are contributing to the
vast amount of constant unstructured data. The minute we had the ability to walk
around with a PC, video camera, still camera and social media outlets in our hands
like our mobile devices, the world of data exploded. Let's just take an image for
example. This is unstructured data. You must look at the image to understand what
the image is representing. Same thing for a video. You have to watch it, and it's an
immense amount of data. With that said, there's also semi-structured data, which is
a mix of both structured data, and unstructured data. Let's say you receive an
image of the cutest cat ever on the beach via a text from your best friend. When you
see this image, you see the cutest cat ever or at least someone's opinion that is the
cutest cat ever, and you see the beach. A data professional sees much, much
more. What I see when I look at the cutest cat ever picture is much more than a cat
in the beach. I see the time of day, the weather, the location, the type of cat, the color
of the cat, even the age of the cat. I also see the image type like PNG, and what's the
image size, as well as the dimensions, and what's the quality of the image. Don't
forget. We mentioned that we received this from someone, that's data. We received
it at a certain time, and that's data. Did I mention when the picture was taken, and
by who? I mean this list can keep going. Just think. It went from being the cutest cat
ever to a lot of data really fast. Now, imagine people posting their favorite images on
their social feeds. Multiple times per minute, and then others are sharing that
image or they like it, or they look at it and move on. That's also data. Unstructured
data requires our brain to review and provide context, and structured data fits neatly
into designs. And semi-structured is everything in-between structured and
unstructured. Depending on the organization you work with, and what they do as
their product or service will determine what tools, and software you need to work
with with their data. Just knowing that there are different types of data like
structured and unstructured can help you explore the roles in data. Data is not going
anywhere. It is only growing. And just think, there was a time when the name didn't
exist. It should keep us all motivated for what's coming next.
Visualization methods and best practices
I read a post lately about how the person designed this beautiful dashboard and no
one was using it. This left the data professional perplexed and frustrated. I get
that. But it immediately made me start thinking of why. Why, if it is so great, are the
users not using it? Well, have you ever heard of beauty is in the eye of the
beholder? It could be as simple as the data analyst designed something that only the
data analyst can use. Looks great, but no one else understands it. It could be that the
data analyst just loaded this beautiful dashboard, sent a link, and said, "Here's your
great, new dashboard." Best practice number one. For a moment, be the person
you're designing for. If you want to see what this feels like, imagine driving a 10 or
15 year old vehicle and then go sit in the newest car on the lot with the most
features. The dashboard will likely take you more than a minute to translate before
you drive it. Now, imagine that someone hands you the keys and says, "Take it for a
spin. It's amazing." What do you do? Depending on you, you freeze, go with it, or
get out? In the same scenario, imagine that that car salesman came out and
explained the differences between your car dashboard and this new dashboard, or
at least hit the high points with you. What would you do then if they directed you
where to look to make you feel more comfortable for going for a drive? Always take
time to document and provide a little bit of training on your visuals. Be
consistent. Use the same color for the same item all the way through. If my brain
says, "This product is blue on this stacked bar," then every time I see a reference to
this product, it will be blue. And then when I believe this new visual has the same
blue, is the same item, and I realize it's a totally different product, I get stuck on why
isn't this the same? And the data is not showing me anything. Don't overcomplicate
to show your fancy vis skills. I do understand that people want to use advanced
visuals to show their skills. But the point of the dashboard has nothing to do with
your skills, but providing information. If you provide valuable insight through correct
visuals and layout, they will believe you to be a visual magician, and they will not
care that you presented it in simplified visuals. Be sure to title, label, and add tooltips
appropriately. People should be able to read a title for context, be able to easily read
the labels, and hover over to get additional insight, not just see the same thing the
visual already shows. Remember that a picture is worth a thousand words. And if we
could all make decisions by consuming thousands of lines of data, we wouldn't need
visuals. Not all data visualization is a chart or graph. Make appropriate use of cards
for high level totals and other aggregate functions. And remember, a table, matrix or
pivot is also a visual presentation of data, and some people prefer that matrix to a
chart. So it never hurts to give them both to meet the needs of the audience. Always
remember that your visuals will be used to provide information. So make sure that it
does it in a way that people can quickly understand and make decisions.
Creating reports to visualize your data over pages
[Instructor] Not all data is best consumed using a dashboard. Yes, dashboards
provide valuable capabilities, but some reports can be valuable in different
formats. When we have reports that have line items and that is the type of
display this will produce several pages and a dashboard representation of that
data may not be the most user friendly. There are tools like Power BI Report
Builder that allow us to build what are called paginated reports. Paginated reports
allow you to connect to data, not unlike dashboards. In fact, before the popularity of
the dashboard most of the reports were paginated reports. Although some people
think they are a thing of the past I think it's important to remember what determines
the style of your report is the need for how that data is best visualized and how it's
going to be consumed. If it's going to be delivered via PDF, or even printed for a
meeting. In our role we've been asked to update an existing report that's currently a
line item report and is many, many pages long. This report would simply benefit from
some groups and summaries. Let's go to Report Builder and redesign this sales order
meeting records report. This report is connected to AdventureWorks 2019, which is
a popular sample report. And I have a data set here called sales records. And when I
expand that, I see all of the fields that are available to me. But if I want to look at the
underlying query I can right click and go to query. This lets me look at the different
fields that are being used in the actual report. It'll also let me take a look at the
relationships, which again, multiple tables means I need relationships. And it shows
me the join type, which is inner. Now there are some fields that I need that are not
in the data set. I can go right click and go to the data set properties. And this lets me
work with the individual fields. I can go in and add a calculation, which I've already
done here for order date. Let's go take a look at that function. This actually allows
me to format that order date value in a date format that's a short date, which is
perfect. I'll go ahead and click okay. Click okay again. This report does provide
valuable information, but again, that multiple lines is not effective for the
meeting. So we're going to replace it with the matrix. And even though we'll have
line items, we'll just have fewer, and it will become more meaningful for the
meeting. The matrix is just like a pivot in Excel. It has rows, columns, and summary
values. So we want to look at a simple subtotal for the sales people for each
product. So we'll go to insert, choose matrix, we'll click on insert matrix and then we
can click in the body of our report. We'll drag name for product name to the
rows. We'll put the last name of our salesperson in the columns and then we'll use
the total due for our summary field. In Report Builder we build in the design view, but
to see the data we have to run the report. I'll choose run. Now this report actually
shows the high level subtotal for each salesperson across the top, and then also
shows each breakdown of each product down the left hand side. We can go to the
very last page and see that we went from 3,000 plus pages to 6 pages. Fantastic. Let's
go back to the design view. We can make a few adjustments. Want to increase our
product there, a little bit wider. Let's go preview it again. Definitely getting a little bit
better. We'll go to our page set up. Because it's wide we'll make it landscape. We
definitely want to adjust our margins to be smaller. And then when we're ready, we
can export our report into various formats. But we can also publish these
reports. Paginated reports can provide valuable reporting when your data expands
over many pages. And remember it can easily be published, PDF'd, or printed.
Creating a dashboards for reporting
[Instructor] Dashboards can provide valuable insight into different data
scenarios. And the scenarios are created by us, the users. Dashboards can be
built where they show key performance indicators. And for us, it's the sales
performance in the countries that we're interested in. Here, the size of the
dot represents the amount of sales in the country. So if I click on North America, I
will see on the left-hand side all the products. And because these headings are
sortable, I can interact and sort the total after discount, bringing the highest
products ordered to the top. But how does that look in different countries? So for
example, if I choose Sweden, will it also be wines? As soon as I click Sweden, I see
that the bratwurst takes the lead. I click off my map and brings all of my sales back
to the top where I see wines and bratwurst and even peanut butter cups are the top
of my sales. I also noticed that I have some formatting issues here, so I think I'll go
ahead and address those now. If I go to order details, I can choose total after
discount. I'll go ahead and make that two decimal places. Perfect. Okay. So we want
to create a dashboard for our sales managers. They need to be able to work through
several scenarios of the data. One of the key things they ask for are the order
details. I'll go ahead and add a page here. And we'll name it sales orders. I'll start by
adding a table and then adding various fields to it. So I'll start with adding the
company name of the customer. I also want to bring in from the orders table the
order ID. Go ahead and size it out just a little bit, so I can see it populate. I also want
to bring in product name. I want to bring in the quantity. Because that's a number
field, it automatically tries to sum it up. I'm going to right-click it until it don't
summarize. I want to bring in the unit price. I also do not want to summarize it. And
then I want to bring in that total. Okay, great. I have all the basic information my
sales managers will need. One of the other requests that they had was to be able to
see how the sales people handle different customers. Are they just working with one
customer? Or are they working with a lot of different customers? We can visualize
this data using a stacked bar. So I'll click the stacked bar. I'll bring in the last name
from employees. I'll go to customers, make that my legend. And since we're looking
at their sales volume, I'll go back and grab that total there. Perfect. Now I can clearly
see that there's a nice spread of how we help our customers here. We have several
sales people, and they serve several of the same customers. We can see the total of
all sales at the bottom of the table. But it would be nice if we could see that across
the top. I'll go ahead and decrease the size of my stacked bar. And I'll introduce a
card into the mix. I'll bring it up here. I'm not going to size too much. And I'll bring
in that total after discount. That information really stands out across the top. Really
easy to see it. Okay. So let's see how this interacts. Right now, we see all sales, all
order details. There's lots of them. We see our scroll there. But what if I just want to
focus on Sergio? I can click Sergio, and now I'm seeing only his records. What about
Blankenship? Choose Blankenship. I can see the total for Blankenship, and I can see
all of Blankenship sales. So this is just one way these visuals interact with each
other to filter other information. Okay, I'll go ahead and click in the corner there and
remove that. There are times though that we want to see other filters. Like, what
about the year? What about the country? What about the customer? Again, we can
see that multiple customers are served by our sales people. So I'll go ahead and
create some slicers here. Set that sort back to company. I'll choose my slicer. And go
ahead and sort of size it. I'll save that intricate sizing for last. The very first type of
filter I want to create onto the dashboard will be the order date. I'll grab my order
date hierarchy and drag that into the field list. Now, I don't really need quarter or
day, just month and year is fine. And I really don't want to take up the screen
space, so I'll go ahead and make this a dropdown. And that will be the same for all
of my slicers so that I can be consistent. Okay. I'll go ahead and add another
slicer. This slicer will be for last name. Put that into my fields list there. And again,
adjust it to a dropdown. If I want to focus on a particular group of customers, then I
can actually go add that slicer. I'll put in that company name. Again, that's a healthy
list, so I'll make it a dropdown. I'll go ahead and size it just a little bit. And if I want
to create a country dropdown list, I can do that as well. Create my last slicer
here. Place it where I think it's going to go. Go ahead and do some basic sizing here
to fit it all in. Don't want them to overlap. All right, perfect. Now I need to use the
ship country. I don't want to use the customer's country, but where they're shipping
the information, so I'll drag that ship country to field. All right. And then again, to be
consistent, I'll make it a dropdown list. Okay. Now size my card over. All right. Let's
watch our dashboard interact. So I want to see 2022 sales. And I only want to look at
Brazil. Okay. Let's look at Germany. Excellent. I'm seeing valuable insight. Imagine
that we've gotten all this data collected, and we need to use it, maybe we need to
email it to these particular customers. Let me show you a very valuable feature. Now
that I have these filters set, I can actually export this data. This actually creates a
spreadsheet of my filtered scenario. This provides a ton of value for the
user. Exporting the data out in this way provides valuable access to information. The
sales managers merely need to create the scenarios. Then they can work with this
data or even copy and paste it to use it into emails to share information with
customers when they're not going to ever have access to our
dashboards. Dashboards are amazing. And when designed effectively, they can
provide a lot of value. Remember, effective being the keyword. When we take data
from reading it line by line on multiple pages to being interactive, we're giving
people the ability to question the data, create different scenarios and then also
providing actionable insight.
Gathering requirements for visualizations
We have all heard the stories where the entrepreneur designs their life-changing
app on a napkin and then moves on to greatness. Well, guess what? You can apply
the same approach to your visuals. Maybe it's not a napkin, but I can tell you from
my experience, even starting with your customer and a napkin is better than
guessing at the visual representation of the data on your own. People never know
what they want in a dashboard or report until they can see what you see, and if that's
all in your head, well no one is a mind reader. The best way to express your ideas is
to create a mockup of the dashboard. Just lay out different objects, like a
table, matrix or stack chart, add a few filters on the image, this will help everyone
get on the same page about the design. And if it's multiple pages with
navigation, then wireframing helps communicate the navigation of information
before you build it. Wireframing allows you to build out a skeleton of the pages, it
doesn't have to be designed with all the colors and final graphics, it's just a
sketch. The mockup might have a little more visual styling than the wireframe, but
even just a few minutes of investing time into these together will reduce tons of back
and forth on the design process. There are many ways to produce mockups and
wireframes, we can thank all the software developers and UX designers for these sets
of tools. If you are newer, it might be hard to visualize the visuals needed because
you may still be trying to determine the right visual for the data. You can look for
inspiration through samples that you can find available in the software, like Power BI
has a whole set of dashboards you can play around with to get started. In addition
to getting on the same page about the look of the dashboard, we must consider
other requirements. Be sure you're documenting these in every meeting and then
following up with notes to all stakeholders afterwards. A few items to always address
are, what type of filters do we need on the data? That way you're not bringing in
more than what's needed. Example would be a 100 year old company doesn't need
100 years of data in the dashboard. I would call this a hard filter, that's because you
handle this type of filter at the data level. What type of filters are needed for the
consumer? Which is the user of the dashboard. What might they search and
filter? These are soft filters and they're meant to be interactive. Common filters might
be years and dates. If it's dedicated to products, it will likely have a product filter. And
if it's dedicated to customers, it will have some customer filters. Never fail to find out
who this dashboard actually is for, and also determine if they have the permissions
to the data and the correct licensing to use the dashboard. Visualization is as much
an art as it is a science, and these requirements are pretty standard to every type of
visualization project. And you'll discover there are many more, but if you start with
these, you'll be designing better dashboards right from the beginning.
Presenting data challenges effectively to others
There are some moments in meetings where I realize no matter how simple I make
what I say next, all of the sudden people are going to be staring through me, trying
to figure out what in the world I'm talking about. That's tough. It shouldn't be like
that in every meeting. And if it is that way for you every time you talk about
data, then you need to focus on communication skills. Again, a very important soft
skill for a data professional. I find that talking to leadership through the process can
really help their understanding of what we're working with on a data project. Here's
an example. The data team has been tasked with studying a scenario that will have a
major impact on the organization, and it's imperative that we get this right. It's a high
stakes project. They've provided us all the access to the data, all the questions they
need answered, and we have our approach and we're ready to go. In the first several
passes of the project, we realize one of the key pieces of information that we need
for the study has not been collected consistently. And there appears to be major
gaps in the time for the data we do have, and it really makes us question if we can
trust the data. What do you do facing this scenario? I can tell you very easily what
not to do. Do not wait to communicate the challenges and make sure you're
prepared to discuss them. Here are a few ways you can address this situation. Be sure
to let the right person on the team know what data appears to be missing. People
make mistakes. It could have been a bad file or even a missing file. Also communicate
about what you see in the data you do have. This gives you an opportunity to
confirm that they understand about the gaps you're finding in the data, this way
there's no big surprise. And by the way, they may have a very sound reason for those
gaps. You may just not know about it. This is part of the learning curve of any new
data set. There are other scenarios. The organization is hoping that the data team will
be able to show something very positive with the data, and you found the exact
opposite to be true. This is truly a challenging scenario and not a fun one to face. So
what do you do? When I find myself in this situation where there is a totally different
understanding of the data reality versus the actual reality, I start by confirming that
I'm not missing something. I double check everything. I confirm that I've not
introduced an error in any way. If I find that this is the truth of the data, then I turn
to the person in leadership and discuss my findings to get further insight into what I
may be missing and get guidance from them on the next steps for me to
take. Remember, we don't have access to all the data or even all the
knowledge. Turning to your leadership is the legitimate next step. If you discover no
errors, you have done all that you can have and the truth isn't going to be exactly
what they planned. Having some communication skills on how to deliver information
might be your next step. Remember, data is used to inform a business for
improvement and sometimes delivering the results can be hard. As a data
professional, just make sure you have thoroughly checked all your results, follow the
chain of command of information, and by all means, communicate with your team.
Finalizing dashboards
[Instructor] Visualization tools give us so many features, including some that we
really need to pay attention to, like automatically creating titles and built-in tool
tips. These features are so nice, but they don't always really make sense to the users
that are not involved in the back end of the data. Changing titles should be an overall
part of your process. And when you're ready to finalize your dashboard, it should be
one of the final things you check. You can change them at any time, but you certainly
want to make time for it. Let's look at our SalesManagerDashboard here. There's
definitely a few titles we can change to make things more meaningful. For example,
we have TotalAfterDiscount by Last Name and Company Name, and really what this
does is shows each salesperson and the total for each of their customers. Also,
there's a couple of other little things that are not too meaningful, like this company
name in the legend. It's really small, and there's a lot of different customers
here. Okay, so I'll choose that option. I'll go to my format Visual, and I can look at
the Y and the X axis. So first of all, I'll turn that Title off the Y axis. And you'll
notice that the last name here on the left disappears. I'll go to the X axis and I'll turn
the Title off here and it will disappear from the bottom. And then I really don't think
I need a legend for this. There's other ways I can work with that information. So I'll
turn the Legend off. Okay, now I'll go to General and I'll go to Title. Right now, the
Title is turned on, but it's not really meaningful. So let's do Total By Salesperson For
Each Customer. And I'll go ahead and center align this. Perfect. I'm going to bring it
down just a little bit. And then I have my card up top. Let's go ahead and expand
that. If I make it just a little bit bigger, I can see that it has a TotalAfterDiscount. Okay,
that's called its category label. I've got that selected. I'll go to its format and I'll turn
off that Category label. Okay, I'll work with this call out value. First of all, it's really
big. So I'll go ahead and make it a size 30, make it a little bit smaller. But I want to
change the way it displays, like I want the whole number there. I can go ahead and
choose that. I'll go ahead and leave it for Auto 'cause these numbers get large when
I remove the filters. And I'll go to General and then I can turn on its Title and have to
supply that title. And we'll do Total Sales here. And again, we can make that just a
tad bit bigger, and then let's center it. Okay, perfect. Now there's no question that
that's the Total Sales and then underneath that is the Total By Salesperson For Each
Customer. Okay, also notice that we have different slicers across the top. This is the
perfect opportunity to provide some instructions. So I'll go ahead and click on Year
and Month, I'll go to the format, I'll go to my Slicer settings. I want to leave it as
Multi-select because I want people to be able to select multiple criteria, but I also
like the Select all option, so I'll turn that on. Can take a look at the header. And then
notice the title text. Here, I can change this to read Select Year and Month. Just the
word select tells people, hey, this is something I can select. I'll go to a Last
Name. Because I was on that area, it will automatically update. Okay, and I can do
Salesperson. Company name is fine, but I'll go ahead and put Select Company
Name. Now, I want to be consistent. So I'll go back to my Salesperson and tell it to
be Select Salesperson. Looks much better. And then I'll go to my ShipCountry, and
I'll change to Select Shipping Country. Now because we have two countries, the
country that the customers' in and the shipping country, it is probably important to
specify. Now, these changes are minimal, but they've already made a big
difference. Okay, let's go here to our table here. Let's go to General. It's Title is turned
off, so let's turn it on, and let's call those Sales Order Records. Fantastic.
Adding dashboard filters
[Narrator] One thing I've noticed is that we have these filters. Let me go ahead and
clear them. And I'll clear this country filter. The dashboard, when it opens, it's actually
going to look like this. And if people start to make changes we might want to give
them the ability to go back to this original view. We can do this by adding a
bookmark, I'll choose add bookmark. And I'm going to rename that as clear. And let
me show you how this works. So I'll go ahead and choose Sergio and it updates to
show me Sergio's sales, just perfect. And then if I choose the bookmark, it clears it
back. If I go select 2021 I choose control select on these sales people. And then I
choose clear, I go back to the original state. This is really, really great. This could be
very handy for your end users, gives them the ability to clear all their filters and go
back to the original state, but they may not know how to navigate to
bookmarks. Let's go add a button onto our dashboard. I'll go to buttons. I'll go ahead
and add a blank bookmark. I'll go ahead and move it over here to the right. Okay, I
want to change it to a pill shape. So it'll look more like a button. I'll go to my style
settings here and I'll turn on the text. Need the text to say clear filters and don't need
the icon. Let's go back to that text and make it centered. Let's go ahead and make it
black. We'll go to our style here and let's turn the fill of the button on and let's make
that sort of a darker gray color. Perfect. So I have my clear filters button created. Now
I need to apply my action. I'll tell it because I chose a bookmark to go to the clear
bookmark. Okay, let's go ahead and close our bookmark pane or format pain. What
I'll do now is I'll go ahead and select a few of my sales people. I'm holding my control
key. I'll go ahead and say, let's see for 2022. And then what I need to do is clear my
filters. I can just control, click, and I go back to the original state and notice all my
filters are cleared.
Modifying dashboard tooltips
[Instructor] Let's hover over some of the information in our stacked bar. One thing I
want you to notice is that we have some pre-built tool tips, which is great. Gives us
a lot of information but it may not be all the information we'd like to have. So let's
go ahead and choose that stacked bar. Let's go back to our visualizations and take a
look at tool tips. By default, it'll bring in the tool tips based on what information has
been supplied to the visual. This is why we see last name, company name, and total
after discount. Okay, let's just go ahead and name that total after discount to total
amount. And let's change this last name here to salesperson. Last name is all
preference if you do that. I think we'll just do salesperson. Let me bring quantity to
the tool tips. Now I do want to see a total quantity. Okay, so I'll go ahead and make
sure that's set to sum, which is perfect. And then I want to count how many orders
they actually placed. And I want to do a distinct count of the order ID. And I'll name
this total order count. Now when I hover over, I can see the salesperson, the
company name, the total amount, the quantity of what was ordered and the total
order count. Okay, really don't need that quantity because again, that's related to
each individual line item so I'll go ahead and take that out. I'll go ahead and put this
total after discount in again. And I want to change that to an average. And then I'll
do average of order amounts. Perfect. This gives me a lot of information just by
simply changing a few things in the tool tips and naming things appropriately. One
last thing as you finalize your dashboard is sometimes people want to have a little
bit more background, different formats. I want to change it up to look a little bit
more than just solid white. When you're in Power BI, you can actually go in and
change a lot. For example, go to view. Let's go change this dashboard to a dark
background. There are several different option here for you to choose from. You can
just point and click until you find the one you like. You can also create your own
custom themes. Okay, let's do that black background, like that dark
background. One final step is the mobile layout. I'm in the page view, I'll go to mobile
layout. This is how this Power BI dashboard will look on this page when people visit
it. I'm going to go ahead and bring my card to the top. And again, it's a responsive
design. So even if these look big, they'll work themselves out. I'll go ahead and move
my stacked bar here. And then I'll bring in my sales orders. That way, if someone
consumes this dashboard, this is how it'll look in that mobile environment. Okay, very
last thing we want to do is bring in the filters. Want those to be at the top. And again,
I'll just keep sizing. Go ahead and put this one here. I'll do two slicers per section. So
now I have the mobile layout covered as well as the page view. Okay, let me go
ahead and go out of the mobile layout. Check and make sure everything is
labeled. Also check and make sure things are functional, so I'll go ahead and click on
Sergio, Jeffers. Perfect. And then I'll choose to clear my filters and hold my Control
key and clear my filters. Now I'm ready to save and publish my dashboard. There are
certainly more items that can be adjusted and tweaked with these dashboards, but
at a minimum, this is a great start.
Data workers
If you use spreadsheets every day and you create valuable insights for people
through various presentations or reporting, you are a data worker. But you're not
likely called that by your job title. You may have a job title that represents a
department or the people you support, but you're not titled data worker. You just
are one. I would also consider you a data worker if you find yourself exporting data
out of systems, building some form of report or presentation weekly or monthly. You
may also receive data from someone in another department, like IT, who has access
to more data than you. You may frequently visit the company's data warehouse, or
data system, to gain information for your reporting purposes. Data workers also
work with functions and do some aggregate functions with the data. You may use
some logical functions like an if. You're able to search for functions and find the ones
that are relevant to your data work, you are likely a data worker. I believe that there
are far more data workers than our organization realize, and if you're in this role,
guess what? You're a great resource. And one of the first places an organization can
turn to, to upscale and data. If you're looking for areas of growth, then make sure
you're using tools in Excel like Power Query and other analysis techniques like
PivotTables and basic visualizations. If you have more than average skill with
these, you might be more than a data worker already. You can also build skills like
PowerPoint because this is another way we visualize data for meetings and
presentations. Documentation is a critical competency for any data role, so being a
wizard at Microsoft Word doesn't hurt. Remember, like every other tool, it's powerful
and often because we use it every day, we don't believe we need to explore
training. Trust me, you should. For the soft skills, you'll want to focus on effective
presentations and communication skills. Having these skills make you more than
suitable for roles that require advanced skills in Excel and doing basic analysis.
Data analysts
I have spent years trying to define data analysts to people. I've come up with several
ways to try to define this role and the skills. It's important to know that mos, do not
have a job title that contain the words data analyst, but if you have a data
department, then they are likely to be called data analyst. Not all organizations have
a dedicated data department. So you might be called an operations analyst or a
marketing analyst. Your title likely has analyst in the title. There are also varying levels
of data analyst, and you can be a data analyst, and not know it. Or be performing the
skills of an analyst, and have no idea that you are. A data analyst will have a deeper
understanding of data systems and have more knowledge about database designs
than a data worker. A data analyst will find they have a little more access to see
tables, and views of the databases. They probably have some basic SQL querying
skills and may write SQL statements to gain access to data all the time. This varies by
organization, and access levels. A data analyst will have a better than average
understanding of the data governance plan because if you're a data analyst, you are
going to be working under the policies, and procedures that are established. Data
analysts that are a few years in are likely to understand more about what questions
to ask, and research in general. Data analysts understand how to clean data, and
transform it to meet the requirements of the project. Data analysts also know how
to create functions of varying types like conditional statements, logical
statements. Data analysts work with statistics, and most certainly at the beginning of
their career, basic stats and aggregate functions, and certainly have learned how to
connect data in a way that they can just refresh their data, and update their visuals
and reports. If you're looking for areas of growth, then you can go a little bit
deeper into statistics. It's a must. Note that I said a little deeper, not a full
statistician, which is another role entirely. You'll find the data sets you are developing
might be used for different statistical tests. So it is important to have a basic
knowledge. You can never have enough experience writing functions, and you
definitely want to be able to write if functions, aggregate functions and simple
lookups. You must understand joins, and how they impact data sets. And for the soft
skills, active listening, data storytelling, and critical thinking. If you're realizing that
you're a data analyst, then you might relate to being called a wizard at work.
Data engineers
It is one thing to refine and add to a data set. It's an entirely different skill to be able
to build data sets. I personally believe what most people consider, as a data analyst
in their organization, may be performing data engineering tasks more than analysis
tasks. The crossover between the analysts and the engineering and skills are
real. They share a lot of common foundational skills. A data engineer is someone
who fully understands how to look at the data sets, knows how to refine them into
smaller more sensible sets for people to use. You may receive data from someone
who is engineering that data from a set of queries, and then providing it to you or
others. A data engineer also is likely to have more access to data, which is why they're
sending it to you in the first place. They also understand security and privacy of data
through the overall data governance strategy. Data engineers can transition to data
architect, which covers more systems, more server and more security strategies for
systems across all of the organization. If you want to grow further in this role, you
will certainly need to understand more about structured and unstructured data and
how to convert it to usable data sets. You'll want to understand the design
methodologies of relational database systems and you will need to understand how
to design databases. You'll also want the shared skills of communication, effective
presentations, critical thinking, and active listening. These skills will be used to learn
how to take hundreds of tables to define them into usable tables for other processes
using ETL or ELT, which is extract, transform, and load or extract, load, and
transform. This is how data goes from a production system to a data warehouse, as
an example. I believe there is a lot of opportunity for data analysts to pursue this role
as they grow deeper in their understanding of data and infrastructure.
Data scientists
People often pursue data with the hopes of becoming a data scientist. And I believe
it's important to know that not all data professionals grow into data scientists, nor
do we need all analysts or engineers to turn into data scientists. Data scientists will
likely have all the skills of the analyst engineer and they will have likely worked in
those roles. However, a data scientist will have a heavier requirement for skills in
coding, mathematics, and statistics. A data scientist will be instrumental in
developing tools and instruments that provide valuable insight to the organization,
but they can't do it alone without all the other roles, or well, maybe they can perform
the task, but when you don't have all the other roles, the data scientists must perform
them. Data scientists or data science teams comprised of all the disciplines will
interpret large sets. They'll likely build machine learning models. They'll present
outcomes and make suggestions as a portion of what they do. They'll likely be
leaders in the data science team. They'll provide support and strategy to the overall
data governance plan. If you want to further your skills in this area, you should
consider gaining a better understanding of programmatic thinking. You'll want to
dive deeper into learning code and maybe start with something like Python. If you
have some stats experience, or not, you will definitely want to grow in this
area. Remember, one of the key differences between data scientists and all other
roles is heavier math, coding, and stats. It's also important to remember that for most
organizations, having a data scientist and not having all the other roles means that
that data scientist is having to perform all those roles before they get to the data
science. This is where having a team of multi-discipline people serving all the roles
might just be your next play.
Download