Uploaded by salesleadangel

Data Cleaning (1)

advertisement
Data Cleaning: What It Is, What It Doesn't Do, What It
Doesn't Do, What It Doesn't Do, What It Doesn't Do, What It
Doesn't Do
Most people think that your insights and analyses are only as good as the data
you're using while working with data. In other words, if you put garbage data in, you'll
get garbage analysis out. If you want to build a culture around quality data decisionmaking, data cleaning, also known as data cleansing and data scrubbing, is one of
the most crucial tasks for your organization to take.
What does data cleaning entail?
The practice of correcting or deleting incorrect, corrupted, improperly formatted,
duplicate, or incomplete data from a dataset is known as data cleaning. There are
numerous ways for data to be duplicated or mislabelled when merging multiple data
sources. Even if the data is right, outcomes and algorithms are untrustworthy if the
data is erroneous. Because the methods will differ from dataset to dataset, there is
no one-size-fits-all approach to prescribing the exact phrases in the data cleaning
process. However, it's critical to create a template for your data cleaning procedure
so you can be sure you're doing it correctly every time.
What's the difference between data cleanup and data transformation?
Data cleansing (or data cleanup) is the process of removing data from your dataset
that does not belong there. The process of changing data from one format or
structure to another is known as data transformation. Data transformation, often
known as data wrangling or data munging, is the process of changing and mapping
data from one "raw" data type into another for warehousing and analysis. This article
focuses on the data cleansing procedures.
What methods do you use to clean up data?
While data cleaning processes differ depending on the sorts of data your firm stores,
you may utilize these fundamental steps to create a foundation for your company.
Step 1: Remove any observations that are duplicated or irrelevant
Remove any undesirable observations, such as duplicates or irrelevant observations,
from your dataset. Duplicate observations are most likely to occur during the data
collection process. Duplicate data can be created when you integrate data sets from
numerous sources, scrape data, or get data from clients or multiple departments.
One of the most important aspects to consider in this procedure is de-duplication.
When you observe observations that aren't relevant to the problem you're trying to
solve, you've made irrelevant observations.
Step 2: Correct structural flaws
When you measure or transfer data and find unusual naming conventions, typos, or
wrong capitalization, you have structural issues. Mislabeled categories or classes
can result from these inconsistencies. "N/A" and "Not Applicable," for example, may
both exist, but they should be examined as one category.
Step 3: Remove any undesirable outliers.
There will frequently be one-off observations that do not appear to fit into the data
you are studying at first sight. If you have a good cause to delete an outlier, such as
incorrect data entry, doing so will make the data you're working with performing
better. The advent of an outlier, on the other hand, can sometimes prove a theory
you're working on. It's important to remember that just because an outlier exists
doesn't mean it's wrong. This step is required to determine the number's legitimacy.
Consider deleting an outlier if it appears to be unimportant for analysis or is a
mistake.
Step 4: Deal with any data that is missing.
Many algorithms will not allow missing values, therefore you can't ignore them. There
are several options for dealing with missing data. Neither option is ideal, but they can
both be examined.
You can drop observations with missing values as a first option, but this can cause
you to lose or lose information, so be aware of this before you do so.
As a second alternative, you can fill in missing numbers based on other
observations; however, you risk losing data integrity because you're working with
assumptions rather than actual observations.
As a third solution, you may change the way the data is used to navigate null values
more efficiently.
Step 5: Validate and Quality Assurance
As part of basic validation, you should be able to answer these questions at the end
of the data cleansing process:
1. Is the information logical?
2. Is the data formatted according to the field's rules?
3. Does it support or refute your working hypothesis, or provide any new
information?
4. Can you spot patterns in the data to aid in the development of your next
hypothesis?
5. Is this due to a problem with data quality?
Poor business planning and decision-making can be influenced by false conclusions
based on inaccurate or "dirty" data. False assumptions can lead to a humiliating
moment in a reporting meeting when you learn your data isn't up to par. It's critical to
establish a data-quality culture in your organization before you get there. To do so,
make a list of the tools you might employ to foster this culture, as well as what data
quality means to you.
Download