Data Analysis with Jumbune

advertisement
Jumbune Data Analyzer
1
Agenda
Enterprise Data Lake
?
Data Analysis Challenges
Data Analyzer
2
One Unified System: An Enterprise Data Lake
•
Data ETLing from all possible sources to Enterprise Data Lake through
•
Real time ingestion
•
Micro batch ingestion
•
Batch ingestion
•
A unified hub makes analysis, management and access of data easier.
•
Enterprise data lake enables ecosystem tools to collaboratively manage data.
•
A place to store all data in its original fidelity, with the flexibility to run a variety of Enterprise
workloads.
3
Key elements of an Enterprise Data Lake
•
Data Quality – data values as per business KPI
•
Data Profiling– statistical assessment of data
•
Data Governance – management of data
•
Data Lineage – define data lifecycle
•
Data Security – protecting data from unauthorized users
BIG DATA
4
Major challenges in Data Analysis
•
Incremental imports may ingest Bad Data
•
Analyzing anomalies in HDFS data
•
Tracking data quality over time
•
Tracing bad data out of billions of rows
•
Displaying concise meaningful results
5
Jumbune’s
Data Analyzer
6
Gain a better control over Data Analysis
•
Gives a centralized dashboard for profiling data
quality to gain better control
•Profile
•Quality
•
•
•
Leverage Jumbune’s infrastructure to get
capabilities of remote profiling capabilities
Timelines
Violations
No data movement required for performing data
profiling
Anomalies
Business
Rules
•Analyse
No specialized MapReduce or coding skills are
required to validate data.
•Control
7
Offering Data Quality and Data Profiling to
Enterprise Data Lake
• Tracing the conservation of data quality on timeline, even in
massive data offloading environment.
• Real time data quality monitoring tracked against customizable
KPIs
Data Quality
Timeline
• Statistic assessment of data values within a data set for
consistency, uniqueness and logic.
• Gauging the data profiles as per the business rules.
Data Profiling
8
Data Analysis Component
Data Analysis Process
Records Analysis
HDFS/NFS
jumbune
Data Profiling & Quality Reports
9
Data Quality: Provides Generic way of
testifying Anomalies
•
Validates inconsistencies in data in form of :
•
Null Checks
•
Data Type Checks
•
Regular Expressions
•
In depth record level data violation reports, can be drilled to line and field level.
•
Offers to generically specify data quality requirements according to user’s data lake.
•
Makes impossible looking quality checks on Big Data Lake possible.
•
Doesn’t require data to be moved out of Hadoop for testifying anomalies
•
Currently, Jumbune supports HDFS, NFS as Data Lake.
10
Data Profiling: Provides lake insights
• Statistical analysis of data values present in the enterprise data lake.
• Computes various profiles that help you become familiar with data.
• Evaluating structure of the data set in the enterprise data lake according to the set of business rules.
• Helps to know whether existing data can be used for more analytics.
Integrate
Centralized
Remote
Generic
11
Let’s provision a clean Enterprise Data Lake
Website
• http://jumbune.org
Contribute
• http://github.com/impetus-opensource/jumbune
• http://jumbune.org/jira/JUM
Social
• Follow @jumbune Use #jumbune
• Jumbune Group: http://linkd.in/1mUmcYm
Forums
• Users: users-subscribe@collaborate.jumbune.org
• Dev: dev-subscribe@collaborate.jumbune.org
• Issues: issues-subscribe@collaborate.jumbune.org
Downloads
• http://jumbune.org
• https://bintray.com/jumbune/downloads/jumbune
Download