Uploaded by Muhammad Ali Hayat

Data Analytics & Data Science Glossary: Essential Terms Defined

advertisement
Data Analytics / Data Science - Glossary of Terms
1. Data Analysis: The process of inspecting, cleansing, transforming, and modeling data to
discover useful information, inform conclusions, and support decision-making.
2. Data Ecosystem: A complex system of technologies and processes used by organizations
to collect, store, process, and analyze data.
3. Enterprise Applications: Software solutions that provide business logic and tools to model
entire business processes for organizations to improve productivity and efficiency.
4. Types of Data: Includes structured, semi-structured, and unstructured data.
5. Data Source Types: Various origins of data such as RDBMS, NoSQL databases, flat files,
APIs, and data streams.
6. Data Repositories: Places where data is stored and managed like databases, data
warehouses, data marts, and data lakes.
7. Data Visualization: The graphic representation of data to enable stakeholders to
understand the significance of data by placing it in a visual context.
8. RDBMS (Relational Database Management System): A database management system
based on the relational model as introduced by E.F. Codd.
9. NoSQL: A variety of database technologies that are designed to accommodate a wide
variety of data models, including key-value, document, columnar, and graph formats.
10. ETL (Extract, Transform, Load): A type of data integration that involves extracting data from
outside sources, transforming it to fit operational needs, and loading it into the end target.
11. Data Pipeline: The complete set of operations involved in the collection, processing, and
storage of data.
12. Big Data: Data sets that are so voluminous and complex that traditional data processing
software are inadequate to deal with them.
13. Data Analytics vs. Data Analysis: Data analytics refers to the systematic computational
analysis of data or statistics, whereas data analysis is a broader term that involves breaking
down a data set to understand its components and structure.
14. Machine Learning: A type of artificial intelligence that enables software applications to
become more accurate in predicting outcomes without being explicitly programmed.
15. Business Intelligence (BI): Technologies, applications, and practices for the collection,
integration, analysis, and presentation of business information.
16. Cloud Computing: The on-demand availability of computer system resources, especially
data storage and computing power, without direct active management by the user.
17. API (Application Programming Interface): A set of rules and specifications that software
programs can follow to communicate with each other.
18. SQL (Structured Query Language): A programming language designed for managing data
held in a relational database management system.
19. Python: A high-level programming language used for general-purpose programming and
particularly popular in data science and analytics for its readability and breadth of
functionality.
20. Hadoop: An open-source framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models.
21. Data Cleaning: The process of detecting and correcting (or removing) corrupt or inaccurate
records from a dataset.
22. Data Transformation: The process of converting data from one format or structure into
another format or structure that is more appropriate for a variety of analytics uses.
23. Data Communication: The process of presenting, interpreting, and discussing data
findings, often using data visualization techniques.
24. Modern Data Ecosystem: Encompasses all the technologies and processes that handle
the collection, storage, management, and analysis of data.
25. Enterprise Data Analysis: The use of analytical methods and tools within an enterprise to
make business decisions based on data.
26. Descriptive Analysis: A type of data analysis that helps in describing, showing or
summarizing data features in a constructive way which does not allow for making any
conclusions beyond the data analyzed or making any predictions.
27. Diagnostic Analysis: Refers to the examination of data to understand cause and effect
relationships. This involves drilling down into data to understand the different elements of
your data set.
28. Predictive Analysis: The use of data, statistical algorithms, and machine learning
techniques to identify the likelihood of future outcomes based on historical data.
29. Prescriptive Analysis: This type of analysis seeks to determine the best solutions or
outcomes among various choices, given the known parameters.
30. Data Science: A multi-disciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from structured and
unstructured data.
31. Data Mart: A subset of a data warehouse, designed to serve a particular purpose or a
specific user community, typically organized by department or business function.
32. Data Lake: A system or repository of data stored in its natural/raw format, usually object
blobs or files, which is scalable and allows the storage of data from multiple sources.
33. Structured Data: Data that adheres to a pre-defined model and is easy to organize and
search. Examples include RDBMS (SQL databases).
34. Semi-Structured Data: Data that does not reside in a relational database but that does
have some organizational properties that make it easier to analyze. Examples include XML
and JSON formats.
35. Unstructured Data: Information that either does not have a pre-defined data model or is
not organized in a pre-defined manner. Examples include emails, videos, and social media
postings.
36. Key-value Store: A type of non-relational database that uses a simple key/value method to
store data.
37. Columnar Storage: Data storage in a column-oriented format, allowing for faster retrieval
of data and improved disk I/O.
38. Graph Database: A database that uses graph structures for semantic queries with nodes,
edges, and properties to represent and store data.
39. OLTP (Online Transaction Processing): A class of systems that facilitate and manage
transaction-oriented applications, typically for data entry and retrieval transaction
processing.
40. OLAP (Online Analytical Processing): A category of software tools that provides analysis of
data stored in a database and is used in data mining.
41. ACID Compliance: A set of properties of database transactions intended to guarantee data
validity despite errors, power failures, and other mishaps.
42. Data Governance: The management of the availability, usability, integrity, and security of
the data employed in an enterprise, with the objective to ensure that data is consistent and
trustworthy and doesn't get misused.
43. Data Integration: The process of combining data from different sources into a single,
unified view. Integration begins with the ingestion process and includes steps such as
cleansing, ETL mapping, and transformation.
44. Data Warehousing: The electronic storage of a large amount of information by a business,
designed to facilitate management decision-making.
45. Dashboard: A visual display of the most important information needed to achieve one or
more objectives, consolidated and arranged on a single screen so the information can be
monitored at a glance.
46. Machine Learning Engineer: A role focused on creating data funnels and delivering
software solutions, which often involves machine learning algorithms to handle various
tasks related to data, predictions, and decision-making.
47. Business Analyst: A professional who analyzes a business or organization, documenting its
business, processes, or systems, assessing the business model or its integration with
technology.
48. Business Intelligence Analyst: A data-savvy professional who uses data to help
organizations make better business decisions by running queries, creating reports, and
visualizing data to convey the findings to stakeholders.
49. Data Architect: A practitioner of data architecture, a set of rules, policies, standards, and
models that govern and define the type of data collected, and how it is used, stored,
managed, and integrated within an organization and its database systems.
50. Query Language: A computer language used to make queries in databases and information
systems. Broadly, query languages can be classified according to whether they are
database query languages or information retrieval query languages.
51. Statistical Analysis: The science of collecting, exploring, and presenting large amounts of
data to discover underlying patterns and trends.
52. API Integration: The connection between two or more applications, via their APIs, that lets
those systems exchange data.
53. Web Scraping: A technique employed to extract large amounts of data from websites
whereby the data is extracted and saved to a local file in your computer or to a database in
table (spreadsheet) format.
54. IoT (Internet of Things): The interconnection via the internet of computing devices
embedded in everyday objects, enabling them to send and receive data.
55. Real-Time Processing: The processing of data immediately after it is collected, without
putting it into a batch queue.
56. Data Normalization: The process of organizing data in a database to reduce redundancy
and improve data integrity.
57. Predictive Modeling: The process of using known results to create, process, and validate a
model that can be used to forecast future outcomes.
58. Cloud Computing Platforms: Online platforms that offer computing services such as
servers, storage, databases, networking, software, apps, among others, over the cloud.
59. Version Control Systems: Systems that manage changes to a set of files, keeping track of
all modifications in a special kind of database. If a mistake is made, developers can turn
back the clock and compare earlier versions of the code to help fix the mistake while
minimizing disruption to all team members.
60. HDFS (Hadoop Distributed File System): A distributed file system designed to run on
commodity hardware. It has many similarities with existing distributed file systems but is
designed to be more fault-tolerant and to provide high throughput access to application
data.
61. Spark: Apache Spark is an open-source unified analytics engine for large-scale data
processing, with built-in modules for streaming, SQL, machine learning and graph
processing.
62. Scalability: The capability of a system, network, or process to handle a growing amount of
work, or its potential to be enlarged to accommodate that growth.
63. Schema: The structure of a database system, described in a formal language supported by
the database management system (DBMS). In a relational database, the schema defines
the tables, the fields in each table, and the relationships between fields and tables.
64. Aggregation: A computational process of combining multiple data entities into a single
summary or total. Often used in data science to compile information for statistical analysis
or visual summarization.
65. Data Filtering: The process of choosing a smaller part of your data and using that subset for
viewing or analysis. It is one of the steps toward data cleaning and analysis.
66. Batch Processing: The processing of a large volume of data all at once. This type of data
processing where information is collected and stored for processing at a predetermined
time or when a certain condition is met.
67. Data Enrichment: The process of enhancing, refining, and improving raw data by merging
with other pieces of relevant data. It is used to create data sets that enable data scientists
and analysts to pull meaningful insights.
68. Data Validation: The process of ensuring that a program operates on clean, correct and
useful data. It uses routines, often called "validation rules", "validation constraints", or
"check routines", that check for correctness, meaningfulness, and security of data that are
input to the system.
69. ACID Properties: Atomicity, Consistency, Isolation, Durability. A set of properties that
guarantee database transactions are processed reliably and ensure the integrity of data in
the database.
70. Transactional Data: Information that records the exchanges, transactions, and activities
that affect an organization, which are used to make business decisions and manage
activities.
71. Analytics Platform: Tools and applications used for processing and analyzing data stored
in a database. Platforms can vary widely in the complexity and features offered, supporting
tasks from simple data visualization to advanced predictive and prescriptive analytics.
72. Data Compliance: The process by which an organization ensures that it follows the
established standards and regulations defined by government bodies for managing data.
73. Metadata: Data that describes other data. It provides information about a certain item's
content. For example, an image may include metadata that describes how large the picture
is, the color depth, the image resolution, and when the image was created.
74. Data Stewardship: The management and oversight of an organization's data assets to
provide data governance to ensure data quality and the implementation of policies.
75. Data Hygiene: The practice of keeping data clean and accurate by continually updating it,
removing inaccuracies, and resolving inconsistencies.
76. Data Profiling: The process of examining the data available from an existing source and
collecting statistics or informative summaries about that data.
77. Data Mining: The practice of examining large databases in order to generate new
information. It involves methods at the intersection of machine learning, statistics, and
database systems.
78. Data Lifecycle Management (DLM): The process of managing the flow of data throughout
its lifecycle from creation and initial storage to the time when it becomes obsolete and is
deleted.
79. Data Provenance: Information that helps determine the derivation history of a data record.
It includes details about the processes and data sources that have contributed to the
creation of a data record.
80. Feature Engineering: The process of using domain knowledge to select, modify, or create
new features from raw data that make machine learning algorithms work.
81. Data Wrangling: The process of cleaning and unifying messy and complex data sets for
easy access and analysis.
82. Deep Learning: A subset of machine learning in artificial intelligence that has networks
capable of learning unsupervised from data that is unstructured or unlabeled.
83. Natural Language Processing (NLP): A field of artificial intelligence that gives machines
the ability to read, understand, and derive meaning from human languages.
84. Sentiment Analysis: The use of natural language processing, text analysis, and
computational linguistics to identify, extract, quantify, and study affective states and
subjective information.
85. Predictive Analytics: The use of data, statistical algorithms, and machine learning
techniques to identify the likelihood of future outcomes based on historical data.
86. Time Series Analysis: A statistical technique that deals with time series data, or trend
analysis, to forecast future events based on known past events.
87. Dimensionality Reduction: The process of reducing the number of random variables under
consideration by obtaining a set of principal variables.
88. Anomaly Detection: The identification of rare items, events, or observations which raise
suspicions by differing significantly from the majority of the data.
89. Cluster Analysis: A technique used to group a set of objects in such a way that objects in
the same group are more similar to each other than to those in other groups.
90. Data Federation: The process of aggregating information from disparate sources to create a
single, unified view. This process often involves virtualization to retrieve and manipulate
data without requiring technical details about the data.
91. Multivariate Analysis: A set of statistical techniques used for analysis of data that contains
more than one variable. This is typically done to understand the relationships between
variables and to model the structure of the data.
92. Geospatial Analysis: The gathering, display, and manipulation of imagery, GPS, satellite
photography and historical data, described explicitly in terms of geographic coordinates or
implicitly, in terms of a street address, postal code, or forest stand identifier as they are
applied to geographic models.
93. Data Mart vs. Data Warehouse: Data marts are subsections of data warehouses that
provide data to specific groups within an organization, whereas a data warehouse is a
central repository for all organizational data.
94. Operational Data Store (ODS): A type of database often used as an interim logical area for
a data warehouse, where data is cleaned and transformed.
95. Data Governance Framework: A structure under which business and IT units operate to
ensure that data policies and standards are implemented consistently to ensure the
integrity of data.
96. ETL Testing: The process of validating, verifying, and qualifying data while preventing
duplicate records and data loss. This confirms that the data is loaded into the data
warehouse without errors and as expected.
97. Real-Time Analytics: The use of, or the capacity to use, all available enterprise data and
resources when needed. It involves continuous data processing, which provides immediate
outputs and insights.
98. Data Obfuscation: A form of data masking where data is purposely scrambled to prevent
unauthorized access to sensitive materials.
99. Data Lineage: The detailed data life history, including origins, movements, characteristics,
and quality changes over time.
100.
Data Warehouse Automation: The process by which tools and processes
automatically manage and optimize tasks involved in planning, design, construction, and
operation of a data warehouse.
101.
Exploratory Data Analysis (EDA): An approach to analyzing data sets to summarize
their main characteristics, often with visual methods, before applying more formal
statistical techniques.
102.
Model Validation: The process of evaluating how well your data analysis or
predictive model performs on new data. It helps ensure that the models are accurate and
reliable.
103.
Cross-Validation: A technique for assessing how the results of a statistical analysis
will generalize to an independent data set. Commonly used in settings where the goal is
prediction, and one wants to estimate how accurately a predictive model will perform in
practice.
104.
Hyperparameter Tuning: The process of finding the optimal combination of
parameters that maximizes the performance of a model. This is crucial in many machine
learning algorithms that require the setting of parameters before the learning process
begins.
105.
Supervised Learning: A type of machine learning algorithm that is trained on
labeled data, or data that has an input-output pair. The algorithm learns a model that can be
applied to new data.
106.
Unsupervised Learning: A type of machine learning algorithm used to draw
inferences from datasets consisting of input data without labeled responses.
107.
Reinforcement Learning: A type of machine learning technique where an agent
learns to behave in an environment by performing certain actions and observing the
rewards/results of those actions.
108.
Feature Selection: The process of selecting a subset of relevant features for use in
model construction. This reduces the complexity of a model, makes the model easier to
interpret, and can improve model accuracy.
109.
Principal Component Analysis (PCA): A statistical procedure that uses an
orthogonal transformation to convert a set of observations of possibly correlated variables
into a set of values of linearly uncorrelated variables called principal components.
110.
k-Means Clustering: A method of vector quantization, originally from signal
processing, that is popular for cluster analysis in data science. k-means clustering aims to
partition n observations into k clusters in which each observation belongs to the cluster
with the nearest mean.
111.
Decision Trees: A decision support tool that uses a tree-like model of decisions and
their possible consequences, including chance event outcomes, resource costs, and utility.
112.
Random Forests: An ensemble learning method for classification, regression, and
other tasks that operate by constructing a multitude of decision trees at training time and
outputting the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees.
113.
Neural Networks: A set of algorithms, modeled loosely after the human brain, that
are designed to recognize patterns. They interpret sensory data through a kind of machine
perception, labeling, or clustering raw input.
114.
Support Vector Machines (SVM): Supervised learning models with associated
learning algorithms that analyze data used for classification and regression analysis.
115.
Bias-Variance Tradeoff: The problem of simultaneously minimizing two sources of
error that prevent supervised learning algorithms from generalizing beyond their training set:
the bias, due to erroneous assumptions in the learning algorithm, and the variance, due to
sensitivity to small fluctuations in the training set.
116.
Overfitting: A modeling error that occurs when a function is too closely fit to a
limited set of data points. It happens when a model learns the detail and noise in the
training data to an extent that it negatively impacts the performance of the model on new
data.
117.
Underfitting: A modeling error that occurs when a statistical model or machine
learning algorithm cannot capture the underlying trend of the data. It often happens when
we have too few data to build an accurate model or when we try to build a linear model with
non-linear data.
118.
Regularization: Techniques used to reduce the error by fitting a function
appropriately on the given training set and avoid overfitting. These techniques typically
involve adding some form of magnitude measurement (such as squares of coefficients) to
the optimization problem.
119.
Gradient Descent: An optimization algorithm used to minimize some function by
iteratively moving in the direction of steepest descent as defined by the negative of the
gradient.
120.
Model Deployment: The method by which a data science model is integrated into
existing production environments to provide insights for decision-making processes or
automate actions.
Download