Data Analytics / Data Science - Glossary of Terms 1. Data Analysis: The process of inspecting, cleansing, transforming, and modeling data to discover useful information, inform conclusions, and support decision-making. 2. Data Ecosystem: A complex system of technologies and processes used by organizations to collect, store, process, and analyze data. 3. Enterprise Applications: Software solutions that provide business logic and tools to model entire business processes for organizations to improve productivity and efficiency. 4. Types of Data: Includes structured, semi-structured, and unstructured data. 5. Data Source Types: Various origins of data such as RDBMS, NoSQL databases, flat files, APIs, and data streams. 6. Data Repositories: Places where data is stored and managed like databases, data warehouses, data marts, and data lakes. 7. Data Visualization: The graphic representation of data to enable stakeholders to understand the significance of data by placing it in a visual context. 8. RDBMS (Relational Database Management System): A database management system based on the relational model as introduced by E.F. Codd. 9. NoSQL: A variety of database technologies that are designed to accommodate a wide variety of data models, including key-value, document, columnar, and graph formats. 10. ETL (Extract, Transform, Load): A type of data integration that involves extracting data from outside sources, transforming it to fit operational needs, and loading it into the end target. 11. Data Pipeline: The complete set of operations involved in the collection, processing, and storage of data. 12. Big Data: Data sets that are so voluminous and complex that traditional data processing software are inadequate to deal with them. 13. Data Analytics vs. Data Analysis: Data analytics refers to the systematic computational analysis of data or statistics, whereas data analysis is a broader term that involves breaking down a data set to understand its components and structure. 14. Machine Learning: A type of artificial intelligence that enables software applications to become more accurate in predicting outcomes without being explicitly programmed. 15. Business Intelligence (BI): Technologies, applications, and practices for the collection, integration, analysis, and presentation of business information. 16. Cloud Computing: The on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user. 17. API (Application Programming Interface): A set of rules and specifications that software programs can follow to communicate with each other. 18. SQL (Structured Query Language): A programming language designed for managing data held in a relational database management system. 19. Python: A high-level programming language used for general-purpose programming and particularly popular in data science and analytics for its readability and breadth of functionality. 20. Hadoop: An open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. 21. Data Cleaning: The process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. 22. Data Transformation: The process of converting data from one format or structure into another format or structure that is more appropriate for a variety of analytics uses. 23. Data Communication: The process of presenting, interpreting, and discussing data findings, often using data visualization techniques. 24. Modern Data Ecosystem: Encompasses all the technologies and processes that handle the collection, storage, management, and analysis of data. 25. Enterprise Data Analysis: The use of analytical methods and tools within an enterprise to make business decisions based on data. 26. Descriptive Analysis: A type of data analysis that helps in describing, showing or summarizing data features in a constructive way which does not allow for making any conclusions beyond the data analyzed or making any predictions. 27. Diagnostic Analysis: Refers to the examination of data to understand cause and effect relationships. This involves drilling down into data to understand the different elements of your data set. 28. Predictive Analysis: The use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data. 29. Prescriptive Analysis: This type of analysis seeks to determine the best solutions or outcomes among various choices, given the known parameters. 30. Data Science: A multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. 31. Data Mart: A subset of a data warehouse, designed to serve a particular purpose or a specific user community, typically organized by department or business function. 32. Data Lake: A system or repository of data stored in its natural/raw format, usually object blobs or files, which is scalable and allows the storage of data from multiple sources. 33. Structured Data: Data that adheres to a pre-defined model and is easy to organize and search. Examples include RDBMS (SQL databases). 34. Semi-Structured Data: Data that does not reside in a relational database but that does have some organizational properties that make it easier to analyze. Examples include XML and JSON formats. 35. Unstructured Data: Information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Examples include emails, videos, and social media postings. 36. Key-value Store: A type of non-relational database that uses a simple key/value method to store data. 37. Columnar Storage: Data storage in a column-oriented format, allowing for faster retrieval of data and improved disk I/O. 38. Graph Database: A database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. 39. OLTP (Online Transaction Processing): A class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing. 40. OLAP (Online Analytical Processing): A category of software tools that provides analysis of data stored in a database and is used in data mining. 41. ACID Compliance: A set of properties of database transactions intended to guarantee data validity despite errors, power failures, and other mishaps. 42. Data Governance: The management of the availability, usability, integrity, and security of the data employed in an enterprise, with the objective to ensure that data is consistent and trustworthy and doesn't get misused. 43. Data Integration: The process of combining data from different sources into a single, unified view. Integration begins with the ingestion process and includes steps such as cleansing, ETL mapping, and transformation. 44. Data Warehousing: The electronic storage of a large amount of information by a business, designed to facilitate management decision-making. 45. Dashboard: A visual display of the most important information needed to achieve one or more objectives, consolidated and arranged on a single screen so the information can be monitored at a glance. 46. Machine Learning Engineer: A role focused on creating data funnels and delivering software solutions, which often involves machine learning algorithms to handle various tasks related to data, predictions, and decision-making. 47. Business Analyst: A professional who analyzes a business or organization, documenting its business, processes, or systems, assessing the business model or its integration with technology. 48. Business Intelligence Analyst: A data-savvy professional who uses data to help organizations make better business decisions by running queries, creating reports, and visualizing data to convey the findings to stakeholders. 49. Data Architect: A practitioner of data architecture, a set of rules, policies, standards, and models that govern and define the type of data collected, and how it is used, stored, managed, and integrated within an organization and its database systems. 50. Query Language: A computer language used to make queries in databases and information systems. Broadly, query languages can be classified according to whether they are database query languages or information retrieval query languages. 51. Statistical Analysis: The science of collecting, exploring, and presenting large amounts of data to discover underlying patterns and trends. 52. API Integration: The connection between two or more applications, via their APIs, that lets those systems exchange data. 53. Web Scraping: A technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format. 54. IoT (Internet of Things): The interconnection via the internet of computing devices embedded in everyday objects, enabling them to send and receive data. 55. Real-Time Processing: The processing of data immediately after it is collected, without putting it into a batch queue. 56. Data Normalization: The process of organizing data in a database to reduce redundancy and improve data integrity. 57. Predictive Modeling: The process of using known results to create, process, and validate a model that can be used to forecast future outcomes. 58. Cloud Computing Platforms: Online platforms that offer computing services such as servers, storage, databases, networking, software, apps, among others, over the cloud. 59. Version Control Systems: Systems that manage changes to a set of files, keeping track of all modifications in a special kind of database. If a mistake is made, developers can turn back the clock and compare earlier versions of the code to help fix the mistake while minimizing disruption to all team members. 60. HDFS (Hadoop Distributed File System): A distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems but is designed to be more fault-tolerant and to provide high throughput access to application data. 61. Spark: Apache Spark is an open-source unified analytics engine for large-scale data processing, with built-in modules for streaming, SQL, machine learning and graph processing. 62. Scalability: The capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth. 63. Schema: The structure of a database system, described in a formal language supported by the database management system (DBMS). In a relational database, the schema defines the tables, the fields in each table, and the relationships between fields and tables. 64. Aggregation: A computational process of combining multiple data entities into a single summary or total. Often used in data science to compile information for statistical analysis or visual summarization. 65. Data Filtering: The process of choosing a smaller part of your data and using that subset for viewing or analysis. It is one of the steps toward data cleaning and analysis. 66. Batch Processing: The processing of a large volume of data all at once. This type of data processing where information is collected and stored for processing at a predetermined time or when a certain condition is met. 67. Data Enrichment: The process of enhancing, refining, and improving raw data by merging with other pieces of relevant data. It is used to create data sets that enable data scientists and analysts to pull meaningful insights. 68. Data Validation: The process of ensuring that a program operates on clean, correct and useful data. It uses routines, often called "validation rules", "validation constraints", or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system. 69. ACID Properties: Atomicity, Consistency, Isolation, Durability. A set of properties that guarantee database transactions are processed reliably and ensure the integrity of data in the database. 70. Transactional Data: Information that records the exchanges, transactions, and activities that affect an organization, which are used to make business decisions and manage activities. 71. Analytics Platform: Tools and applications used for processing and analyzing data stored in a database. Platforms can vary widely in the complexity and features offered, supporting tasks from simple data visualization to advanced predictive and prescriptive analytics. 72. Data Compliance: The process by which an organization ensures that it follows the established standards and regulations defined by government bodies for managing data. 73. Metadata: Data that describes other data. It provides information about a certain item's content. For example, an image may include metadata that describes how large the picture is, the color depth, the image resolution, and when the image was created. 74. Data Stewardship: The management and oversight of an organization's data assets to provide data governance to ensure data quality and the implementation of policies. 75. Data Hygiene: The practice of keeping data clean and accurate by continually updating it, removing inaccuracies, and resolving inconsistencies. 76. Data Profiling: The process of examining the data available from an existing source and collecting statistics or informative summaries about that data. 77. Data Mining: The practice of examining large databases in order to generate new information. It involves methods at the intersection of machine learning, statistics, and database systems. 78. Data Lifecycle Management (DLM): The process of managing the flow of data throughout its lifecycle from creation and initial storage to the time when it becomes obsolete and is deleted. 79. Data Provenance: Information that helps determine the derivation history of a data record. It includes details about the processes and data sources that have contributed to the creation of a data record. 80. Feature Engineering: The process of using domain knowledge to select, modify, or create new features from raw data that make machine learning algorithms work. 81. Data Wrangling: The process of cleaning and unifying messy and complex data sets for easy access and analysis. 82. Deep Learning: A subset of machine learning in artificial intelligence that has networks capable of learning unsupervised from data that is unstructured or unlabeled. 83. Natural Language Processing (NLP): A field of artificial intelligence that gives machines the ability to read, understand, and derive meaning from human languages. 84. Sentiment Analysis: The use of natural language processing, text analysis, and computational linguistics to identify, extract, quantify, and study affective states and subjective information. 85. Predictive Analytics: The use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data. 86. Time Series Analysis: A statistical technique that deals with time series data, or trend analysis, to forecast future events based on known past events. 87. Dimensionality Reduction: The process of reducing the number of random variables under consideration by obtaining a set of principal variables. 88. Anomaly Detection: The identification of rare items, events, or observations which raise suspicions by differing significantly from the majority of the data. 89. Cluster Analysis: A technique used to group a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. 90. Data Federation: The process of aggregating information from disparate sources to create a single, unified view. This process often involves virtualization to retrieve and manipulate data without requiring technical details about the data. 91. Multivariate Analysis: A set of statistical techniques used for analysis of data that contains more than one variable. This is typically done to understand the relationships between variables and to model the structure of the data. 92. Geospatial Analysis: The gathering, display, and manipulation of imagery, GPS, satellite photography and historical data, described explicitly in terms of geographic coordinates or implicitly, in terms of a street address, postal code, or forest stand identifier as they are applied to geographic models. 93. Data Mart vs. Data Warehouse: Data marts are subsections of data warehouses that provide data to specific groups within an organization, whereas a data warehouse is a central repository for all organizational data. 94. Operational Data Store (ODS): A type of database often used as an interim logical area for a data warehouse, where data is cleaned and transformed. 95. Data Governance Framework: A structure under which business and IT units operate to ensure that data policies and standards are implemented consistently to ensure the integrity of data. 96. ETL Testing: The process of validating, verifying, and qualifying data while preventing duplicate records and data loss. This confirms that the data is loaded into the data warehouse without errors and as expected. 97. Real-Time Analytics: The use of, or the capacity to use, all available enterprise data and resources when needed. It involves continuous data processing, which provides immediate outputs and insights. 98. Data Obfuscation: A form of data masking where data is purposely scrambled to prevent unauthorized access to sensitive materials. 99. Data Lineage: The detailed data life history, including origins, movements, characteristics, and quality changes over time. 100. Data Warehouse Automation: The process by which tools and processes automatically manage and optimize tasks involved in planning, design, construction, and operation of a data warehouse. 101. Exploratory Data Analysis (EDA): An approach to analyzing data sets to summarize their main characteristics, often with visual methods, before applying more formal statistical techniques. 102. Model Validation: The process of evaluating how well your data analysis or predictive model performs on new data. It helps ensure that the models are accurate and reliable. 103. Cross-Validation: A technique for assessing how the results of a statistical analysis will generalize to an independent data set. Commonly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. 104. Hyperparameter Tuning: The process of finding the optimal combination of parameters that maximizes the performance of a model. This is crucial in many machine learning algorithms that require the setting of parameters before the learning process begins. 105. Supervised Learning: A type of machine learning algorithm that is trained on labeled data, or data that has an input-output pair. The algorithm learns a model that can be applied to new data. 106. Unsupervised Learning: A type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. 107. Reinforcement Learning: A type of machine learning technique where an agent learns to behave in an environment by performing certain actions and observing the rewards/results of those actions. 108. Feature Selection: The process of selecting a subset of relevant features for use in model construction. This reduces the complexity of a model, makes the model easier to interpret, and can improve model accuracy. 109. Principal Component Analysis (PCA): A statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. 110. k-Means Clustering: A method of vector quantization, originally from signal processing, that is popular for cluster analysis in data science. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. 111. Decision Trees: A decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. 112. Random Forests: An ensemble learning method for classification, regression, and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. 113. Neural Networks: A set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling, or clustering raw input. 114. Support Vector Machines (SVM): Supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. 115. Bias-Variance Tradeoff: The problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set: the bias, due to erroneous assumptions in the learning algorithm, and the variance, due to sensitivity to small fluctuations in the training set. 116. Overfitting: A modeling error that occurs when a function is too closely fit to a limited set of data points. It happens when a model learns the detail and noise in the training data to an extent that it negatively impacts the performance of the model on new data. 117. Underfitting: A modeling error that occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. It often happens when we have too few data to build an accurate model or when we try to build a linear model with non-linear data. 118. Regularization: Techniques used to reduce the error by fitting a function appropriately on the given training set and avoid overfitting. These techniques typically involve adding some form of magnitude measurement (such as squares of coefficients) to the optimization problem. 119. Gradient Descent: An optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. 120. Model Deployment: The method by which a data science model is integrated into existing production environments to provide insights for decision-making processes or automate actions.