The world of data is full of buzzwords and ambiguous terminology. It’s no wonder that even professionals sometimes get confused. What’s the difference between Data Lake and Data lakehouse? Is a Data Warehouse the same as a Data Platform?
These terms are sometimes used inconsistently or even misleadingly. Adding to the confusion, the field evolves so rapidly that universally accepted definitions are hard to come by.
This glossary is for anyone working with data or simply interested in using data to drive business development.
This glossary is meant to be shared and reused. We will update it regularly as new terms emerge. Feel free to suggest missing terms. It may never be “complete,” as the field is always evolving.
A step-by-step set of instructions used to perform a task or solve a problem. In computing, algorithms are the basis of automation. In machine learning, we talk about adaptive or learning algorithms—those that improve their own performance based on data feedback.
A broad field of computer science aimed at building systems that simulate human intelligence. Modern AI includes both “narrow AI” (task-specific, like language translation or image classification) and aspirations toward “general AI.” Applications increasingly include autonomous agents, reasoning systems, and creative generation.
Refers to the storage and processing of large and often unstructured data sets such as text, images, audio, and video. Technologies like Hadoop emerged to handle this. Today, big data solutions are often packaged by major cloud providers and enable storing massive volumes of data without predefined schemas—but querying them can be more complex.
A structured “dictionary” of core business terms such as customer, product, or project, maintained by the organization. It supports initiatives like data governance and platform implementations. Conceptual modeling is often used to build it.
Tools and methods for analyzing, visualizing, and reporting business data. Tools like Power BI, Tableau, Looker, and ThoughtSpot enable self-service BI, allowing business users to analyze data without IT support. BI helps businesses make data-driven decisions.
A visual interface, similar to a car’s dashboard, that shows key metrics (KPIs) in real time. Typically built using BI tools, dashboards help monitor business performance at a glance.
The delivery of computing services (servers, storage, databases, software) over the internet. Providers like AWS, Azure, and Google Cloud enable rapid deployment, elastic scalability, and usage-based pricing. Foundational to modern data platforms.
A high-level approach to model key concepts and their relationships from a business perspective, independent of technical implementation. Supports communication between stakeholders and forms a basis for database design.
Includes both descriptive analytics (summarizing past events) and predictive analytics (forecasting future outcomes using statistical models). When automated, it’s often called Advanced Analytics. Use cases include churn prediction, machine failure forecasting, and early disease detection.
The structural design of an organization’s data assets, systems, and flows. It defines how data is collected, stored, integrated, and used across technologies and business processes. Data architecture supports governance, analytics, security, and the transformation of raw data into actionable information. It guides both strategic initiatives and operational execution, and is used by data, solution, and enterprise architects alike.
A metadata-driven tool that helps organizations discover, understand, and govern their data assets. Think of it as a “library” of metadata for all systems, databases, and files. It’s key to navigating the growing complexity of enterprise data.
The process of identifying and correcting or removing inaccurate, incomplete, or duplicate data in datasets. It's often said that data science is 80% data cleaning and 20% actual analysis due to the messy nature of real-world data.
A formal, version-controlled agreement between data producers and consumers that defines the structure, semantics, and quality expectations of a dataset. Data contracts help ensure stability and trust by specifying what data is delivered, in what format, how often, and with what guarantees (e.g. schema, freshness, SLAs). They support autonomous development, prevent breaking changes, and are especially useful in data mesh architectures where domain teams act as both producers and owners of data products.
The framework of policies, roles, responsibilities, and processes for managing data throughout its lifecycle. It ensures compliance (e.g., GDPR), defines ownership, and supports high-quality data usage across the organization.
A file-based architecture for storing large volumes of raw and unstructured data. Data is ingested quickly and stored cheaply, allowing analysis at a later stage. Often used alongside data warehouses but not ideal for structured reporting. Favored by data scientists and AI/ML developers.
A hybrid architecture that combines the flexibility of data lakes with the structure of data warehouses. Popularized by Databricks, lakehouses aim to serve both AI/ML and BI use cases in a unified platform.
A method for documenting what data exists across an organization and how it is structured. It helps clarify key business concepts, supports communication between business and IT, and is essential for system migrations, BI projects, and digital transformation.
An emerging concept advocating decentralized data ownership and architecture. It emphasizes domain-driven design, data products, and APIs. Companies like Netflix and Zalando use Data Mesh to scale their data platforms.
The process of defining data structures, relationships, and rules for databases. Follows conceptual modeling and includes techniques like dimensional modeling (e.g. star schema) and Data Vault, especially for data warehouses.
A comprehensive environment for data ingestion, storage, processing, and analysis—usually cloud-based. Common examples include Microsoft Azure, AWS, and Google Cloud Platform. These platforms offer native tools for AI/ML, app development, data warehousing, and BI. Strengths include flexibility and scalability; challenges include integration across tools.
Processes and practices to ensure data is accurate, complete, and fit for use. It includes validation, profiling, and monitoring. High data quality is essential for analytics, machine learning, and business operations.
A discipline combining statistical modeling, hypothesis testing, and machine learning to discover patterns and generate insights from data. Data scientists formulate new questions and build models that drive innovation and competitive advantage.
A designated individual responsible for the quality and usability of data within a specific domain. Data stewards act as subject matter experts and contribute to governance efforts.
A technology that allows you to access and query data across multiple sources without physically moving it. Tools like Denodo or TIBCO create virtual views that present data in a unified way, accessible via SQL. Useful for quick integrations but may struggle with performance or data history tracking.
A structured collection of data, stored for easy access, retrieval, and sharing. Managed by a Database Management System (DBMS). Most enterprise databases are relational, but document and graph databases are increasingly used.
A subject-specific, refined data store optimized for reporting—often using star schema or flat table structures. Can be standalone or derived from a larger data warehouse. BI tools typically integrate well with datamarts.
A subfield of machine learning using artificial neural networks to model complex patterns. It powers breakthroughs in image recognition, natural language processing, and autonomous systems. Deep learning models often outperform humans in narrow tasks.
A centralized data repository for integrating, storing, and analyzing data from multiple sources. Designed for querying and reporting, often using relational technologies. It supports historical data tracking and enables a “360-degree view” of business entities.
A company-wide version of a data warehouse, integrating data across business units. Designed for scalability and long-term strategic use. Requires rigorous planning, modeling, and governance.
A process for transferring data from source systems to a data warehouse. It involves extracting data, transforming it into a desired structure, and loading it into the destination. Common in enterprise data integration workflows.
A subset of artificial intelligence focused on creating new content such as text, images, audio, or code. GenAI systems learn patterns from large datasets and generate outputs that resemble human-created content. Prominent examples include tools for text generation, image synthesis, and music composition. GenAI enables applications like chatbots, virtual assistants, content automation, and personalized recommendations. Business use cases are rapidly expanding across industries.
The integration of physical devices with the internet, enabling them to generate and transmit data. Applications include smart homes, connected factories, and wearable health devices. IoT data can enrich business analytics and predictive models.
A type of deep learning model trained on massive text corpora to understand and generate human language. LLMs use architectures like transformers and are capable of tasks such as translation, summarization, question answering, and code generation. Examples include GPT, Claude, and LLaMA. LLMs are foundational to GenAI applications and are increasingly integrated into enterprise data tools to support search, documentation, and decision-making.
A field of AI where systems learn patterns from data without explicit programming. It uses statistical techniques to improve task performance over time. Deep learning is a subset of this category.
Processes and tools to ensure the consistency and accuracy of key reference data like customers, products, or suppliers across systems. It aims to centralize and maintain trusted “golden records” for critical business entities.
Data about data—describing the structure, content, and context of data assets. Examples include file creation dates or definitions of business terms. Well-maintained metadata improves data discoverability, integration, and governance.
Non-relational databases optimized for flexibility, scalability, and speed. They use data models like documents, key-value pairs, or graphs. Examples include MongoDB and Neo4j. Popular for modern, high-performance applications.
Software developed in a transparent and collaborative way, where the source code is publicly available for use, modification, and distribution. Examples include MariaDB (from Finland) and Hadoop. This model drives much of today’s software innovation.
A traditional database structure where data is stored in structured tables and accessed via SQL. Based on E.F. Codd’s relational model, relational databases like Oracle, PostgreSQL, and MySQL are the backbone of enterprise systems.
A practice in software development, data quality, and testing that involves moving critical activities earlier in the lifecycle—“to the left” on the project timeline. In data work, it means addressing quality, validation, governance, and even stakeholder collaboration earlier in design and development, not just during testing or deployment. The goal is to catch issues sooner, reduce rework, and increase agility. “Shift Left” reflects a broader cultural shift toward proactive planning and cross-functional collaboration, especially in DevOps, DataOps, and agile delivery models.
A powerful language for managing and querying data in relational databases. Developed in the 1970s at IBM, SQL remains the standard for structured data and is supported by many modern platforms, including some NoSQL systems.