Data Glossary

Here you will see all the data related definitions

This page is still W.I.P.

A
B
D
E
- ETL (Extract, Transform, Load)
F
- Formula
- Function
M
- Master Data Management (MDM)
- Metadata
O
- One-hot encoding
S
- Streaming Data:

A

Analytical Skills

Qualities and characteristics associated with using facts to solve problems.

Analytical Thinking

The process of identifying and defining a problem, then solving it by using data in an organized, step-by-step manner.

Anomaly Detection

It is a machine learning technique to discover out of norm behavior (abnormal behaviour == in statistical term "outlier") in the dataset.

In data analysis, anomaly detection is commonly understood as the recognition of uncommon items, events, or observations that significantly deviate from most of the data and do not align with a clearly defined concept of normal behaviour.

B

Batch Processing

Batch processing is the method of processing data in predefined, discrete chunks (batches) at scheduled intervals, often employed for non-real-time data processing tasks.

Big Data

Big Data refers to large and complex datasets that exceed the capabilities of traditional data processing applications. It often involves high volume, velocity, and variety of data.

BigQuery

BigQuery is a fully managed, serverless data warehouse. Tables in BigQuery are organised into datasets.

"BigQuery is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data. Use built-in ML/AI and BI for insights at scale."

BigQuery provides two services in one:

Storage plus analytics. It’s a place to store petabytes of data. For reference, 1 petabyte is equivalent to 11,000 movies at 4k quality.
BigQuery is also a place to analyse data, with built-in features like machine learning, geospatial analysis, and business intelligence, which we will look at a bit later on.

BigQuery is a fully managed serverless solution, meaning that you don’t need to worry about provisioning any resources or managing servers in the backend but only focus on using SQL queries to answer your organisation's questions in the frontend.

BigQuery has a flexible pay-as-you-go pricing model. Pay for what your query processes, or use a flat-rate option.

BigQuery uses data encryption at rest by default. Encrypt data stored on a disk.

BigQuery has built-in machine learning features so you can write ML models directly in BigQuery using SQL.

BigQuery ML

BigQuery ML lets you create and execute machine learning models using GoogleSQL queries. BigQuery ML democratises machine learning by letting SQL practitioners build models using existing SQL tools and skills. BigQuery ML increases development speed by eliminating the need to move data.

A model in BigQuery ML represents what a machine learning (ML) system has learned from training data.

D

DAG

DAG is a Directed Acyclic Graph — a conceptual representation of a series of activities, or, in other words, a mathematical abstraction of a data pipeline.

A DAG (or a pipeline) defines a sequence of execution stages in any non-recurring algorithm.

The DAG acronym stands for:

Directed – In general, if multiple tasks exist, each must have at least one defined upstream (previous) or downstream (subsequent) task or one or more of both. (It’s important to note that there are also DAGs with multiple parallel tasks — meaning no dependencies.)

Acyclic – No task can create data that goes on to reference itself. That could cause an infinite loop, giving rise to a problem or two. There are no cycles in DAGs.

Graph – In mathematics, a graph is a finite set of nodes with vertices connecting the nodes. In data engineering, each node in a graph represents a task. All tasks are laid out clearly, with discrete processes occurring at set points and transparent relationships with other tasks.

Data

A collection of facts.

Data Analysis

The collection, transformation, and organisation of data in order to draw conclusions, make predictions, and drive informed decision-making.

Data Analyst

Someone who collects, transforms, and organizes data in order to draw conclusions, make predictions, and drive informed decision-making.

Data Analytics

The science of data.

Data Archiving

Data Archiving is the practice of moving data that is no longer actively used to a separate storage or archive, typically for long-term retention and compliance purposes.

Data Catalog

A data catalog is a centralized inventory of data assets within an organisation, providing metadata and information about available data sources and datasets.

Data Cleansing

Data cleansing, also known as data scrubbing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in data to improve its quality.

Data Compression

Data compression is the technique of reducing the size of data to save storage space, minimize data transfer time, and improve overall system performance.

Data Design

How information is organized. Data Design involves the process of creating and defining the structure, layout, and organisation of data within a database, data warehouse, or data system. It encompasses decisions on data models, schemas, tables, and relationships, aiming to optimize data storage, access, and retrieval for efficient data processing and analysis.

Data Design is fundamental in ensuring that data is structured in a way that aligns with the organisation's needs and supports various data processes effectively.

Data Ecosystem

The various elements that interact with one another in order to produce, manage, store, organize, analyze, and share data.

Data Engineering

Data Engineering is the process of designing, developing, and managing the infrastructure, architecture, and workflows necessary to acquire, store, process, and transform data into a usable format for analysis and consumption by data-driven applications.

Data Engineering Framework

A Data Engineering Framework is a structured approach or methodology that provides guidelines, best practices, and reusable components to design, implement, and manage data pipelines, ETL processes, and data solutions efficiently.

Data Governance

Data governance involves the establishment and enforcement of policies, processes, and standards for data management, ensuring data quality, security, and compliance.

Data Governance Council

A Data Governance Council is a cross-functional group within an organisation responsible for defining data governance policies, resolving data-related issues, and driving data-related decisions.

Data Ingestion

Data ingestion is the process of collecting and importing data from various sources into a data system or data pipeline for further processing and analysis.

Data Integration

Data integration is the process of combining data from different sources and formats to provide a unified view for analysis and reporting.

Data Lake

A data lake is a large, centralized storage repository that holds raw, unprocessed, and structured/semi-structured data from various sources, providing a foundation for big data analytics.

Data Lakehouse

A data lakehouse is a modern data architecture that combines the flexibility and scalability of a data lake with the query and analytical capabilities of a data warehouse.

Data Lineage

Data lineage refers to the documentation and tracking of the origin, transformation, and movement of data throughout its lifecycle.

Data Mart

A data mart is a smaller, specialized data repository that focuses on specific business functions or departments, providing a subset of data from the data warehouse for quick and efficient analysis.

Data Mesh

A decentralised, domain-oriented approach to data platform architecture and organisational design, treating data as a product with individual teams responsible for their own domain-specific data products.

Data Modeling

Data modeling is the process of designing the structure, relationships, and constraints of a database or data warehouse to ensure efficient data management and query optimization.

Data Partitioning

Data partitioning is the practice of dividing large datasets into smaller, manageable segments or partitions, which can improve data processing efficiency and query performance.

Data Pipeline

A data pipeline is a series of interconnected data processing steps that enable the automated and efficient flow of data from source to destination, typically used for ETL processes and data transformation.

Data Quality

Data quality refers to the accuracy, consistency, completeness, and reliability of data, ensuring that data is fit for its intended purpose.

Data Replication

Data replication involves copying and synchronizing data from one database or system to another, often used for data redundancy, disaster recovery, and scalability.

Data Retention Policy

A Data Retention Policy outlines how long data should be retained and stored before it is archived or deleted, taking into account regulatory, legal, and business requirements.

Data Scalability

Data scalability refers to the ability of a data infrastructure or system to handle growing volumes of data and increased demands without sacrificing performance.

Data Science

A field of study that uses raw data to create new ways of modeling and understanding the unknown.

Data Security

Data security encompasses measures and protocols that protect data from unauthorized access, ensuring confidentiality, integrity, and availability.

Data SerDes (Serialization/Deserialization)

Data serialization is the process of converting data objects into a format suitable for storage or transmission, and deserialization is the reverse process of reconstructing objects from the serialized format.

Data Silos

Data Silos refer to isolated storage or repositories of data that are not easily accessible or shareable with other parts of the organisation.

In organisations, these silos can arise for various reasons, such as departmental barriers, incompatible systems, proprietary formats, or lack of integration among systems.

Data Strategy

The management of the people, processes, and tools used in data analysis.

Data Streaming

Streaming data refers to continuous and real-time data that is generated and processed as events happen, enabling immediate analysis and response to changing data.

Data Transformation

Data transformation involves converting data from one format or structure to another to align it with the target data model or to make it suitable for analysis.

Data Virtualization

Data virtualization is an approach that allows data to be accessed and integrated from multiple sources in real-time, without the need for physically storing it in a centralized location.

Data Visualization

The graphical representation of data.

Data Warehouse

A data warehouse is a centralized repository that stores structured, historical data from multiple sources, facilitating data analysis, reporting, and business intelligence.

Data Warehousing Solutions

Data Warehousing Solutions refer to technologies, platforms, and tools that facilitate the creation, maintenance, and management of data warehouses for effective data analysis and reporting.

DataOps (Data Operations)

DataOps is a collaborative approach that brings together data engineers, data scientists, and other stakeholders to streamline and automate the end-to-end data lifecycle, including data integration, processing, and analytics.

Data-driven Decision-making

Using facts to guide business strategy.

Database

A collection of data stored in a computer system.

Dataset

A collection of data that can be manipulated or analyzed as one unit.

Dimensional modeling

Dimensional modeling is a data modeling technique where you break data up into "facts" and "dimensions" to organize and describe entities within your data warehouse.

Dimensional modeling is one of many data modeling techniques that are used by data practitioners to organize and present data for analytics. Other data modeling techniques include Data Vault (DV), Third Normal Form (3NF), and One Big Table (OBT) to name a few.

E

ETL (Extract, Transform, Load)

ETL is a data integration process that involves extracting data from various sources, transforming it to fit a predefined data model or structure, and loading it into a target database or data warehouse.

F

Formula

A set of instructions used to perform a calculation using the data in a spreadsheet.

Function

A preset command that automatically performs a specified process or task using the data in a spreadsheet.

M

Master Data Management (MDM)

Master data management (MDM) is a technology-enabled discipline in which business and IT collaborate to ensure the uniformity, accuracy, stewardship, semantic consistency, and accountability of the enterprise's official shared master data assets. Master data comprises the consistent and uniform set of identifiers and extended attributes that describe the core entities of the enterprise, including customers, prospects, citizens, suppliers, sites, hierarchies, and chart of accounts.

Master data management (MDM) helps to create one single master reference source for all critical business data, leading to fewer errors and less redundancy in business processes.