CASE STUDY #002

CS-002 — Building an Analytics-Ready Retail Dataset with BigQuery

Transforming 1.06M+ Retail Transactions into a Cloud-Based Analytics & Reporting Foundation


Executive Summary

This project demonstrates how a large retail transaction dataset containing more than 1.06 million records was transformed into an analytics-ready dataset using Google BigQuery.

The objective was not simply to create dashboards, but to establish a reliable data foundation capable of supporting reporting, business intelligence, and decision-making. The implementation focused on data ingestion, ETL processing, data validation, KPI development, and analytics dataset preparation.

Using Google BigQuery, multiple source files were consolidated into a centralized cloud-based data warehouse where business metrics could be calculated, validated, and consumed by reporting tools such as Power BI and Looker Studio.

The result was a scalable retail analytics solution capable of transforming raw transactional data into actionable business insights.


Business Problem

Retail businesses often accumulate transaction data across multiple files, systems, and reporting periods. Before meaningful reporting can occur, the data must be consolidated, standardized, and validated.

The Online Retail II dataset contained more than one million retail transaction records spread across two separate reporting periods.

Key Challenges

The project was approached from a data engineering perspective where data quality, transformation logic, and business rules were prioritized before visualization.


Source Data

The project utilized the Online Retail II dataset obtained from the UCI Machine Learning Repository.

Dataset Characteristics

The source workbook contained two worksheets:

Both worksheets were exported to CSV format prior to ingestion into BigQuery.


Solution Architecture

The project followed a layered data engineering architecture designed to separate raw data from transformed business-ready datasets.

Excel Files
     ↓
CSV Files
     ↓
BigQuery Raw Tables
     ↓
Staging Layer
     ↓
Data Validation
     ↓
Analytics Dataset
     ↓
Power BI / Looker Studio
     ↓
Business Insights

This architecture improves maintainability, scalability, and reporting reliability.


Data Ingestion Process

A dedicated BigQuery dataset named:

detleng_retail

was created to host the complete retail analytics solution.

Two raw tables were created:

retail_raw_2009_2010
retail_raw_2010_2011

The original Excel worksheets were converted to CSV files and loaded into their respective BigQuery tables.

This raw layer preserved source records without applying business transformations, ensuring traceability and data integrity throughout the project lifecycle.


Raw Data Layer

The raw layer served as the system of record for all imported transaction data.

Key Objectives

Before transformation activities began, record counts were validated to confirm successful ingestion.

Approximate Source Volumes

2009–2010 Dataset: 525K+ Records
2010–2011 Dataset: 541K+ Records

Combined Dataset:
1,067,371 Records

ETL & Data Transformation

After ingestion, both source datasets were consolidated into a staging table using SQL.

CREATE OR REPLACE TABLE detleng_retail.retail_staging AS

SELECT *
FROM detleng_retail.retail_raw_2009_2010

UNION ALL

SELECT *
FROM detleng_retail.retail_raw_2010_2011;

Additional transformations were applied to create analytics-ready fields.

Revenue Calculation

Revenue = Quantity × Price

Date Standardization

Source date fields were parsed and converted into reporting-friendly formats.

Feature Engineering

The following analytical attributes were created:

These fields simplified downstream reporting and dashboard development.


Data Quality Validation

Data quality validation was performed before KPI development.

Record Count Validation

Total Rows: 1,067,371

Date Range Validation

Start Date: 2009-12-01
End Date: 2011-12-09

Null Customer Validation

243,007 Records

The presence of null customer identifiers reflects guest purchases and anonymous transactions commonly found in retail systems.

Negative Quantity Validation

22,950 Records

These records primarily represented returns, refunds, and order cancellations.

Negative Revenue Validation

19,498 Records

Negative revenue values were expected due to return-related transactions.

The validation process confirmed that the dataset accurately represented real-world retail activity.


Analytics Dataset Design

The staging table was transformed into an analytics-ready dataset capable of supporting executive reporting.

Final Dataset Components

This structure allowed reporting tools to consume clean business-ready data without requiring additional transformation logic.


KPI Development

Business KPIs were developed directly within BigQuery using SQL.

KPI Summary

KPI Value
Total Revenue €20.97M
Total Orders 53,628
Total Customers 5,942
Total Products 5,305
Total Transactions 1,067,371

These KPIs provided a high-level view of retail performance across the entire dataset.


Business Insights

Top Revenue Countries

Rank Country Revenue
1 United Kingdom €17.87M
2 EIRE €664K
3 Netherlands €554K
4 Germany €431K
5 France €357K

Key Insight

The United Kingdom generated more than 85% of total revenue, highlighting a strong concentration of business activity within the domestic market.


Top Revenue Products

Rank Product Revenue
1 REGENCY CAKESTAND 3 TIER €344K
2 Manual €341K
3 DOTCOM POSTAGE €323K
4 WHITE HANGING HEART T-LIGHT HOLDER €267K
5 PAPER CRAFT, LITTLE BIRDIE €168K

Key Insight

Home décor and gift-related products generated a significant share of total revenue, indicating strong customer demand within these product categories.


Revenue Seasonality

Highest-Performing Months

Month Revenue
November €2.98M
December €2.73M
October €2.32M

Key Insight

Quarter 4 represented the strongest sales period, demonstrating the impact of holiday and seasonal purchasing behavior.


Reporting & Dashboard Integration

Once the analytics-ready dataset was validated, it was connected to both Power BI and Looker Studio.

This architecture allowed reporting tools to consume pre-processed business data directly from BigQuery rather than performing transformations locally.

Benefits


Key Outcomes

Data Engineering Deliverables

Analytics Deliverables


Technologies Used


Conclusion

This project demonstrates how cloud-based data engineering practices can transform raw retail transaction data into a reliable analytics foundation.

By implementing a structured ETL workflow within BigQuery, more than 1.06 million transactions were consolidated, validated, and transformed into an analytics-ready dataset capable of supporting reporting, business intelligence, and executive decision-making.

The project reinforces a core principle of modern analytics:

Reliable dashboards begin with reliable data.

At DeTLeng, the focus is not only on building reports but on engineering the data that makes those reports trustworthy.



Step-by-Step Project Execution

The previous sections provided a high-level overview of the solution, architecture, implementation approach, and business outcomes.

The following section documents the complete execution process used to build the retail analytics platform in Google BigQuery.

Each step reflects the actual implementation workflow, including data ingestion, warehouse setup, ETL development, data validation, KPI creation, and analytics preparation.

This execution log serves as a practical reference for understanding how more than 1.06 million retail transactions were transformed into an analytics-ready dataset capable of supporting reporting, business intelligence, and decision-making.

The implementation followed a structured data engineering approach:

Source DataData Ingestion
     ↓
Raw Data Layer
     ↓
ETL Processing
     ↓
Data Validation
     ↓
Analytics Dataset
     ↓
KPI Development
     ↓
Business Insights
     ↓
Power BI / Looker Studio

The sections below document each phase of the project in the order it was executed.


Data Warehouse Setup and Data Ingestion

Following the initial review of the Online Retail II dataset, the next phase focused on establishing a cloud-based data warehouse using Google BigQuery.

The objective was to move beyond spreadsheet-based analysis and create a scalable environment capable of supporting data transformation, validation, reporting, and business intelligence.

Solution Architecture

The project followed a layered data engineering architecture designed to separate raw source data from transformed analytical datasets.

Excel Files
     ↓
BigQuery Raw Tables
     ↓
Staging Layer
     ↓
Analytics Layer
     ↓
SQL Reporting
     ↓
Looker Studio

This structure provides a clear separation between data ingestion, transformation, and reporting activities while supporting future scalability.

BigQuery Dataset Creation

A dedicated dataset was created within Google BigQuery to host all retail-related assets.

Dataset Name:
detleng_retail

This dataset became the central repository for raw tables, staging tables, validation processes, and analytics-ready datasets.

Source File Preparation

The original Online Retail II workbook contained two worksheets representing separate reporting periods.

Year 20092010
Year 20102011

To ensure compatibility with BigQuery, both worksheets were exported to CSV format.

retail_2009_2010.csv
retail_2010_2011.csv

The conversion process enabled structured ingestion into the cloud data warehouse environment.

Raw Data Layer

Two raw tables were created to preserve the original source data.

retail_raw_2009_2010
retail_raw_2010_2011

The raw layer was intentionally designed to retain source records without applying business transformations. This approach ensures traceability and allows future validation against the original files whenever required.

Handling Source Data Types

During the initial import process, a schema validation issue was encountered with the InvoiceDate field.

The source files stored dates using the following format:

13/12/2009 09:58

BigQuery attempted to interpret these values as native timestamps, which resulted in loading errors because the source format did not match BigQuery's default timestamp expectations.

To preserve data integrity and simplify the ETL workflow, the date field was initially imported as a string.

Invoice:STRING
StockCode:STRING
Description:STRING
Quantity:INTEGER
InvoiceDate:STRING
Price:FLOAT
CustomerID:STRING
Country:STRING

This approach allowed date parsing and standardization to be handled later within the transformation layer using SQL.

Data Loading and Validation

After both CSV files were successfully loaded into BigQuery, row-count validation was performed to verify ingestion completeness.

2009–2010 Dataset: 525K+ Records
2010–2011 Dataset: 541K+ Records

The validation confirmed that all source records were successfully imported into the raw layer.

Staging Layer Development

Once the raw tables had been validated, both datasets were consolidated into a single staging table.

This process replicated the append operation commonly performed in Power Query but executed directly within BigQuery using SQL.

CREATE OR REPLACE TABLE
detleng_retail.retail_staging AS

SELECT *
FROM detleng_retail.retail_raw_2009_2010

UNION ALL

SELECT *
FROM detleng_retail.retail_raw_2010_2011;

image

The successful execution of the query created a unified staging table containing all retail transactions from both reporting periods.

This staging layer became the foundation for data transformation, quality validation, KPI development, and reporting activities performed later in the project.

Outcome

At the completion of this phase, the project had successfully established:

With the data warehouse environment fully operational, the project was ready to move into data transformation, quality validation, and KPI development.


Initial Data Validation and Staging Verification

Following the successful creation of the staging table, the next phase focused on validating the consolidated dataset before proceeding with analytics development.

A staging layer should never be assumed to be correct simply because the SQL execution completed successfully. Record counts, data structures, and sample transactions must be reviewed to verify that the ETL process has preserved the source data accurately.

Record Count Validation

The first validation step was to confirm that both source datasets had been successfully merged into the staging table.

The following query was executed:

SELECT COUNT(*) AS TotalRows
FROM detleng_retail.retail_staging;

The result returned approximately 1.06 million records, confirming that the staging layer contained the complete transaction history from both reporting periods.

Total Records:
~1,067,371

This validation confirmed that the UNION ALL operation successfully consolidated the two source tables without data loss.

Sample Data Verification

After validating row counts, a sample of records was reviewed to verify data quality and field consistency.

SELECT *
FROM detleng_retail.retail_staging
LIMIT 10;

image

The sample records confirmed that transaction data had been loaded correctly and that all expected business fields were available within the staging layer.

Key fields included:

Business Data Observations

The sample records revealed several transaction types commonly found in real-world retail environments.

Examples included:

AMAZON FEE
POSTAGE
Manual Adjustments
Bad Debt Adjustments

In addition, negative quantity values were present within the dataset.

-1
-2

These records are important because they represent legitimate business activities such as:

Rather than treating these records as errors, they were preserved within the dataset to maintain an accurate representation of business operations.

ETL Validation Outcome

The validation process confirmed that:

Data Engineering Progress

At this stage of the project, the following components had been completed:

✅ BigQuery Environment Setup

✅ Dataset Creation

✅ CSV Data Ingestion

✅ Raw Data Layer

✅ Schema Configuration

✅ ETL Consolidation Process

✅ Staging Layer Development

✅ Record Count Validation

✅ Sample Data Verification

The project had now progressed beyond simple data loading and entered the data validation phase, providing a trusted foundation for analytics engineering, KPI development, and business intelligence reporting.


Data Quality Validation

After confirming that the staging table contained the complete transaction history, a series of data quality validation checks were performed to assess the reliability of the dataset before KPI development and analytics modeling.

The objective of this phase was not to remove records, but to understand the characteristics of the data and identify business scenarios that could influence reporting results.

Record Count Validation

The first validation step confirmed the total number of records available in the staging layer.

SELECT COUNT(*) AS TotalRows
FROM detleng_retail.retail_staging;

image

Result:

Total Rows: 1,067,371

This result confirmed that the complete transaction history from both source datasets had been successfully consolidated into the staging layer.

Date Range Validation

The next step was to verify the reporting period covered by the dataset.

SELECT
  MIN(InvoiceDate) AS StartDate,
  MAX(InvoiceDate) AS EndDate
FROM detleng_retail.retail_staging;

image

Result:

Start Date: 2009-12-01
End Date: 2011-12-09

The validation confirmed that the dataset covered approximately two years of retail activity, matching the expected reporting period of the Online Retail II dataset.

Customer Data Completeness

Customer identifiers were reviewed to determine the presence of missing customer records.

SELECT COUNT(*) AS NullCustomerID
FROM detleng_retail.retail_staging
WHERE CustomerID IS NULL;

image

Result:

Null Customer IDs: 243,007

Business Interpretation

This is a known characteristic of the Online Retail II dataset and does not necessarily indicate poor data quality.

These records may represent:

The records were retained because they still contribute to revenue and transaction analysis.

Negative Quantity Analysis

Retail datasets commonly contain return and cancellation transactions. To validate their presence, quantity values were analyzed.

SELECT COUNT(*) AS NegativeQuantity
FROM detleng_retail.retail_staging
WHERE Quantity < 0;

image

Result:

Negative Quantity Records: 22,950

Business Interpretation

Negative quantities typically represent:

These records were preserved because they reflect legitimate business activity and are essential for accurate revenue reporting.

Negative Revenue Analysis

Revenue validation was performed to identify transactions generating negative sales values.

SELECT COUNT(*) AS NegativeRevenue
FROM detleng_retail.retail_staging
WHERE Revenue < 0;

image

Result:

Negative Revenue Records: 19,498

Business Interpretation

Revenue was calculated using:

Revenue = Quantity × Price

When quantity values are negative, revenue values also become negative.

These records primarily represent:

The presence of negative revenue confirms that the dataset captures both sales activity and post-sale business events.

Validation Summary

The data quality assessment confirmed that the dataset accurately represents real-world retail operations.

Validation Check Result
Total Rows 1,067,371
Date Range 2009-12-01 to 2011-12-09
Null Customer IDs 243,007
Negative Quantity Records 22,950
Negative Revenue Records 19,498

Outcome

The validation process confirmed that the staging layer was complete, consistent, and suitable for analytics development.

Rather than treating every anomaly as an error, the project focused on understanding the business meaning behind each data pattern. This approach ensured that returns, refunds, anonymous purchases, and other real-world retail events remained available for analysis.

With data quality validation completed, the project moved to the next phase: developing an analytics-ready dataset and building business KPIs using SQL.


Analytics Layer Development

With data ingestion, transformation, and validation successfully completed, the project moved into the analytics layer.

The objective of this phase was to transform validated transaction data into business-ready KPIs that could support executive reporting, dashboard development, and decision-making.

Unlike traditional dashboard-first approaches, the KPI logic was developed directly within BigQuery, ensuring that business calculations remained centralized, reusable, and consistent across reporting platforms.

Core Business KPIs

Several foundational business metrics were created using SQL.

Total Revenue

SELECT
ROUND(SUM(Revenue),2) AS TotalRevenue
FROM detleng_retail.retail_staging
WHERE Revenue > 0;

Total Orders

SELECT
COUNT(DISTINCT Invoice) AS TotalOrders
FROM detleng_retail.retail_staging;

Total Customers

SELECT
COUNT(DISTINCT CustomerID) AS TotalCustomers
FROM detleng_retail.retail_staging
WHERE CustomerID IS NOT NULL;

Total Products

SELECT
COUNT(DISTINCT StockCode) AS TotalProducts
FROM detleng_retail.retail_staging;

KPI Results

The resulting metrics provided an executive-level summary of the retail business.

KPI Result
Total Revenue €20,972,968.14
Total Orders 53,628
Total Customers 5,942
Total Products 5,305

These results closely aligned with the previously developed Power BI solution, confirming the consistency and reliability of the BigQuery implementation.

Analytics Data Mart Structure

To support future reporting requirements, KPI outputs and analytical datasets were organized within the BigQuery environment.

02

The final structure contained:

detleng_retail
│
├── retail_raw_2009_2010
├── retail_raw_2010_2011
├── retail_staging
│
├── Total Revenue
├── Total Orders
├── Total Customers
├── Total Products
│
├── Revenue by Country
├── Revenue by Month
└── Revenue by Year

This structure separated source data from business reporting datasets and provided a clean foundation for analytics consumption.

Revenue Analysis by Country

To understand geographical sales performance, revenue was aggregated at the country level.

SELECT
Country,
ROUND(SUM(Revenue),2) AS Revenue
FROM detleng_retail.retail_staging
WHERE Revenue > 0
GROUP BY Country
ORDER BY Revenue DESC;

This analysis identified the highest-performing markets and highlighted revenue concentration across countries.

Revenue Analysis by Year

Annual revenue trends were calculated to evaluate overall business growth across the reporting period.

SELECT
EXTRACT(YEAR FROM InvoiceDate) AS Year,
ROUND(SUM(Revenue),2) AS Revenue
FROM detleng_retail.retail_staging
WHERE Revenue > 0
GROUP BY Year
ORDER BY Year;

The resulting dataset provided a year-over-year revenue view suitable for executive reporting and trend analysis.

Revenue Analysis by Month

Monthly revenue aggregation was developed to identify seasonality and purchasing patterns.

SELECT
FORMAT_DATE('%B', DATE(InvoiceDate)) AS MonthName,
ROUND(SUM(Revenue),2) AS Revenue
FROM detleng_retail.retail_staging
WHERE Revenue > 0
GROUP BY MonthName
ORDER BY Revenue DESC;

image

The analysis revealed clear seasonal trends, with November, December, and October generating the highest revenue levels across the dataset.

Business Insights

The analytics layer produced several valuable business observations:

Outcome

The analytics layer successfully transformed raw transaction data into business-ready metrics that could be consumed by reporting platforms such as Power BI and Looker Studio.

At this stage, the project had completed:

✅ Data Warehouse Setup

✅ Raw Data Layer

✅ ETL Processing

✅ Data Quality Validation

✅ KPI Development

✅ Revenue Analysis

✅ Customer Analysis

✅ Product Analysis

✅ Geographic Analysis

✅ Time-Series Analysis

The project had now evolved from a simple retail dataset into a fully operational analytics-ready data platform built on Google BigQuery.


Business Insights

After developing the core KPIs and analytical datasets, the final phase focused on identifying meaningful business insights from the retail transaction data.

The objective was not only to calculate metrics but also to understand customer behavior, product performance, geographic trends, and seasonal sales patterns.

Top Revenue-Generating Countries

The following query was used to identify the highest-performing markets.

SELECT
    Country,
    ROUND(SUM(Revenue),2) AS Revenue
FROM detleng_retail.retail_staging
WHERE Revenue > 0
GROUP BY Country
ORDER BY Revenue DESC
LIMIT 10;

image

Top Countries by Revenue

Rank Country Revenue
1 United Kingdom €17.87M
2 EIRE €664K
3 Netherlands €554K
4 Germany €431K
5 France €357K

Geographic Insights

Several important patterns emerged from the country-level analysis:

These findings demonstrate the importance of geographic segmentation when evaluating business growth opportunities.


Top Revenue-Generating Products

To identify the most valuable products within the catalog, revenue was aggregated at the product level.

SELECT
    Description,
    ROUND(SUM(Revenue),2) AS Revenue
FROM detleng_retail.retail_staging
WHERE Revenue > 0
GROUP BY Description
ORDER BY Revenue DESC
LIMIT 10;

image

image

Top Products by Revenue

Rank Product Revenue
1 REGENCY CAKESTAND 3 TIER €344K
2 Manual €341K
3 DOTCOM POSTAGE €323K
4 WHITE HANGING HEART T-LIGHT HOLDER €267K
5 PAPER CRAFT, LITTLE BIRDIE €168K

Product Performance Insights

The product-level analysis highlighted several interesting trends:

These findings provide valuable input for merchandising, inventory planning, and product portfolio optimization.


Monthly revenue analysis revealed strong seasonal behavior across the reporting period.

Month Revenue
November €2.98M
December €2.73M
October €2.32M
September €1.98M
March €1.55M

Seasonal Insights

The monthly analysis identified clear purchasing patterns:

Understanding these seasonal patterns can support future forecasting, inventory management, and marketing campaign planning.


Business Value Delivered

This project transformed more than 1.06 million retail transactions into a structured, analytics-ready dataset using Google BigQuery.

The solution established a complete data engineering workflow covering:

Key business outcomes included:

Most importantly, the project demonstrated how properly engineered data foundations create more reliable reporting and decision-making environments.

This aligns directly with DeTLeng's core philosophy:

We don't just build dashboards. We engineer the data that makes dashboards trustworthy.




Ready for Any Reporting Platform

The final analytics-ready dataset was engineered to support Power BI, Looker Studio, Excel, Tableau, and any downstream reporting platform.

One trusted data foundation. Unlimited reporting possibilities.


👉 Get Started with DeTLeng 👈



DATA TRANSFORM ENGINEERING
👉 Free Consultation | DeTLeng 👈