Transforming 1.06M+ Retail Transactions into a Cloud-Based Analytics & Reporting Foundation
This project demonstrates how a large retail transaction dataset containing more than 1.06 million records was transformed into an analytics-ready dataset using Google BigQuery.
The objective was not simply to create dashboards, but to establish a reliable data foundation capable of supporting reporting, business intelligence, and decision-making. The implementation focused on data ingestion, ETL processing, data validation, KPI development, and analytics dataset preparation.
Using Google BigQuery, multiple source files were consolidated into a centralized cloud-based data warehouse where business metrics could be calculated, validated, and consumed by reporting tools such as Power BI and Looker Studio.
The result was a scalable retail analytics solution capable of transforming raw transactional data into actionable business insights.
Retail businesses often accumulate transaction data across multiple files, systems, and reporting periods. Before meaningful reporting can occur, the data must be consolidated, standardized, and validated.
The Online Retail II dataset contained more than one million retail transaction records spread across two separate reporting periods.
The project was approached from a data engineering perspective where data quality, transformation logic, and business rules were prioritized before visualization.
The project utilized the Online Retail II dataset obtained from the UCI Machine Learning Repository.
The source workbook contained two worksheets:
Both worksheets were exported to CSV format prior to ingestion into BigQuery.
The project followed a layered data engineering architecture designed to separate raw data from transformed business-ready datasets.
Excel Files
↓
CSV Files
↓
BigQuery Raw Tables
↓
Staging Layer
↓
Data Validation
↓
Analytics Dataset
↓
Power BI / Looker Studio
↓
Business Insights
This architecture improves maintainability, scalability, and reporting reliability.
A dedicated BigQuery dataset named:
detleng_retail
was created to host the complete retail analytics solution.
Two raw tables were created:
retail_raw_2009_2010
retail_raw_2010_2011
The original Excel worksheets were converted to CSV files and loaded into their respective BigQuery tables.
This raw layer preserved source records without applying business transformations, ensuring traceability and data integrity throughout the project lifecycle.
The raw layer served as the system of record for all imported transaction data.
Before transformation activities began, record counts were validated to confirm successful ingestion.
2009–2010 Dataset: 525K+ Records
2010–2011 Dataset: 541K+ Records
Combined Dataset:
1,067,371 Records
After ingestion, both source datasets were consolidated into a staging table using SQL.
CREATE OR REPLACE TABLE detleng_retail.retail_staging AS
SELECT *
FROM detleng_retail.retail_raw_2009_2010
UNION ALL
SELECT *
FROM detleng_retail.retail_raw_2010_2011;
Additional transformations were applied to create analytics-ready fields.
Revenue = Quantity × Price
Source date fields were parsed and converted into reporting-friendly formats.
The following analytical attributes were created:
These fields simplified downstream reporting and dashboard development.
Data quality validation was performed before KPI development.
Total Rows: 1,067,371
Start Date: 2009-12-01
End Date: 2011-12-09
243,007 Records
The presence of null customer identifiers reflects guest purchases and anonymous transactions commonly found in retail systems.
22,950 Records
These records primarily represented returns, refunds, and order cancellations.
19,498 Records
Negative revenue values were expected due to return-related transactions.
The validation process confirmed that the dataset accurately represented real-world retail activity.
The staging table was transformed into an analytics-ready dataset capable of supporting executive reporting.
This structure allowed reporting tools to consume clean business-ready data without requiring additional transformation logic.
Business KPIs were developed directly within BigQuery using SQL.
| KPI | Value |
|---|---|
| Total Revenue | €20.97M |
| Total Orders | 53,628 |
| Total Customers | 5,942 |
| Total Products | 5,305 |
| Total Transactions | 1,067,371 |
These KPIs provided a high-level view of retail performance across the entire dataset.
| Rank | Country | Revenue |
|---|---|---|
| 1 | United Kingdom | €17.87M |
| 2 | EIRE | €664K |
| 3 | Netherlands | €554K |
| 4 | Germany | €431K |
| 5 | France | €357K |
The United Kingdom generated more than 85% of total revenue, highlighting a strong concentration of business activity within the domestic market.
| Rank | Product | Revenue |
|---|---|---|
| 1 | REGENCY CAKESTAND 3 TIER | €344K |
| 2 | Manual | €341K |
| 3 | DOTCOM POSTAGE | €323K |
| 4 | WHITE HANGING HEART T-LIGHT HOLDER | €267K |
| 5 | PAPER CRAFT, LITTLE BIRDIE | €168K |
Home décor and gift-related products generated a significant share of total revenue, indicating strong customer demand within these product categories.
| Month | Revenue |
|---|---|
| November | €2.98M |
| December | €2.73M |
| October | €2.32M |
Quarter 4 represented the strongest sales period, demonstrating the impact of holiday and seasonal purchasing behavior.
Once the analytics-ready dataset was validated, it was connected to both Power BI and Looker Studio.
This architecture allowed reporting tools to consume pre-processed business data directly from BigQuery rather than performing transformations locally.
This project demonstrates how cloud-based data engineering practices can transform raw retail transaction data into a reliable analytics foundation.
By implementing a structured ETL workflow within BigQuery, more than 1.06 million transactions were consolidated, validated, and transformed into an analytics-ready dataset capable of supporting reporting, business intelligence, and executive decision-making.
The project reinforces a core principle of modern analytics:
Reliable dashboards begin with reliable data.
At DeTLeng, the focus is not only on building reports but on engineering the data that makes those reports trustworthy.
The previous sections provided a high-level overview of the solution, architecture, implementation approach, and business outcomes.
The following section documents the complete execution process used to build the retail analytics platform in Google BigQuery.
Each step reflects the actual implementation workflow, including data ingestion, warehouse setup, ETL development, data validation, KPI creation, and analytics preparation.
This execution log serves as a practical reference for understanding how more than 1.06 million retail transactions were transformed into an analytics-ready dataset capable of supporting reporting, business intelligence, and decision-making.
The implementation followed a structured data engineering approach:
Source Data
↓
Data Ingestion
↓
Raw Data Layer
↓
ETL Processing
↓
Data Validation
↓
Analytics Dataset
↓
KPI Development
↓
Business Insights
↓
Power BI / Looker Studio
The sections below document each phase of the project in the order it was executed.
Following the initial review of the Online Retail II dataset, the next phase focused on establishing a cloud-based data warehouse using Google BigQuery.
The objective was to move beyond spreadsheet-based analysis and create a scalable environment capable of supporting data transformation, validation, reporting, and business intelligence.
The project followed a layered data engineering architecture designed to separate raw source data from transformed analytical datasets.
Excel Files
↓
BigQuery Raw Tables
↓
Staging Layer
↓
Analytics Layer
↓
SQL Reporting
↓
Looker Studio
This structure provides a clear separation between data ingestion, transformation, and reporting activities while supporting future scalability.
A dedicated dataset was created within Google BigQuery to host all retail-related assets.
Dataset Name:
detleng_retail
This dataset became the central repository for raw tables, staging tables, validation processes, and analytics-ready datasets.
The original Online Retail II workbook contained two worksheets representing separate reporting periods.
Year 2009–2010
Year 2010–2011
To ensure compatibility with BigQuery, both worksheets were exported to CSV format.
retail_2009_2010.csv
retail_2010_2011.csv
The conversion process enabled structured ingestion into the cloud data warehouse environment.
Two raw tables were created to preserve the original source data.
retail_raw_2009_2010
retail_raw_2010_2011
The raw layer was intentionally designed to retain source records without applying business transformations. This approach ensures traceability and allows future validation against the original files whenever required.
During the initial import process, a schema validation issue was encountered with the InvoiceDate field.
The source files stored dates using the following format:
13/12/2009 09:58
BigQuery attempted to interpret these values as native timestamps, which resulted in loading errors because the source format did not match BigQuery's default timestamp expectations.
To preserve data integrity and simplify the ETL workflow, the date field was initially imported as a string.
Invoice:STRING
StockCode:STRING
Description:STRING
Quantity:INTEGER
InvoiceDate:STRING
Price:FLOAT
CustomerID:STRING
Country:STRING
This approach allowed date parsing and standardization to be handled later within the transformation layer using SQL.
After both CSV files were successfully loaded into BigQuery, row-count validation was performed to verify ingestion completeness.
2009–2010 Dataset: 525K+ Records
2010–2011 Dataset: 541K+ Records
The validation confirmed that all source records were successfully imported into the raw layer.
Once the raw tables had been validated, both datasets were consolidated into a single staging table.
This process replicated the append operation commonly performed in Power Query but executed directly within BigQuery using SQL.
CREATE OR REPLACE TABLE
detleng_retail.retail_staging AS
SELECT *
FROM detleng_retail.retail_raw_2009_2010
UNION ALL
SELECT *
FROM detleng_retail.retail_raw_2010_2011;
The successful execution of the query created a unified staging table containing all retail transactions from both reporting periods.
This staging layer became the foundation for data transformation, quality validation, KPI development, and reporting activities performed later in the project.
At the completion of this phase, the project had successfully established:
With the data warehouse environment fully operational, the project was ready to move into data transformation, quality validation, and KPI development.
Following the successful creation of the staging table, the next phase focused on validating the consolidated dataset before proceeding with analytics development.
A staging layer should never be assumed to be correct simply because the SQL execution completed successfully. Record counts, data structures, and sample transactions must be reviewed to verify that the ETL process has preserved the source data accurately.
The first validation step was to confirm that both source datasets had been successfully merged into the staging table.
The following query was executed:
SELECT COUNT(*) AS TotalRows
FROM detleng_retail.retail_staging;
The result returned approximately 1.06 million records, confirming that the staging layer contained the complete transaction history from both reporting periods.
Total Records:
~1,067,371
This validation confirmed that the UNION ALL operation successfully consolidated the two source tables without data loss.
After validating row counts, a sample of records was reviewed to verify data quality and field consistency.
SELECT *
FROM detleng_retail.retail_staging
LIMIT 10;
The sample records confirmed that transaction data had been loaded correctly and that all expected business fields were available within the staging layer.
Key fields included:
The sample records revealed several transaction types commonly found in real-world retail environments.
Examples included:
AMAZON FEE
POSTAGE
Manual Adjustments
Bad Debt Adjustments
In addition, negative quantity values were present within the dataset.
-1
-2
These records are important because they represent legitimate business activities such as:
Rather than treating these records as errors, they were preserved within the dataset to maintain an accurate representation of business operations.
The validation process confirmed that:
At this stage of the project, the following components had been completed:
✅ BigQuery Environment Setup
✅ Dataset Creation
✅ CSV Data Ingestion
✅ Raw Data Layer
✅ Schema Configuration
✅ ETL Consolidation Process
✅ Staging Layer Development
✅ Record Count Validation
✅ Sample Data Verification
The project had now progressed beyond simple data loading and entered the data validation phase, providing a trusted foundation for analytics engineering, KPI development, and business intelligence reporting.
After confirming that the staging table contained the complete transaction history, a series of data quality validation checks were performed to assess the reliability of the dataset before KPI development and analytics modeling.
The objective of this phase was not to remove records, but to understand the characteristics of the data and identify business scenarios that could influence reporting results.
The first validation step confirmed the total number of records available in the staging layer.
SELECT COUNT(*) AS TotalRows
FROM detleng_retail.retail_staging;
Result:
Total Rows: 1,067,371
This result confirmed that the complete transaction history from both source datasets had been successfully consolidated into the staging layer.
The next step was to verify the reporting period covered by the dataset.
SELECT
MIN(InvoiceDate) AS StartDate,
MAX(InvoiceDate) AS EndDate
FROM detleng_retail.retail_staging;
Result:
Start Date: 2009-12-01
End Date: 2011-12-09
The validation confirmed that the dataset covered approximately two years of retail activity, matching the expected reporting period of the Online Retail II dataset.
Customer identifiers were reviewed to determine the presence of missing customer records.
SELECT COUNT(*) AS NullCustomerID
FROM detleng_retail.retail_staging
WHERE CustomerID IS NULL;
Result:
Null Customer IDs: 243,007
This is a known characteristic of the Online Retail II dataset and does not necessarily indicate poor data quality.
These records may represent:
The records were retained because they still contribute to revenue and transaction analysis.
Retail datasets commonly contain return and cancellation transactions. To validate their presence, quantity values were analyzed.
SELECT COUNT(*) AS NegativeQuantity
FROM detleng_retail.retail_staging
WHERE Quantity < 0;
Result:
Negative Quantity Records: 22,950
Negative quantities typically represent:
These records were preserved because they reflect legitimate business activity and are essential for accurate revenue reporting.
Revenue validation was performed to identify transactions generating negative sales values.
SELECT COUNT(*) AS NegativeRevenue
FROM detleng_retail.retail_staging
WHERE Revenue < 0;
Result:
Negative Revenue Records: 19,498
Revenue was calculated using:
Revenue = Quantity × Price
When quantity values are negative, revenue values also become negative.
These records primarily represent:
The presence of negative revenue confirms that the dataset captures both sales activity and post-sale business events.
The data quality assessment confirmed that the dataset accurately represents real-world retail operations.
| Validation Check | Result |
|---|---|
| Total Rows | 1,067,371 |
| Date Range | 2009-12-01 to 2011-12-09 |
| Null Customer IDs | 243,007 |
| Negative Quantity Records | 22,950 |
| Negative Revenue Records | 19,498 |
The validation process confirmed that the staging layer was complete, consistent, and suitable for analytics development.
Rather than treating every anomaly as an error, the project focused on understanding the business meaning behind each data pattern. This approach ensured that returns, refunds, anonymous purchases, and other real-world retail events remained available for analysis.
With data quality validation completed, the project moved to the next phase: developing an analytics-ready dataset and building business KPIs using SQL.
With data ingestion, transformation, and validation successfully completed, the project moved into the analytics layer.
The objective of this phase was to transform validated transaction data into business-ready KPIs that could support executive reporting, dashboard development, and decision-making.
Unlike traditional dashboard-first approaches, the KPI logic was developed directly within BigQuery, ensuring that business calculations remained centralized, reusable, and consistent across reporting platforms.
Several foundational business metrics were created using SQL.
SELECT
ROUND(SUM(Revenue),2) AS TotalRevenue
FROM detleng_retail.retail_staging
WHERE Revenue > 0;
SELECT
COUNT(DISTINCT Invoice) AS TotalOrders
FROM detleng_retail.retail_staging;
SELECT
COUNT(DISTINCT CustomerID) AS TotalCustomers
FROM detleng_retail.retail_staging
WHERE CustomerID IS NOT NULL;
SELECT
COUNT(DISTINCT StockCode) AS TotalProducts
FROM detleng_retail.retail_staging;
The resulting metrics provided an executive-level summary of the retail business.
| KPI | Result |
|---|---|
| Total Revenue | €20,972,968.14 |
| Total Orders | 53,628 |
| Total Customers | 5,942 |
| Total Products | 5,305 |
These results closely aligned with the previously developed Power BI solution, confirming the consistency and reliability of the BigQuery implementation.
To support future reporting requirements, KPI outputs and analytical datasets were organized within the BigQuery environment.
The final structure contained:
detleng_retail
│
├── retail_raw_2009_2010
├── retail_raw_2010_2011
├── retail_staging
│
├── Total Revenue
├── Total Orders
├── Total Customers
├── Total Products
│
├── Revenue by Country
├── Revenue by Month
└── Revenue by Year
This structure separated source data from business reporting datasets and provided a clean foundation for analytics consumption.
To understand geographical sales performance, revenue was aggregated at the country level.
SELECT
Country,
ROUND(SUM(Revenue),2) AS Revenue
FROM detleng_retail.retail_staging
WHERE Revenue > 0
GROUP BY Country
ORDER BY Revenue DESC;
This analysis identified the highest-performing markets and highlighted revenue concentration across countries.
Annual revenue trends were calculated to evaluate overall business growth across the reporting period.
SELECT
EXTRACT(YEAR FROM InvoiceDate) AS Year,
ROUND(SUM(Revenue),2) AS Revenue
FROM detleng_retail.retail_staging
WHERE Revenue > 0
GROUP BY Year
ORDER BY Year;
The resulting dataset provided a year-over-year revenue view suitable for executive reporting and trend analysis.
Monthly revenue aggregation was developed to identify seasonality and purchasing patterns.
SELECT
FORMAT_DATE('%B', DATE(InvoiceDate)) AS MonthName,
ROUND(SUM(Revenue),2) AS Revenue
FROM detleng_retail.retail_staging
WHERE Revenue > 0
GROUP BY MonthName
ORDER BY Revenue DESC;
The analysis revealed clear seasonal trends, with November, December, and October generating the highest revenue levels across the dataset.
The analytics layer produced several valuable business observations:
The analytics layer successfully transformed raw transaction data into business-ready metrics that could be consumed by reporting platforms such as Power BI and Looker Studio.
At this stage, the project had completed:
✅ Data Warehouse Setup
✅ Raw Data Layer
✅ ETL Processing
✅ Data Quality Validation
✅ KPI Development
✅ Revenue Analysis
✅ Customer Analysis
✅ Product Analysis
✅ Geographic Analysis
✅ Time-Series Analysis
The project had now evolved from a simple retail dataset into a fully operational analytics-ready data platform built on Google BigQuery.
After developing the core KPIs and analytical datasets, the final phase focused on identifying meaningful business insights from the retail transaction data.
The objective was not only to calculate metrics but also to understand customer behavior, product performance, geographic trends, and seasonal sales patterns.
The following query was used to identify the highest-performing markets.
SELECT
Country,
ROUND(SUM(Revenue),2) AS Revenue
FROM detleng_retail.retail_staging
WHERE Revenue > 0
GROUP BY Country
ORDER BY Revenue DESC
LIMIT 10;
| Rank | Country | Revenue |
|---|---|---|
| 1 | United Kingdom | €17.87M |
| 2 | EIRE | €664K |
| 3 | Netherlands | €554K |
| 4 | Germany | €431K |
| 5 | France | €357K |
Several important patterns emerged from the country-level analysis:
These findings demonstrate the importance of geographic segmentation when evaluating business growth opportunities.
To identify the most valuable products within the catalog, revenue was aggregated at the product level.
SELECT
Description,
ROUND(SUM(Revenue),2) AS Revenue
FROM detleng_retail.retail_staging
WHERE Revenue > 0
GROUP BY Description
ORDER BY Revenue DESC
LIMIT 10;
| Rank | Product | Revenue |
|---|---|---|
| 1 | REGENCY CAKESTAND 3 TIER | €344K |
| 2 | Manual | €341K |
| 3 | DOTCOM POSTAGE | €323K |
| 4 | WHITE HANGING HEART T-LIGHT HOLDER | €267K |
| 5 | PAPER CRAFT, LITTLE BIRDIE | €168K |
The product-level analysis highlighted several interesting trends:
These findings provide valuable input for merchandising, inventory planning, and product portfolio optimization.
Monthly revenue analysis revealed strong seasonal behavior across the reporting period.
| Month | Revenue |
|---|---|
| November | €2.98M |
| December | €2.73M |
| October | €2.32M |
| September | €1.98M |
| March | €1.55M |
The monthly analysis identified clear purchasing patterns:
Understanding these seasonal patterns can support future forecasting, inventory management, and marketing campaign planning.
This project transformed more than 1.06 million retail transactions into a structured, analytics-ready dataset using Google BigQuery.
The solution established a complete data engineering workflow covering:
Key business outcomes included:
Most importantly, the project demonstrated how properly engineered data foundations create more reliable reporting and decision-making environments.
This aligns directly with DeTLeng's core philosophy:
We don't just build dashboards. We engineer the data that makes dashboards trustworthy.
The final analytics-ready dataset was engineered to support Power BI, Looker Studio, Excel, Tableau, and any downstream reporting platform.
One trusted data foundation. Unlimited reporting possibilities.