Discogs Lakehouse

A local-first analytical lakehouse for Discogs data, built around run-based execution, orchestrated pipelines and immutable Parquet snapshots.

Showcase queries Explore (SQL) GitHub

What this project is

Fully local lakehouse (no cloud services)
Discogs snapshot dumps processed as deterministic pipeline runs
Pipeline orchestration with explicit phases and failure handling
Typed Parquet datasets designed for safe joins and reproducible analytics

Architecture

Storage: external immutable Parquet files
Metadata: Hive Metastore (Postgres-backed)
Compute: Trino (stateless)
Orchestration: Digdag workflows governing pipeline execution

Pipelines are executed as ordered workflows with explicit dependencies and guarded promotion.

Pipeline design

Each execution generates a unique, immutable run directory
Run identifiers are computed once and propagated to all tasks
Ingest, build, test and promotion are isolated pipeline phases
Promotion is allowed only after all validations succeed

Data philosophy

Canonical datasets preserve Discogs data faithfully
Identifiers are typed wherever possible
Ambiguous fields remain textual to avoid false precision
Interpretation is applied explicitly at analytical level

Reproducibility & audit

No dataset is ever overwritten
Publishing is atomic via an active pointer switch
Failed runs never affect consumers
Each run produces permanent validation and sanity reports

Repositories

Notes

Discogs data is subject to Discogs licensing. This project focuses on data engineering architecture, orchestration and reproducibility, and does not redistribute datasets.