Discogs Lakehouse
A local-first analytical lakehouse for Discogs data, built around run-based execution, orchestrated pipelines and immutable Parquet snapshots.
What this project is
- Fully local lakehouse (no cloud services)
- Discogs snapshot dumps processed as deterministic pipeline runs
- Pipeline orchestration with explicit phases and failure handling
- Typed Parquet datasets designed for safe joins and reproducible analytics
Architecture
- Storage: external immutable Parquet files
- Metadata: Hive Metastore (Postgres-backed)
- Compute: Trino (stateless)
- Orchestration: Digdag workflows governing pipeline execution
Pipelines are executed as ordered workflows with explicit dependencies and guarded promotion.
Pipeline design
- Each execution generates a unique, immutable run directory
- Run identifiers are computed once and propagated to all tasks
- Ingest, build, test and promotion are isolated pipeline phases
- Promotion is allowed only after all validations succeed
Data philosophy
- Canonical datasets preserve Discogs data faithfully
- Identifiers are typed wherever possible
- Ambiguous fields remain textual to avoid false precision
- Interpretation is applied explicitly at analytical level
Reproducibility & audit
- No dataset is ever overwritten
- Publishing is atomic via an
activepointer switch - Failed runs never affect consumers
- Each run produces permanent validation and sanity reports
Repositories
Notes
Discogs data is subject to Discogs licensing. This project focuses on data engineering architecture, orchestration and reproducibility, and does not redistribute datasets.