paolo_olivieri
tools · audio · browser

Discogs Lakehouse

A local-first analytical lakehouse for Discogs data, built around run-based execution, orchestrated pipelines and immutable Parquet snapshots.

Showcase queries Explore (SQL) GitHub

What this project is

  • Fully local lakehouse (no cloud services)
  • Discogs snapshot dumps processed as deterministic pipeline runs
  • Pipeline orchestration with explicit phases and failure handling
  • Typed Parquet datasets designed for safe joins and reproducible analytics

Architecture

  • Storage: external immutable Parquet files
  • Metadata: Hive Metastore (Postgres-backed)
  • Compute: Trino (stateless)
  • Orchestration: Digdag workflows governing pipeline execution

Pipelines are executed as ordered workflows with explicit dependencies and guarded promotion.

Pipeline design

  • Each execution generates a unique, immutable run directory
  • Run identifiers are computed once and propagated to all tasks
  • Ingest, build, test and promotion are isolated pipeline phases
  • Promotion is allowed only after all validations succeed

Data philosophy

  • Canonical datasets preserve Discogs data faithfully
  • Identifiers are typed wherever possible
  • Ambiguous fields remain textual to avoid false precision
  • Interpretation is applied explicitly at analytical level

Reproducibility & audit

  • No dataset is ever overwritten
  • Publishing is atomic via an active pointer switch
  • Failed runs never affect consumers
  • Each run produces permanent validation and sanity reports

Repositories

Notes

Discogs data is subject to Discogs licensing. This project focuses on data engineering architecture, orchestration and reproducibility, and does not redistribute datasets.