Agentic Hybrid Retrieval for Ad Hoc Dataset Search: A Reference Architecture with LLM-Augmented Metadata

SAML Workshop, IEEE ICSA 2026 (Accepted), 2026

Abstract

Ad hoc dataset search requires matching underspecified natural-language queries against sparse, heterogeneous metadata records. We present a reference architecture for agentic hybrid retrieval that combines BM25 lexical search with dense-embedding retrieval via reciprocal rank fusion, orchestrated by an LLM controller. We introduce an offline metadata augmentation step in which an LLM generates pseudo-queries and structured pseudo-descriptions for each dataset record.

Download Paper