Agentic Hybrid Retrieval for Ad Hoc Dataset Search: A Reference Architecture with LLM-Augmented Metadata
SAML Workshop, IEEE ICSA 2026 (Accepted), 2026
Abstract
Ad hoc dataset search requires matching underspecified natural-language queries against sparse, heterogeneous metadata records. We present a reference architecture for agentic hybrid retrieval that combines BM25 lexical search with dense-embedding retrieval via reciprocal rank fusion, orchestrated by an LLM controller. We introduce an offline metadata augmentation step in which an LLM generates pseudo-queries and structured pseudo-descriptions for each dataset record.
