Home  | Publications | GFA+26

Annbatch Unlocks Terabyte-Scale Training of Biological Data in Anndata

MCML Authors

Abstract

The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine-learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across single-cell transcriptomics, microscopy, and whole-genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data-loading infrastructure for scalable biological AI, enabling the use of increasingly large and diverse datasets without abandoning standard biological data formats.

misc GFA+26


Preprint

Apr. 2026

Authors

I. Gold • F. Fischer • L. Arnoldt • F. A. Wolf • F. J. Theis

Links

arXiv GitHub

Research Area

 C2 | Biology

BibTeXKey: GFA+26

Back to Top