FlashLib: Bringing Flash Magic to Classical Machine Learning Operators

Shuo Yang¹, Haocheng Xi¹, Yilong Zhao¹, Qiuyang Mang¹, Zhe Wang², Shanlin Sun², Kurt Keutzer¹, Joseph E. Gonzalez¹, Song Han³, Chenfeng Xu^4,*, Ion Stoica^1,*

¹UC Berkeley · ²UC Irvine · ³MIT · ⁴UT Austin · ^*Co-advising

Code: github.com/FlashML-org/flashlib

26× KMeans

19× KNN

208× TruncatedSVD

47× PCA

7× UMAP

40× HDBSCAN

147× t-SNE (exact)

49× MultinomialNB

Introducing FlashLib — a GPU library for classical machine learning operators on modern hardwares, rebuilt for today's ML workloads and emerging agentic AI systems. Here are a few headline results from the first release:

Significant wins over cuML on Hopper GPUs: up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB.
Flash informative API: Predict runtime, memory footprint, and overhead for any workload in ~5 µs on pure CPU, with no GPU profiling required.
Fast cold start, built to scale: FlashLib uses heuristic kernel selection to avoid long autotune loops, and already supports multi-GPU execution for large workloads.
Toward optimal hardware utilization: FlashLib drives kernels much closer to the limits of modern GPUs, with Flash-KMeans reaching up to 61% of peak FLOPs and Flash-KNN reaching up to 85.2% of peak HBM bandwidth on H200.

The next frontier of AI efficiency is not just faster LLM inference. It is faster intelligence assembly. For the past few years, MLsys work largely followed a model-centric view of intelligence. As LLMs became stronger through better reasoning, larger-scale test-time compute, and more capable inference, the systems community focused on making the transformer core faster: FlashAttention, FlashDecoding, KV-cache management, and LLM serving systems etc.

But the rise of agentic AI is changing the bottleneck. Modern intelligence is increasingly built around the base model through tools, harnesses, retrieval, verification, search, and orchestration. The LLM is no longer merely a standalone reasoner; it becomes a controller over a broader computational system. As a result, the performance bottleneck is no longer confined to transformer inference. It extends to the entire computational substrate surrounding the model. For example, in Agentic AI for Science, LLM agents may generate hypotheses or candidate solutions, but the surrounding loop often depends on search, clustering, nearest-neighbor retrieval, PCA, SVD, and other classical ML operators for verification and feedback. In multimodal generation and physical AI, models must increasingly process, compress, retrieve, and reorganize streaming features on the fly before they enter the model. These examples point to a broader shift: classical ML operators are becoming core primitives around the LLM model. We envision future agentic workflows where clustering, retrieval, dimensionality reduction, verification, and linear algebra are no longer offline utilities, but online primitives in the critical path of intelligence assembly. Figure 1 illustrates this shift.

Figure 1. The latency budget for classical ML operators (KMeans, k-NN, TruncatedSVD, PCA, HDBSCAN) has been falling steadily over the past decade, on a log scale. The same primitives that used to run offline at the minute-to-hour tier (user segmentation, topic modeling, batch feature reduction) are now being called inside online serving paths (RAG retrieval, semantic cache, KV-cache clustering, agent tool routing) where the budget is measured in milliseconds. As this trend continues, the systems community needs implementations of these operators that are fast, hardware-efficient, reliable, and numerically faithful enough to sit in the critical path. Hover (or tap) any point to see the specific work it represents.

However, the underlying implementations of these classical operators have not kept pace with this shift. Their core design assumptions still come from the pre-FlashAttention, pre-Hopper, pre-agent era, which creates a four-way mismatch. First, many operators carry natural implementations that are unfriendly to GPUs. Second, many libraries ship one static kernel implementation across all workloads and hardware tiers, leaving modern GPU hardware features unexploited. Third, many libraries are unaware of the user's precision needs: they expose no way to declare a precision budget, leaving users unable to ask for the fastest algorithm that meets their tolerance. Fourth, the performance is black box: costly to profile, hard to modify, and impossible to budget without first reading the codebase, which leaves both developers and LLM-based agents in the dark.

FlashLib is our attempt to close these gaps and accelerate this emerging substrate, making it fast enough to sit inside the loop of agentic AI. It transforms classical ML operators from slow, offline utilities into fast, online ML primitives. Moreover, FlashLib exposes flash-informative APIs that reveal the cost, tolerance, and execution behavior of these primitives to higher-level agentic pipelines, thus enabling better scheduling and orchestration. We would also like to point out that while FlashLib is motivated by the emerging needs of LLM-centric and agentic AI systems, we also recognize that classical ML algorithms remain widely used across today's machine learning stacks. Beyond generative AI, operators such as KMeans, KNN, PCA, SVD, t-SNE, and HDBSCAN etc are still core building blocks for recommendation systems, retrieval pipelines, scientific computing, anomaly detection, visualization, and preprocessing for downstream ML models. FlashLib provides a fast, easy-to-use, and adaptive software stack that covers these diverse applications with plug-and-play GPU acceleration.

We built FlashLib around four design principles. First, we reshape the algorithm to fit the hardware while achieving mathematical equivalence. Second, we build kernel variants per operator to fully exploit different workloads on different hardwares using modern hardware features. Third, we let users declare a precision budget and route to the fastest algorithm that meets it. Fourth, we keep the entire library transparent enough that users and LLM agents can easily read, compose, and modify the kernels.

01 / 04Reformulation

Mathematically Equivalent Reformulation: Rewriting Operators to Be GPU-Friendly

Many classical ML operators have natural implementations that are unfriendly to GPUs: they materialize large intermediates in HBM, introduce atomic contention, or run reductions along dimensions that don't tile well. FlashLib's first principle is to rewrite these into mathematically equivalent forms that are friendly to modern accelerators. KMeans assign is the clearest example: the natural implementation forms an N×K distance matrix in HBM and runs an argmin per row, but the streaming-fused version keeps the running local minima in registers and never materializes the matrix. The same pattern recurs across the library: KNN's fused top-K skips the ‖x‖² term in ‖x − y‖² = ‖x‖² + ‖y‖² − 2⟨x, y⟩, PCA's dual-Gram routing picks the smaller of X^⊤X (D×D) and X X^⊤ (N×N), avoiding the wasted O(max(N,D)³) eigh that cuML's fixed D×D path runs on wide data, MultinomialNB changes atomic scatter for segment-level reduction, and t-SNE's gradient never materializes the N×N Q matrix.

02 / 04Hardware-Aware Kernels

Hardware-Aware Implementation: Kernel Variants for Different Hardware and Workloads

To map these mathematical formulations directly to silicon, FlashLib builds multiple kernel variants that adapt to both the hardware and the workload. Flash-KNN illustrates this approach. First, at the backend layer, we ship a portable Triton implementation for both Ampere and Hopper. For Hopper, an opt-in CuteDSL FA3 backend additionally unlocks modern features like TMA fetching and warp-specialized pipelines. Second, at the kernel layer, the design adapts to the workload. For large queries, the kernel mirrors standard FlashAttention to maximize TensorCore utilization. For small queries against a huge corpus, it mirrors Flash-Decoding: a split-k layout distributes work across the corpus dimension to prevent SM idling. Third, at the heuristic layer, we choose hyperparameters like tile sizes and warp counts based on hardware characteristics such as cache size and register capacity. As a result, even a Q=1 query against a 100M-vector corpus holds the kernel at 85.2% of the H200's peak HBM bandwidth.

03 / 04Tolerance Routing

Tolerance-Driven Dispatch: Routing to the Fastest Algorithm within a Precision Budget

FlashLib also exposes the speed-accuracy tradeoff as a user choice. Classical scientific computing often demands high precision in FP32 or even FP64 — for solving PDEs, certifying numerical methods, or anywhere a small error cascades into a wrong answer. Many AI workloads have no such requirement: a clustering pass over embedding vectors, a top-K retrieval, or a regression on noisy data can absorb a small declared residual for a substantial speedup. FlashLib makes that distinction the user's to draw, through a single per-call argument, tol. At tol=None, reductions stay in exact precision and the call rides the kernel-fusion wins from above. At tol > 0, the dispatcher routes through a Pareto frontier of precision emulation (fused variants like 3xbf16 and Ozaki-II INT8) and algorithm substitution (Halko subspace iteration), picking whichever has the highest throughput within the declared residual.

# GEMM: same call, different tol -> different variant.
flashlib.gemm(A, B)               # exact fp32
flashlib.gemm(A, B, tol=1e-3)     # bf16
flashlib.gemm(A, B, tol=1e-5)     # 3xbf16 (cute-fused)
flashlib.gemm(A, B, tol=1e-7)     # ozaki2_cute(s=8): tighter AND faster
flashlib.gemm(A, B, tol=1e-12)    # ozaki2_int8(s=14): FP64-grade

# PCA: tol unlocks an algorithm substitution, not just precision.
flashlib.flash_pca(X, K=32)            # exact eigh on Gram / cov matrix
flashlib.flash_pca(X, K=32, tol=1e-4)  # Halko subspace: ~30x faster

04 / 04Cost-Predictable API

Agent-Native API: Transparent Source and Predictable Cost for Users and Agents

In an era when LLM-based agents increasingly read, call, and modify performance code, the cost of a library is not just its kernel throughput but how legible its cost model and source are. FlashLib is written in Triton and CuteDSL with no opaque binaries — every kernel from flash_kmeans(...) down to the tl.dot call is editable. And every primitive ships a GPU-free cost-prediction surface: flashlib.info.estimate(...) takes a shape and a tolerance and returns a recursive tree of runtime, FLOPs, HBM bytes, and bound regime in ~5 microseconds on pure CPU, never importing torch, triton, or cutlass. An LLM agent can compose a pipeline of ten primitives, walk that cost tree, and decide whether the budget fits before spending a single FLOP.

import flashlib.info as info   # pure stdlib -- no torch/triton/cutlass.

# Predict cost without touching the GPU -- ~5 microseconds on pure CPU.
est = info.estimate("pca", shape=(1_000_000, 512), params={"K": 32}, device="H200")

print(est.summary_line())
# pca   13.18 ms  bound=compute   42 TF  (83% peak)  res~1e-7  [roofline]

est.print_tree()              # walk the recursive call-stack tree
# pca           13.18 ms  2.18 GB  compute   42 TF  res~1e-7
# ├── cov_gemm  10.49 ms  2.05 GB  compute   50 TF
# ├── eigh       0.12 ms  0.00 GB  compute    3 TF
# └── transform  2.57 ms  2.18 GB  compute   13 TF

# Pareto-optimal GEMM variants for a 4Kx4Kx4K matmul on H200:
for v in info.pareto("gemm", shape=(4096, 4096, 4096), device="H200"):
    print(v)
# Variant('gemm_fp16'           : 0.2 ms  residual~8e-04)
# Variant('gemm_tf32'           : 0.4 ms  residual~3e-04)
# Variant('gemm_3xfp16'         : 0.5 ms  residual~2e-04)
# Variant('gemm_fp16_x3_kahan'  : 0.6 ms  residual~5e-07)
# Variant('gemm_ozaki2_cute'    : 0.8 ms  residual~2e-15)

Benchmarks

All results below are measured on a single NVIDIA H200 (SM90, 150 GB HBM3e) with CUDA 13.0, driver 580.126, PyTorch 2.11, Triton 3.6, against cuML 25.10. Every cell is the median over 5 iterations with the first call discarded for JIT amortization; inputs are GPU-resident on both sides; matched-algorithm rows (same algorithm, method, svd_solver) are paired with reduced-precision and algorithmic-shortcut rows so the comparison is fair at every shape.

1. Breadth: speedup over cuML across 13 primitives

The first benchmark is a broad sweep: 13 primitives × 194 (shape, dtype, hyperparameter) cells, all run against cuml 25.10 on the same H200. Every cell here is an apples-to-apples comparison: matched algorithm, matched precision, matched hyperparameters — FlashLib is forbidden from using any reduced-precision GEMM (no bf16/fp16/Ozaki) or algorithmic shortcut (no Halko, no FFT t-SNE, no NN-Descent KNN). Because of this, the bars below are strictly lower than the headline numbers at the top of the post: the hero stats are FlashLib's best ceiling speedups on each primitive (which, where applicable, do let the user trade precision or algorithmic exactness for throughput via the tol knob from Principle 03), whereas the broad sweep deliberately removes those degrees of freedom to isolate the pure kernel-engineering win. The bar chart below collapses each primitive's 8–34 cells into a single bar — Geomean shows the geometric-mean speedup across all of that primitive's cells, and Max shows the per-primitive ceiling on the most favourable cell. FlashLib is at least as fast as cuML on 193 / 194 cells; 126 cells cross 5×, and 11 cross 50×.

Speedup over cuML 25.10, per primitive

Geometric mean across 194 workload cells, exact-to-exact (no reduced precision, no algorithmic shortcut) · log-scale x-axis. Maximum win across the same 194 cells, exact-to-exact · log-scale x-axis.

Spotlight Click one of the five primitive buttons to pin it to the top, then switch between geomean and max to watch the ordering and bar lengths move smoothly.

2. Depth: precision × runtime trade-off inside one primitive (GEMM)

The second benchmark zooms into a single primitive, GEMM at 4096³ on H200, to show what tolerance routing looks like in practice. FlashLib ships ~10 GEMM variants in flashlib.linalg.gemm — bf16, fp16, tf32, fused multi-pass (3xbf16, fp16_x3_kahan), and the Ozaki-II INT8 family (ozaki2_cute, ozaki2_int8) — all behind the single tol-routed dispatcher from Principle 03. The scatter below plots each variant on RMS relative error (vs an FP64 reference) against per-call runtime. The dashed curve is the Pareto frontier: variants below it dominate the rest. The interesting point is ozaki2_cute(s=8): it sits below and to the left of fp32, meaning it is simultaneously tighter and ~2× faster than the native fp32 path on this shape.

GEMM precision vs runtime, 4096³ on H200

Each point is one variant. Lower is better on both axes. The dashed accent curve is the Pareto frontier — everything to its upper-right is dominated.

3. Agent loops: Does FlashLib accelerate the agent loop?

Our third benchmark returns to the core motivation of FlashLib: many agentic workflows are bottlenecked not only by LLM rollout, but also by auxiliary operators such as retrieval, clustering, vector search, and numerical routines. We therefore evaluate FlashLib in a meta setting: whether giving a coding agent access to FlashLib helps it ship a faster end-to-end system.

Concretely, we task Claude Code Opus 4.7 with building a GPU vector-search backend, based on the GPU port of KCORES/vector-db-bench, under a 1M-token budget. The goal is to maximize QPS subject to a strict recall constraint of at least 0.999 on SIFT-1M. We evaluate the resulting system across three workloads: offline batch search, online single-query search, and streaming search. We run the same task twice, changing only whether the agent has access to FlashLib.

QPS over time, w/ vs w/o flashlib in the prompt

Both runs are Claude Code Opus 4.7 against the same vector-search task at recall ≥ 0.999, capped at 1M tokens. The trajectory plays as if you were watching the agent live; squares mark a clean exit, crosses mark a budget timeout.

Offline batch (10k queries / call)

Bulk-vector queries with no latency constraint. The defining benchmark for fused matmul+top-k.

w/ flashlib default

Online single (one query / call)

Per-query launches, so kernel launch overhead dominates. Both variants saturate around the same ceiling.

w/ flashlib default

Streaming (small slabs, hard 0.999 recall)

Tie-breaks against the bench's cupy ground truth dominate. Hybrid fast/slow paths unlock the high-QPS region.

w/ flashlib default

Offline batch

310kw/ flashlib

50kdefault

6.2×

Online single

2.6kw/ flashlib

2.6kdefault

≈ 1×

Streaming

5.0kw/ flashlib

1.0kdefault

5.2×

On the two settings where flashlib unlocks a structural win (offline batch and streaming), the flashlib-prompted agent reaches 5–6× higher QPS by directly adopting flash_knn instead of laboriously rediscovering the cuVS path and slowly bolting on TF32. On online single, where per-query launch overhead dominates either way, both variants converge to roughly the same ceiling. Just as importantly, the flashlib-prompted agent finishes the task naturally at 904k tokens with budget to spare; the default agent runs out at 1.07M tokens with no further ideas on hand — it hit a 50k QPS plateau on the dominant offline-batch task and could not escape it within the budget.

Example: Flash-KMeans in a few lines

FlashLib ships on PyPI and installs with a single command. The smart dispatcher in flashlib.primitives.kmeans takes a GPU tensor of shape (N, D) or (B, N, D) and returns the cluster IDs and centroids; it picks between the Triton path, the Hopper-only CuteDSL FA3-style fused-assign path, and a pure-torch fallback automatically based on the shape.

$ pip install flashlib

import torch
from flashlib.primitives.kmeans import flash_kmeans

# 1M points, 128 dims on a single H200 -- runs in a few ms.
x = torch.randn(1_000_000, 128, device="cuda", dtype=torch.float32)
labels, centroids = flash_kmeans(x, n_clusters=1024, max_iters=20)
# labels:    (1, 1_000_000) int64   -- cluster assignment per point
# centroids: (1, 1024, 128) float32 -- final cluster centers

If you want to know what the call will cost before launching it — useful for an agent budgeting a pipeline of ten primitives — the same shape goes through the pure-stdlib cost API from Principle 04:

import flashlib.info as info

est = info.estimate("kmeans",
                    shape=(1_000_000, 128),
                    params={"K": 1024, "max_iters": 20})
print(est.summary_line())
# kmeans  ~7 ms  bound=compute  ~140 TF  (64% peak)  [calibrated]

Other primitives follow the same shape: flash_knn, flash_pca, flash_hdbscan, flash_tsne, flashlib.gemm, and the rest of the catalog are all one-call drop-ins with the same tol / backend knobs and the same flashlib.info.estimate(...) cost surface.

Limitations and Future Work

FlashLib's current release covers an important but incomplete slice of the classical-ML operator landscape — primarily the clustering, nearest-neighbour, dimensionality-reduction, and dense linear-algebra primitives most central to our own agentic-AI workloads. Extending the catalog is one of our top priorities; upcoming releases will deepen coverage of the existing families and add new ones (Gaussian mixtures, kernel methods, graph-based primitives, sparse inputs). A second limitation is hardware coverage: development and benchmarking have so far concentrated on H200. The dispatcher and kernels are written to be portable, but broader measurement and tuning across a wider range of hardware is still needed.

Acknowledgements

FlashLib was deeply inspired by several open-source efforts that pioneered the “flash” design playbook on which this work is built: FlashAttention, FlashDecoding, NVIDIA's cuVS and cuML, FlashLinearAttention, and FlashInfer. We are also grateful to Dacheng Li (UC Berkeley), Shiyi Cao (UC Berkeley), Melissa Pan (UC Berkeley), and Zihao Ye (University of Washington) for many helpful discussions, and to the flash-kmeans and Sparse VideoGen communities for valuable feedback on early prototypes.

Citation

If FlashLib is useful in your research or product, please cite the project as:

@misc{yang2026flashlib,
  title  = {FlashLib: Bringing Flash Magic to Classical Machine Learning Operators},
  author = {Yang, Shuo and Xi, Haocheng and Zhao, Yilong and Mang, Qiuyang and
            Wang, Zhe and Sun, Shanlin and Keutzer, Kurt and Gonzalez, Joseph E. and
            Han, Song and Xu, Chenfeng and Stoica, Ion},
  year   = {2026},
  url    = {https://flashml-org.github.io/},
}