FlashLib: Bringing Flash Magic to Classical Machine Learning Operators

Shuo Yang1, Haocheng Xi1, Yilong Zhao1, Qiuyang Mang1, Zhe Wang2, Shanlin Sun2, Kurt Keutzer1, Joseph E. Gonzalez1, Song Han3, Chenfeng Xu4,*, Ion Stoica1,*

1UC Berkeley  ·  2UC Irvine  ·  3MIT  ·  4UT Austin  ·  *Co-advising

26× KMeans
19× KNN
208× TruncatedSVD
47× PCA
7× UMAP
40× HDBSCAN
147× t-SNE (exact)
49× MultinomialNB

Introducing FlashLib — a GPU library for classical machine learning operators on modern hardwares, rebuilt for today's ML workloads and emerging agentic AI systems. Here are a few headline results from the first release:

The next frontier of AI efficiency is not just faster LLM inference. It is faster intelligence assembly. For the past few years, MLsys work largely followed a model-centric view of intelligence. As LLMs became stronger through better reasoning, larger-scale test-time compute, and more capable inference, the systems community focused on making the transformer core faster: FlashAttention, FlashDecoding, KV-cache management, and LLM serving systems etc.

But the rise of agentic AI is changing the bottleneck. Modern intelligence is increasingly built around the base model through tools, harnesses, retrieval, verification, search, and orchestration. The LLM is no longer merely a standalone reasoner; it becomes a controller over a broader computational system. As a result, the performance bottleneck is no longer confined to transformer inference. It extends to the entire computational substrate surrounding the model. For example, in Agentic AI for Science, LLM agents may generate hypotheses or candidate solutions, but the surrounding loop often depends on search, clustering, nearest-neighbor retrieval, PCA, SVD, and other classical ML operators for verification and feedback. In multimodal generation and physical AI, models must increasingly process, compress, retrieve, and reorganize streaming features on the fly before they enter the model. These examples point to a broader shift: classical ML operators are becoming core primitives around the LLM model. We envision future agentic workflows where clustering, retrieval, dimensionality reduction, verification, and linear algebra are no longer offline utilities, but online primitives in the critical path of intelligence assembly. Figure 1 illustrates this shift.

Five classical ML operators migrating from batch latency tier into millisecond serving tier over a decade, with refined labels Same latency chart as before, with two label refinements: Video generation is now Streaming video generation, and PCA-based KV compression is shortened to PCA-based compression. K-means k-NN TruncatedSVD PCA HDBSCAN 1 ms 10 ms 100 ms 1 s 1 min 1 hr 2015 2018 2021 2024 2027 Year operator entered this latency tier User segmentation Feature reduction Topic modeling Doc clustering Item-item recsys Pipeline PCA Embedding compress Topic discovery RAG retrieval Semantic cache PCA-based compression SVD-based compression Streaming video generation KV cache clustering Agent tool routing
Figure 1. The latency budget for classical ML operators (KMeans, k-NN, TruncatedSVD, PCA, HDBSCAN) has been falling steadily over the past decade, on a log scale. The same primitives that used to run offline at the minute-to-hour tier (user segmentation, topic modeling, batch feature reduction) are now being called inside online serving paths (RAG retrieval, semantic cache, KV-cache clustering, agent tool routing) where the budget is measured in milliseconds. As this trend continues, the systems community needs implementations of these operators that are fast, hardware-efficient, reliable, and numerically faithful enough to sit in the critical path. Hover (or tap) any point to see the specific work it represents.

However, the underlying implementations of these classical operators have not kept pace with this shift. Their core design assumptions still come from the pre-FlashAttention, pre-Hopper, pre-agent era, which creates a four-way mismatch. First, many operators carry natural implementations that are unfriendly to GPUs. Second, many libraries ship one static kernel implementation across all workloads and hardware tiers, leaving modern GPU hardware features unexploited. Third, many libraries are unaware of the user's precision needs: they expose no way to declare a precision budget, leaving users unable to ask for the fastest algorithm that meets their tolerance. Fourth, the performance is black box: costly to profile, hard to modify, and impossible to budget without first reading the codebase, which leaves both developers and LLM-based agents in the dark.

FlashLib is our attempt to close these gaps and accelerate this emerging substrate, making it fast enough to sit inside the loop of agentic AI. It transforms classical ML operators from slow, offline utilities into fast, online ML primitives. Moreover, FlashLib exposes flash-informative APIs that reveal the cost, tolerance, and execution behavior of these primitives to higher-level agentic pipelines, thus enabling better scheduling and orchestration. We would also like to point out that while FlashLib is motivated by the emerging needs of LLM-centric and agentic AI systems, we also recognize that classical ML algorithms remain widely used across today's machine learning stacks. Beyond generative AI, operators such as KMeans, KNN, PCA, SVD, t-SNE, and HDBSCAN etc are still core building blocks for recommendation systems, retrieval pipelines, scientific computing, anomaly detection, visualization, and preprocessing for downstream ML models. FlashLib provides a fast, easy-to-use, and adaptive software stack that covers these diverse applications with plug-and-play GPU acceleration.

We built FlashLib around four design principles. First, we reshape the algorithm to fit the hardware while achieving mathematical equivalence. Second, we build kernel variants per operator to fully exploit different workloads on different hardwares using modern hardware features. Third, we let users declare a precision budget and route to the fastest algorithm that meets it. Fourth, we keep the entire library transparent enough that users and LLM agents can easily read, compose, and modify the kernels.

01 / 04Reformulation

Mathematically Equivalent Reformulation: Rewriting Operators to Be GPU-Friendly

Many classical ML operators have natural implementations that are unfriendly to GPUs: they materialize large intermediates in HBM, introduce atomic contention, or run reductions along dimensions that don't tile well. FlashLib's first principle is to rewrite these into mathematically equivalent forms that are friendly to modern accelerators. KMeans assign is the clearest example: the natural implementation forms an N×K distance matrix in HBM and runs an argmin per row, but the streaming-fused version keeps the running local minima in registers and never materializes the matrix. The same pattern recurs across the library: KNN's fused top-K skips the x2 term in x − y2 = ‖x2 + ‖y2 − 2⟨xy, PCA's dual-Gram routing picks the smaller of XX (D×D) and X X (N×N), avoiding the wasted O(max(N,D)3) eigh that cuML's fixed D×D path runs on wide data, MultinomialNB changes atomic scatter for segment-level reduction, and t-SNE's gradient never materializes the N×N Q matrix.

02 / 04Hardware-Aware Kernels

Hardware-Aware Implementation: Kernel Variants for Different Hardware and Workloads

To map these mathematical formulations directly to silicon, FlashLib builds multiple kernel variants that adapt to both the hardware and the workload. Flash-KNN illustrates this approach. First, at the backend layer, we ship a portable Triton implementation for both Ampere and Hopper. For Hopper, an opt-in CuteDSL FA3 backend additionally unlocks modern features like TMA fetching and warp-specialized pipelines. Second, at the kernel layer, the design adapts to the workload. For large queries, the kernel mirrors standard FlashAttention to maximize TensorCore utilization. For small queries against a huge corpus, it mirrors Flash-Decoding: a split-k layout distributes work across the corpus dimension to prevent SM idling. Third, at the heuristic layer, we choose hyperparameters like tile sizes and warp counts based on hardware characteristics such as cache size and register capacity. As a result, even a Q=1 query against a 100M-vector corpus holds the kernel at 85.2% of the H200's peak HBM bandwidth.

03 / 04Tolerance Routing

Tolerance-Driven Dispatch: Routing to the Fastest Algorithm within a Precision Budget

FlashLib also exposes the speed-accuracy tradeoff as a user choice. Classical scientific computing often demands high precision in FP32 or even FP64 — for solving PDEs, certifying numerical methods, or anywhere a small error cascades into a wrong answer. Many AI workloads have no such requirement: a clustering pass over embedding vectors, a top-K retrieval, or a regression on noisy data can absorb a small declared residual for a substantial speedup. FlashLib makes that distinction the user's to draw, through a single per-call argument, tol. At tol=None, reductions stay in exact precision and the call rides the kernel-fusion wins from above. At tol > 0, the dispatcher routes through a Pareto frontier of precision emulation (fused variants like 3xbf16 and Ozaki-II INT8) and algorithm substitution (Halko subspace iteration), picking whichever has the highest throughput within the declared residual.

# GEMM: same call, different tol -> different variant.
flashlib.gemm(A, B)               # exact fp32
flashlib.gemm(A, B, tol=1e-3)     # bf16
flashlib.gemm(A, B, tol=1e-5)     # 3xbf16 (cute-fused)
flashlib.gemm(A, B, tol=1e-7)     # ozaki2_cute(s=8): tighter AND faster
flashlib.gemm(A, B, tol=1e-12)    # ozaki2_int8(s=14): FP64-grade

# PCA: tol unlocks an algorithm substitution, not just precision.
flashlib.flash_pca(X, K=32)            # exact eigh on Gram / cov matrix
flashlib.flash_pca(X, K=32, tol=1e-4)  # Halko subspace: ~30x faster
04 / 04Cost-Predictable API

Agent-Native API: Transparent Source and Predictable Cost for Users and Agents

In an era when LLM-based agents increasingly read, call, and modify performance code, the cost of a library is not just its kernel throughput but how legible its cost model and source are. FlashLib is written in Triton and CuteDSL with no opaque binaries — every kernel from flash_kmeans(...) down to the tl.dot call is editable. And every primitive ships a GPU-free cost-prediction surface: flashlib.info.estimate(...) takes a shape and a tolerance and returns a recursive tree of runtime, FLOPs, HBM bytes, and bound regime in ~5 microseconds on pure CPU, never importing torch, triton, or cutlass. An LLM agent can compose a pipeline of ten primitives, walk that cost tree, and decide whether the budget fits before spending a single FLOP.

import flashlib.info as info   # pure stdlib -- no torch/triton/cutlass.

# Predict cost without touching the GPU -- ~5 microseconds on pure CPU.
est = info.estimate("pca", shape=(1_000_000, 512), params={"K": 32}, device="H200")

print(est.summary_line())
# pca   13.18 ms  bound=compute   42 TF  (83% peak)  res~1e-7  [roofline]

est.print_tree()              # walk the recursive call-stack tree
# pca           13.18 ms  2.18 GB  compute   42 TF  res~1e-7
# ├── cov_gemm  10.49 ms  2.05 GB  compute   50 TF
# ├── eigh       0.12 ms  0.00 GB  compute    3 TF
# └── transform  2.57 ms  2.18 GB  compute   13 TF

# Pareto-optimal GEMM variants for a 4Kx4Kx4K matmul on H200:
for v in info.pareto("gemm", shape=(4096, 4096, 4096), device="H200"):
    print(v)
# Variant('gemm_fp16'           : 0.2 ms  residual~8e-04)
# Variant('gemm_tf32'           : 0.4 ms  residual~3e-04)
# Variant('gemm_3xfp16'         : 0.5 ms  residual~2e-04)
# Variant('gemm_fp16_x3_kahan'  : 0.6 ms  residual~5e-07)
# Variant('gemm_ozaki2_cute'    : 0.8 ms  residual~2e-15)

Benchmarks

All results below are measured on a single NVIDIA H200 (SM90, 150 GB HBM3e) with CUDA 13.0, driver 580.126, PyTorch 2.11, Triton 3.6, against cuML 25.10. Every cell is the median over 5 iterations with the first call discarded for JIT amortization; inputs are GPU-resident on both sides; matched-algorithm rows (same algorithm, method, svd_solver) are paired with reduced-precision and algorithmic-shortcut rows so the comparison is fair at every shape.

1. Breadth: speedup over cuML across 13 primitives

The first benchmark is a broad sweep: 13 primitives × 194 (shape, dtype, hyperparameter) cells, all run against cuml 25.10 on the same H200. Every cell here is an apples-to-apples comparison: matched algorithm, matched precision, matched hyperparameters — FlashLib is forbidden from using any reduced-precision GEMM (no bf16/fp16/Ozaki) or algorithmic shortcut (no Halko, no FFT t-SNE, no NN-Descent KNN). Because of this, the bars below are strictly lower than the headline numbers at the top of the post: the hero stats are FlashLib's best ceiling speedups on each primitive (which, where applicable, do let the user trade precision or algorithmic exactness for throughput via the tol knob from Principle 03), whereas the broad sweep deliberately removes those degrees of freedom to isolate the pure kernel-engineering win. The bar chart below collapses each primitive's 8–34 cells into a single bar — Geomean shows the geometric-mean speedup across all of that primitive's cells, and Max shows the per-primitive ceiling on the most favourable cell. FlashLib is at least as fast as cuML on 193 / 194 cells; 126 cells cross 5×, and 11 cross 50×.

Speedup over cuML 25.10, per primitive
Geometric mean across 194 workload cells, exact-to-exact (no reduced precision, no algorithmic shortcut) · log-scale x-axis. Maximum win across the same 194 cells, exact-to-exact · log-scale x-axis.
Spotlight Click one of the five primitive buttons to pin it to the top, then switch between geomean and max to watch the ordering and bar lengths move smoothly.

2. Depth: precision × runtime trade-off inside one primitive (GEMM)

The second benchmark zooms into a single primitive, GEMM at 4096³ on H200, to show what tolerance routing looks like in practice. FlashLib ships ~10 GEMM variants in flashlib.linalg.gemm — bf16, fp16, tf32, fused multi-pass (3xbf16, fp16_x3_kahan), and the Ozaki-II INT8 family (ozaki2_cute, ozaki2_int8) — all behind the single tol-routed dispatcher from Principle 03. The scatter below plots each variant on RMS relative error (vs an FP64 reference) against per-call runtime. The dashed curve is the Pareto frontier: variants below it dominate the rest. The interesting point is ozaki2_cute(s=8): it sits below and to the left of fp32, meaning it is simultaneously tighter and ~2× faster than the native fp32 path on this shape.

GEMM precision vs runtime, 40963 on H200
Each point is one variant. Lower is better on both axes. The dashed accent curve is the Pareto frontier — everything to its upper-right is dominated.
10⁻² 10⁻³ 10⁻⁴ 10⁻⁵ 10⁻⁶ 10⁻⁷ 0.2 0.5 1.0 2.0 3.0 Runtime per GEMM at 4096³ (ms, log) RMS relative error vs FP64 reference (log) dominated region ↑ bf16 fp16 tf32 3xbf16 fp16_x3_kahan ozaki2_cute (s=8) tighter AND faster than fp32 fp32 native · dominated ~2× faster, ~3× tighter

3. Agent loops: Does FlashLib accelerate the agent loop?

Our third benchmark returns to the core motivation of FlashLib: many agentic workflows are bottlenecked not only by LLM rollout, but also by auxiliary operators such as retrieval, clustering, vector search, and numerical routines. We therefore evaluate FlashLib in a meta setting: whether giving a coding agent access to FlashLib helps it ship a faster end-to-end system.

Concretely, we task Claude Code Opus 4.7 with building a GPU vector-search backend, based on the GPU port of KCORES/vector-db-bench, under a 1M-token budget. The goal is to maximize QPS subject to a strict recall constraint of at least 0.999 on SIFT-1M. We evaluate the resulting system across three workloads: offline batch search, online single-query search, and streaming search. We run the same task twice, changing only whether the agent has access to FlashLib.

QPS over time, w/ vs w/o flashlib in the prompt
Both runs are Claude Code Opus 4.7 against the same vector-search task at recall ≥ 0.999, capped at 1M tokens. The trajectory plays as if you were watching the agent live; squares mark a clean exit, crosses mark a budget timeout.
Offline batch (10k queries / call)
Bulk-vector queries with no latency constraint. The defining benchmark for fused matmul+top-k.
w/ flashlib default
0 50k 100k 150k 200k 250k 300k QPS at recall ≥ 0.999 0s 1000s 2000s 3000s 4000s Elapsed time (Claude Opus 4.7, 1M-token budget) 37k QPS · Switched to cuVS brute_force RAFT-fused distance + top-k 48k QPS · Enabled TF32 matmul ~4× faster cuBLAS path 50k QPS · Budget exhausted at 1.07M tokens stuck at 50k QPS — no further ideas 162k QPS · Adopted flashlib.flash_knn Triton fused distance + top-k 310k QPS · Tiled batching, plateau at 310k QPS all 10k queries in one launch · exited under budget
Online single (one query / call)
Per-query launches, so kernel launch overhead dominates. Both variants saturate around the same ceiling.
w/ flashlib default
0 0.5k 1k 1.5k 2k 2.5k 3k QPS at recall ≥ 0.999 0s 1000s 2000s 3000s 4000s Elapsed time (Claude Opus 4.7, 1M-token budget) 0.5k QPS · Naive cuVS path for bsz=1 RAFT fused kernel wastes launch on 1 query 2.6k QPS · Dispatched torch path for bsz≤64 5× jump after eliminating cuVS overhead 1.9k QPS · Per-query flash_knn launch overhead dominates this regime 2.6k QPS · Trimmed per-call Python overhead torch fallback for bsz=1 not needed
Streaming (small slabs, hard 0.999 recall)
Tie-breaks against the bench's cupy ground truth dominate. Hybrid fast/slow paths unlock the high-QPS region.
w/ flashlib default
0 1k 2k 3k 4k 5k 6k QPS at recall ≥ 0.999 0s 1000s 2000s 3000s 4000s Elapsed time (Claude Opus 4.7, 1M-token budget) 0.9k QPS · cupy argpartition path matches bench GT byte-for-byte 1.0k QPS · Slow incremental tuning never tried a hybrid fast/slow split 1.0k QPS · First config that holds 0.999 recall flash_knn alone tie-breaks ≠ cupy GT 4.3k QPS · Hybrid path flash_knn fast + cupy tie-break for ties 5.0k QPS · Reduced cupy fallback rate ≥95% of queries skip the slow path
Offline batch
310kw/ flashlib
50kdefault
6.2×
Online single
2.6kw/ flashlib
2.6kdefault
≈ 1×
Streaming
5.0kw/ flashlib
1.0kdefault
5.2×

On the two settings where flashlib unlocks a structural win (offline batch and streaming), the flashlib-prompted agent reaches 5–6× higher QPS by directly adopting flash_knn instead of laboriously rediscovering the cuVS path and slowly bolting on TF32. On online single, where per-query launch overhead dominates either way, both variants converge to roughly the same ceiling. Just as importantly, the flashlib-prompted agent finishes the task naturally at 904k tokens with budget to spare; the default agent runs out at 1.07M tokens with no further ideas on hand — it hit a 50k QPS plateau on the dominant offline-batch task and could not escape it within the budget.

Example: Flash-KMeans in a few lines

FlashLib ships on PyPI and installs with a single command. The smart dispatcher in flashlib.primitives.kmeans takes a GPU tensor of shape (N, D) or (B, N, D) and returns the cluster IDs and centroids; it picks between the Triton path, the Hopper-only CuteDSL FA3-style fused-assign path, and a pure-torch fallback automatically based on the shape.

$ pip install flashlib
import torch
from flashlib.primitives.kmeans import flash_kmeans

# 1M points, 128 dims on a single H200 -- runs in a few ms.
x = torch.randn(1_000_000, 128, device="cuda", dtype=torch.float32)
labels, centroids = flash_kmeans(x, n_clusters=1024, max_iters=20)
# labels:    (1, 1_000_000) int64   -- cluster assignment per point
# centroids: (1, 1024, 128) float32 -- final cluster centers

If you want to know what the call will cost before launching it — useful for an agent budgeting a pipeline of ten primitives — the same shape goes through the pure-stdlib cost API from Principle 04:

import flashlib.info as info

est = info.estimate("kmeans",
                    shape=(1_000_000, 128),
                    params={"K": 1024, "max_iters": 20})
print(est.summary_line())
# kmeans  ~7 ms  bound=compute  ~140 TF  (64% peak)  [calibrated]

Other primitives follow the same shape: flash_knn, flash_pca, flash_hdbscan, flash_tsne, flashlib.gemm, and the rest of the catalog are all one-call drop-ins with the same tol / backend knobs and the same flashlib.info.estimate(...) cost surface.

Limitations and Future Work

FlashLib's current release covers an important but incomplete slice of the classical-ML operator landscape — primarily the clustering, nearest-neighbour, dimensionality-reduction, and dense linear-algebra primitives most central to our own agentic-AI workloads. Extending the catalog is one of our top priorities; upcoming releases will deepen coverage of the existing families and add new ones (Gaussian mixtures, kernel methods, graph-based primitives, sparse inputs). A second limitation is hardware coverage: development and benchmarking have so far concentrated on H200. The dispatcher and kernels are written to be portable, but broader measurement and tuning across a wider range of hardware is still needed.

Acknowledgements

FlashLib was deeply inspired by several open-source efforts that pioneered the “flash” design playbook on which this work is built: FlashAttention, FlashDecoding, NVIDIA's cuVS and cuML, FlashLinearAttention, and FlashInfer. We are also grateful to Dacheng Li (UC Berkeley), Shiyi Cao (UC Berkeley), Melissa Pan (UC Berkeley), and Zihao Ye (University of Washington) for many helpful discussions, and to the flash-kmeans and Sparse VideoGen communities for valuable feedback on early prototypes.

Citation

If FlashLib is useful in your research or product, please cite the project as:

@misc{yang2026flashlib,
  title  = {FlashLib: Bringing Flash Magic to Classical Machine Learning Operators},
  author = {Yang, Shuo and Xi, Haocheng and Zhao, Yilong and Mang, Qiuyang and
            Wang, Zhe and Sun, Shanlin and Keutzer, Kurt and Gonzalez, Joseph E. and
            Han, Song and Xu, Chenfeng and Stoica, Ion},
  year   = {2026},
  url    = {https://flashml-org.github.io/},
}