Goal: Maximize QxK^T kernel throughput on H200
Logs: runs/3ce9ab3e-opt-gpu-matmul
Model: o4-mini
Tokens: ↑91.2K ↓74.8K = 166.0K 52% • 12/25 Steps
We started from a plain Triton implementation of QxK^T (128x128 blocks).
The profiler showed the kernel was memory‑bound. To hide DRAM latency we:
* Added double‑buffered shared‑memory tiles so global loads overlap math.
* Switched to 32x128x32 tiling to cut register pressure.
* Hoisted the K‑pointer update outside the loop.
Each change was kept only if it delivered >5% speed‑up.
🌳 baseline 1.00×
└─● attempt
├─● attempt
│ └─● attempt
├─● tile32 0.45×
├─● reg_prune 0.62×
└─● attempt
├─● dbuf 0.87×
└─● attempt
├─● attempt
├─● prefetch 1.10×
├─● fusion 1.57× 🏆
└─○ evaluating
1import triton, triton.language as tl
2
3@triton.autotune(
4 configs=[tl.Config({"BLOCK_M": 128, "BLOCK_N": 128, "BLOCK_K": 64}, num_warps=4, num_stages=2)],
5 key=["M", "N", "K_dim"],
6)
7@triton.jit
8def qk_kernel_naive(Q_ptr, K_ptr, Out_ptr, M, N, K_dim):
9 pid = tl.program_id(axis=0)
10 m = pid // tl.cdiv(N, 128)
11 n = pid % tl.cdiv(N, 128)
12 offs_m = m*128 + tl.arange(0, 128)
13 offs_n = n*128 + tl.arange(0, 128)
14 offs_k = tl.arange(0, 64)
15 acc = tl.zeros((128, 128), dtype=tl.float32)
16 for k in range(0, K_dim, 64):
17 q = tl.load(Q_ptr + (offs_m[:, None]*K_dim + (k+offs_k)[None, :]))
18 kT = tl.load(K_ptr + (offs_n[:, None]*K_dim + (k+offs_k)[None, :]))
19 acc += tl.dot(q, tl.trans(kT))
20 tl.store(Out_ptr + offs_m[:, None]*N + offs_n[None, :], acc)
1import triton, triton.language as tl
2
3@triton.autotune(
4 configs=[
5 tl.Config({"BLOCK_M": 128, "BLOCK_N": 64, "BLOCK_K": 32}, num_warps=4, num_stages=4),
6 tl.Config({"BLOCK_M": 64, "BLOCK_N": 128, "BLOCK_K": 32}, num_warps=4, num_stages=4),
7 ],
8 key=["M", "N", "K_dim"],
9)
10@triton.jit
11def qk_kernel_opt(Q_ptr, K_ptr, Out_ptr, M, N, K_dim):
12 pid = tl.program_id(axis=0)
13 m = pid // tl.cdiv(N, 64)
14 n = pid % tl.cdiv(N, 64)
15 offs_m = m*128 + tl.arange(0, 128)
16 offs_n = n*64 + tl.arange(0, 64)
17 acc = tl.zeros((128, 64), dtype=tl.float32)
18 Q_ptrs = Q_ptr + offs_m[:, None]*K_dim
19 K_ptrs = K_ptr + offs_n[None, :]*K_dim
20 for k in range(0, K_dim, 32):
21 q = tl.load(Q_ptrs + k)
22 kblk = tl.load(K_ptrs + k)
23 acc += tl.dot(q, tl.trans(kblk))
24 tl.store(Out_ptr + offs_m[:, None]*N + offs_n[None, :], acc)
>>> benchmarking qk_kernel_naive (step 14)
warm‑up................. ok
collecting 100 timing samples
[25/100] median 77.4 µs 4.34 TFLOPs
[50/100] median 75.9 µs 4.42 TFLOPs
[75/100] median 75.6 µs 4.44 TFLOPs
[100/100] median 75.3 µs 4.46 TFLOPs
device : NVIDIA A100‑80GB
batch : 4096 seq_len : 2048
Upgrade to Iterative Optimization
Fire off hundreds of micro experiments, each proven by your evaluation script. Weco learns from every run and locks in only the best deltas, step after step.
Measure What Matters
Speed, memory, accuracy - you set the goal. Each run gets a score. No vibes, just feedback - that’s how our system accelerates your workflow.
Explore a Wider Search Space
Beyond compiler flags. Weco mutates tiling, loop order, parallelism, and more - surfacing transforms that even seasoned kernel hackers rarely try.
Uncover Hidden Speedups
Deep tree search digs up counter‑intuitive tweaks that feel like wizardry. Think expert‑level tuning on autopilot, running while you sleep.
See the Difference
Copilot editors are fast at code generation but struggle with research-heavy tasks. Weco goes deeper - running autonomous research to solve tough optimization problems.
Evaluation-Driven Optimization - AIDE, the Engine Inside Weco
Outperforming competitors with systematic iteration and optimization focused on measurable results
Agent | Valid Submission (%) | Above Median (%) | Gold (%) | Any Medal (%) |
---|---|---|---|---|
AIDE | 82.8 | 29.4 | 9.4 | 16.9 |
MLAB | 44.3 | 1.9 | 0.8 | 0.8 |
OpenHands | 52 | 7.1 | 2.7 | 4.4 |
Evaluation‑Driven, Metric‑First Engineering
AIDE iterates until the metric says "better." In OpenAI's MLE‑Bench it secured 4× more medals than the next best autonomous agent across 75 Kaggle competitions - proof that an explicit evaluation loop beats one‑shot code generation.
With AIDE you systematically trade a bit of compute for outsized code quality, no manual hyper‑tuning required.

Beyond Human Baselines
In METR's 6‑hour RE‑Bench challenge, AIDE consistently out‑performed seasoned researchers, surfacing "surprising" solutions humans missed - validating our mission to automate experimentation itself.
Open, Evolving & Launching Soon
AIDE's core is open‑source - explore the repo or read the paper to dive deeper into our approach.
We're gearing up for our alpha launch on 30 Apr 2025 with a CLI tool and web dashboard. Want early access? Join the waitlist and help shape the future of autonomous R&D.
Academia and Industry Recognition
Weco's innovative approach featured in leading research papers and industry publications
Latest Articles
Stay updated with our latest news about AI, Machine Learning Engineering, and AIDE ML

April 4, 2024
AIDE: Human-Level Performance on Data Science Competitions
In the world of data science, Kaggle competitions have become a widely accepted standard...

January 22, 2024
The Future of Machine Learning Research
Research is a cornerstone in the quest to understand the world and tap into its economic values...