The AI Research Engineer

Turn your benchmarks
into breakthroughs

Trusted by labs advancing state‑of‑the‑art AI, Weco turns your evaluation pipeline into a self-improving engine - running hundreds of experiments that learn what works to surface your code's best version and compound gains run after run.

Read Docs
📊 Summary
Goal: Maximize QxK^T kernel throughput on H200
Logs: runs/3ce9ab3e-opt-gpu-matmul
Model: o4-mini
Tokens:91.2K74.8K = 166.0K   52%12/25 Steps
📝 Thinking...
We started from a plain Triton implementation of QxK^T (128x128 blocks).
The profiler showed the kernel was memory‑bound. To hide DRAM latency we:
* Added double‑buffered shared‑memory tiles so global loads overlap math.
* Switched to 32x128x32 tiling to cut register pressure.
* Hoisted the K‑pointer update outside the loop.
Each change was kept only if it delivered >5% speed‑up.
🔍 Exploring Solutions...
🌳 baseline  1.00×
└─● attempt
  ├─● attempt
  │ └─● attempt
  ├─● tile32  0.45×
  ├─● reg_prune  0.62×
  └─● attempt
    ├─● dbuf  0.87×
    └─● attempt
      ├─● attempt
      ├─● prefetch  1.10×
      ├─fusion  1.57× 🏆
      └─○ evaluating
💡 Current Solution (Step 12)
  1import triton, triton.language as tl
  2 
  3@triton.autotune(
  4  configs=[tl.Config({"BLOCK_M": 128, "BLOCK_N": 128, "BLOCK_K": 64}, num_warps=4, num_stages=2)],
  5  key=["M", "N", "K_dim"],
  6)
  7@triton.jit
  8def qk_kernel_naive(Q_ptr, K_ptr, Out_ptr, M, N, K_dim):
  9    pid = tl.program_id(axis=0)
 10    m = pid // tl.cdiv(N, 128)
 11    n = pid % tl.cdiv(N, 128)
 12    offs_m = m*128 + tl.arange(0, 128)
 13    offs_n = n*128 + tl.arange(0, 128)
 14    offs_k = tl.arange(0, 64)
 15    acc = tl.zeros((128, 128), dtype=tl.float32)
 16    for k in range(0, K_dim, 64):
 17        q = tl.load(Q_ptr + (offs_m[:, None]*K_dim + (k+offs_k)[None, :]))
 18        kT = tl.load(K_ptr + (offs_n[:, None]*K_dim + (k+offs_k)[None, :]))
 19        acc += tl.dot(q, tl.trans(kT))
 20    tl.store(Out_ptr + offs_m[:, None]*N + offs_n[None, :], acc)
🏆 Best Solution (1.57×)
  1import triton, triton.language as tl
  2 
  3@triton.autotune(
  4  configs=[
  5    tl.Config({"BLOCK_M": 128, "BLOCK_N": 64, "BLOCK_K": 32}, num_warps=4, num_stages=4),
  6    tl.Config({"BLOCK_M": 64, "BLOCK_N": 128, "BLOCK_K": 32}, num_warps=4, num_stages=4),
  7  ],
  8  key=["M", "N", "K_dim"],
  9)
 10@triton.jit
 11def qk_kernel_opt(Q_ptr, K_ptr, Out_ptr, M, N, K_dim):
 12    pid = tl.program_id(axis=0)
 13    m = pid // tl.cdiv(N, 64)
 14    n = pid % tl.cdiv(N, 64)
 15    offs_m = m*128 + tl.arange(0, 128)
 16    offs_n = n*64 + tl.arange(0, 64)
 17    acc = tl.zeros((128, 64), dtype=tl.float32)
 18    Q_ptrs = Q_ptr + offs_m[:, None]*K_dim
 19    K_ptrs = K_ptr + offs_n[None, :]*K_dim
 20    for k in range(0, K_dim, 32):
 21        q = tl.load(Q_ptrs + k)
 22        kblk = tl.load(K_ptrs + k)
 23        acc += tl.dot(q, tl.trans(kblk))
 24    tl.store(Out_ptr + offs_m[:, None]*N + offs_n[None, :], acc)
🖥 Evaluation Output
>>> benchmarking   qk_kernel_naive   (step 14)
warm‑up................. ok
collecting 100 timing samples
  [25/100] median  77.4 µs   4.34 TFLOPs
  [50/100] median  75.9 µs   4.42 TFLOPs
  [75/100] median  75.6 µs   4.44 TFLOPs
  [100/100] median 75.3 µs   4.46 TFLOPs

device   : NVIDIA A100‑80GB
batch    : 4096     seq_len : 2048
OpenAI Logo
MITLogo
ETH Zurich Logo
United Nations Logo
Accenture Logo
KPMG Logo

Upgrade to Iterative Optimization

Fire off hundreds of micro experiments, each proven by your evaluation script. Weco learns from every run and locks in only the best deltas, step after step.

Measure What Matters

Speed, memory, accuracy - you set the goal. Each run gets a score. No vibes, just feedback - that’s how our system accelerates your workflow.

Explore a Wider Search Space

Beyond compiler flags. Weco mutates tiling, loop order, parallelism, and more - surfacing transforms that even seasoned kernel hackers rarely try.

Uncover Hidden Speedups

Deep tree search digs up counter‑intuitive tweaks that feel like wizardry. Think expert‑level tuning on autopilot, running while you sleep.

See the Difference

Copilot editors are fast at code generation but struggle with research-heavy tasks. Weco goes deeper - running autonomous research to solve tough optimization problems.

Copilot
Throughput: 1.00×
Weco
Throughput: 1.00×

Evaluation-Driven Optimization - AIDE, the Engine Inside Weco

Outperforming competitors with systematic iteration and optimization focused on measurable results

AgentValid Submission (%)Above Median (%)Gold (%)Any Medal (%)
AIDE82.829.49.416.9
MLAB44.31.90.80.8
OpenHands527.12.74.4

Evaluation‑Driven, Metric‑First Engineering

AIDE iterates until the metric says "better." In OpenAI's MLE‑Bench it secured 4× more medals than the next best autonomous agent across 75 Kaggle competitions - proof that an explicit evaluation loop beats one‑shot code generation.

With AIDE you systematically trade a bit of compute for outsized code quality, no manual hyper‑tuning required.

AIDE vs. human engineers on RE‑Bench

Beyond Human Baselines

In METR's 6‑hour RE‑Bench challenge, AIDE consistently out‑performed seasoned researchers, surfacing "surprising" solutions humans missed - validating our mission to automate experimentation itself.

Open, Evolving & Launching Soon

AIDE's core is open‑source - explore the repo or read the paper to dive deeper into our approach.

We're gearing up for our alpha launch on 30 Apr 2025 with a CLI tool and web dashboard. Want early access? Join the waitlist and help shape the future of autonomous R&D.

Academia and Industry Recognition

Weco's innovative approach featured in leading research papers and industry publications

Latest Articles

Stay updated with our latest news about AI, Machine Learning Engineering, and AIDE ML

Frequently Asked Questions