We're running out of benchmarks to upper bound AI capabilities

15 points
1/21/1970
19 hours ago
by gmays

Comments


nikisweeting

We can definitely make harder evals, the problem is a good eval set is indistinguishable from good training data / market edge, so no one is incentivized to share their best eval sets publicly.

17 hours ago

WarmWash

Start front loading the models with 5k, 10k, 50k, 100k tokens of messy quasi related context, and then run the benchmarks.

These models are ridiculously powerful with a blank slate. It's when they get loaded down with all the necessary (and inevitably unnecessary) context to complete the task that they really start to crumble and fold.

18 hours ago

jballanc

We need benchmarks that can distinguish between continuous learning and long-context extrapolation.

17 hours ago

UltraSane

This is the least true thing ever. All LLMs are terrible at ARC-AGI-3. Every video game can be used as a benchmark. You could rank LLMs on how long they can keep a game of Dwarf Fortress running or how fast they can beat GTA5.

17 hours ago

ttoinou

We already have specialized AI to play video games

16 hours ago

UltraSane

We are talking about LLMs. a true AGI would be able to beat every video game.

16 hours ago

conception

Until Arc-Battletoads is passed I’m not buying it.

16 hours ago

UltraSane

More like ARC-SegaMasterSystem-ALF

14 hours ago

refactorbench

[dead]

16 hours ago