Mistral Small 4
Comments
zacksiri
Reubend
Seems like it does quite well on that particular benchmark?
zacksiri
It's ok, it's not the best. There are models that do better, I'd use it for some basic tasks but not actual complex tasks like query generation and retrieval.
kristianp
Interesting that they target around 120 billion parameters. Just enough to fit onto a single H100 with 4 bit quant. Or 128GB APU like apple silicon, AMD AI cpus or the GB spark.
Copying GPT-OSS-120b?
Available to try at https://build.nvidia.com/mistralai/mistral-small-4-119b-2603
revolvingthrow
I really wish the benchmarks were even slightly trustworthy for AI models. ~120B are the largest models I can run locally. Naturally I grabbed the 122B Qwen3.5, which had great benchmarks and… frankly, the model is garbage, worse than glm air 4.5 IMO. But then, qwen famously benchmaxxes.
And here we have another release. The benchmarks are just a tiny bit worse than qwen3.5 (for far less tokens). Am I to take it that the model is worse? Or does qwen’s benchmaxxing mean that slightly worse result of non-qwen models means a better model? I’d rather not spend hours testing things myself for every noteworthy release.
Ah well. Mistral has been fairly decent so worth taking a look. Obviously they’re behind the big 3, but in my experience their small models are probably the best you can get for several months after each release. I’m not sure how it works as a sales funnel for their paid models, same as with chinese models - people likely just go for google/openai/anthropic in this case - but I’m thankful for their existence.
2001zhaozhao
Which Haiku model are they comparing to? Is it 4.5? In which case it's absolutely wild that Qwen3.5 122B is shredding it in those graphs
I tested the model in an agentic workflow. Here is the report:
https://upmaru.com/llm-tests/simple-tama-agentic-workflow-q1...