News

Anthropic's Claude 4 models show particular strength in coding and reasoning tasks, but lag behind in multimodality and ...
Making the situation worse, several benchmarks, most notably FrontierMath and Chatbot Arena, have recently come under heat for an alleged lack of transparency. Nevertheless, benchmarks still play ...
This was because OpenAI's benchmark claimed a 25% score, but when tested by another company, they found that the o3 model can only answer 10% of FrontierMath problems. OpenAI is only one of the ...
OpenAI introduced o3 in December, stating that the model could solve approximately 25% of questions on FrontierMath, a difficult math problem set. Epoch AI, the research institute behind ...
The challenges faced by LM Arena are not isolated. Other benchmarks, such as Frontier Math and ARC AGI, have also been criticized for similar shortcomings. These issues highlight systemic problems ...
Dubbed FrontierMath, the new AI benchmark tests large language models (LLMs) on their capability of reseasoning and mathematical problem-solving. The AI firm claims that existing math benchmarks ...
People realised that OpenAI funded development of the FrontierMath benchmark, which Epoch AI didn’t initially disclose. OpenAI insists it didn’t train on the benchmark. 30% of game developers ...