frontiermath - Search News

News

Anthropic Claude 4 Review: Creative Genius Trapped by Old Limitations

Anthropic's Claude 4 models show particular strength in coding and reasoning tasks, but lag behind in multimodality and ...

MIT Technology Review24d

How to build a better AI benchmark

Making the situation worse, several benchmarks, most notably FrontierMath and Chatbot Arena, have recently come under heat for an alleged lack of transparency. Nevertheless, benchmarks still play ...

techtimes16d

OpenAI 'Safety Evaluation Hub' Promises to Be Transparent on Model Hallucinations, Harmful Content

This was because OpenAI's benchmark claimed a 25% score, but when tested by another company, they found that the o3 model can only answer 10% of FrontierMath problems. OpenAI is only one of the ...

TechCrunch8d

ChatGPT: Everything you need to know about the AI-powered chatbot

OpenAI introduced o3 in December, stating that the model could solve approximately 25% of questions on FrontierMath, a difficult math problem set. Epoch AI, the research institute behind ...

Geeky Gadgets29d

AI Benchmarks Are Broken : The Leaderboard Illusion

The challenges faced by LM Arena are not isolated. Other benchmarks, such as Frontier Math and ARC AGI, have also been criticized for similar shortcomings. These issues highlight systemic problems ...

NDTV17d

Ai Mathematics

Dubbed FrontierMath, the new AI benchmark tests large language models (LLMs) on their capability of reseasoning and mathematical problem-solving. The AI firm claims that existing math benchmarks ...

Hosted on MSN23d

Unravelling the Stargate spin

People realised that OpenAI funded development of the FrontierMath benchmark, which Epoch AI didn’t initially disclose. OpenAI insists it didn’t train on the benchmark. 30% of game developers ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results