Don't trust the hype. Run head-to-head comparisons on your own prompts.
General benchmarks like MMLU or HumanEval don't tell you how a model will perform on your specific data. The only way to know if Claude 3.5 Sonnet is better than GPT-4o for your use case is to test them side-by-side.
AIWorkbench.dev features a unique Compare tool that lets you pit up to 6 models against each other simultaneously.