General-purpose large language models outperform specialized clinical AI tools on medical benchmarks

This result does not surprise me at all. Here is part of the abstract:

Frontier LLMs outperformed clinical AI tools in all three evaluations. Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ. These findings highlight the need for independent, real-world evaluation of AI tools before they enter clinical settings.

From Krithik Viswanath, et.al. As a side note, this (and the more general version of the point) is one big reason why some fairly large number of Emergent Ventures proposals are rejected rather quickly.