Circling back to this because those OSWorld numbers have moved on since you wrote this. Sonnet 4.5's 61.4% was already a leap. Sonnet 4.6 recently hit 72.5%, and what's more interesting is who Anthropic brought in after that score.
They acquired Vercept; Ross Girshick (inventor of R-CNN), Kiana Ehsani, and Luca Weihs from Allen AI https://reading.sh/inside-anthropics-quiet-bet-on-computer-vision-ba0cf1000756. All perception researchers. The virtuous cycle framing you used here is the right lens. Applications drive infrastructure, infrastructure enables applications. Adding arguably the best computer vision team in the field to that cycle is a serious accelerant.
The Devin CEO quote about 4.5 being the biggest jump is interesting in hindsight. I wonder what he'd say about 4.6.
Circling back to this because those OSWorld numbers have moved on since you wrote this. Sonnet 4.5's 61.4% was already a leap. Sonnet 4.6 recently hit 72.5%, and what's more interesting is who Anthropic brought in after that score.
They acquired Vercept; Ross Girshick (inventor of R-CNN), Kiana Ehsani, and Luca Weihs from Allen AI https://reading.sh/inside-anthropics-quiet-bet-on-computer-vision-ba0cf1000756. All perception researchers. The virtuous cycle framing you used here is the right lens. Applications drive infrastructure, infrastructure enables applications. Adding arguably the best computer vision team in the field to that cycle is a serious accelerant.
The Devin CEO quote about 4.5 being the biggest jump is interesting in hindsight. I wonder what he'd say about 4.6.