完成 300 億美元融資後,Anthropic 交出了第一份 AI 答卷。就在剛剛,Claude Sonnet 4.6 正式發布,定位是「史上最強 Sonnet」。
編程、電腦操作、長上下文推理、智能體規劃,全面升級。價格沒變,還是每百萬 token 3 美元輸入/15 美元輸出,但性能直接逼近 Opus 級別。
在與 Opus 4.5 的對比測試里,用戶有 59% 的時間更偏好 Sonnet 4.6。理由也很實在:過度工程化更少、幻覺更少、多步驟任務執行更穩。
電腦操作能力是這次升級的重頭戲。
在 OSWorld 基準測試上,Sonnet 系列過去 16 個月持續進步,現在處理複雜電子表格、填寫多步驟網頁表單已接近人類水平。
這個能力戳中的是一個真實痛點:很多企業的老舊軟體沒有現代 API 接口,過去只能專門開發連接器,現在模型直接像人一樣看螢幕、點鼠標就行了,省掉了一大截工程成本。
順帶一提,Excel 中的 Claude 插件這次也同步升級,新增了 MCP 連接器支持,對金融從業者來說,這個更新很實用。
Sonnet 4.6 另一個亮點是支持 100 萬 token 超大上下文,足以在一次請求里塞進完整代碼庫、數十篇論文或一堆合同。
在 Vending-Bench Arena 這個模擬企業運營的評估里,Sonnet 4.6 摸索出一套有意思的策略:前期大舉投資產能,最後階段猛轉盈利導向,靠這個轉折時機甩開其他模型。支撐這套打法的,正是它的長期規劃能力。
對普通用戶來說,Free 和 Pro 方案的默認模型已經切換為 Sonnet 4.6,claude.ai 和 Claude Cowork 同步更新。
開發者方面,API 模型標識是 claude-sonnet-4-6,支持自適應思考、擴展思考,上下文壓縮功能可以在對話快撐爆上下文時自動總結舊內容,省 token 又省心。
而就在 Sonnet 4.6 發布的同期,馬斯克旗下 xAI 的 Grok 4.20 測試版也正式上線了 grok.com。
Grok 4.20 支持並行調度 4 個專業智能體——Grok、Harper、Benjamin、Lucas——協同執行任務。然而整體口碑兩極分化嚴重,且過往預期拔得太高,導致不少用戶期望落空,負評偏多。
後續馬斯克罕見連發多條推文滅火「救場」。他解釋稱,目前的 Grok 4.20 只是參數量 500B 的小型基礎模型,尚處公測階段。他還強調,Grok 4.20 的底層架構具備每周自我疊代的能力,遞歸式智能增長空間很大。
按他的說法,公測結束後,Grok 4.20 的智能和速度將比 Grok 4 提升約一個數量級。但這個承諾能否兌現,只能說拭目以待吧。
This is Claude Sonnet 4.6: our most capable Sonnet model yet. It’s a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. It also features a 1M token context window in beta.




The Grok 4.2 release candidate (public beta) is now available for use. You need to select it specifically. Critical feedback is appreciated. Unlike prior versions of Grok, 4.2 is able to learn rapidly, so there will be improvements every week with release notes.

first prompt to grok4.20 ... color me impressed! first attempt was good. second attempt had one error and third attempt was amazing. kind of reminds of the Plinyverse tbh
Grok 4.20 is on another level! First video coming in a few minutes...
Overall, between Grok 4.20 being mogged by any current gen Flash model, old Grok architectures we’ve seen, and Grok Imagine after Seedance 2… Elon really will need that mass driver on the Moon to compete
By the way, even tho Grok-4.20 doesn't feel great, it's not xAI employees' fault They've worked hard and tried their best, training models is hard But a small team can only get so far, we're not criticizing the work xAI employees have done, we're criticizing how xAI can't really compete against any lab because they lack the talent (the team is too small) and the data Instead of buying a trillion GPUs, Elon should focus on those two points first
The foundations of 4.2 are such that it is able to improve every week, so the recursive intelligence growth will be strong
Actually, I don’t think HLE is a great measure of usefulness. We’re moving away from these benchmarks in favor of making Grok maximally useful for actual engineering.
Grok 4.2 will be about an order of magnitude smarter and faster than Grok 4 when the public beta concludes next month. Still many bug fixes and improvements landing every day. The public beta gives us more critical feedback to address.






