OpenAI shipped Computer Use for Codex on macOS — and the Mac can be locked, screen off, while it works. Kick a task off from your phone, Codex temporarily unlocks the machine, covers every display with a privacy curtain, completes the work, and relocks. Local keyboard or pointer input forces an immediate relock. Verified shipped (today, public docs, plugin install + Screen Recording/Accessibility perms). Source: TheTip.ai breakdown.
This is the Always-On Reeve Phase 2 mechanic, shipped by a competitor first. The whole Phase 2 thesis — persistent Telegram listener, overnight autonomy, work happens while Roy sleeps — was waiting on a custom build. OpenAI just demonstrated the lock-screen-with-privacy-curtain pattern Anthropic has not. We don’t switch — Reeve is on PaperClip + claude_local for cost and architecture reasons — but the bar for what Anthropic ships next just moved.
Action this week: Set up a Codex Computer Use trial on the Mac Mini M1 (8GB RAM, the Always-On Reeve hardware). Run one overnight task that requires GUI interaction (something MACA can’t do via CLI — e.g. inspect Meta Ads Manager visual reports). Capture what works and what doesn’t. This is reconnaissance for what Phase 2 needs to match, not a stack switch.
1 What to Know Today
Tier 1 — Codex Computer Use ships locked-Mac control from a phone (Always-On Reeve mechanic)
OpenAI’s Codex can now drive any macOS app visually, including while the Mac is locked with the screen off — temporary unlock, full-display privacy curtain, instant relock on local input. Verdict: verified shipped. Plugin install + system permissions live today. Two-layer permission model (OS-level + Codex per-app allowlist with Always Allow). Cannot operate terminal apps, cannot auth as admin, cannot self-drive. This is the closest any vendor has come to the Always-On Reeve Phase 2 spec — overnight autonomy with safe local handoff. Action: see PAY ATTENTION above. Link.
Tier 1 — a16z’s “Yellow Brick Road” essay validates the Ben + MACA + UBX legal-explorer thesis
a16z’s piece argues the labs own everything “on the Yellow Brick Road” (horizontal, capability-scales-with-model) but the “Rest of Oz” — vertical, multi-step, governed workflows with messy data and unwritten norms — is where startups win. Four defenses: across-customer + within-customer flywheels, multi-vendor model routing, cost optimisation by sub-task tier, and being the governance control plane. Verdict: high-credibility thesis from a16z portfolio practitioners (11x for sales, FurtherAI for insurance). Roy already forwarded this to Reeve with the note “we are on the right track with MACA and Ben.” Confirmed — this is the strategic frame. Action: save the three tests (Tools-and-Steps, System Test, Hedge Fund P&L Test) into ~/Reeve/research/ and apply them to MACA, Ben, and the UBX legal explorers in this week’s review. Link.
Tier 1 — Harvey’s Legal Agent Benchmark: frontier models max out at 7.1% all-pass (UBX legal explorer validation)
Harvey baselined frontier models on its Legal Agent Benchmark under an “all-pass” standard (every rubric criterion must pass). Results: Claude Opus 4.7 7.1% (lead), Sonnet 4.6 5.4%, Opus 4.6 4.2%, GPT-5.5 2.1%, Gemini 3.5 Flash 0.8%. Harvey’s conclusion: “legal work is far from saturated by frontier intelligence.” Verdict: verified — published by Harvey on their own data, methodology disclosed. This is the exact thesis behind the UBX South Bank Sale franchise + lease playbook architecture — wrap Anthropic’s knowledge-work-plugins/legal /review-contract with playbooks, don’t expect a frontier model to navigate franchise law cold. Action: keep this number ready for any solicitor or buyer’s-counsel conversation during UBX data room walkthroughs. It justifies the playbook layer. Link.
2 What You Already Know That Most People Don't
11x’s “guardrails are the product, not a safety feature” — Ben already operates this way
The a16z piece’s deepest line, from 11x’s CEO Prabhav Jain: “Guardrails aren’t just to prevent bad stuff from happening. That’s what your customers are paying you for.” He describes regulated-finance customers needing different guarantees than mid-market SaaS — guardrails rolling down into who the agent contacts, what data it touches, what it logs. You built that in Ben months ago. Ben/XeroAgent has a 3-tier authority model, Telegram-gated approvals on anything substantive, full audit trail in SQLite, learning-from-corrections that turns every human override into a future rule, and 90 tests guarding the boundary conditions. The 51-session build that felt like over-engineering at the time is exactly the 11x pattern. When a buyer-counsel or a CIO asks you about agent governance, the answer is “here is how Ben handles it — three-tier authority, every action logged, corrections feed forward” — not a deck.
FurtherAI’s “the workflow IS the intelligence” maps to the UBX legal playbooks you’re writing
Aman Gour’s claim: in insurance, the intelligence does not live in the model — it lives in the workflow itself, in SOPs and undocumented carrier appetite and “which loss signals matter.” His company’s loop: every escalation becomes a signal, every exception is feedback, every correction shows where the runbook was incomplete. That’s exactly what franchise-playbook.md and lease-playbook.md are for UBX. Roy spent ~$20K on UBX franchise legal fees at acquisition because the Perth lawyer didn’t know UBX terms — that’s the “navigation cost vs substantive legal judgment” gap. The Australian Franchising Code positions, ACCC nuances, UBX-specific clauses, Queensland turnover-rent norms — none of that is in any training set. The playbooks ARE the carrier’s operating memory, applied to franchise-resale. You’re not building a legal agent. You’re encoding tribal knowledge into a workflow Anthropic’s /review-contract skill can execute against.
3 Worth a Deeper Look This Week
Anthropic’s “How We Contain Claude Across Products” engineering post (28 min)
Direct link. Anthropic’s own engineering team on how they isolate agent behaviour at the environment layer first, model layer second. Core line: “AI deployment can be risky, but placing a hard limit on the potential damage often shifts the balance in the right direction.” Why it matters for you: this is the same containment philosophy Ben implements (PaperClip sandboxing + 3-tier authority + MCP-scoped Xero access) and what Always-On Reeve Phase 2 needs to formalise. Specific angle: read it through the lens of GUARDRAILS.md — what containment patterns Reeve doesn’t yet have explicit guidance for. 30 min investment, directly improves Phase 2 design.
Felix Rieseberg on how the Claude Cowork lead engineer uses AI (8 min, Lenny’s)
Direct link. Rieseberg demonstrates building a live floor planner from a 2D house plan, mining email as a personal inventory database, and building live dashboards from connected apps. Specific angle for Roy: the live-dashboards-from-connected-apps pattern is exactly what ProjectDashboard wants to become (currently scaffolded but unbuilt), and the email-as-personal-inventory pattern is a precursor to what AI Edge could be on the personal-knowledge side. 30 min, two project hooks.
4 Conversation Capital
“Harvey just benchmarked frontier models on legal work under a real all-pass rubric — Claude Opus 4.7 leads at 7.1%, every other frontier model is below 6, GPT-5.5 is at 2 and Gemini 3.5 Flash is at under 1. Their conclusion is that legal work is nowhere near saturated by frontier intelligence. That’s why I’m not building UBX’s legal document explorer on top of a generic prompt — we’re wrapping Anthropic’s review-contract skill with our own franchise and lease playbooks. The model is fungible; the playbook is the moat.”
Use case: Aria Property Group (Michael Zaicek if CourseBuilds activates), any solicitor or buyer’s counsel question about the UBX data room methodology, any RT AI-pro conversation about why “just use GPT” doesn’t work on regulated workflows. Cites a verifiable benchmark, names a specific architectural decision, signals you read primary sources not headlines.
5 Something You Haven't Thought About
Anthropic is shipping an in-Claude “AI Fluency Scorecard” — 11 behavioural indicators measuring how well users interact with AI. (Source: TestingCatalog leak, surfaced via TLDR.) The wingman read: CourseBuilds just got a measurement instrument it didn’t have before. Every CourseBuilds pilot (Aria first, then onward) currently has a soft “did the team get better at this?” success criterion — vague, hard to defend, hard to price the renewal against. If Anthropic ships the scorecard publicly, CourseBuilds can wrap it as a pre/post measurement for pilots: 6-week pilot starts with baseline scores across the 11 indicators, ends with delta. That’s a sellable artefact for the Tier 2 embedded engagement renewal conversation.
Act / queue / drop guidance: Queue. Don’t pivot CourseBuilds activation around something that’s still rumour. But the moment the scorecard ships publicly, the CourseBuilds activation pack adds a “Fluency Baseline + Fluency Outcome” deliverable. Watch for the launch over the next 4-6 weeks. First-mover bar: be the first Australian AI consulting service to wrap the Anthropic-native fluency metric into a delivery format.
6 Skip File
- [TLDR — “xAI warns staff to limit Cursor contact”]: Acquisition-hygiene story, no project relevance.
- [TLDR — “China expands travel curbs to top AI talent”]: Geopolitical, already partially covered yesterday via Bloomberg cite.
- [TLDR — “MAI-Image-2.5 hits #3 on Arena”]: Microsoft image model, no MACA/marketing impact yet — not text-rendering breakthrough class.
- [TLDR — “DeepSWE benchmark”]: New SWE benchmark, interesting but doesn’t change any tooling choice today.
- [TLDR — “NVIDIA CompileIQ auto-tuning”]: GPU kernel optimisation, no impact on Roy’s stack.
- [TLDR — “OpenRouter raises $113M at $1.3B”]: Multi-model router validation, but you don’t use OpenRouter — note the trend.
- [TLDR — “SpaceX two AI compute stories”]: Interesting financial narrative, no action.
- [TLDR — “Claude Mythos solves OpenAI Erdős with cute proof”]: Math + boasting rights, no operator angle.
- [TLDR — “Native Multimodal Models repo”]: Academic catalogue, no project hook.
- [Rundown — “Demis Hassabis on AGI 2030 ±1”]: Quotable but Roy already filed similar quotes; no new conversation capital beyond what we ran 5-23.
- [Rundown — “Jensen on AI-proof subjects / wabi-sabi”]: Conversation capital potential but second-tier vs the Harvey quote.
- [Rundown — “Stanford racial bias in AI hiring”]: Important but old data (2018-22), not actionable for Roy’s projects.
- [Rundown — “ElevenLabs Music v2”]: No project hook. Add to ArtWithZobo idea pile if needed later.
- [Rundown — “Xiaomi MiMo-V2.5 99% API cut”]: Pricing pressure noise, not a stack change.
- [Rundown — guides — “weekly marketing report in Claude Cowork”]: Pattern overlap with existing MACA reporting work; not novel for you.
- [Practicaly — “Six Claude apps thread”]: Motivational but you’re already shipping six apps yourself.
- [Practicaly — “Memdex.ai context portability”]: Interesting tool but ~/Reeve already does cross-session memory natively.
- [Practicaly — “Pose Chrome extension”]: Cute UX pattern, not a Roy use case.
- [Practicaly — “Claude + Apify Reddit content strategy”]: Workflow you could clone but lower priority than MACA copy work.
- [Practicaly — “Build a morning brief with Claude connectors”]: That’s what AI Edge already is.
- [Bagelbots — “$150 humanoid cleaning service in SF”]: Interesting trend, no action.
- [Bagelbots — “ClickUp 22% layoffs + 3,000 AI agents”]: Conversation capital potential but a near-dup of Snap/Salesforce stories already covered.
- [Bagelbots — “Human ghostwriter prompt”]: Prompt collection, the MACA ad-copy issue is structural not promptable.
- [Bagelbots — “American Airlines Starlink fleet”]: Already covered via Information digest a week ago.
- [Bagelbots — “Huawei sanctions-busting chip”]: No operator hook.
- [Bagelbots — “NASA Moon base”]: No.
- [Bagelbots — “Anti-tech extremism surveillance / Erin Brockovich data-center map”]: Vibe story, no action.
- [The Information — “OpenAI generated nearly $6B Q1, boosted by Codex”]: Background context; the Codex Computer Use launch above is the more actionable angle from the same trend.
- [The Information — “Anthropic flexes pricing power, customers eat the cost”]: Theme covered three times in past 10 days (AI cost crisis, usage-based pricing, Anthropic CFO power) — no new news.
- [The Information — “Anthropic in talks to buy developer tools startup”]: Watch for confirmation but no name yet, no action.
- [The Information — “Anthropic and OpenAI’s share of AI startup revenue rises to 89%”]: Same trend as above, repeat framing.
- [The Information — “OpenAI Broadcom $18B chip deal hits financing snag”]: Infra finance story, no operator hook.
- [The Information — “Cerebras IPO winners”]: Same.
- [The Information — “Anti-drone AI startup $2B valuation”] + “Ex-OpenAI researcher’s six-week startup at $4B”: Valuation noise.
- [Neil Patel — “Google revealed the new SEO playbook”]: Same AEO/freshness/structured-formatting message Patel has run three weeks running.
- [a16z — “Everything, Everywhere is Compliance” (Tuesday)]: Already surfaced in yesterday’s brief as Tier 1.
Brief Metadata
- Sources scanned: 9 (TLDR AI, Rundown AI, Practicaly AI, The Tip, a16z, Bagelbots, Neil Patel, The Information Finance digest, The Information main)
- Items extracted: 47
- Items surfaced: 9 (1 PAY ATTENTION, 3 Tier 1, 2 anxiety-flip, 2 deeper-look, 1 conversation capital, 1 first-mover)
- Items skipped: 36
- Read time: ~9 minutes