Week of May 31, 2026

AI agents are graduating from helpers to autonomous workers this week—literally controlling your computer while you're away, generating production-ready code through design tools, and executing complex task lists in the background. Meanwhile, the infrastructure catching up: inference speed is becoming an intelligence lever, and evaluation frameworks are preventing expensive AI disasters before they hit production.

This Week's Top Videos

Windows Computer Use and mobile access for Codex

By OpenAI

OpenAI's Codex now lets you literally walk away from your desk while AI agents control your entire Windows computer and apps autonomously. You can monitor and start new tasks remotely via mobile app while the AI works in the background. This marks the shift from AI assistants to true AI workers that operate independently.

Computer Use feature allows Codex to take complete control of your Windows desktop and cursor to perform tasks across any installed application autonomously
Remote monitoring capability lets you check progress and start new AI tasks from iOS/Android ChatGPT app while away from your desk
Browser-specific tasks should use Codex Chrome integration instead as it can work across multiple tabs simultaneously in background
You can @mention specific installed applications when setting up Computer Use tasks for more targeted automation
Setup requires enabling Computer Use in settings, scanning QR code for mobile access, and ensuring computer stays connected to internet

Read the full summary →

Inference, Diffusion, World Models, and More | YC Paper Club

By Y Combinator

Inference speed is becoming a capability lever, not just a cost optimization—faster tokens per second equals higher peak intelligence when models can reason with more compute time. YC's first paper club featured speculative decoding techniques that dramatically accelerate model inference, with one algorithm showing visibly faster performance than standard approaches. This matters NOW because RL is exceeding pre-training compute requirements and inference costs are dominating at scale.

Inference costs now dominate training costs when serving models to billions of users, requiring trillions of tokens at scale
RL compute requirements are starting to exceed pre-training needs, and RL is essentially just a wrapper around inference
Inference speed will become a capability metric within 1-3 years—tokens per second directly determines peak intelligence for reasoning models
Speculative decoding uses a small model as a proxy to dramatically speed up sampling from larger models without changing output quality
YC's Winter 16 batch produced 10-15 unicorns out of 140 companies, including the early OpenAI team asking founders what problems to research
Half of Bay Area AI talent is in Peninsula/South Bay (Google DeepMind, Tesla, xAI) versus city-based companies like Anthropic and OpenAI

Read the full summary →

I Stopped Using PowerPoint After Building This Claude Code Skill (Full Tutorial + 3 Templates)

By Peter Yang

A product manager built a Claude skill that generates fully interactive HTML slide decks in minutes, eliminating the need for PowerPoint or Google Slides. The system includes 12 slide formats, AI-powered QA agents that screenshot and fix layout issues, and generates animated charts with hover interactions—turning hour-long deck creation into a few-minute process.

HTML slide decks can include fully interactive charts where users can hover to see data, making presentations far more engaging than static PowerPoint charts
Building a QA agent that screenshots every slide and checks for layout issues before delivery ensures consistent quality without manual review
The skill supports 12 common slide formats including two-column layouts, stack grids for stats, comparison tables, animated step processes, and technical code blocks
AI can perform web research during slide generation to pull in current facts and data, making decks more comprehensive than manual creation
Tasks that previously took an hour or more for manual deck creation now complete in minutes with automated animations and professional styling
The system reads workflow templates from scale.md and styles.md files, then asks clarifying questions before generating the complete HTML deck

Read the full summary →

Meet Gemini Spark, your 24/7 personal AI agent✨

By Google

Google's Gemini Spark lets you brain-dump multiple complex tasks at speaking speed—calendar changes, personal notes, and deadline-organized documents—then executes them autonomously in background threads with approval gates. This 'throw tasks over your shoulder' approach could fundamentally change how founders manage operational overhead while scaling.

Voice-to-execution workflow allows brain-dumping multiple complex tasks at natural speaking speed, with AI parsing and breaking them into individual executable threads automatically.
Background task processing means you can delegate work and completely context-switch, with the AI working autonomously while you focus on other priorities.
Built-in approval gates prevent unwanted actions by asking for user input on sensitive tasks, balancing automation with control.
Deep integration with Google Workspace allows automatic formatting, color coding, and structured document creation without manual setup.
Multi-threaded task management automatically subdivides complex requests into organized, actionable items with deadline and priority categorization.

Read the full summary →

New capabilities coming to Figma Make

By Figma

Figma Make now lets designers edit code directly through familiar Figma panels and deploy to production via normal pull requests. No more prompting tiny AI changes—select elements, change layouts, alter text, and annotate with voice feedback. This bridges the designer-developer gap by making code editable through design tools.

Figma Make automatically spins up dev servers from your codebase, letting you preview and edit your actual production app rather than prototypes
Designers can now make code changes through familiar Figma editing panels instead of prompting AI for tiny adjustments
New annotation system allows voice feedback, image uploads, and linking Figma frames directly on rendered screens for better AI context
All design edits and annotations get passed to AI models simultaneously, eliminating the need for iterative prompting
Engineers receive normal pull requests with unchanged workflows, while CI checks and review processes remain identical
MCP server integration allows engineering-level conflict resolution and CI check debugging within the design interface

Read the full summary →

The maturity phases of running evals — Phil Hetzel, Braintrust

By ai.engineer

Most teams are vibes-checking their AI agents when 18-minute maturity frameworks could prevent production disasters. Phil Hetzel outlines 4 evaluation phases—from human annotation with justification to advanced techniques—that separate successful AI products from expensive proof-of-concepts. Critical for teams pushing agents to production right now.

Start with vibes-checking but document everything—have humans give thumbs up/down AND justification to extract domain knowledge for eventual LLM-as-judge scaling
Build evals around specific failure modes, not exhaustively like unit tests—infinite edge cases will keep you testing forever instead of shipping
Eval results don't need 100% accuracy—directional trends from LLM-as-judge techniques are sufficient as long as you're improving
Four maturity phases: just getting started, measuring to manage, accounting for complexity, and advanced eval techniques—complexity increases with agent sophistication
Companies are prolific at AI proof-of-concepts but struggle bringing them to production—evals bridge this critical gap for real user deployment
Three eval primitives: task (agent under test), dataset (example inputs), and scoring functions (quality judgment)—foundation for systematic evaluation

Read the full summary →