Week of May 31, 2026
AI agents are graduating from helpers to autonomous workers this week—literally controlling your computer while you're away, generating production-ready code through design tools, and executing complex task lists in the background. Meanwhile, the infrastructure catching up: inference speed is becoming an intelligence lever, and evaluation frameworks are preventing expensive AI disasters before they hit production.
This Week's Top Videos
Windows Computer Use and mobile access for Codex
By OpenAI
OpenAI's Codex now lets you literally walk away from your desk while AI agents control your entire Windows computer and apps autonomously. You can monitor and start new tasks remotely via mobile app while the AI works in the background. This marks the shift from AI assistants to true AI workers that operate independently.
- Computer Use feature allows Codex to take complete control of your Windows desktop and cursor to perform tasks across any installed application autonomously
- Remote monitoring capability lets you check progress and start new AI tasks from iOS/Android ChatGPT app while away from your desk
- Browser-specific tasks should use Codex Chrome integration instead as it can work across multiple tabs simultaneously in background
- You can @mention specific installed applications when setting up Computer Use tasks for more targeted automation
- Setup requires enabling Computer Use in settings, scanning QR code for mobile access, and ensuring computer stays connected to internet
Inference, Diffusion, World Models, and More | YC Paper Club
By Y Combinator
Inference speed is becoming a capability lever, not just a cost optimization—faster tokens per second equals higher peak intelligence when models can reason with more compute time. YC's first paper club featured speculative decoding techniques that dramatically accelerate model inference, with one algorithm showing visibly faster performance than standard approaches. This matters NOW because RL is exceeding pre-training compute requirements and inference costs are dominating at scale.
- Inference costs now dominate training costs when serving models to billions of users, requiring trillions of tokens at scale
- RL compute requirements are starting to exceed pre-training needs, and RL is essentially just a wrapper around inference
- Inference speed will become a capability metric within 1-3 years—tokens per second directly determines peak intelligence for reasoning models
- Speculative decoding uses a small model as a proxy to dramatically speed up sampling from larger models without changing output quality
- YC's Winter 16 batch produced 10-15 unicorns out of 140 companies, including the early OpenAI team asking founders what problems to research
- Half of Bay Area AI talent is in Peninsula/South Bay (Google DeepMind, Tesla, xAI) versus city-based companies like Anthropic and OpenAI
I Stopped Using PowerPoint After Building This Claude Code Skill (Full Tutorial + 3 Templates)
By Peter Yang
A product manager built a Claude skill that generates fully interactive HTML slide decks in minutes, eliminating the need for PowerPoint or Google Slides. The system includes 12 slide formats, AI-powered QA agents that screenshot and fix layout issues, and generates animated charts with hover interactions—turning hour-long deck creation into a few-minute process.
- HTML slide decks can include fully interactive charts where users can hover to see data, making presentations far more engaging than static PowerPoint charts
- Building a QA agent that screenshots every slide and checks for layout issues before delivery ensures consistent quality without manual review
- The skill supports 12 common slide formats including two-column layouts, stack grids for stats, comparison tables, animated step processes, and technical code blocks
- AI can perform web research during slide generation to pull in current facts and data, making decks more comprehensive than manual creation
- Tasks that previously took an hour or more for manual deck creation now complete in minutes with automated animations and professional styling
- The system reads workflow templates from scale.md and styles.md files, then asks clarifying questions before generating the complete HTML deck
Meet Gemini Spark, your 24/7 personal AI agent✨
By Google
Google's Gemini Spark lets you brain-dump multiple complex tasks at speaking speed—calendar changes, personal notes, and deadline-organized documents—then executes them autonomously in background threads with approval gates. This 'throw tasks over your shoulder' approach could fundamentally change how founders manage operational overhead while scaling.
- Voice-to-execution workflow allows brain-dumping multiple complex tasks at natural speaking speed, with AI parsing and breaking them into individual executable threads automatically.
- Background task processing means you can delegate work and completely context-switch, with the AI working autonomously while you focus on other priorities.
- Built-in approval gates prevent unwanted actions by asking for user input on sensitive tasks, balancing automation with control.
- Deep integration with Google Workspace allows automatic formatting, color coding, and structured document creation without manual setup.
- Multi-threaded task management automatically subdivides complex requests into organized, actionable items with deadline and priority categorization.
New capabilities coming to Figma Make
By Figma
Figma Make now lets designers edit code directly through familiar Figma panels and deploy to production via normal pull requests. No more prompting tiny AI changes—select elements, change layouts, alter text, and annotate with voice feedback. This bridges the designer-developer gap by making code editable through design tools.
- Figma Make automatically spins up dev servers from your codebase, letting you preview and edit your actual production app rather than prototypes
- Designers can now make code changes through familiar Figma editing panels instead of prompting AI for tiny adjustments
- New annotation system allows voice feedback, image uploads, and linking Figma frames directly on rendered screens for better AI context
- All design edits and annotations get passed to AI models simultaneously, eliminating the need for iterative prompting
- Engineers receive normal pull requests with unchanged workflows, while CI checks and review processes remain identical
- MCP server integration allows engineering-level conflict resolution and CI check debugging within the design interface
The maturity phases of running evals — Phil Hetzel, Braintrust
By ai.engineer
Most teams are vibes-checking their AI agents when 18-minute maturity frameworks could prevent production disasters. Phil Hetzel outlines 4 evaluation phases—from human annotation with justification to advanced techniques—that separate successful AI products from expensive proof-of-concepts. Critical for teams pushing agents to production right now.
- Start with vibes-checking but document everything—have humans give thumbs up/down AND justification to extract domain knowledge for eventual LLM-as-judge scaling
- Build evals around specific failure modes, not exhaustively like unit tests—infinite edge cases will keep you testing forever instead of shipping
- Eval results don't need 100% accuracy—directional trends from LLM-as-judge techniques are sufficient as long as you're improving
- Four maturity phases: just getting started, measuring to manage, accounting for complexity, and advanced eval techniques—complexity increases with agent sophistication
- Companies are prolific at AI proof-of-concepts but struggle bringing them to production—evals bridge this critical gap for real user deployment
- Three eval primitives: task (agent under test), dataset (example inputs), and scoring functions (quality judgment)—foundation for systematic evaluation