Automates iOS devices through natural language commands and screenshot analysis instead of traditional DOM scraping or accessibility APIs. You connect a device, Claude takes screenshots, then issues commands like "tap the login button" or "scroll to the bottom" based on what it sees. The workflow is strictly synchronous: one command at a time, wait for the screenshot, analyze, then act. It's built on Midscene and needs a vision model like Gemini or Qwen configured via environment variables. The critical thing to know is that each command can take up to a minute because of AI inference, so this isn't for speed. Good for testing apps where traditional automation hooks don't exist or when you want to drive interactions purely from visual state.
npx skills add https://github.com/web-infra-dev/midscene-skills --skill ios-device-automation