This is vision-driven browser automation that works from screenshots instead of the DOM, powered by Midscene.js. It runs in headless Puppeteer by default, but can also connect to your existing Chrome via CDP or a browser extension to preserve login sessions. The skill handles clicks, form fills, scrolling, and multi-step workflows by analyzing what's actually visible on screen. One thing to know: it requires a vision model (Gemini, Qwen, Doubao) configured via environment variables, and commands run one at a time since each step involves AI inference on screenshots. Use this when you need to scrape data, test UI, or automate web tasks without wrestling with selectors or accessibility trees.
npx skills add https://github.com/web-infra-dev/midscene-skills --skill browser-automation