Takes control of your actual mouse and keyboard to automate desktop apps through screenshots and natural language commands, powered by Midscene.js. Works on native apps (Electron, Qt, anything that won't run in a browser) across macOS, Windows, and Linux, plus remote Windows machines over RDP. The workflow is straightforward: run one command at a time, wait for the screenshot, analyze what happened, then decide the next move. No DOM scraping, no accessibility tree, just pixel vision and AI reasoning. One honest catch: this literally commandeers your physical cursor in local mode, so you can't multitask while it runs. For web automation, use the browser skill instead. Only reach for this when you need to drive something that lives outside a browser window.
npx skills add https://github.com/web-infra-dev/midscene-skills --skill computer-automation