Lets you automate desktop apps through screenshots and natural language instead of hunting for DOM elements or accessibility IDs. Works across macOS, Windows, and Linux by literally looking at what's on screen, then clicking, typing, and navigating based on visual AI models like Gemini or Qwen. The workflow is strictly synchronous: take a screenshot, let the model figure out what to do, execute one action, repeat. It takes over your actual mouse and keyboard, so this is for native apps, Electron UIs, or anything that can't run headless. Web stuff should stick to their browser automation skill instead. Commands can be slow since each involves AI inference, and you need to configure vision model credentials upfront.
npx skills add https://github.com/web-infra-dev/midscene-skills --skill desktop-computer-automation