native-devtools-mcp

Give your AI agent "eyes" and "hands" for native desktop applications.

A Model Context Protocol (MCP) server that provides Computer Use capabilities: screenshots, OCR, input simulation, and window management.

[//]: # "Search keywords: MCP, Model Context Protocol, computer use, desktop automation, UI automation, RPA, screenshots, OCR, mouse, keyboard, screen reading, macOS, Windows, native-devtools-mcp"

Features • Installation • For AI Agents • Permissions

!Demo

---

🔍 Search Keywords

MCP, Model Context Protocol, computer use, desktop automation, UI automation, RPA, screenshots, OCR, screen reading, mouse, keyboard, macOS, Windows, native-devtools-mcp.

🚀 Features

- 👀 Computer Vision: Capture screenshots of screens, windows, or specific regions. Includes built-in OCR (text recognition) to "read" the screen.
- 🖱️ Input Simulation: Click, drag, scroll, and type text naturally. Supports global coordinates and window-relative actions.
- 🪟 Window Management: List open windows, find applications, and bring them to focus.
- 🧩 Template Matching: Find non-text UI elements (icons, shapes) using load_image + find_image, returning precise click coordinates.
- 🔒 Local & Private: 100% local execution. No screenshots or data are ever sent to external servers.
- 🔌 Dual-Mode Interaction:
1. Visual/Native: Works with any app via screenshots & coordinates (Universal).
2. AppDebugKit: Deep integration for supported apps to inspect the UI tree (DOM-like structure).

🤖 For AI Agents (LLMs)

This MCP server is designed to be highly discoverable and usable by AI models (Claude, Gemini, GPT).

- 📄 Read AGENTS.md: A compact, token-optimized technical reference designed specifically for ingestion by LLMs. It contains intent definitions, schema examples, and reasoning patterns.

Core Capabilities for System Prompts:
1. take_screenshot: The "eyes". Returns images + layout metadata + text locations (OCR).
2. click / type_text: The "hands". Interacts with the system based on visual feedback.
3. find_text: A shortcut to find text on screen and get its coordinates immediately.
4. load_image / find_image: Template matching for non-text UI elements (icons, shapes), returning screen coordinates for clicking.

📦 Installation (macOS + Windows)

The install steps are identical on macOS and Windows.

$3

``bash npx -y native-devtools-mcp`

`$3`

`bash npm install -g native-devtools-mcp`

`$3`


Click to expand build instructions

`bash git clone https://github.com/sh3ll3x3c/native-devtools-mcp cd native-devtools-mcp cargo build --release

`Binary: ./target/release/native-devtools-mcp`

`⚙️ Configuration`

`$3`

Claude Desktop config file: ~/Library/Application Support/Claude/claude_desktop_config.json

Claude Desktop requires the signed app bundle (npx/npm will not work due to Gatekeeper):

1. Download NativeDevtools-X.X.X.dmgfrom GitHub Releases 2. Open the DMG and dragNativeDevtools.app to /Applications3. Configure Claude Desktop:

`json { "mcpServers": { "native-devtools": { "command": "/Applications/NativeDevtools.app/Contents/MacOS/native-devtools-mcp" } } }`

4. Restart Claude Desktop - it will prompt for Screen Recording and Accessibility permissions for NativeDevtools

> Note: Claude Code (CLI) can use either the signed app or npx - both work.

`$3`

Claude Desktop config file: %APPDATA%\Claude\claude_desktop_config.json

`$3`

For Windows (or macOS with Claude Code CLI):

`json { "mcpServers": { "native-devtools": { "command": "npx", "args": ["-y", "native-devtools-mcp"] } } }`

> Note: Requires Node.js 18+ installed.

`$3`

To avoid approving every single tool call (clicks, screenshots), you can add this wildcard permission to your project's settings or global config:

File: .claude/settings.local.json (or similar)

`json { "permissions": { "allow": ["mcp__native-devtools__*"] } }`

`🔍 Two Approaches to Interaction`

We provide two ways for agents to interact, allowing them to choose the best tool for the job.

`$3`


Best for: 99% of apps (Electron, Qt, Games, Browsers).
*   How it works: The agent takes a screenshot, analyzes it visually (or uses OCR), and clicks at coordinates.
*   Tools:

take_screenshot, find_text, click, type_text (plus load_image / find_image

 for icons and shapes).
*   Example: "Click the button that looks like a gear icon." → use

find_image

 with a gear template.
$3

Best for: Apps specifically instrumented with our AppDebugKit library (mostly for developers testing their own apps).
*   How it works: The agent connects to a debug port and queries the UI tree (like HTML DOM).
*   Tools:

app_connect, app_query, app_click

.
*   Example:

app_click(element_id="submit-button")

.
🧩 Template Matching (find_image)

Use find_image when the target is not text (icons, toggles, custom controls) and OCR or find_text cannot identify it.

Typical flow: 1.take_screenshot(app_name="MyApp") → screenshot_id2.load_image(path="/path/to/icon.png") → template_id3.find_image(screenshot_id="...", template_id="...") → matches with screen_x/screen_y4.click(x=..., y=...)

Fast vs Accurate: - fast (default): uses downscaling and early-exit for speed. - accurate: uses full-resolution, wider scale search, and smaller stride for thorough matching.

Optional inputs like mask_id, search_region, scales, and rotations can improve precision and performance.

`🏗️ Architecture`


🔧 Technical Details (Under the Hood)

| OS | Feature | API Used | |----|---------|----------| | macOS | Screenshots |screencapture(CLI) | | | Input |CGEvent(CoreGraphics) | | | OCR |VNRecognizeTextRequest(Vision Framework) | | Windows | Screenshots |BitBlt(GDI) | | | Input |SendInput(Win32) | | | OCR |Windows.Media.Ocr (WinRT) |

`$3`

Screenshots include metadata for accurate coordinate conversion:

- screenshot_origin_x/y: Screen-space origin of the captured area (in points) -screenshot_scale: Display scale factor (e.g., 2.0 for Retina displays) -screenshot_pixel_width/height: Actual pixel dimensions of the image -screenshot_window_id: Window ID (for window captures)

Coordinate conversion:`screen_x = screenshot_origin_x + (pixel_x / screenshot_scale) screen_y = screenshot_origin_y + (pixel_y / screenshot_scale)`

Implementation notes: - Window captures (macOS): Usesscreencapture -o which excludes window shadow. The captured image dimensions match kCGWindowBounds × scaleexactly, ensuring click coordinates derived from screenshots land on intended UI elements. - Region captures: Origin coordinates are aligned to integers to match the actual captured area.

`🛡️ Privacy, Safety & Best Practices`

`$3`


*   100% Local: All processing (screenshots, OCR, logic) happens on your device.
*   No Cloud: Images are never uploaded to any third-party server by this tool.
*   Open Source: You can inspect the code to verify exactly what it does.
$3

*   Hands Off: When the agent is "driving" (clicking/typing), do not move your mouse or type.
       Why?* Real hardware inputs can conflict with the simulated ones, causing clicks to land in the wrong place.
*   Focus Matters: Ensure the window you want the agent to use is visible. If a popup steals focus, the agent might type into the wrong window unless it checks first.
🔐 Required Permissions (macOS)
On macOS, you must grant permissions to the host application (e.g., Terminal, VS Code, Claude Desktop) to allow screen recording and input control.

1. Screen Recording: Required for take_screenshot. System Settings > Privacy & Security > Screen Recording* 2. Accessibility: Required forclick, type_text, scroll`.
System Settings > Privacy & Security > Accessibility*

> Restart Required: After granting permissions, you must fully quit and restart the host application.

🪟 Windows Notes

Works out of the box on Windows 10/11.
* Uses standard Win32 APIs (GDI, SendInput).
* OCR uses the built-in Windows Media OCR engine (offline).
* Note: Cannot interact with "Run as Administrator" windows unless the MCP server itself is also running as Administrator.

📜 License

MIT © sh3ll3x3c

native-devtools-mcp

!Version
!License
!Platform
!Downloads

Give your AI agent "eyes" and "hands" for native desktop applications.

A Model Context Protocol (MCP) server that provides Computer Use capabilities: screenshots, OCR, input simulation, and window management.

[//]: # "Search keywords: MCP, Model Context Protocol, computer use, desktop automation, UI automation, RPA, screenshots, OCR, mouse, keyboard, screen reading, macOS, Windows, native-devtools-mcp"

Features • Installation • For AI Agents • Permissions

!Demo

---

🔍 Search Keywords

MCP, Model Context Protocol, computer use, desktop automation, UI automation, RPA, screenshots, OCR, screen reading, mouse, keyboard, macOS, Windows, native-devtools-mcp.

🚀 Features

🤖 For AI Agents (LLMs)

This MCP server is designed to be highly discoverable and usable by AI models (Claude, Gemini, GPT).

- 📄 Read AGENTS.md: A compact, token-optimized technical reference designed specifically for ingestion by LLMs. It contains intent definitions, schema examples, and reasoning patterns.

📦 Installation (macOS + Windows)

The install steps are identical on macOS and Windows.

$3

``bash npx -y native-devtools-mcp`

`$3`

`bash npm install -g native-devtools-mcp`

`$3`


Click to expand build instructions

`bash git clone https://github.com/sh3ll3x3c/native-devtools-mcp cd native-devtools-mcp cargo build --release

`Binary: ./target/release/native-devtools-mcp`

`⚙️ Configuration`

`$3`

Claude Desktop config file: ~/Library/Application Support/Claude/claude_desktop_config.json

Claude Desktop requires the signed app bundle (npx/npm will not work due to Gatekeeper):

1. Download NativeDevtools-X.X.X.dmgfrom GitHub Releases 2. Open the DMG and dragNativeDevtools.app to /Applications3. Configure Claude Desktop:

`json { "mcpServers": { "native-devtools": { "command": "/Applications/NativeDevtools.app/Contents/MacOS/native-devtools-mcp" } } }`

4. Restart Claude Desktop - it will prompt for Screen Recording and Accessibility permissions for NativeDevtools

> Note: Claude Code (CLI) can use either the signed app or npx - both work.

`$3`

Claude Desktop config file: %APPDATA%\Claude\claude_desktop_config.json

`$3`

For Windows (or macOS with Claude Code CLI):

`json { "mcpServers": { "native-devtools": { "command": "npx", "args": ["-y", "native-devtools-mcp"] } } }`

> Note: Requires Node.js 18+ installed.

`$3`

To avoid approving every single tool call (clicks, screenshots), you can add this wildcard permission to your project's settings or global config:

File: .claude/settings.local.json (or similar)

`json { "permissions": { "allow": ["mcp__native-devtools__*"] } }`

`🔍 Two Approaches to Interaction`

We provide two ways for agents to interact, allowing them to choose the best tool for the job.

`$3`


Best for: 99% of apps (Electron, Qt, Games, Browsers).
*   How it works: The agent takes a screenshot, analyzes it visually (or uses OCR), and clicks at coordinates.
*   Tools:

take_screenshot, find_text, click, type_text (plus load_image / find_image

 for icons and shapes).
*   Example: "Click the button that looks like a gear icon." → use

find_image

 with a gear template.
$3

Best for: Apps specifically instrumented with our AppDebugKit library (mostly for developers testing their own apps).
*   How it works: The agent connects to a debug port and queries the UI tree (like HTML DOM).
*   Tools:

app_connect, app_query, app_click

.
*   Example:

app_click(element_id="submit-button")

.
🧩 Template Matching (find_image)

Use find_image when the target is not text (icons, toggles, custom controls) and OCR or find_text cannot identify it.

Fast vs Accurate: - fast (default): uses downscaling and early-exit for speed. - accurate: uses full-resolution, wider scale search, and smaller stride for thorough matching.

Optional inputs like mask_id, search_region, scales, and rotations can improve precision and performance.

`🏗️ Architecture`


🔧 Technical Details (Under the Hood)

`$3`

Screenshots include metadata for accurate coordinate conversion:

Coordinate conversion:`screen_x = screenshot_origin_x + (pixel_x / screenshot_scale) screen_y = screenshot_origin_y + (pixel_y / screenshot_scale)`

`🛡️ Privacy, Safety & Best Practices`

`$3`


*   100% Local: All processing (screenshots, OCR, logic) happens on your device.
*   No Cloud: Images are never uploaded to any third-party server by this tool.
*   Open Source: You can inspect the code to verify exactly what it does.
$3

*   Hands Off: When the agent is "driving" (clicking/typing), do not move your mouse or type.
       Why?* Real hardware inputs can conflict with the simulated ones, causing clicks to land in the wrong place.
*   Focus Matters: Ensure the window you want the agent to use is visible. If a popup steals focus, the agent might type into the wrong window unless it checks first.
🔐 Required Permissions (macOS)
On macOS, you must grant permissions to the host application (e.g., Terminal, VS Code, Claude Desktop) to allow screen recording and input control.

> Restart Required: After granting permissions, you must fully quit and restart the host application.