At my previous company, I proposed a simple way to automate login flows: use a LLM to tell us where to click, based on a numbered grid. Below is how we arrived at that solution, why we avoided passing the entire DOM to the model, and how it compares to more robust systems.
The Need: Adaptive UI Automation 🔗
- We wanted to automate logins for tests and demos without dealing with frequent layout changes.
- Traditional approaches tend to fail when element IDs change or the design is adjusted.
The Grid-Based Quick Win 🔗
- We capture a screenshot of the login page.
- We overlay a grid, for example 100×100, labeling each cell from 1 through 100.
- We ask the LLM, “Which cell should we click for the username field?”
- The LLM returns a cell number, which we convert to a screen coordinate and click.
This requires only a short text description in the prompt, rather than full DOM or large images. It adapts well if the site’s layout changes (just regenerate the grid and ask again). It also avoids heavy GPU workloads since it doesn’t rely on a large vision model.
The Cost of Passing the Full DOM 🔗
- Large HTML trees often exceed the LLM’s context window.
- Prompt tokens become expensive for complex pages.
- Minor changes to the DOM can break the model’s element references.
The grid overlay method avoids these pitfalls by focusing on a minimal text prompt.
Considering OmniParser 🔗
Microsoft’s OmniParser is a powerful system that detects clickable icons, captures semantics of UI elements, and guides vision-language models to perform accurate clicks. It is a strong choice for:
- Detailed understanding of icons and element roles.
- Complex multi-step flows and wide coverage.
We found OmniParser appealing but opted against it because:
- We had limited GPU capacity.
- We only needed a simple login-click solution.
- Deploying a specialized vision model was beyond our immediate scope.
Unexpected Benefits 🔗
- Multi-step logins work by repeating the screenshot-grid-prompt-click cycle.
- Sudden UI changes are handled by generating a new overlay.
- The approach is easy to prototype, requiring no advanced training or fine-tuning.
If your primary goal is to handle logins or simple UI interactions without constant upkeep, a screenshot grid plus LLM instructions might be all you need. For complex, large-scale automation, a more specialized parser could be worth the investment.