Research Paper: Project Cartographer - Automated Visual Procurement

Abstract This paper outlines the architectural design and feasibility of "Project Cartographer," an autonomous system designed to translate handwritten physical shopping lists into structured digital procurement actions. By leveraging multi-modal Large Language Models (LLMs) for Optical Character Recognition (OCR) and high-level reasoning, the system reconciles ambiguous user intent with historical purchase data. The study further explores the implementation of stealth-based browser automation to navigate complex e-commerce environments while bypassing sophisticated bot-detection mechanisms.

Keywords: Multi-modal Agents, Browser Automation, OCR, Visual Procurement, Autonomous Systems.

1. Introduction

In the contemporary digital era, the interface between physical planning (handwritten notes) and digital execution (e-commerce) remains fragmented. Users frequently document needs manually but must perform redundant data entry to fulfill these needs online. Project Cartographer proposes a seamless "Vision-to-Cart" pipeline that automates this transition, significantly reducing cognitive load and procurement friction.

2. System Architecture

The system is comprised of three primary layers: The Perception Layer, the Reconciliation Layer, and the Execution Layer.

2.1 Perception Layer (Vision Intelligence)

Utilizing high-parameter multi-modal models (e.g., Gemini 3 Pro), the system performs non-linear OCR. Unlike traditional OCR, which merely transcribes text, this layer applies semantic parsing to identify quantities, units, and brand preferences even within inconsistent handwriting styles.

2.2 Reconciliation Layer (Historical Context)

Ambiguity is a primary challenge in procurement (e.g., "Milk" vs "Full Cream 1L"). The Reconciliation Layer cross-references parsed items against a JSON-structured purchase history or by scraping the user's "Previous Orders" on the target platform. This ensures high-fidelity matching with established user habits.

2.3 Execution Layer (Browser Automation)

The Execution Layer employs Playwright-based browser control. To ensure successful delivery, the agent must mimic human interaction patterns—incorporating variable delays, non-linear mouse movements, and header spoofing—to navigate anti-bot protections common in high-traffic retail domains.

3. Comparative Methodology

4. Technical Challenges & Ethical Considerations

The primary technical hurdle remains the volatility of e-commerce Document Object Models (DOM). To mitigate this, the system uses "Visual Anchoring," where the agent periodically takes screenshots to re-orient its navigational logic. Furthermore, the handling of user credentials and purchase history requires strict adherence to secure session management protocols within the OpenClaw framework.

5. Conclusion & Future Work

Project Cartographer demonstrates that the integration of vision and action-oriented agents can effectively bridge the physical-digital divide. Future iterations will explore multi-store price optimization and automated 2FA handling to further streamline the procurement lifecycle.

Feature	Manual Procurement	Traditional OCR	Project Cartographer
Data Entry	High (Manual)	Medium (Copy-Paste)	Zero (Autonomous)
Context Awareness	High (Human)	Low (Literal)	High (Predictive)
Platform Agnostic	Yes	No	Yes (Adaptive)

References

[1] OpenClaw Framework Documentation v2.2 (2026). "Agentic Browser Interaction Protocols."

[2] Anthropic Research (2025). "The Role of Multi-Modal Models in Knowledge Work."

[3] Playwright Community (2026). "Stealth Tactics for Automated Navigation."