Create Next App

Application Design & Architecture Brief

1. Introduction

This document outlines the conceptual design for a desktop application that interfaces with the WhatsApp desktop client. The primary goal is to provide automation capabilities and supervisiing AI agents, such as downloading media, summarise document and other business operation by programmatically observing and interacting with the WhatsApp UI through screen capture, text recognition and simulated inputs.

2. Features & Workflow

Core Features:

Real-time screen mirroring of the WhatsApp window.
Automated detection of media elements (images, videos, files).
Recognition of text instructions
User-initiated commands to download specific or all detected media.
Activity logging for all operations.

Workflow:

User starts the application and positions the WhatsApp window.
The backend service continuously captures the WhatsApp window.
The captured feed is streamed to this frontend interface.
User clicks an action button (e.g., "Download All Images"), other WhatsApp functions.
The command or the captured text is sent to the backend.
The backend uses computer vision to decode the capture text, locate and simulate a "download" and other click on the media.

3. Architecture & Technology Stack

Frontend

React (with TypeScript) & Tailwind CSS

Responsible for user interface, command invocation, and displaying the screen feed & logs.

WebSockets/HTTP

Backend

Python (with FastAPI)

Handles screen capture, LLM, Multi-modal, computer vision (e.g., OpenCV), automation (e.g., PyAutoGUI), and serves the API.

4. Information Flow

1. Screen Capture (Backend): A Python script captures the WhatsApp window frame by frame.

2. Data Streaming (Backend → Frontend): Frames are encoded (e.g., base64) and sent to the Frontend via WebSockets for a live feed.

3. User Instruction (Frontend → Backend): User clicks an action button, sending a command (e.g., { "action": "download_images" }) to the FastAPI backend via an HTTP POST request.

4. Computer Vision & Automation (Backend): The backend processes the latest captured frame with a library like OpenCV to find coordinates of download buttons, then uses PyAutoGUI to move the mouse and click them.

5. Feedback (Backend → Frontend): The backend sends status updates and results (e.g., "Download complete") back to the frontend, which are displayed in the activity log.

5. Proposed File Structure

A typical project structure to keep the frontend and backend code organized and maintainable, whether in a single monorepo or separate repositories.

Frontend (React)

/frontend
|-- /src
|   |-- /components
|   |   |-- ControlPanel.tsx
|   |   |-- ScreenCaptureView.tsx
|   |-- /hooks
|   |   |-- useWebSocket.ts
|   |-- /types
|   |   |-- index.ts
|   |-- App.tsx
|   |-- index.tsx
|-- /public
|   |-- index.html
|-- package.json

Backend (FastAPI)

/backend
|-- /app
|   |-- /api
|   |   |-- endpoints.py
|   |   |-- schemas.py
|   |-- /services
|   |   |-- automation.py
|   |   |-- vision.py
|   |   |-- capture.py
|   |-- main.py
|-- requirements.txt
|-- .env

Live Feed: WhatsApp Desktop