graphcap Inference Bridge Architecture#
Overview#
- The graphcap Inference Bridge is a sophisticated system designed to generate captions for images using different “perspectives.”
A perspective represents a specific way of analyzing and describing an image, tailored for particular use cases. The bridge is built on FastAPI and designed to be extensible, supporting various AI model providers and caption perspectives.
The Inference Bridge is completely stateless - it does not maintain information about jobs, caption configurations, or persist data to databases. It receives requests, processes them, and returns results without preserving state between requests. It serves as a bridge between the client and AI providers, transforming image caption requests into AI model calls and returning structured results.
Key Components#
Perspective System#
Perspectives are the core feature of graphcap. Each perspective:
Has a specific schema defining the structure of captions it generates
Contains a prompt template used to instruct the AI model
Can be grouped into modules for organization
Is configured via JSON files or programmatically
- The perspective system allows users to generate various descriptions of the same image,
each focusing on different aspects.
Provider Management#
The provider system abstracts away different AI model providers through:
A unified interface for vision capabilities
Support for structured output generation
Automatic API key management and rate limiting
Provider-specific formatting and processing
Currently, the system supports providers like Vllm, Ollama, Gemini, and OpenAI with an extensible design to add more.
Processing Pipeline#
The processing flow works as follows:
The bridge receives a REST request with an image and specified perspective
It prepares the image (download/convert as needed)
The selected perspective’s prompt is sent with the image to the configured AI provider
Response is parsed according to the perspective’s schema
Structured data is returned directly to the client via the REST API response
Architecture Diagram#
┌────────────────┐ ┌─────────────────────────────┐
│ │ REST API │ │
│ React Client │────────────────────▶│ Inference Bridge │
│ (Orchestrator) │◀────────────────────│ (Stateless) │
│ │ │ │
└───────┬────────┘ └─────────────┬───────────────┘
│ │
│ │
│ ▼
│ ┌─────────────────────┐
│ │ │
│ │ AI Providers │
▼ │ │
┌───────────────┐ └─────────────────────┘
│ Data Service │
│ │
└───────┬───────┘
│
▼
┌──────────────┐
│ PostgreSQL │
│ Database │
└──────────────┘
┌────────────────────────────────────────────┐
│ Workspace Volume │
│ (Accessed by all services independently) │
└────────────────────────────────────────────┘
Internal Bridge Components:
┌─────────────────────────────────────┐
│ HTTP API (FastAPI) │
└───────────────┬─────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Perspective Router │
└───────────────┬─────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Perspective Service │
├─────────────────┬───────────────────┤
│ ┌─────────────┐│ ┌───────────────┐ │
│ │ Perspective ││ │ Image │ │
│ │ Management ││ │ Processing │ │
│ └─────────────┘│ └───────────────┘ │
└─────────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Provider Management │
├─────────────────────────────────────┤
│ ┌─────────────┐ ┌───────────────┐ │
│ │ Provider │ │ Provider │ │
│ │ Factory │ │ Clients │ │
│ └─────────────┘ └───────────────┘ │
└─────────────────┬───────────────────┘
│
▼
┌─────────────────────────────────────┐
│ AI Model Providers │
├─────────────────────────────────────┤
│ ┌─────────────┐ ┌───────────────┐ │
│ │ Gemini │ │ Other │ │
│ │ │ │ Providers │ │
│ └─────────────┘ └───────────────┘ │
└─────────────────────────────────────┘
Stateless Design#
REST API Design#
The Inference Bridge maintains its stateless nature through a well-defined REST API:
Request-Response Pattern: Each caption request is a self-contained HTTP transaction.
Complete Request Packages: Each request contains all necessary information to process the caption: * Image data or references (file paths, URLs, or base64-encoded data) * Perspective specification * Provider selection * Processing parameters
Independent Processing: Each request is processed independently without knowledge of previous requests.
No Session State: The service doesn’t maintain session state between requests.
Maintaining Statelessness#
The Inference Bridge achieves complete statelessness through:
No Job History: The bridge doesn’t track previous requests or maintain any job state.
No Configuration Storage: All perspective and provider configurations are either: * Loaded at startup from read-only configuration files * Provided in the request payload
No User Sessions: Each request is treated independently, with all authentication and context provided per request.
Ephemeral Storage: Any temporary files created during processing are deleted once the response is sent.
No Cross-Request Dependencies: Each request is processed independently without relying on data from previous requests.
Division of Responsibilities#
The Inference Bridge maintains a clear separation of responsibilities:
React Client: * Acts as the system orchestrator * Sends caption requests to the Inference Bridge * Receives and processes caption results * Coordinates the overall workflow * Manages user interface and interactions
Data Service: * Stores caption results in PostgreSQL * Maintains relationships between images, captions, and perspectives * Provides data access APIs to the React Client * Manages job tracking and status in database tables * Serves as the system’s single source of truth for caption data
Inference Bridge: * Processes individual caption requests * Handles only single-image caption generation * Returns results directly to the client * Maintains no state between requests * Accesses the workspace volume for image files when needed
This separation of concerns ensures that the Inference Bridge remains completely stateless while the Data Service handles persistence and the React Client handles orchestration.
Client-Bridge Interaction Flow#
The typical flow for image processing is:
Client Request: The React Client sends a caption request directly to the Inference Bridge.
Image Processing: The bridge processes the image with the specified perspective and AI provider.
Result Generation: Structured caption data is generated by the AI provider.
Direct Response: Results are returned directly to the React Client in the API response.
Data Persistence: The React Client may optionally send the results to the Data Service for storage.
Job Tracking: For batch operations, the React Client coordinates with the Data Service to track progress.
Scaling and Load Balancing#
This architecture facilitates horizontal scaling:
Multiple Inference Bridge Instances: Multiple stateless bridges can be deployed behind a load balancer.
Independent Scaling: Inference bridges can be scaled independently of other system components.
Zero Downtime Deployment: Bridge instances can be added or removed without system disruption.
Load Balancer Distribution: A standard HTTP load balancer can distribute requests among instances.
Implementation Details#
Perspective Definition#
Perspectives are defined with:
A schema specifying the structure of the output (fields and types)
A prompt template that guides the AI model
Configuration for display and output formatting
Example perspective fields could include:
General description
Key elements
Aesthetic qualities
Technical details
Contextual information
Perspectives are organized into modules for better management, and each module can be enabled or disabled as needed.
Provider Integration#
The system uses an abstraction layer through the BaseClient class that:
Provides a consistent interface over different AI providers
Handles authentication and API communication
Formats prompts and images according to provider requirements
Processes and validates responses
The provider management system supports dynamic loading of provider configurations and handles rate limiting and error recovery.
Image Processing#
The bridge includes utilities for:
Accessing images from the workspace volume
Downloading images from URLs
Processing base64-encoded images
Creating temporary files for processing
Converting between image formats as needed
These utilities ensure that images can be processed regardless of how they’re provided in the request.
Structured Output#
A key feature of the system is the ability to generate structured output via:
Dynamic creation of structured outputs based on perspective schemas
Parsing and validation of AI model responses
Conversion to standardized JSON format
Error handling for malformed responses
API Structure#
REST API#
The Inference Bridge exposes several key API endpoints:
/perspectives/list: Get available perspectives/perspectives/{name}/caption: Generate a caption using a specific perspective/perspectives/analyze: Analyze an image with multiple perspectives/providers/list: Get available AI providers/providers/{name}/models: Get models available for a specific provider
These endpoints form an API for image captioning and analysis that the React Client uses directly.
WebSocket Support#
The bridge also supports WebSocket connections for:
Real-time updates during processing
Streaming responses from AI models that support it
Data Flow#
Request Reception: The bridge receives an HTTP request containing a complete caption payload.
Image Acquisition: The image is retrieved based on the information in the request (workspace path, URL, or base64 data).
Perspective Loading: The specified perspective is loaded from the configuration files or the request payload.
Provider Selection: The appropriate AI provider is selected based on the request.
Caption Generation: The perspective prompt and image are sent to the AI provider.
Response Processing: The response is parsed according to the perspective schema.
Result Delivery: The structured caption data is returned in the HTTP response.
Resource Cleanup: All temporary resources are released after processing.
Deployment#
The Inference Bridge is containerized using Docker, with the following key components:
Base image:
python:3.12-slimPackage management:
uvtoolCore dependencies:
fastapi,uvicorn,pydanticModel-specific dependencies configured per provider
Multiple instances can be deployed behind a load balancer for horizontal scaling.
Security Considerations#
Input Validation: All input payloads are validated before processing
No Persistent Secrets: API keys and credentials are provided at runtime and not stored
Isolated Processing: Each request is processed in isolation
Path Validation: Workspace file paths are validated to prevent directory traversal
Conclusion#
The graphcap Inference Bridge provides a powerful, flexible, and completely stateless system for generating structured captions from images using various AI providers. By maintaining a stateless REST API design, it achieves high scalability and resilience while eliminating the need for state management or database interactions. The clear separation of responsibilities between the client (orchestration), the Inference Bridge (stateless caption generation), and the Data Service (persistence) creates a robust architecture that allows for easy horizontal scaling and seamless deployment in various environments.