graphcap Data Service Architecture#

Overview#

The Data Service is a core component of the graphcap system with a dual role: it serves as both a data persistence layer: for all application content and a database access service for the React client. It provides a REST API for data access and job tracking while persisting all application data in a PostgreSQL database.

This document details the architecture, components, and interactions of the Data Service within the graphcap ecosystem.

Purpose#

The Data Service fulfills two critical responsibilities:

Data Management - Acts as the system’s single source of truth for all persistent data - Stores and retrieves images, captions, and their relationships - Maintains metadata for datasets and perspectives - Provides APIs for data querying and manipulation - Ensures data integrity and manages retention policies
Job Tracking - Stores the metadata and status for batch caption jobs - Maintains database tables for job queue prioritization - Records individual task status and results - Provides APIs for job status retrieval and management - Implements database-level job status tracking

These responsibilities allow the Data Service to act as the persistence layer while the React Client serves as the system orchestrator.

Architecture Components#

┌──────────────────────────────────────────────────────┐
│                    Data Service                      │
│                                                      │
│  ┌─────────────┐      ┌──────────────┐     ┌────────┐│
│  │             │      │              │     │        ││
│  │  Hono API   ├──────┤  Repository  ├─────┤ Drizzle││
│  │   Layer     │      │    Layer     │     │   ORM  ││
│  │             │      │              │     │        ││
│  └─────────────┘      └──────────────┘     └────┬───┘│
│                                                 │    │
│                                            ┌────┴───┐│
│                                            │Postgres││
│                                            │Database││
│                                            └────────┘│
│                                                      │
└──────────────────────────────────────────────────────┘
                         ▲
                         │
                         ▼
                 ┌───────────────┐
                 │  React Client │
                 │ (Orchestrator)│
                 └───────────────┘

Core Components#

Hono API Layer - Implements RESTful API endpoints - Handles request validation and error responses - Routes requests to the appropriate repository methods - Implements pagination, filtering and sorting
Repository Layer - Provides abstractions for database operations - Implements business logic for data access - Manages transactions and data integrity
Drizzle ORM - Type-safe ORM for PostgreSQL - Handles SQL query generation - Manages schema migrations

Database Schema#

The Data Service manages several schemas within PostgreSQL:

Main Schemas#

┌────────────────┐      ┌────────────────┐
│  core_schema   │      │  job_queue     │
│                │      │                │
│ - images       │      │ - caption_jobs │
│ - perspectives │      │ - job_items    │
│ - datasets     │      │ - job_archives │
│ - captions     │      │                │
└────────────────┘      └────────────────┘

Job Queue Schema#

┌─────────────────────────┐
│      caption_jobs       │
├─────────────────────────┤
│ id: serial (PK)         │
│ job_id: text (unique)   │
│ status: text (enum)     │
│ created_at: timestamp   │
│ started_at: timestamp   │
│ completed_at: timestamp │
│ type: text              │
│ priority: integer       │
│ total_images: integer   │
│ processed_images: int   │
│ failed_images: integer  │
│ progress: integer       │
│ config: json            │
│ user_id: text           │
│ archived: boolean       │
│ archive_date: timestamp │
└─────────────────────────┘
         │
         │ 1:many
         ▼
┌─────────────────────────┐
│       job_items         │
├─────────────────────────┤
│ id: serial (PK)         │
│ job_id: text (FK)       │
│ image_path: text        │
│ perspective: text       │
│ status: text (enum)     │
│ result: json            │
│ error: text             │
│ processing_time: int    │
│ started_at: timestamp   │
│ completed_at: timestamp │
└─────────────────────────┘

REST API Endpoints#

The Data Service exposes the following REST API endpoints:

Batch Captioning Queue#

Endpoint	Method	Description
/api/perspectives/batch/create	POST	Create a new batch caption job record
/api/perspectives/batch/list	GET	List active jobs with pagination and filters
/api/perspectives/batch/status/:jobId	GET	Get detailed job status including items
/api/perspectives/batch/cancel/:jobId	POST	Mark a job as cancelled in the database
/api/perspectives/batch/reorder	POST	Change job queue order or priorities
/api/perspectives/batch/analyze-images	POST	Analyze images to determine missing perspectives
/api/perspectives/batch/archive/:jobId	POST	Archive a completed job
/api/perspectives/batch/restore/:jobId	POST	Restore an archived job
/api/perspectives/batch/retry-failed/:jobId	POST	Mark failed items for retry
/api/perspectives/batch/statistics	GET	Get queue statistics

Job Item Operations#

Endpoint	Method	Description
/api/perspectives/batch/items/:itemId	POST	Update an individual job item status
/api/perspectives/batch/items/:jobId/list	GET	List all items for a specific job
/api/perspectives/batch/items/:jobId/failed	GET	List only failed items for a job

WebSocket Endpoints#

The Data Service may also provide WebSocket endpoints for real-time updates:

Endpoint	Description
/api/ws/job-updates	Provides real-time job status and progress updates

Implementation Stack#

The Data Service is built using the following technologies:

Bun: Runtime environment
TypeScript: Programming language
Hono.js: Lightweight, high-performance API framework
Drizzle ORM: Type-safe SQL query builder
PostgreSQL: Relational database
zod: Schema validation for API requests

Configuration#

The Data Service is configured using environment variables:

Variable	Description	Default
PORT	Port to run the service on	32550
DATABASE_URL	PostgreSQL connection string	None
NODE_ENV	Environment (development/production)	development
WORKSPACE_PATH	Path to workspace directory	/workspace
MAX_CONCURRENT_JOBS	Maximum concurrent running jobs	2
MAX_CONCURRENT_ITEMS	Maximum concurrent items per job	4

Deployment#

The Data Service is containerized using Docker:

graphcap_data_service:
  container_name: graphcap_data_service
  build:
    context: ./servers/data_service
    dockerfile: Dockerfile.data_service.dev
  ports:
    - "32550:32550"
  environment:
    - NODE_ENV=development
    - PORT=32550
    - DATABASE_URL=postgresql://user:password@graphcap_postgres:5432/graphcap
    - WORKSPACE_PATH=/workspace
    - MAX_CONCURRENT_JOBS=2
    - MAX_CONCURRENT_ITEMS=4
  volumes:
    - ./workspace:/workspace
    - ./servers/data_service/src:/app/src
  networks:
    - graphcap
  depends_on:
    graphcap_postgres:
      condition: service_healthy
  healthcheck:
    test: ["CMD", "wget", "--spider", "http://localhost:32550/health"]
    interval: 5m
    timeout: 10s
    retries: 3
    start_period: 30s