graphcap Data Service Architecture#

Overview#

The Data Service is a core component of the graphcap system with a dual role: it serves as both a data persistence layer

for all application content and a database access service for the React client. It provides a REST API for data access and job tracking while persisting all application data in a PostgreSQL database.

This document details the architecture, components, and interactions of the Data Service within the graphcap ecosystem.

Purpose#

The Data Service fulfills two critical responsibilities:

  1. Data Management - Acts as the system’s single source of truth for all persistent data - Stores and retrieves images, captions, and their relationships - Maintains metadata for datasets and perspectives - Provides APIs for data querying and manipulation - Ensures data integrity and manages retention policies

  2. Job Tracking - Stores the metadata and status for batch caption jobs - Maintains database tables for job queue prioritization - Records individual task status and results - Provides APIs for job status retrieval and management - Implements database-level job status tracking

These responsibilities allow the Data Service to act as the persistence layer while the React Client serves as the system orchestrator.

Architecture Components#

┌──────────────────────────────────────────────────────┐
│                    Data Service                      │
│                                                      │
│  ┌─────────────┐      ┌──────────────┐     ┌────────┐│
│  │             │      │              │     │        ││
│  │  Hono API   ├──────┤  Repository  ├─────┤ Drizzle││
│  │   Layer     │      │    Layer     │     │   ORM  ││
│  │             │      │              │     │        ││
│  └─────────────┘      └──────────────┘     └────┬───┘│
│                                                 │    │
│                                            ┌────┴───┐│
│                                            │Postgres││
│                                            │Database││
│                                            └────────┘│
│                                                      │
└──────────────────────────────────────────────────────┘
                         ▲
                         │
                         ▼
                 ┌───────────────┐
                 │  React Client │
                 │ (Orchestrator)│
                 └───────────────┘

Core Components#

  1. Hono API Layer - Implements RESTful API endpoints - Handles request validation and error responses - Routes requests to the appropriate repository methods - Implements pagination, filtering and sorting

  2. Repository Layer - Provides abstractions for database operations - Implements business logic for data access - Manages transactions and data integrity

  3. Drizzle ORM - Type-safe ORM for PostgreSQL - Handles SQL query generation - Manages schema migrations

Database Schema#

The Data Service manages several schemas within PostgreSQL:

Main Schemas#

┌────────────────┐      ┌────────────────┐
│  core_schema   │      │  job_queue     │
│                │      │                │
│ - images       │      │ - caption_jobs │
│ - perspectives │      │ - job_items    │
│ - datasets     │      │ - job_archives │
│ - captions     │      │                │
└────────────────┘      └────────────────┘

Job Queue Schema#

┌─────────────────────────┐
│      caption_jobs       │
├─────────────────────────┤
│ id: serial (PK)         │
│ job_id: text (unique)   │
│ status: text (enum)     │
│ created_at: timestamp   │
│ started_at: timestamp   │
│ completed_at: timestamp │
│ type: text              │
│ priority: integer       │
│ total_images: integer   │
│ processed_images: int   │
│ failed_images: integer  │
│ progress: integer       │
│ config: json            │
│ user_id: text           │
│ archived: boolean       │
│ archive_date: timestamp │
└─────────────────────────┘
         │
         │ 1:many
         ▼
┌─────────────────────────┐
│       job_items         │
├─────────────────────────┤
│ id: serial (PK)         │
│ job_id: text (FK)       │
│ image_path: text        │
│ perspective: text       │
│ status: text (enum)     │
│ result: json            │
│ error: text             │
│ processing_time: int    │
│ started_at: timestamp   │
│ completed_at: timestamp │
└─────────────────────────┘

REST API Endpoints#

The Data Service exposes the following REST API endpoints:

Batch Captioning Queue#

Endpoint

Method

Description

/api/perspectives/batch/create

POST

Create a new batch caption job record

/api/perspectives/batch/list

GET

List active jobs with pagination and filters

/api/perspectives/batch/status/:jobId

GET

Get detailed job status including items

/api/perspectives/batch/cancel/:jobId

POST

Mark a job as cancelled in the database

/api/perspectives/batch/reorder

POST

Change job queue order or priorities

/api/perspectives/batch/analyze-images

POST

Analyze images to determine missing perspectives

/api/perspectives/batch/archive/:jobId

POST

Archive a completed job

/api/perspectives/batch/restore/:jobId

POST

Restore an archived job

/api/perspectives/batch/retry-failed/:jobId

POST

Mark failed items for retry

/api/perspectives/batch/statistics

GET

Get queue statistics

Job Item Operations#

Endpoint

Method

Description

/api/perspectives/batch/items/:itemId

POST

Update an individual job item status

/api/perspectives/batch/items/:jobId/list

GET

List all items for a specific job

/api/perspectives/batch/items/:jobId/failed

GET

List only failed items for a job

WebSocket Endpoints#

The Data Service may also provide WebSocket endpoints for real-time updates:

Endpoint

Description

/api/ws/job-updates

Provides real-time job status and progress updates

Implementation Stack#

The Data Service is built using the following technologies:

  • Bun: Runtime environment

  • TypeScript: Programming language

  • Hono.js: Lightweight, high-performance API framework

  • Drizzle ORM: Type-safe SQL query builder

  • PostgreSQL: Relational database

  • zod: Schema validation for API requests

Configuration#

The Data Service is configured using environment variables:

Variable

Description

Default

PORT

Port to run the service on

32550

DATABASE_URL

PostgreSQL connection string

None

NODE_ENV

Environment (development/production)

development

WORKSPACE_PATH

Path to workspace directory

/workspace

MAX_CONCURRENT_JOBS

Maximum concurrent running jobs

2

MAX_CONCURRENT_ITEMS

Maximum concurrent items per job

4

Deployment#

The Data Service is containerized using Docker:

graphcap_data_service:
  container_name: graphcap_data_service
  build:
    context: ./servers/data_service
    dockerfile: Dockerfile.data_service.dev
  ports:
    - "32550:32550"
  environment:
    - NODE_ENV=development
    - PORT=32550
    - DATABASE_URL=postgresql://user:password@graphcap_postgres:5432/graphcap
    - WORKSPACE_PATH=/workspace
    - MAX_CONCURRENT_JOBS=2
    - MAX_CONCURRENT_ITEMS=4
  volumes:
    - ./workspace:/workspace
    - ./servers/data_service/src:/app/src
  networks:
    - graphcap
  depends_on:
    graphcap_postgres:
      condition: service_healthy
  healthcheck:
    test: ["CMD", "wget", "--spider", "http://localhost:32550/health"]
    interval: 5m
    timeout: 10s
    retries: 3
    start_period: 30s