.. _data_service:

=================================
graphcap Data Service Architecture
=================================

Overview
========

The Data Service is a core component of the graphcap system with a dual role: it serves as both a **data persistence layer**
 for all application content and a **database access service** for the React client. It provides a REST API for 
 data access and job tracking while persisting all application data in a PostgreSQL database.

This document details the architecture, components, and interactions of the Data Service within the graphcap ecosystem.

Purpose
-------

The Data Service fulfills two critical responsibilities:

1. **Data Management**
   - Acts as the system's single source of truth for all persistent data
   - Stores and retrieves images, captions, and their relationships
   - Maintains metadata for datasets and perspectives
   - Provides APIs for data querying and manipulation
   - Ensures data integrity and manages retention policies

2. **Job Tracking**
   - Stores the metadata and status for batch caption jobs
   - Maintains database tables for job queue prioritization
   - Records individual task status and results
   - Provides APIs for job status retrieval and management
   - Implements database-level job status tracking

These responsibilities allow the Data Service to act as the persistence layer while the React Client serves as the system orchestrator.

Architecture Components
======================

.. code-block:: text

   ┌──────────────────────────────────────────────────────┐
   │                    Data Service                      │
   │                                                      │
   │  ┌─────────────┐      ┌──────────────┐     ┌────────┐│
   │  │             │      │              │     │        ││
   │  │  Hono API   ├──────┤  Repository  ├─────┤ Drizzle││
   │  │   Layer     │      │    Layer     │     │   ORM  ││
   │  │             │      │              │     │        ││
   │  └─────────────┘      └──────────────┘     └────┬───┘│
   │                                                 │    │
   │                                            ┌────┴───┐│
   │                                            │Postgres││
   │                                            │Database││
   │                                            └────────┘│
   │                                                      │
   └──────────────────────────────────────────────────────┘
                            ▲
                            │
                            ▼
                    ┌───────────────┐
                    │  React Client │
                    │ (Orchestrator)│
                    └───────────────┘


Core Components
--------------

1. **Hono API Layer**
   - Implements RESTful API endpoints
   - Handles request validation and error responses
   - Routes requests to the appropriate repository methods
   - Implements pagination, filtering and sorting

2. **Repository Layer**
   - Provides abstractions for database operations
   - Implements business logic for data access
   - Manages transactions and data integrity

3. **Drizzle ORM**
   - Type-safe ORM for PostgreSQL
   - Handles SQL query generation
   - Manages schema migrations


Database Schema
==============

The Data Service manages several schemas within PostgreSQL:

Main Schemas
-----------

.. code-block:: text

   ┌────────────────┐      ┌────────────────┐
   │  core_schema   │      │  job_queue     │
   │                │      │                │
   │ - images       │      │ - caption_jobs │
   │ - perspectives │      │ - job_items    │
   │ - datasets     │      │ - job_archives │
   │ - captions     │      │                │
   └────────────────┘      └────────────────┘

Job Queue Schema
---------------

.. code-block:: text

   ┌─────────────────────────┐
   │      caption_jobs       │
   ├─────────────────────────┤
   │ id: serial (PK)         │
   │ job_id: text (unique)   │
   │ status: text (enum)     │
   │ created_at: timestamp   │
   │ started_at: timestamp   │
   │ completed_at: timestamp │
   │ type: text              │
   │ priority: integer       │
   │ total_images: integer   │
   │ processed_images: int   │
   │ failed_images: integer  │
   │ progress: integer       │
   │ config: json            │
   │ user_id: text           │
   │ archived: boolean       │
   │ archive_date: timestamp │
   └─────────────────────────┘
            │
            │ 1:many
            ▼
   ┌─────────────────────────┐
   │       job_items         │
   ├─────────────────────────┤
   │ id: serial (PK)         │
   │ job_id: text (FK)       │
   │ image_path: text        │
   │ perspective: text       │
   │ status: text (enum)     │
   │ result: json            │
   │ error: text             │
   │ processing_time: int    │
   │ started_at: timestamp   │
   │ completed_at: timestamp │
   └─────────────────────────┘

REST API Endpoints
=================

The Data Service exposes the following REST API endpoints:

Batch Captioning Queue
---------------------

.. list-table::
   :header-rows: 1
   :widths: 10 8 30

   * - Endpoint
     - Method
     - Description
   * - /api/perspectives/batch/create
     - POST
     - Create a new batch caption job record
   * - /api/perspectives/batch/list
     - GET
     - List active jobs with pagination and filters
   * - /api/perspectives/batch/status/:jobId
     - GET
     - Get detailed job status including items
   * - /api/perspectives/batch/cancel/:jobId
     - POST
     - Mark a job as cancelled in the database
   * - /api/perspectives/batch/reorder
     - POST
     - Change job queue order or priorities
   * - /api/perspectives/batch/analyze-images
     - POST
     - Analyze images to determine missing perspectives
   * - /api/perspectives/batch/archive/:jobId
     - POST
     - Archive a completed job
   * - /api/perspectives/batch/restore/:jobId
     - POST
     - Restore an archived job
   * - /api/perspectives/batch/retry-failed/:jobId
     - POST
     - Mark failed items for retry
   * - /api/perspectives/batch/statistics
     - GET
     - Get queue statistics

Job Item Operations
-----------------

.. list-table::
   :header-rows: 1
   :widths: 10 8 30

   * - Endpoint
     - Method
     - Description
   * - /api/perspectives/batch/items/:itemId
     - POST
     - Update an individual job item status
   * - /api/perspectives/batch/items/:jobId/list
     - GET
     - List all items for a specific job
   * - /api/perspectives/batch/items/:jobId/failed
     - GET
     - List only failed items for a job

WebSocket Endpoints
------------------

The Data Service may also provide WebSocket endpoints for real-time updates:

.. list-table::
   :header-rows: 1
   :widths: 30 30

   * - Endpoint
     - Description
   * - /api/ws/job-updates
     - Provides real-time job status and progress updates

Implementation Stack
===================

The Data Service is built using the following technologies:

- **Bun**: Runtime environment
- **TypeScript**: Programming language
- **Hono.js**: Lightweight, high-performance API framework
- **Drizzle ORM**: Type-safe SQL query builder
- **PostgreSQL**: Relational database
- **zod**: Schema validation for API requests

Configuration
============

The Data Service is configured using environment variables:

.. list-table::
   :header-rows: 1
   :widths: 15 35 10

   * - Variable
     - Description
     - Default
   * - PORT
     - Port to run the service on
     - 32550
   * - DATABASE_URL
     - PostgreSQL connection string
     - None
   * - NODE_ENV
     - Environment (development/production)
     - development
   * - WORKSPACE_PATH
     - Path to workspace directory
     - /workspace
   * - MAX_CONCURRENT_JOBS
     - Maximum concurrent running jobs
     - 2
   * - MAX_CONCURRENT_ITEMS
     - Maximum concurrent items per job
     - 4

Deployment
=========

The Data Service is containerized using Docker:

.. code-block:: yaml

   graphcap_data_service:
     container_name: graphcap_data_service
     build:
       context: ./servers/data_service
       dockerfile: Dockerfile.data_service.dev
     ports:
       - "32550:32550"
     environment:
       - NODE_ENV=development
       - PORT=32550
       - DATABASE_URL=postgresql://user:password@graphcap_postgres:5432/graphcap
       - WORKSPACE_PATH=/workspace
       - MAX_CONCURRENT_JOBS=2
       - MAX_CONCURRENT_ITEMS=4
     volumes:
       - ./workspace:/workspace
       - ./servers/data_service/src:/app/src
     networks:
       - graphcap
     depends_on:
       graphcap_postgres:
         condition: service_healthy
     healthcheck:
       test: ["CMD", "wget", "--spider", "http://localhost:32550/health"]
       interval: 5m
       timeout: 10s
       retries: 3
       start_period: 30s