Skip to content

Architecture Overview

This page gives a high-level overview of the DeepFix architecture: the main components, how they interact, and the guiding design principles.

System Overview

DeepFix is a distributed system for AI-powered ML artifact analysis. It follows a client–server architecture that separates artifact computation from intelligent analysis.

Architecture Diagram

Core Components

DeepFix SDK (Client)

The SDK is responsible for:

  • Artifact computation (datasets, checks, metrics)
  • Artifact recording in MLflow
  • Workflow integration (PyTorch Lightning, ML pipelines)
  • Client communication with the DeepFix server

Location: deepfix-sdk/

See also: SDK API Reference.

DeepFix Server

The server is responsible for:

  • Running specialized analysis agents
  • Querying the knowledge base
  • Synthesizing and returning results

Location: deepfix-server/

See also: Server Architecture and Server API Reference.

DeepFix Core

Shared models and types:

  • Data models: APIRequest, APIResponse, artifact models
  • Type definitions: data types, artifact paths, enums

Location: deepfix-core/

See also: Core API Reference.

Knowledge Base

Stores best practices and domain knowledge:

  • Architecture best practices
  • Data quality best practices
  • Training best practices

Location: deepfix-kb/, documents/

Architecture Principles

Separation of Concerns

  • Client handles computation and workflow integration.
  • Server focuses on AI-powered analysis and reasoning.
  • Clear boundaries between SDK, Server, and Core.

Stateless Server

  • No session state between requests.
  • Enables horizontal scaling and simpler deployment.

Artifact Storage

  • MLflow is the source of truth for artifacts.
  • Client generates artifacts and logs them to MLflow.

Agentic Analysis

  • Specialized agents for different artifact types (datasets, deepchecks, checkpoints, training).
  • Parallel agent execution where possible.
  • Cross-artifact reasoning for holistic insights.

Local-First Design

  • Designed for local deployment on a single machine.
  • Can scale out to cloud and container deployments.
  • Minimal external dependencies.

Data Flow

Analysis Request Flow

Analysis Request Flow

Agent Execution Flow

AnalyseArtifactsAPI
AgentContext (decode request)
ArtifactAnalysisCoordinator
┌─────────────────────────────────────┐
│  Parallel Agent Execution           │
│  - DatasetArtifactsAnalyzer         │
│  - DeepchecksArtifactsAnalyzer      │
│  - ModelCheckpointArtifactsAnalyzer │
│  - TrainingArtifactsAnalyzer        │
└─────────────────────────────────────┘
CrossArtifactReasoningAgent (sequential)
Synthesize results
APIResponse

Technology Stack

Client (SDK)

  • Language: Python 3.11+
  • Key libraries:
  • requests for HTTP communication
  • mlflow for artifact tracking
  • pydantic for data validation

Server

  • Language: Python 3.11+
  • Framework: FastAPI (via LitServe)
  • Key libraries:
  • dspy for LLM orchestration
  • litserve for serving
  • pydantic v2 for validation
  • llama-index-retrievers-bm25 for retrieval

Core

  • Language: Python 3.11+
  • Key libraries: pydantic for data models

Communication Protocol

REST API

  • Protocol: HTTP/HTTPS
  • Format: JSON
  • Main endpoint: POST /v1/analyse

Example Request:

{
  "dataset_name": "my-dataset",
  "dataset_artifacts": {},
  "deepchecks_artifacts": {},
  "model_checkpoint_artifacts": {},
  "training_artifacts": {},
  "language": "english"
}

Example Response:

{
  "agent_results": {},
  "summary": "Cross-artifact summary",
  "additional_outputs": {},
  "error_messages": {}
}

Deployment Architecture

Local Deployment

┌─────────────────────────────────────┐
│           Local Machine             │
│                                     │
│  ┌──────────┐    ┌──────────────┐  │
│  │  Client  │───▶│    Server    │  │
│  └──────────┘    └──────────────┘  │
│       │                 │           │
│       ▼                 ▼           │
│  ┌──────────┐    ┌──────────────┐  │
│  │  MLflow  │    │ Knowledge KB │  │
│  └──────────┘    └──────────────┘  │
│                                     │
└─────────────────────────────────────┘

Docker / Compose Deployment

See Docker Deployment for details on running DeepFix in containers alongside MLflow.

Design Decisions

Why Client–Server?

  • Scalability: independently scale analysis.
  • Separation: clear boundary between computation and analysis.
  • Flexibility: SDK can work in offline or degraded mode.

Why MLflow for Artifacts?

  • Standardized artifact storage and tracking.
  • Integration with existing ML workflows.
  • Versioning and reproducibility.

Why Agentic Architecture?

  • Specialization per artifact type.
  • Easy to add new agents.
  • Parallelizable execution.