Architecture Overview

This page gives a high-level overview of the DeepFix architecture: the main components, how they interact, and the guiding design principles.

System Overview

DeepFix is a distributed system for AI-powered ML artifact analysis. It follows a client–server architecture that separates artifact computation from intelligent analysis.

Architecture Diagram

Core Components

DeepFix SDK (Client)

The SDK is responsible for:

Artifact computation (datasets, checks, metrics)
Artifact recording in MLflow
Workflow integration (PyTorch Lightning, ML pipelines)
Client communication with the DeepFix server

Location: deepfix-sdk/

DeepFix Server

The server is responsible for:

Running specialized analysis agents
Querying the knowledge base
Synthesizing and returning results

Location: deepfix-server/

DeepFix Core

Shared models and types:

Data models: APIRequest, APIResponse, artifact models
Type definitions: data types, artifact paths, enums

Location: deepfix-core/

Knowledge Base

Stores best practices and domain knowledge:

Architecture best practices
Data quality best practices
Training best practices

Location: deepfix-kb/, documents/

Architecture Principles

Separation of Concerns

Client handles computation and workflow integration.
Server focuses on AI-powered analysis and reasoning.
Clear boundaries between SDK, Server, and Core.

Stateless Server

No session state between requests.
Enables horizontal scaling and simpler deployment.

Artifact Storage

MLflow is the source of truth for artifacts.
Client generates artifacts and logs them to MLflow.

Agentic Analysis

Specialized agents for different artifact types (datasets, deepchecks, checkpoints, training).
Parallel agent execution where possible.
Cross-artifact reasoning for holistic insights.

Local-First Design

Designed for local deployment on a single machine.
Can scale out to cloud and container deployments.
Minimal external dependencies.

Data Flow

Analysis Request Flow

Agent Execution Flow

AnalyseArtifactsAPI
    ↓
AgentContext (decode request)
    ↓
ArtifactAnalysisCoordinator
    ↓
┌─────────────────────────────────────┐
│  Parallel Agent Execution           │
│  - DatasetArtifactsAnalyzer         │
│  - DeepchecksArtifactsAnalyzer      │
│  - ModelCheckpointArtifactsAnalyzer │
│  - TrainingArtifactsAnalyzer        │
└─────────────────────────────────────┘
    ↓
CrossArtifactReasoningAgent (sequential)
    ↓
Synthesize results
    ↓
APIResponse

Technology Stack

Client (SDK)

Language: Python 3.11+
Key libraries:
requests for HTTP communication
mlflow for artifact tracking
pydantic for data validation

Server

Language: Python 3.11+
Framework: FastAPI (via LitServe)
Key libraries:
dspy for LLM orchestration
litserve for serving
pydantic v2 for validation
llama-index-retrievers-bm25 for retrieval

Core

Language: Python 3.11+
Key libraries: pydantic for data models

Communication Protocol

REST API

Protocol: HTTP/HTTPS
Format: JSON
Main endpoint: POST /v1/analyse

Example Request:

{
  "dataset_name": "my-dataset",
  "dataset_artifacts": {},
  "deepchecks_artifacts": {},
  "model_checkpoint_artifacts": {},
  "training_artifacts": {},
  "language": "english"
}

Example Response:

{
  "agent_results": {},
  "summary": "Cross-artifact summary",
  "additional_outputs": {},
  "error_messages": {}
}

Deployment Architecture

Local Deployment

┌─────────────────────────────────────┐
│           Local Machine             │
│                                     │
│  ┌──────────┐    ┌──────────────┐  │
│  │  Client  │───▶│    Server    │  │
│  └──────────┘    └──────────────┘  │
│       │                 │           │
│       ▼                 ▼           │
│  ┌──────────┐    ┌──────────────┐  │
│  │  MLflow  │    │ Knowledge KB │  │
│  └──────────┘    └──────────────┘  │
│                                     │
└─────────────────────────────────────┘

Docker / Compose Deployment

See Docker Deployment for details on running DeepFix in containers alongside MLflow.

Design Decisions

Why Client–Server?

Scalability: independently scale analysis.
Separation: clear boundary between computation and analysis.
Flexibility: SDK can work in offline or degraded mode.

Why MLflow for Artifacts?

Standardized artifact storage and tracking.
Integration with existing ML workflows.
Versioning and reproducibility.

Why Agentic Architecture?

Specialization per artifact type.
Easy to add new agents.
Parallelizable execution.