Client–Server Architecture
This page details the client–server architecture of DeepFix: what each side is responsible for, how they communicate, and the main design constraints.
Overview
DeepFix separates artifact computation (client) from AI-powered analysis (server). This enables scalability, flexibility, and maintainability.

Architecture Decision
Server state management: hybrid (stateless + in-memory cache).
- Stateless core API for horizontal scalability.
- In-memory LRU cache for knowledge retrieval (upgradeable to Redis).
- MLflow as the persistent artifact store.
- Local-first deployment model.
Key Design Choices
| Aspect | Decision | Rationale |
|---|---|---|
| Communication | REST API | Simple, HTTP-based, widely supported |
| Artifact Storage | Server pulls from MLflow | Simplifies client, centralizes access control |
| State Management | Stateless + cache | Enables scaling, simpler deployment |
| Deployment | Local-first | Matches expected usage, easier to run and debug |
Server Responsibilities
The DeepFix server is responsible for:
1. AI-Powered Analysis
- Execute the multi-agent analysis pipeline.
- Coordinate agent execution (parallel where possible).
- Aggregate agent results into unified output.
- Generate natural-language summaries and recommendations.
The server does not:
- Train models.
- Compute metrics or run training loops.
2. Knowledge Retrieval
- Query the knowledge base via a
KnowledgeBridge. - Cache knowledge retrieval results.
- Validate retrieved knowledge against the agent context.
- Attach knowledge citations to responses.
The server does not own:
- Knowledge base updates or curation.
3. Result Formatting
- Transform agent results into the API response schema.
- Prioritize findings by severity and confidence.
- Format recommendations as concrete steps.
4. Error Handling
- Validate incoming requests against the schema.
- Handle MLflow connection failures.
- Manage agent timeouts and execution errors.
- Return structured error messages.
Server Boundaries
What the server does:
- ✅ Run AI analysis on artifacts.
- ✅ Query and cache knowledge.
- ✅ Return structured results.
What the server does not do:
- ❌ Compute or generate artifacts.
- ❌ Store artifacts permanently.
- ❌ Log to MLflow.
- ❌ Train models.
- ❌ Manage user sessions.
Client Responsibilities
The DeepFix SDK (client) is responsible for:
1. Artifact Computation
- Generate datasets, Deepchecks reports, model checkpoints.
- Compute training metrics and logs.
- Run data quality checks and extract statistics.
2. Artifact Recording
- Store artifacts in an MLflow tracking server or local store.
- Tag artifacts with metadata (dataset name, run ID, etc.).
- Handle artifact upload errors and retries.
3. Workflow Integration
- Integrate with PyTorch Lightning (callbacks/hooks).
- Integrate with MLflow experiments.
- Support notebooks, scripts, and pipelines.
4. Client Communication
- Send analysis requests to the DeepFix server.
- Handle server responses and errors.
- Implement retry logic for transient failures.
- Provide clear error messages to users.
5. Result Processing
- Parse server responses.
- Display results in notebooks, logs, or UIs.
- Optionally store results back into MLflow.
Client Boundaries
What the client does:
- ✅ Compute artifacts (datasets, checks, metrics).
- ✅ Store artifacts in MLflow.
- ✅ Send analysis requests.
- ✅ Render results to users.
What the client does not do:
- ❌ Run AI analysis.
- ❌ Query the knowledge base directly.
- ❌ Manage server state.
Communication Protocol
REST API
The client and server communicate via a JSON-over-HTTP API.
- Endpoint:
POST /v1/analyse - Protocol: HTTP/HTTPS
- Format: JSON
Request format:
{
"dataset_name": "my-dataset",
"dataset_artifacts": {},
"deepchecks_artifacts": {},
"model_checkpoint_artifacts": {},
"training_artifacts": {},
"language": "english"
}
Response format:
{
"agent_results": {},
"summary": "Cross-artifact summary...",
"additional_outputs": {
"recommendations": [],
"citations": []
},
"error_messages": {}
}
Error Handling
Typical server errors:
400 Bad Request: invalid request.404 Not Found: artifacts not found.500 Internal Server Error: server error.503 Service Unavailable: overloaded or offline.
Typical client-side handling:
- Retry with backoff on transient errors.
- Surface validation errors clearly.
- Log and expose server-side error messages.
Workflow Patterns
Synchronous Analysis
Use when you want immediate feedback after training or data preparation.
Asynchronous Analysis
Use for long-running analyses or batch jobs.
Batch Analysis
Use to analyze many experiments or datasets together.
Design Rationale
Why Client–Server?
- Separation of concerns between computation and analysis.
- Independent scaling of the analysis service.
- Easier updates to analysis logic without touching training code.
Why Stateless Server?
- Easier horizontal scaling and load balancing.
- No session state to manage or migrate.
- More robust restarts and deployments.
Why MLflow for Artifacts?
- Integrates with existing ML workflows.
- Standard artifact tracking and versioning.
- Rich ecosystem and UI.