Files
dflash-server-docker/README.md
Evan Reichard ab19369966 feat: initial dflash-server docker packaging
Multi-stage CUDA build of the native dflash_server from
Luce-Org/lucebox-hub (pinned at 42f36f1). Models are not baked
into the image; mount /models at runtime.

- Dockerfile: nvidia/cuda:12.6.0 devel -> runtime, CUDA_ARCH build-arg
  (default sm_86), libcuda.so.1 stub symlink + -rpath-link fix
- docker-compose.yml: reference service with ./models:/models:ro
- Makefile: submodules / doctor / build / run / shell / up-down-logs /
  push / clean. push targets gitea.va.reichard.io/evan
- README + .dockerignore + .gitignore
2026-05-21 09:24:57 -04:00

109 lines
3.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# dflash-server-docker
Docker packaging for the native C++/CUDA `dflash_server` from
[Luce-Org/lucebox-hub](https://github.com/Luce-Org/lucebox-hub) (`dflash/`
subtree). Produces an OpenAI-compatible HTTP server image suitable for port
forwarding to OpenAI-compatible clients (Open WebUI, LM Studio, Cline, Codex,
etc.).
Models are **not** baked into the image — mount them as a volume at runtime.
## Prerequisites
- Host with an NVIDIA GPU + driver supporting CUDA 12.6.
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
configured for your container runtime.
- `git`, `make`, and a working `docker` (or podman with `docker` alias).
## Layout
```
dflash-server-docker/
├── Dockerfile # multi-stage CUDA build, copies lucebox-hub/dflash
├── docker-compose.yml # reference service (mounts ./models)
├── Makefile # submodule init, build, run, push, compose
├── .dockerignore
├── README.md
└── lucebox-hub/ # git submodule, pinned commit
└── dflash/ # source built into the image
└── deps/ # nested submodules (llama.cpp, Block-Sparse-Attention, cutlass)
```
## Quick start
```bash
git clone --recurse-submodules git@ssh.gitea.va.reichard.io:evan/dflash-server-docker.git
cd dflash-server-docker
# Build (slow: full CUDA compile, ~2040 min on a fast machine; defaults to sm_86)
make build
# Place models under ./models (target + Lucebox GGUF draft)
mkdir -p models/draft
# ... copy Qwen3.6-27B-Q4_K_M.gguf to models/
# ... copy dflash-draft-3.6-q8_0.gguf to models/draft/
# Run with the reference flag set
make run
```
Then point any OpenAI-compatible client at `http://<host>:18080/v1`.
## Targets
| Make target | What it does |
|---|---|
| `make doctor` | Sanity-check docker + submodules |
| `make submodules` | `git submodule update --init --recursive` |
| `make build` | Build `dflash-server:latest` for `CUDA_ARCH=86` (RTX 3090) |
| `make rebuild` | Build with `--no-cache` |
| `make run` | Run with the reference flag set, mounts `./models:/models:ro` |
| `make shell` | Interactive shell in the built image |
| `make up` / `down` / `logs` | docker compose lifecycle |
| `make push` | Tag and push to `gitea.va.reichard.io/evan/dflash-server:latest` |
| `make clean` | Remove built images |
Common overrides:
```bash
make build CUDA_ARCH=89 # RTX 4090
make build CUDA_VERSION=12.4.1 # match older host drivers
make run MODELS_DIR=/srv/models
make push REGISTRY=ghcr.io/evan
```
## Running on a GPU host
```bash
docker run --rm --gpus all \
-v /path/to/models:/models:ro \
-p 18080:18080 \
gitea.va.reichard.io/evan/dflash-server:latest \
/models/Qwen3.6-27B-Q4_K_M.gguf \
--draft /models/draft/dflash-draft-3.6-q8_0.gguf \
--host 0.0.0.0 --port 18080 \
--max-ctx 32768 --max-tokens 512 \
--fa-window 2048 \
--ddtree --ddtree-budget 22 \
--model-name luce-dflash
```
## Notes
- The `lucebox-hub` submodule is pinned to a specific commit. Bumping it:
```bash
cd lucebox-hub
git fetch
git checkout <new-ref>
git submodule update --init --recursive
cd ..
git add lucebox-hub
git commit -m "bump lucebox-hub to <new-ref>"
```
- `--host 0.0.0.0` inside the container is required for port forwarding.
- Mount `/models` read-only (`:ro`) — the server only reads model files.
- See [`lucebox-hub/dflash/README.md`](lucebox-hub/dflash/README.md) for the
full server flag reference, perf numbers, and architecture notes.