# dflash-server-docker Docker packaging for the native C++/CUDA `dflash_server` from [Luce-Org/lucebox-hub](https://github.com/Luce-Org/lucebox-hub) (`dflash/` subtree). Produces an OpenAI-compatible HTTP server image suitable for port forwarding to OpenAI-compatible clients (Open WebUI, LM Studio, Cline, Codex, etc.). Models are **not** baked into the image — mount them as a volume at runtime. ## Prerequisites - Host with an NVIDIA GPU + driver supporting CUDA 12.6. - [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) configured for your container runtime. - `git`, `make`, and a working `docker` (or podman with `docker` alias). ## Layout ``` dflash-server-docker/ ├── Dockerfile # multi-stage CUDA build, copies lucebox-hub/dflash ├── docker-compose.yml # reference service (mounts ./models) ├── Makefile # submodule init, build, run, push, compose ├── .dockerignore ├── README.md └── lucebox-hub/ # git submodule, pinned commit └── dflash/ # source built into the image └── deps/ # nested submodules (llama.cpp, Block-Sparse-Attention, cutlass) ``` ## Quick start ```bash git clone --recurse-submodules git@ssh.gitea.va.reichard.io:evan/dflash-server-docker.git cd dflash-server-docker # Build (slow: full CUDA compile, ~20–40 min on a fast machine; defaults to sm_86) make build # Place models under ./models (target + Lucebox GGUF draft) mkdir -p models/draft # ... copy Qwen3.6-27B-Q4_K_M.gguf to models/ # ... copy dflash-draft-3.6-q8_0.gguf to models/draft/ # Run with the reference flag set make run ``` Then point any OpenAI-compatible client at `http://:18080/v1`. ## Targets | Make target | What it does | |---|---| | `make doctor` | Sanity-check docker + submodules | | `make submodules` | `git submodule update --init --recursive` | | `make build` | Build `dflash-server:latest` for `CUDA_ARCH=86` (RTX 3090) | | `make rebuild` | Build with `--no-cache` | | `make run` | Run with the reference flag set, mounts `./models:/models:ro` | | `make shell` | Interactive shell in the built image | | `make up` / `down` / `logs` | docker compose lifecycle | | `make push` | Tag and push to `gitea.va.reichard.io/evan/dflash-server:latest` | | `make clean` | Remove built images | Common overrides: ```bash make build CUDA_ARCH=89 # RTX 4090 make build CUDA_VERSION=12.4.1 # match older host drivers make run MODELS_DIR=/srv/models make push REGISTRY=ghcr.io/evan ``` ## Running on a GPU host ```bash docker run --rm --gpus all \ -v /path/to/models:/models:ro \ -p 18080:18080 \ gitea.va.reichard.io/evan/dflash-server:latest \ /models/Qwen3.6-27B-Q4_K_M.gguf \ --draft /models/draft/dflash-draft-3.6-q8_0.gguf \ --host 0.0.0.0 --port 18080 \ --max-ctx 32768 --max-tokens 512 \ --fa-window 2048 \ --ddtree --ddtree-budget 22 \ --model-name luce-dflash ``` ## Notes - The `lucebox-hub` submodule is pinned to a specific commit. Bumping it: ```bash cd lucebox-hub git fetch git checkout git submodule update --init --recursive cd .. git add lucebox-hub git commit -m "bump lucebox-hub to " ``` - `--host 0.0.0.0` inside the container is required for port forwarding. - Mount `/models` read-only (`:ro`) — the server only reads model files. - See [`lucebox-hub/dflash/README.md`](lucebox-hub/dflash/README.md) for the full server flag reference, perf numbers, and architecture notes.