Multi-stage CUDA build of the native dflash_server from Luce-Org/lucebox-hub (pinned at 42f36f1). Models are not baked into the image; mount /models at runtime. - Dockerfile: nvidia/cuda:12.6.0 devel -> runtime, CUDA_ARCH build-arg (default sm_86), libcuda.so.1 stub symlink + -rpath-link fix - docker-compose.yml: reference service with ./models:/models:ro - Makefile: submodules / doctor / build / run / shell / up-down-logs / push / clean. push targets gitea.va.reichard.io/evan - README + .dockerignore + .gitignore
dflash-server-docker
Docker packaging for the native C++/CUDA dflash_server from
Luce-Org/lucebox-hub (dflash/
subtree). Produces an OpenAI-compatible HTTP server image suitable for port
forwarding to OpenAI-compatible clients (Open WebUI, LM Studio, Cline, Codex,
etc.).
Models are not baked into the image — mount them as a volume at runtime.
Prerequisites
- Host with an NVIDIA GPU + driver supporting CUDA 12.6.
- NVIDIA Container Toolkit configured for your container runtime.
git,make, and a workingdocker(or podman withdockeralias).
Layout
dflash-server-docker/
├── Dockerfile # multi-stage CUDA build, copies lucebox-hub/dflash
├── docker-compose.yml # reference service (mounts ./models)
├── Makefile # submodule init, build, run, push, compose
├── .dockerignore
├── README.md
└── lucebox-hub/ # git submodule, pinned commit
└── dflash/ # source built into the image
└── deps/ # nested submodules (llama.cpp, Block-Sparse-Attention, cutlass)
Quick start
git clone --recurse-submodules git@ssh.gitea.va.reichard.io:evan/dflash-server-docker.git
cd dflash-server-docker
# Build (slow: full CUDA compile, ~20–40 min on a fast machine; defaults to sm_86)
make build
# Place models under ./models (target + Lucebox GGUF draft)
mkdir -p models/draft
# ... copy Qwen3.6-27B-Q4_K_M.gguf to models/
# ... copy dflash-draft-3.6-q8_0.gguf to models/draft/
# Run with the reference flag set
make run
Then point any OpenAI-compatible client at http://<host>:18080/v1.
Targets
| Make target | What it does |
|---|---|
make doctor |
Sanity-check docker + submodules |
make submodules |
git submodule update --init --recursive |
make build |
Build dflash-server:latest for CUDA_ARCH=86 (RTX 3090) |
make rebuild |
Build with --no-cache |
make run |
Run with the reference flag set, mounts ./models:/models:ro |
make shell |
Interactive shell in the built image |
make up / down / logs |
docker compose lifecycle |
make push |
Tag and push to gitea.va.reichard.io/evan/dflash-server:latest |
make clean |
Remove built images |
Common overrides:
make build CUDA_ARCH=89 # RTX 4090
make build CUDA_VERSION=12.4.1 # match older host drivers
make run MODELS_DIR=/srv/models
make push REGISTRY=ghcr.io/evan
Running on a GPU host
docker run --rm --gpus all \
-v /path/to/models:/models:ro \
-p 18080:18080 \
gitea.va.reichard.io/evan/dflash-server:latest \
/models/Qwen3.6-27B-Q4_K_M.gguf \
--draft /models/draft/dflash-draft-3.6-q8_0.gguf \
--host 0.0.0.0 --port 18080 \
--max-ctx 32768 --max-tokens 512 \
--fa-window 2048 \
--ddtree --ddtree-budget 22 \
--model-name luce-dflash
Notes
-
The
lucebox-hubsubmodule is pinned to a specific commit. Bumping it:cd lucebox-hub git fetch git checkout <new-ref> git submodule update --init --recursive cd .. git add lucebox-hub git commit -m "bump lucebox-hub to <new-ref>" -
--host 0.0.0.0inside the container is required for port forwarding. -
Mount
/modelsread-only (:ro) — the server only reads model files. -
See
lucebox-hub/dflash/README.mdfor the full server flag reference, perf numbers, and architecture notes.
Description
Languages
Makefile
62.9%
Dockerfile
37.1%