Files

Evan Reichard ab19369966 feat: initial dflash-server docker packaging

Multi-stage CUDA build of the native dflash_server from
Luce-Org/lucebox-hub (pinned at 42f36f1). Models are not baked
into the image; mount /models at runtime.

- Dockerfile: nvidia/cuda:12.6.0 devel -> runtime, CUDA_ARCH build-arg
  (default sm_86), libcuda.so.1 stub symlink + -rpath-link fix
- docker-compose.yml: reference service with ./models:/models:ro
- Makefile: submodules / doctor / build / run / shell / up-down-logs /
  push / clean. push targets gitea.va.reichard.io/evan
- README + .dockerignore + .gitignore

2026-05-21 09:24:57 -04:00

3.6 KiB

Raw Permalink Blame History

dflash-server-docker

Docker packaging for the native C++/CUDA dflash_server from Luce-Org/lucebox-hub (dflash/ subtree). Produces an OpenAI-compatible HTTP server image suitable for port forwarding to OpenAI-compatible clients (Open WebUI, LM Studio, Cline, Codex, etc.).

Models are not baked into the image — mount them as a volume at runtime.

Prerequisites

Host with an NVIDIA GPU + driver supporting CUDA 12.6.
NVIDIA Container Toolkit configured for your container runtime.
git, make, and a working docker (or podman with docker alias).

Layout

dflash-server-docker/
├── Dockerfile              # multi-stage CUDA build, copies lucebox-hub/dflash
├── docker-compose.yml      # reference service (mounts ./models)
├── Makefile                # submodule init, build, run, push, compose
├── .dockerignore
├── README.md
└── lucebox-hub/            # git submodule, pinned commit
    └── dflash/             # source built into the image
        └── deps/           # nested submodules (llama.cpp, Block-Sparse-Attention, cutlass)

Quick start

git clone --recurse-submodules git@ssh.gitea.va.reichard.io:evan/dflash-server-docker.git
cd dflash-server-docker

# Build (slow: full CUDA compile, ~20–40 min on a fast machine; defaults to sm_86)
make build

# Place models under ./models (target + Lucebox GGUF draft)
mkdir -p models/draft
# ... copy Qwen3.6-27B-Q4_K_M.gguf to models/
# ... copy dflash-draft-3.6-q8_0.gguf to models/draft/

# Run with the reference flag set
make run

Then point any OpenAI-compatible client at http://<host>:18080/v1.

Targets

Make target	What it does
`make doctor`	Sanity-check docker + submodules
`make submodules`	`git submodule update --init --recursive`
`make build`	Build `dflash-server:latest` for `CUDA_ARCH=86` (RTX 3090)
`make rebuild`	Build with `--no-cache`
`make run`	Run with the reference flag set, mounts `./models:/models:ro`
`make shell`	Interactive shell in the built image
`make up` / `down` / `logs`	docker compose lifecycle
`make push`	Tag and push to `gitea.va.reichard.io/evan/dflash-server:latest`
`make clean`	Remove built images

Common overrides:

make build CUDA_ARCH=89          # RTX 4090
make build CUDA_VERSION=12.4.1   # match older host drivers
make run MODELS_DIR=/srv/models
make push REGISTRY=ghcr.io/evan

Running on a GPU host

docker run --rm --gpus all \
    -v /path/to/models:/models:ro \
    -p 18080:18080 \
    gitea.va.reichard.io/evan/dflash-server:latest \
        /models/Qwen3.6-27B-Q4_K_M.gguf \
        --draft /models/draft/dflash-draft-3.6-q8_0.gguf \
        --host 0.0.0.0 --port 18080 \
        --max-ctx 32768 --max-tokens 512 \
        --fa-window 2048 \
        --ddtree --ddtree-budget 22 \
        --model-name luce-dflash

Notes

The lucebox-hub submodule is pinned to a specific commit. Bumping it:

cd lucebox-hub
git fetch
git checkout <new-ref>
git submodule update --init --recursive
cd ..
git add lucebox-hub
git commit -m "bump lucebox-hub to <new-ref>"

--host 0.0.0.0 inside the container is required for port forwarding.
Mount /models read-only (:ro) — the server only reads model files.
See lucebox-hub/dflash/README.md for the full server flag reference, perf numbers, and architecture notes.

3.6 KiB Raw Permalink Blame History Unescape Escape