Files
dflash-server-docker/README.md
Evan Reichard ab19369966 feat: initial dflash-server docker packaging
Multi-stage CUDA build of the native dflash_server from
Luce-Org/lucebox-hub (pinned at 42f36f1). Models are not baked
into the image; mount /models at runtime.

- Dockerfile: nvidia/cuda:12.6.0 devel -> runtime, CUDA_ARCH build-arg
  (default sm_86), libcuda.so.1 stub symlink + -rpath-link fix
- docker-compose.yml: reference service with ./models:/models:ro
- Makefile: submodules / doctor / build / run / shell / up-down-logs /
  push / clean. push targets gitea.va.reichard.io/evan
- README + .dockerignore + .gitignore
2026-05-21 09:24:57 -04:00

3.6 KiB
Raw Permalink Blame History

dflash-server-docker

Docker packaging for the native C++/CUDA dflash_server from Luce-Org/lucebox-hub (dflash/ subtree). Produces an OpenAI-compatible HTTP server image suitable for port forwarding to OpenAI-compatible clients (Open WebUI, LM Studio, Cline, Codex, etc.).

Models are not baked into the image — mount them as a volume at runtime.

Prerequisites

  • Host with an NVIDIA GPU + driver supporting CUDA 12.6.
  • NVIDIA Container Toolkit configured for your container runtime.
  • git, make, and a working docker (or podman with docker alias).

Layout

dflash-server-docker/
├── Dockerfile              # multi-stage CUDA build, copies lucebox-hub/dflash
├── docker-compose.yml      # reference service (mounts ./models)
├── Makefile                # submodule init, build, run, push, compose
├── .dockerignore
├── README.md
└── lucebox-hub/            # git submodule, pinned commit
    └── dflash/             # source built into the image
        └── deps/           # nested submodules (llama.cpp, Block-Sparse-Attention, cutlass)

Quick start

git clone --recurse-submodules git@ssh.gitea.va.reichard.io:evan/dflash-server-docker.git
cd dflash-server-docker

# Build (slow: full CUDA compile, ~2040 min on a fast machine; defaults to sm_86)
make build

# Place models under ./models (target + Lucebox GGUF draft)
mkdir -p models/draft
# ... copy Qwen3.6-27B-Q4_K_M.gguf to models/
# ... copy dflash-draft-3.6-q8_0.gguf to models/draft/

# Run with the reference flag set
make run

Then point any OpenAI-compatible client at http://<host>:18080/v1.

Targets

Make target What it does
make doctor Sanity-check docker + submodules
make submodules git submodule update --init --recursive
make build Build dflash-server:latest for CUDA_ARCH=86 (RTX 3090)
make rebuild Build with --no-cache
make run Run with the reference flag set, mounts ./models:/models:ro
make shell Interactive shell in the built image
make up / down / logs docker compose lifecycle
make push Tag and push to gitea.va.reichard.io/evan/dflash-server:latest
make clean Remove built images

Common overrides:

make build CUDA_ARCH=89          # RTX 4090
make build CUDA_VERSION=12.4.1   # match older host drivers
make run MODELS_DIR=/srv/models
make push REGISTRY=ghcr.io/evan

Running on a GPU host

docker run --rm --gpus all \
    -v /path/to/models:/models:ro \
    -p 18080:18080 \
    gitea.va.reichard.io/evan/dflash-server:latest \
        /models/Qwen3.6-27B-Q4_K_M.gguf \
        --draft /models/draft/dflash-draft-3.6-q8_0.gguf \
        --host 0.0.0.0 --port 18080 \
        --max-ctx 32768 --max-tokens 512 \
        --fa-window 2048 \
        --ddtree --ddtree-budget 22 \
        --model-name luce-dflash

Notes

  • The lucebox-hub submodule is pinned to a specific commit. Bumping it:

    cd lucebox-hub
    git fetch
    git checkout <new-ref>
    git submodule update --init --recursive
    cd ..
    git add lucebox-hub
    git commit -m "bump lucebox-hub to <new-ref>"
    
  • --host 0.0.0.0 inside the container is required for port forwarding.

  • Mount /models read-only (:ro) — the server only reads model files.

  • See lucebox-hub/dflash/README.md for the full server flag reference, perf numbers, and architecture notes.