feat: initial dflash-server docker packaging
Multi-stage CUDA build of the native dflash_server from Luce-Org/lucebox-hub (pinned at 42f36f1). Models are not baked into the image; mount /models at runtime. - Dockerfile: nvidia/cuda:12.6.0 devel -> runtime, CUDA_ARCH build-arg (default sm_86), libcuda.so.1 stub symlink + -rpath-link fix - docker-compose.yml: reference service with ./models:/models:ro - Makefile: submodules / doctor / build / run / shell / up-down-logs / push / clean. push targets gitea.va.reichard.io/evan - README + .dockerignore + .gitignore
This commit is contained in:
108
README.md
Normal file
108
README.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# dflash-server-docker
|
||||
|
||||
Docker packaging for the native C++/CUDA `dflash_server` from
|
||||
[Luce-Org/lucebox-hub](https://github.com/Luce-Org/lucebox-hub) (`dflash/`
|
||||
subtree). Produces an OpenAI-compatible HTTP server image suitable for port
|
||||
forwarding to OpenAI-compatible clients (Open WebUI, LM Studio, Cline, Codex,
|
||||
etc.).
|
||||
|
||||
Models are **not** baked into the image — mount them as a volume at runtime.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Host with an NVIDIA GPU + driver supporting CUDA 12.6.
|
||||
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
|
||||
configured for your container runtime.
|
||||
- `git`, `make`, and a working `docker` (or podman with `docker` alias).
|
||||
|
||||
## Layout
|
||||
|
||||
```
|
||||
dflash-server-docker/
|
||||
├── Dockerfile # multi-stage CUDA build, copies lucebox-hub/dflash
|
||||
├── docker-compose.yml # reference service (mounts ./models)
|
||||
├── Makefile # submodule init, build, run, push, compose
|
||||
├── .dockerignore
|
||||
├── README.md
|
||||
└── lucebox-hub/ # git submodule, pinned commit
|
||||
└── dflash/ # source built into the image
|
||||
└── deps/ # nested submodules (llama.cpp, Block-Sparse-Attention, cutlass)
|
||||
```
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
git clone --recurse-submodules git@ssh.gitea.va.reichard.io:evan/dflash-server-docker.git
|
||||
cd dflash-server-docker
|
||||
|
||||
# Build (slow: full CUDA compile, ~20–40 min on a fast machine; defaults to sm_86)
|
||||
make build
|
||||
|
||||
# Place models under ./models (target + Lucebox GGUF draft)
|
||||
mkdir -p models/draft
|
||||
# ... copy Qwen3.6-27B-Q4_K_M.gguf to models/
|
||||
# ... copy dflash-draft-3.6-q8_0.gguf to models/draft/
|
||||
|
||||
# Run with the reference flag set
|
||||
make run
|
||||
```
|
||||
|
||||
Then point any OpenAI-compatible client at `http://<host>:18080/v1`.
|
||||
|
||||
## Targets
|
||||
|
||||
| Make target | What it does |
|
||||
|---|---|
|
||||
| `make doctor` | Sanity-check docker + submodules |
|
||||
| `make submodules` | `git submodule update --init --recursive` |
|
||||
| `make build` | Build `dflash-server:latest` for `CUDA_ARCH=86` (RTX 3090) |
|
||||
| `make rebuild` | Build with `--no-cache` |
|
||||
| `make run` | Run with the reference flag set, mounts `./models:/models:ro` |
|
||||
| `make shell` | Interactive shell in the built image |
|
||||
| `make up` / `down` / `logs` | docker compose lifecycle |
|
||||
| `make push` | Tag and push to `gitea.va.reichard.io/evan/dflash-server:latest` |
|
||||
| `make clean` | Remove built images |
|
||||
|
||||
Common overrides:
|
||||
|
||||
```bash
|
||||
make build CUDA_ARCH=89 # RTX 4090
|
||||
make build CUDA_VERSION=12.4.1 # match older host drivers
|
||||
make run MODELS_DIR=/srv/models
|
||||
make push REGISTRY=ghcr.io/evan
|
||||
```
|
||||
|
||||
## Running on a GPU host
|
||||
|
||||
```bash
|
||||
docker run --rm --gpus all \
|
||||
-v /path/to/models:/models:ro \
|
||||
-p 18080:18080 \
|
||||
gitea.va.reichard.io/evan/dflash-server:latest \
|
||||
/models/Qwen3.6-27B-Q4_K_M.gguf \
|
||||
--draft /models/draft/dflash-draft-3.6-q8_0.gguf \
|
||||
--host 0.0.0.0 --port 18080 \
|
||||
--max-ctx 32768 --max-tokens 512 \
|
||||
--fa-window 2048 \
|
||||
--ddtree --ddtree-budget 22 \
|
||||
--model-name luce-dflash
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The `lucebox-hub` submodule is pinned to a specific commit. Bumping it:
|
||||
|
||||
```bash
|
||||
cd lucebox-hub
|
||||
git fetch
|
||||
git checkout <new-ref>
|
||||
git submodule update --init --recursive
|
||||
cd ..
|
||||
git add lucebox-hub
|
||||
git commit -m "bump lucebox-hub to <new-ref>"
|
||||
```
|
||||
|
||||
- `--host 0.0.0.0` inside the container is required for port forwarding.
|
||||
- Mount `/models` read-only (`:ro`) — the server only reads model files.
|
||||
- See [`lucebox-hub/dflash/README.md`](lucebox-hub/dflash/README.md) for the
|
||||
full server flag reference, perf numbers, and architecture notes.
|
||||
Reference in New Issue
Block a user