markdown · 1934 bytes Raw Blame History

First prompt

dlm prompt runs inference against the current adapter using the base model. It's the fastest way to check "did the training actually stick?" without involving Ollama or GGUF conversion.

The happy path

$ uv run dlm prompt tutor.dlm "What is a Python decorator?"
A decorator is a function that takes another function as input…

Behind the scenes:

  1. dlm prompt parses the .dlm, resolves the base model, and checks the hardware doctor's capability report.
  2. It loads the base model + adapter/current.txt-pointed LoRA weights via PEFT.
  3. It calls generate() with your prompt, --max-tokens 256, --temp 0.7 by default.
  4. The response is streamed to stdout; the Rich reporter writes progress / plan info to stderr so you can pipe stdout cleanly.

Deterministic generation

For reproducible output (useful for comparing adapters), pin temperature to 0:

$ uv run dlm prompt tutor.dlm --temp 0 --max-tokens 32 "Say hi"

Greedy decoding is deterministic when the weights are byte-identical — which is the whole point of the determinism contract.

Verbose plan

Pass --verbose to surface the inference plan before generation:

$ uv run dlm prompt tutor.dlm --verbose "Hello"
plan: {'device': 'mps', 'dtype': 'fp16', 'adapter_path': '...', 'quantization': 'none'}
adapter: ~/.dlm/store/01KC…/adapter/versions/v0001
Hello! How can I help you today?

The plan dict is the same object written into manifest.json on training, so you can cross-reference what the model was doing the last time it trained.

Piping and stdin

Prompt via stdin for long inputs:

$ cat long-prompt.txt | uv run dlm prompt tutor.dlm

An empty stdin (no query argument either) exits with a non-zero code and a clear error, rather than hanging.

Next

Happy with inference? Export to Ollama for a real standalone model.

View source
1 # First prompt
2
3 `dlm prompt` runs inference against the current adapter using the base
4 model. It's the fastest way to check "did the training actually stick?"
5 without involving Ollama or GGUF conversion.
6
7 ## The happy path
8
9 ```sh
10 $ uv run dlm prompt tutor.dlm "What is a Python decorator?"
11 A decorator is a function that takes another function as input…
12 ```
13
14 Behind the scenes:
15
16 1. `dlm prompt` parses the `.dlm`, resolves the base model, and
17 checks the hardware doctor's capability report.
18 2. It loads the base model + `adapter/current.txt`-pointed LoRA
19 weights via PEFT.
20 3. It calls `generate()` with your prompt, `--max-tokens 256`,
21 `--temp 0.7` by default.
22 4. The response is streamed to stdout; the Rich reporter writes
23 progress / plan info to stderr so you can pipe stdout cleanly.
24
25 ## Deterministic generation
26
27 For reproducible output (useful for comparing adapters), pin
28 temperature to 0:
29
30 ```sh
31 $ uv run dlm prompt tutor.dlm --temp 0 --max-tokens 32 "Say hi"
32 ```
33
34 Greedy decoding is deterministic when the weights are byte-identical —
35 which is the whole point of the [determinism contract](../determinism.md).
36
37 ## Verbose plan
38
39 Pass `--verbose` to surface the inference plan before generation:
40
41 ```sh
42 $ uv run dlm prompt tutor.dlm --verbose "Hello"
43 plan: {'device': 'mps', 'dtype': 'fp16', 'adapter_path': '...', 'quantization': 'none'}
44 adapter: ~/.dlm/store/01KC…/adapter/versions/v0001
45 Hello! How can I help you today?
46 ```
47
48 The `plan` dict is the same object written into `manifest.json` on
49 training, so you can cross-reference what the model was doing the
50 last time it trained.
51
52 ## Piping and stdin
53
54 Prompt via stdin for long inputs:
55
56 ```sh
57 $ cat long-prompt.txt | uv run dlm prompt tutor.dlm
58 ```
59
60 An empty stdin (no query argument either) exits with a non-zero code
61 and a clear error, rather than hanging.
62
63 ## Next
64
65 Happy with inference? [Export to Ollama](first-export.md) for a real
66 standalone model.