# Section grammar

Everything after the closing `---` of the frontmatter is the document
body. DLM's body parser splits it into typed **sections** using fence
markers of the form `::<type>::` on a line by themselves.

## Section types

### Prose (default)

Any body text that isn't inside an explicit fence is a prose section.
Prose trains via **continued pretraining** — the model learns the
writing style + vocabulary but doesn't get "question → answer" pressure.

```dlm
# Heading

Prose paragraphs, markdown code blocks, whatever you'd normally write.

Another paragraph after a blank line stays in the same prose section.
```

Code fences (` ``` `) inside prose are preserved; the parser doesn't
interpret `::type::` lines that appear inside a code block.

### Instruction (`::instruction::`)

Open with `::instruction::` on its own line. Each Q&A pair uses
`### Q` and `### A` as grammar markers.

```dlm
::instruction::
### Q
What is a decorator?

### A
A function that takes a function and returns a new function.

### Q
When should I use functools.wraps?

### A
Always, inside decorators.
```

Trains via **supervised fine-tuning (SFT)**: the model sees `Q` text
as the prompt, `A` text as the target. This is the pattern that
produces "helpful assistant" behavior.

`dlm synth instructions` can also write synthesized instruction
sections back into the document. Those keep the same basic body grammar
but add an HTML provenance marker immediately after the fence. See the
[instruction section reference](instruction-section.md) for the full
marker shape and validation rules.

### Preference (`::preference::`)

Open with `::preference::`. Each record has three blocks:

```dlm
::preference::
### Prompt
Explain recursion to a beginner.

### Chosen
Recursion is when a function calls itself on a smaller piece of the
problem. Imagine matryoshka dolls.

### Rejected
A recursive function is any function that refers to itself in its own
definition using the stack frame protocol.
```

Trains via **DPO** (direct preference optimization) or **ORPO** — the
model learns to prefer the `Chosen` phrasing. The DPO / ORPO trainer
lands in Sprint 17/18.

### Image (`::image path="..." alt="..."::`)

Schema v10 adds image sections for vision-language bases. The initial
launch covered PaliGemma; later follow-ups added Qwen2-VL,
InternVL2, and Mistral Small 3.1 registry rows. The fence uses
attribute syntax instead of the bare `::type::` form:

```dlm
::image path="figures/architecture.png" alt="training pipeline diagram"::
Caption text describing the figure. The caption body becomes the "text"
part of the training row; the placeholder expands to the base's image
tokens at collate time.
```

Required attributes: `path` (the image file, resolved relative to the
`.dlm`'s parent dir). Optional: `alt` (short description; defaults to
the filename stem on directive-ingested images).

**Supported extensions.** `.png`, `.jpg`, `.jpeg`, `.webp`, `.gif`,
`.bmp`, `.tiff`. Other binary types (PDF, archives) stay out of the
training corpus by default.

**Content hash.** Image sections hash on `(type, path, blob_sha)`
rather than the body text. Two identical-bytes images at different
paths produce different `section_id`s — paths carry meaning. Changing
the blob bytes flips the ID even if the path didn't move.

**Directive ingest.** `training.sources` directives with image
extensions in their `include` globs ingest automatically:

```yaml
training:
  sources:
    - path: ./paper-figures
      include: ["**/*.png", "**/*.jpg"]
```

Each discovered image becomes an `::image::` section with
`alt=<filename-stem>` and flows through the same row-emission path.

**Current InternVL caveat.** InternVL-family rows stay visible in the
registry for planning and future work, but the current runtime still
needs a custom processor/collator path for their `<image>` expansion
and `image_flags` contract. See the [multi-modal training
cookbook](../cookbook/multimodal-training.md) and [VL memory
guide](../hardware/vl-memory.md) before picking `internvl2-2b`.

**Base-model requirements.** Only vision-language bases accept image
sections at training time. `dlm init --multimodal` scaffolds a VL
doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi)
refuse image sections at train start with a pointer to `--multimodal`.

## Fence rules

- A fence must be the full line — `::instruction::` with no leading/
  trailing content other than whitespace.
- Fences inside triple-backtick code blocks are **not** active — the
  parser is aware of the code-fence context.
- An unfenced heading (`# ...`, `## ...`) inside an open instruction or
  preference section does **not** close the section. Close with the
  next section fence or end-of-file.
- Section type is case-sensitive; `::Instruction::` is rejected.
- Sprint 20 introduces a `::type#adapter-name::` suffix for
  multi-adapter routing; the v1 parser accepts the suffix but ignores
  the `#...` tail.

## Section IDs

Every section gets a content-addressed ID — the first 16 hex chars of
the SHA-256 of the section's canonical text. The manifest's
`content_hashes` records these IDs and their types so the next `dlm train`
can compute what's new, unchanged, or removed (Sprint 08's delta system).

You don't write these IDs in the document — they're derived and live
only in the manifest. But if you're debugging "why isn't this section
being picked up as new?", the ID in `dlm show --json` is the answer.

## What NOT to put in sections

- API keys, personal data, anything you wouldn't want baked into a
  model you'll share. The adapter learns from everything in the file.
- JSON / YAML config that the model should emit literally — use
  instruction Q&A pairs instead. Training on raw config produces
  noisy generation.
- Massive code dumps (>200 KB). The replay corpus retains everything,
  and sequence_len is bounded at 32 KB; a single enormous section
  trains one step and wastes the remaining token budget.

## See also

- [Instruction section reference](instruction-section.md)
- [Preference section reference](preference-section.md)
- [First train walkthrough](../getting-started/first-train.md)
- [Cookbook: coding tutor](../cookbook/coding-tutor.md) — full
  example of instruction-heavy authoring